Extension:UniversalLanguageSelector/Fonts for Chinese wikis

Introduction

edit

Including all Chinese characters makes a webfont file too large. We may want to tailor the font file for every page based on characters used on that page. Once finished, this feature can be applied to other languages facing the same problem, such as Japanese.

As of writing, there isn't any "good" enough free font which includes all Chinese characters in Unicode. And the "wiki" concept itself encourages collaborative content creation, so it would be nice to invite user to create a glyph for it when the system sees a character without existing data.

Proposal

edit

Go to Proposal

Mentors

edit

DChan , Liangent

Repository

edit

Font Tailor

Tofu Detection

Demo Site

edit

Go to Demo Site

With a debug tool you can see only 20KB is downloaded. ( You can also click on the red tofu to see the new tofu-detection feature, which is more accurate by comparing pixels. )

Another example has more characters. About 40KB is downloaded.

Please create new pages to have a try to avoid conflicts. You can write things like

<p style="font-family:WenQuanYi Micro Hei">SOMETHING YOU WANT</p>

Development Report

edit

Go to Development Report

Milestones

edit
  • May 19: Start coding.
    • Warm up with code and development tool set
    • Clarify what to do next
  • June 22: Mid-term evaluation: Finish the prototype of Font Tailor
  • July 20: Finish the Font Tailor ( ttf tailor finished and well tested. svg/woff/eot tailor finished but with no guarantee )
  • Aug 11: Pencil down: Tofu detection with font family settings
  • Aug 22: Final evaluation: Documents ( The page you're reading )

Next Step

edit
  • Product Implementation

Known_Issues as described.

  • Future Mentoring

As graduated this year, I'll no longer participate in GSoC as a student. But I'd like to be a mentor here to help others on language related projects.

Font Tailor Implementation

edit

Dynamic WebFonts

edit

For standard WebFonts service, a static font file is downloaded. The @font-face rule is like:

@font-face {
    font-family: WenQuanYi;
    ...
    src: url('fontspath/wenquanyi.ttf') format('ttf'), ...;
}

Now we should return different font which is well tailored to contain all / only the characters in that page. So we change the url to:

@font-face {
    font-family: WenQuanYi;
    ...
    src: url('FontRequest.php?font=WenQuanYi...') format('ttf'), ...;
}

When the page is visited, a font request will be fired towards FontTailor.php. The php will get enough information from the parameters. If a tailored font file exists and is up-to-date, return it by attachment:

header( "Content-Type: application/octet-stream" );
header( "Content-Disposition: attachment; filename=\"$wanted_filename\"" );
readfile( $tailored_fontfile );

If no tailored font file is available or it is out dated, the php should generate one.

Tailored Font Management

edit

Under the font's path, there are three subtrees:

  • tailored/
    • 02c68248c6b40670c2889218987af948.ttf
    • 9efbe2b03fd390fa3e4bec7d65b36f46.ttf
    • ...
  • tailored_for_title/
    • Main_Page_17.ttf -> 02c68248c6b40670c2889218987af948.ttf
    • Main_Page_16.ttf -> 02c68248c6b40670c2889218987af948.ttf
    • ...
  • tailored_for_url/
    • %2Fwiki%2FMain_Page.ttf -> 02c68248c6b40670c2889218987af948.ttf
    • %2Fwiki%2FTest%3Fdebug%3Dtrue.ttf -> 9efbe2b03fd390fa3e4bec7d65b36f46.ttf
    • ...

Tree tailored contains all the real tailored font. Every different set of characters maps to a tailored font. e.g. 'abcde' and 'abcdef' map to different files. The file name is the md5 value of the char-sequence.

Tree tailored_for_title and tailored_for_url contain soft links to some tailored font file.

Font Tailor Workflow

edit
  • Trigger the Tailor

By hooking ArticleViewHeader event, FontTailor will check if the tailored fonts have been ready for an article when it is requested. This is done by checking if the <Title, Revision> pair has been contained under the tailored_for_title tree. If not, fire the tailor.

  • Do Tailoring

Get the article's content, and generate character set by sorting and uniquing. Search its MD5 value under the tailored tree to see if there is existed tailored font . If not, call php-font-lib to generate one there. As you know, this mechanism is somewhat like Git. Different articles or revisions may share the same tailored font.

Create a soft link under the tailored_for_title tree, so the future requests will hit.

Create or update a soft link under the tailored_for_url tree, it will be used below. Note that, the same url may present different article revision from time to time, so we should always update the soft link no matter a real tailoring happened or not.

  • Request Tailored Font

When the article is ready on client, it will fire a request to the font, which has been modified by us from requesting static font to requesting FontRequest.php. The script will read $_SERVER['HTTP_REFERER'] to get requester's url, and find it under the tailored_for_url tree.

  • Download and Render
 
FontTailor Demo

If everything goes well, you'll see a properly rendered page like the attachment. The tailored font only contains the characters in the page, saving the downloading size from 4.5MB to 20KB.

Known issues

edit
  • php-font-lib bug

It's strange that the output font file cannot work in WebFonts. But if you read it by another font editor ( FontCreater or FontForge ), and save to another file, it will work. You can find that the two files have some difference. I don't know why, yet. If someone have knowledge on TTF fonts, please take a look:

- Output TTF of php-font-lib

- Fixed TTF by FontForge

Current solution is to run another fix function:

#!/usr/bin/env fontforge
Open('input.ttf',1)
SelectAll()
Copy()
Generate('output.ttf')
Close()

It's ugly to call exec in PHP, and it's also ugly to have fontforge required. So I want to fix the problem in php-font-lib if possible.

  • Concurrent Requests

As described above, a FontTailor request will tailor a font, write it to the disk, and create two soft links. The whole process takes up to 3 or more seconds. In a production environment it's likely that many concurrent requests will come in such a long duration, and multiple tailoring may be started. So we need some kind of lock when tailoring.

  • Additional Content Loaded with AJAX

We don't consider such complicated scenaries currently. So if the new content has extra characters than the original HTML, it may not be rendered as expected.

  • Redundant Subsetting

Currently every subsetted font contains every character in the page. For example,

<p style="font-family:Font1">ABC</p>
<p style="font-family:Font2">DEF</p>

We tailor Font1 and Font2, but they both contain characters ABCDEF, while not contain those just needed.

Note that I have solved this problem by working the DOM tree with pure javascript. See here.

Tofu Detection with FontFamily

edit

If a Chinese character is rendered as a tofu, the reason is obviously that the glyph is not available in the fonts, both from WebFonts service or from the system. According to task T65122, the most reliable way to detect a tofu is to compare it's image with the known tofu's image, such as unicode 0x0D00.

However, you cannot do that with a fixed fontFamily like sans-serif, because a WebFonts service may render it properly with the remote fonts. So the current detectTofu() method may get some false-positive error. We should detect tofu with it's real fontFamily setting. And tofus are different, too. As you see below:

  • ഀ [sans-serif tofu]

  • ഀ [Linux Libertine tofu]

  • ഀ [宋体 tofu]

  • ഀ [Georgia tofu]

Detect Tofu by Comparing Image

edit

Use HTML5's canvas element to draw each character, and compare with the tofu's image.

It's introduced in another patch from me, see task T65122 and patch 122277.

edit
 

Traverse the DOM tree to find all text nodes, mark them as red, and bind click event to make a popup to show each tofu's information. In the future we can guide them to the font's contribute page or our own glyph-contribution page.

See also

edit