Extension:UniversalLanguageSelector/Fonts for Chinese wikis
This page is obsolete. It is being retained for archival purposes. It may document extensions or features that are obsolete and/or no longer supported. Do not rely on the information here being up-to-date. This was a Google Summer of Code/2014 project/proposal. |
Introduction
editIncluding all Chinese characters makes a webfont file too large. We may want to tailor the font file for every page based on characters used on that page. Once finished, this feature can be applied to other languages facing the same problem, such as Japanese.
As of writing, there isn't any "good" enough free font which includes all Chinese characters in Unicode. And the "wiki" concept itself encourages collaborative content creation, so it would be nice to invite user to create a glyph for it when the system sees a character without existing data.
Proposal
editMentors
editRepository
editDemo Site
editWith a debug tool you can see only 20KB is downloaded. ( You can also click on the red tofu to see the new tofu-detection feature, which is more accurate by comparing pixels. )
Another example has more characters. About 40KB is downloaded.
Please create new pages to have a try to avoid conflicts. You can write things like
<p style="font-family:WenQuanYi Micro Hei">SOMETHING YOU WANT</p>
Development Report
editMilestones
edit- May 19: Start coding.
- Warm up with code and development tool set
- Clarify what to do next
- June 22: Mid-term evaluation: Finish the prototype of Font Tailor
- July 20: Finish the Font Tailor ( ttf tailor finished and well tested. svg/woff/eot tailor finished but with no guarantee )
- Aug 11: Pencil down: Tofu detection with font family settings
- Aug 22: Final evaluation: Documents ( The page you're reading )
Next Step
edit- Product Implementation
Known_Issues as described.
- Future Mentoring
As graduated this year, I'll no longer participate in GSoC as a student. But I'd like to be a mentor here to help others on language related projects.
Font Tailor Implementation
editDynamic WebFonts
editFor standard WebFonts service, a static font file is downloaded. The @font-face rule is like:
@font-face {
font-family: WenQuanYi;
...
src: url('fontspath/wenquanyi.ttf') format('ttf'), ...;
}
Now we should return different font which is well tailored to contain all / only the characters in that page. So we change the url to:
@font-face {
font-family: WenQuanYi;
...
src: url('FontRequest.php?font=WenQuanYi...') format('ttf'), ...;
}
When the page is visited, a font request will be fired towards FontTailor.php. The php will get enough information from the parameters. If a tailored font file exists and is up-to-date, return it by attachment:
header( "Content-Type: application/octet-stream" ); header( "Content-Disposition: attachment; filename=\"$wanted_filename\"" ); readfile( $tailored_fontfile );
If no tailored font file is available or it is out dated, the php should generate one.
Tailored Font Management
editUnder the font's path, there are three subtrees:
- tailored/
- 02c68248c6b40670c2889218987af948.ttf
- 9efbe2b03fd390fa3e4bec7d65b36f46.ttf
- ...
- tailored_for_title/
- Main_Page_17.ttf -> 02c68248c6b40670c2889218987af948.ttf
- Main_Page_16.ttf -> 02c68248c6b40670c2889218987af948.ttf
- ...
- tailored_for_url/
- %2Fwiki%2FMain_Page.ttf -> 02c68248c6b40670c2889218987af948.ttf
- %2Fwiki%2FTest%3Fdebug%3Dtrue.ttf -> 9efbe2b03fd390fa3e4bec7d65b36f46.ttf
- ...
Tree tailored contains all the real tailored font. Every different set of characters maps to a tailored font. e.g. 'abcde' and 'abcdef' map to different files. The file name is the md5 value of the char-sequence.
Tree tailored_for_title and tailored_for_url contain soft links to some tailored font file.
Font Tailor Workflow
edit- Trigger the Tailor
By hooking ArticleViewHeader event, FontTailor will check if the tailored fonts have been ready for an article when it is requested. This is done by checking if the <Title, Revision> pair has been contained under the tailored_for_title tree. If not, fire the tailor.
- Do Tailoring
Get the article's content, and generate character set by sorting and uniquing. Search its MD5 value under the tailored tree to see if there is existed tailored font . If not, call php-font-lib to generate one there. As you know, this mechanism is somewhat like Git. Different articles or revisions may share the same tailored font.
Create a soft link under the tailored_for_title tree, so the future requests will hit.
Create or update a soft link under the tailored_for_url tree, it will be used below. Note that, the same url may present different article revision from time to time, so we should always update the soft link no matter a real tailoring happened or not.
- Request Tailored Font
When the article is ready on client, it will fire a request to the font, which has been modified by us from requesting static font to requesting FontRequest.php. The script will read $_SERVER['HTTP_REFERER'] to get requester's url, and find it under the tailored_for_url tree.
- Download and Render
If everything goes well, you'll see a properly rendered page like the attachment. The tailored font only contains the characters in the page, saving the downloading size from 4.5MB to 20KB.
Known issues
edit- php-font-lib bug
It's strange that the output font file cannot work in WebFonts. But if you read it by another font editor ( FontCreater or FontForge ), and save to another file, it will work. You can find that the two files have some difference. I don't know why, yet. If someone have knowledge on TTF fonts, please take a look:
Current solution is to run another fix function:
#!/usr/bin/env fontforge
Open('input.ttf',1)
SelectAll()
Copy()
Generate('output.ttf')
Close()
It's ugly to call exec in PHP, and it's also ugly to have fontforge required. So I want to fix the problem in php-font-lib if possible.
- Concurrent Requests
As described above, a FontTailor request will tailor a font, write it to the disk, and create two soft links. The whole process takes up to 3 or more seconds. In a production environment it's likely that many concurrent requests will come in such a long duration, and multiple tailoring may be started. So we need some kind of lock when tailoring.
- Additional Content Loaded with AJAX
We don't consider such complicated scenaries currently. So if the new content has extra characters than the original HTML, it may not be rendered as expected.
- Redundant Subsetting
Currently every subsetted font contains every character in the page. For example,
<p style="font-family:Font1">ABC</p>
<p style="font-family:Font2">DEF</p>
We tailor Font1 and Font2, but they both contain characters ABCDEF, while not contain those just needed.
Note that I have solved this problem by working the DOM tree with pure javascript. See here.
Tofu Detection with FontFamily
editIf a Chinese character is rendered as a tofu, the reason is obviously that the glyph is not available in the fonts, both from WebFonts service or from the system. According to task T65122, the most reliable way to detect a tofu is to compare it's image with the known tofu's image, such as unicode 0x0D00.
However, you cannot do that with a fixed fontFamily like sans-serif, because a WebFonts service may render it properly with the remote fonts. So the current detectTofu() method may get some false-positive error. We should detect tofu with it's real fontFamily setting. And tofus are different, too. As you see below:
ഀ [sans-serif tofu]
ഀ [Linux Libertine tofu]
ഀ [宋体 tofu]
ഀ [Georgia tofu]
Detect Tofu by Comparing Image
editUse HTML5's canvas element to draw each character, and compare with the tofu's image.
It's introduced in another patch from me, see task T65122 and patch 122277.
Popup to Show Tofu Information
editTraverse the DOM tree to find all text nodes, mark them as red, and bind click event to make a popup to show each tofu's information. In the future we can guide them to the font's contribute page or our own glyph-contribution page.