Unicode正規化注意事項

This page is a translated version of the page Unicode normalization considerations and the translation is 44% complete.

这是什么?

从1.4版本开始,MediaWiki将[1](NFC)应用于Unicode文本输入。以下是规范化的一些理由:

  • 避免与具有相同字符但组成分解不同的页面标题发生冲突。
    • 长期存在的问题是从Safari上传的媒体文件;文件名以及页面标题均采用分解形式,而大多数其他工具则采用组合形式提供文本。
  • 让搜索都可以按预期进行,无论文本输入的构成形式如何。

我们选择了形式C,这是因为:

  • 绝大多数输入数据已经以形式C使用了预先组合的字符。
  • 形式C被认为是相对无损的,唯一的变化是基本字符+组合字符序列和预组合字符之间的无形转换。从理论上讲,文本永远都不应更改外观,因为它已被规范化为形式C。
  • 此外,W3C认可了这一形式

MediaWiki对其输出不应用任何规范化,例如cafe<nowiki/>́变为“café”(连续显示为U+0065 U+0301,而不会出现诸如U+00E9之类的预组合字符)。

当MediaWiki显示内部链接时,页面标题也将标准化为形式C——即使使用HTML实体,引用或大多数其他变通方法进行编码,这些代码都可以避免源代码中的相应转换。 但是,百分号编码中嵌入到页面标题的字符上无NFC转换(自MediaWiki 1.35.0起),例如%E1%BD%B5

问题

但是,在运行了一段时间后,出现了一些问题。

  • some Arabic, Persian and Hebrew combining vowel markers sort incorrectly.
    • Some of these are just buggy fonts or renderers and only affect some platforms.
    • A few cases, however, can produce incorrect text, because the defined classifications don't include enough distinctions to produce semantically correct ordering. This affects primarily older texts such as Biblical Hebrew.
  • A surprising composition exclusion in Bangla.
    • The result doesn't render right with some tools, probably again a platform-specific bug
    • Some third-party search tools apparently don't know how to normalize and fail to locate texts so normalized.

The rendering and third-party search problems are annoying, though if we stay on our high horse we can try to ignore it and let the other parties fix their broken software over time.

The canonical ordering problems are a harder issue; you simply can't get these right by following the current specs. Unicode won't change the ordering definitions because it would break their compatibility rules, so unless they introduce *new* characters with the correct values... Well, it's not clear this is going to happen.

What can we do about it?

We can either ignore it and hope it goes away (easy, but entails dealing with ongoing complaints from particular linguistic groups), or we can give up on comprehensive normalization and change how we use it to maximize the benefits while minimizing the problems.

If we consider normalization form C (NFC) to be destructive (though not as much as its evil little sister NFKC), one possible plan might look like this:

  • Remove the normalization check on all web input; replace it with a more limited check for UTF-8 validity but allow funny composition forms through, as is.
  • Apply NFC directly in the places where it's most needed:
    • Page title normalization in Title::secureAndSplit()
    • Search engine index generation
    • Search engine queries

This is minimally invasive, allowing page text to contain arbitrary composition forms while ensuring that linking and internal search continue to work. It requires no database format changes, and could be switched on without service disruption.

However, it does leave visible page titles in the normalized, potentially ugly or incorrect form.

Longer term

A further possibility would be to allow page titles to be displayed in non-normalized forms. This might be done in concert with allowing arbitrary case forms ('iMonkey' instead of 'IMonkey').

In this case, the page table might be changed to include a display title form:

  page_title:         'IMonkey'
  page_display_title: 'iMonkey'

or perhaps even scarier case-folded stuff:

  page_title:         'imonkey'
  page_display_title: 'iMonkey'

The canonical and display titles would always be transformable to one another to maintain purity of wiki essence; you should be able to copy the title with your mouse and paste it into a [[link]] and expect it to work.

These kinds of changes could be more disruptive, requiring changes to the database structure and possibly massive swapping of data around in the tables from one form to another, so we might avoid it unless there are big benefits to be gained.

其他规范化形式

NFC was originally chosen because it's supposed to be semantically lossless, but experience has shown that that's not quite as true as we'd hoped.

We may then consider NFKC, the compatibility composition form, for at least some purposes. It's more explicitly lossy; the compatibility forms are recommended for performing searches since they fold additional characters such as plain latin and "full-width" latin letters.

It would likely be appropriate to use NFKC for building the search index and to run on search input to get some additional matches on funny stuff. I'm not sure if it's safe enough for page titles, though; perhaps with a display title, but probably not without.

Normalizaton and unicodification can both be done by bots. While no bot has yet been known to "normalize", the function is possible. The "Curpsbot-unicodify" bot has unicodified various articles on Wikipedia and this should not be undone.

参见