User:TJones (WMF)/Notes/What the Heck Does ICU Normalization Do, Anyway?

March 2020 — See TJones_(WMF)/Notes for other projects. See also T238151.

Background

edit

We're considering using ICU normalization in the Glent project to normalize strings (currently lowercasing is the only normalization done). But that raises the question of exactly what ICU normalization does. It's not well documented, so I set out to test it be feeding it individual Unicode characters and see what the results are.

ICU normalization is available as a token filter and as a character filter. Since some of the normalizations performed could interact with tokenization, I looked at both.

For example, ICU normalization converts ⑴⑵⑶ to (1)(2)(3), but the standard tokenizer drops characters like ⑴⑵⑶. So, with the ICU normalization token filter, ⑴⑵⑶ generates no tokens. With the character filter, it generates three: "1", "2", and "3". With the whitespace tokenizer, the character filter gets the same three tokens, while the token filter generates one token, "(1)(2)(3)". I'm not sure what the best answer is (though I think "123" is my naive preference), but zero tokens is probably not it.

Data & Comparison

edit

I wrote a little script to print out characters from the Basic Multilingual Plane (BMP, U+0000U+FFFF), the Supplementary Multilingual Plane (SMP, U+10000U+1FFFF), and the Supplementary Ideographic Plane (SIP, U+20000U+2FFFF).

The high and low surrogates (U+D800U+DBFF, and U+DC00U+DFFF) are invalid on their own, so I skipped them. I also skipped the non-character points at U+FDD0U+FDEF, and the two reserved code points in each plane, U+xFFFE and U+xFFFF (for x ∈ {∅, 1, 2}), and the large section of unassigned code points in the SIP, from U+2FA20U+2FFFD.

That gives a total of 193,019 characters to investigate.

Results

edit

59,271 characters were exact matches before and after normalization.

An additional 127,502 where matches after accounting for the Java encoding (for example, 𤋮 (U+FA6C) is encoded as a pair of high and low surrogates: "\uD850\uDEEE"). As long as we convert these back to their original Unicode characters, they should be fine.

Another 929 were matches after accounting for lowercasing, which is another thing ICU normalization does. So A → a is a perfectly reasonable result. This included Latin, Greek, Coptic, Cyrillic, Armenian, Georgian, Glagolitic, Deseret, and parts of their various extended character sets.

That leaves 5,317 changes to review, plus two characters that I skipped—single and double quote—because they are messy on the command line. Fortunately many come in big chunks and a lot of the review is easy to do by eye (if you have enough specialty fonts installed)—though the stragglers took forever.

Full list of characters converted by ICU Normalization.

(254) A number of less commonly used scripts or characters get converted to lowercase variants that just weren't known to my lowercase function, including Cyrillic Ԩ/ԩ, Latin Ꞗ/ꞗ, Greek Ϳ/ϳ, Osage 𐒰/𐓘, Old Hungarian 𐲀/𐳀, Warang Citi 𑢠/𑣀, Medefaidrin 𖹀/𖹠. Adlam 𞤀/𞤢, Georgian Ა/ა.

(86) Cherokee lowercase letters are less commonly used, so they are converted to uppercase (ᏹ/Ᏹ).

(467) Some "variant" characters were converted to a more common form (that is sometimes visually indistinguishable, depending on fonts): µ (U+B5) → μ (U+3BC), ſ (U+17F) → s, ς (U+3C2) → σ (U+3C3), ﯼ (U+FBFC) → ی (U+6CC), ﹉ (U+FE49) → ̅, ﹖ (U+FE56) → ?, ά (U+1F71) → ά (U+3AC), ㄱ (U+3131) → ᄀ (U+1100), ︵ (U+FE35) → (, ﬠ (U+FB20) → ע (U+5E2), 〈 (U+2329) → 〈 (U+3008), 〸 (U+3038) → 十 (U+5341), ; (U+37E) → ;, ༌ (U+F0C) → ་ (U+F0B);, ̀ (U+340) → ̀ (U+300), ` (U+1FEF) → ` (U+60). This includes some that have multiple characters in one symbol, like ̈́ (U+344) → ̈́.

(868) Many codepoints get expended into their constituent parts (lowercased, if applicable), including symbols, punctuation, ligatures, presentation forms, digraphs, etc.: ¼ (U+BC) → 1/4, ij (U+133) → ij, DŽ (U+1C4) → dž, և (U+587) → եւ, ٵ (U+675) → اٴ, क़ (U+958) → क़, ਲ਼ (U+A33) → ਲ਼, ଡ଼ (U+B5C) → ଡ଼, ำ (U+E33) → ํา, ໜ (U+EDC) → ຫນ, གྷ (U+F43) → གྷ, ″ (U+2033) → ′′, ⁉ (U+2049) → !?, Ⅲ (U+2162) → iii, ∯ (U+222F) → ∮∮, ㎂ (U+3382) → μa, ffl (U+FB04) → ffl, ㏢ (U+33E2) → 3日, ﯪ (U+FBEA) → ئا, ㍹ (U+3379) → dm3, שׁ (U+FB2A) → שׁ, ⫝̸ (U+2ADC) → ⫝̸, ゟ (U+309F) → より, ‥ (U+2025) → .., ﱞ (U+FC5E) → ٌّ

(2) ß (U+DF) and ẞ (U+1E9E) are converted to ss, which is more or less their constituent parts (and it makes ß and s a smaller edit distance from each other than other characters.

(1) Oddly, ẚ (U+1E9A) gets broken up to aʾ. It's a rare character, but I'm not sure what the logic here is.

(3) Slash (\) gets escaped (\\), so that's fine. Double quote (") and single quote (') I tested manually because escaping them correctly on the command line is a pain. Single quote comes back as is, and double quote is escaped (\"), which is also fine.

(188) Superscripts and subscripts and modifier variants—like ª (U+AA), ᵢ (U+1D62), ʵ (U+2B5), ㆕ (U+3195), ꚜ (U+A69C), U+A770, ꝰ (U+A76F)—get converted to their regular counterparts, including some that have multiple characters in one symbol, like 🅫 (U+1F16B)

(63) Similarly, a fair number of Greek characters with a "built-in" iota subscript get expanded to include a full iota, like ᾀ (U+1F80) → ἀι.

(1179) Upper- and lowercase math bold/italic/script/fraktur/double-struck/sans-serif/monospace Latin and Greek letters, digits, and other symbols—like 𝐀 (U+1D400), 𝐛 (U+1D41B), 𝐶 (U+1D436), 𝑑 (U+1D450), 𝑬 (U+1D46C), 𝒇 (U+1D487), 𝒢 (U+1D4A2), 𝒽 (U+1D4C0), 𝓘 (U+1D4D8), 𝓳 (U+1D4F3), 𝔎 (U+1D50E), 𝔩 (U+1D529), 𝕄 (U+1D544), 𝕟 (U+1D55F), 𝕺 (U+1D57A), 𝖕 (U+1D595), 𝖰 (U+1D5B0), 𝗋 (U+1D5CB), 𝗦 (U+1D5E6), 𝘁 (U+1D601), 𝘜 (U+1D61C), 𝘷 (U+1D637), 𝙒 (U+1D652), 𝙭 (U+1D66D), 𝚈 (U+1D688), 𝚣 (U+1D6A3), 𝚨 (U+1D6A8), 𝛃 (U+1D6C3), 𝛤 (U+1D6E4), 𝛿 (U+1D6FF), 𝜠 (U+1D720), 𝜻 (U+1D73B), 𝝜 (U+1D75C), 𝝷 (U+1D777), 𝞘 (U+1D798), 𝞳 (U+1D7B3), 𝟎 (U+1D7CE), 𝟙 (U+1D7D9), 𝟤 (U+1D7E4), 𝟯 (U+1D7EF), 𝟺 (U+1D7FA), ℵ (U+2135), 𞺯 (U+1EEAF)—are converted to their expected lowercase/plain versions. (I did not expect the list to go on that long—there are well over a thousand of them!!)

(319) Encircled or "ensquared" letters, numbers, and CJK characters are converted to their plain counterparts—⑬ (U+246C), Ⓧ (U+24CD), ㉄ (U+3244), ㉠ (U+3260), ㋐ (U+32D0), 🄺 (U+1F13A), 🈂 (U+1F202), including some that have multiple characters in one symbol, like 🄭 (U+1F12D), 🅎 (U+1F14E), 🈁 (U+1F201).

(180) Numbers and letters with parentheses or tortoise shell brackets, and numbers with periods or commas are converted to plain counterparts, with their accompanying punctuation—which is why we probably want ICU normalization as a character filter before tokenization: ⒇ (U+2487) → (20), ⒎ (U+248E) → 7., ⒯ (U+24AF) → (t), ㈀ (U+3200), ㈠ (U+3220), 🄁 (U+1F101), 🉀 (U+1F240), including some that have multiple characters in one symbol, like ㈝ (U+321D) → (오전).

(216) All 214 Kangxi radicals are converted to the corresponding and (nearly) identical Han character, like ⼄ (U+2F04) → 乙 (U+4E59). Other radicals are also converter to Han characters, like ⺟ (U+2E9F) → 母 (U+6BCD)

(1002) CJK compatibility ideographs get converted to their corresponding CJK unified ideographs, which are similar or identical, such as 丽 (U+2F800) → 丽 (U+4E3D). A fair number (112) are Java encoded, such as 𠄢 (U+2F803) → 𠄢 (U+20122), though it is encoded as "\uD840\uDD22".

(221) CJK halfwidth and fullwidth forms get converted into whichever is more canonical, i.e., plain forms for punctuation and Latin characters, and fullwidth for halfwidth CJK characters. E.g., ! (U+FF01) → !, 3 (U+FF13) → 3, g (U+FF47) → g, ヲ (U+FF66) → ヲ (U+30F2), ᄀ (U+FFA1) → ᄀ (U+1100)

(94) CJK "square" characters get broken up into their constituent parts—e.g., ㌀ (U+3300) → アパート, ㍿ (U+337F) → 株式会社, 🈀 (U+1F200) → ほか

(13) Codepoints for musical notes and symbols (which I don't think many people have the fonts for) also get converted into parts. 𝅘𝅥𝅮 (U+1D160) (an eighth note) gets converted into 𝅘 + 𝅥 + 𝅮 (U+1D158, U+1D165, U+1D16E == notehead black + combining stem + combining flag-1), all of which are java encoded, giving "\uD834\uDD58\uD834\uDD65\uD834\uDD6E"

(32) The control characters (U+0U+1F) were generally deleted. Other than space (U+20) and tab (U+9), I'm not sure if these were deleted by the whitespace tokenizer or the ICU normalizer, but it doesn't matter much since they aren't really normal text.

(77) A lot of characters at first seemed to get converted to spaces, but most of that turned out to be because of weird processing and/or parsing them as single letters. In the context between other characters (e.g., "ab<char>cd", most were simply deleted. These all seem reasonable:

(18) Characters that actually were converted to spaces. These all seem reasonable:

(33) Some plain diacritics are unexpectedly converted into their combining forms, but with spaces before them so they can't actually combine with anything. With the whitespace tokenizer and the character filter version of ICU normalization, this results in the tokens being split before the converted combining diacritic ("a¨b" → "a","̈b"); with the token filter version, the token has a space in it. With the standard tokenizer, the diacritics disappear in either case. When the combining forms are used correctly, ICU normalization converts <letter>+<combining> to the unified form when one exists. This is the case for a number of non-combining diacritical characters, including some script-specific ones (Greek and Katakana) : ¨ ¯ ´ ¸ ˘ ˙ ˚ ˛ ˜ ˝ ΄ ΅ ᾽ ᾿ ῀ ῁ ῍ ῎ ῏ ῝ ῞ ῟ ῭ ΅ ´ ῾ ‗ ‾  ̄ ゛ ゜ ゙ ゚ (the last four are fullwidth and halfwidth forms).

Some leftover bits and bobs:

(2) The combining form ͅ (U+345) and regular form ι (U+1FBE) of the subscript iota are converted to an iota, but...

(1) The Greek character ͺ (U+37A), which is more-or-less a subscript iota, gets converted to a space+iota, which is weird, but sort of follows the pattern of other combining subscripts above, except that the combining form gets converted to a simple iota. My guess is the conversions chain together.

To my amazement, the total of numbers in parens is 5,319 (the original 5,317 + two quote characters). Gotta catch 'em all!

Character Filter vs Token Filter

edit

The one place where the character filter and the token filter differ in their output (given single characters as input) are the complex Arabic ligatures ﷺ (U+FDFA) and ﷻ (U+FDFB), which get expanded to multiple words—"صلى الله عليه وسلم" and "جل جلاله", respectively.

In the case of the token filter, the multiword string is a single token, with spaces. For the character filter, the words get broken up into multiple tokens by the tokenizer.

For the purposes of Glent, the difference doesn't matter, since the result of tokenization is the same—a series of words separated by spaces.

Concerns for Glent

edit
  • Java encoding of high/low surrogate pairs—like 𤋮 (U+FA6C) → "\uD850\uDEEE"—needs to be addressed. Converting one character to 12 is more than a little crazy, but it would also lead to all sorts of false similarities, as we saw in T168427.
  • I don't love the conversion of the stand-alone diacritical characters to space+combining form, but it's probably a low-frequency problem.
  • ẚ (U+1E9A) getting broken up to aʾ is also weird, but also probably not a big problem.

Overall, if we can handle the Java encoding, we'll be fine. I think I prefer using the character filter over the token filter to get a more consistent treatment of similar characters by the tokenizer.