User:TJones (WMF)/Notes/Khmer Reordering Analysis Analysis

January 2021 — See TJones_(WMF)/Notes for other projects. See also T185721, and notes on Syllable Re-Writing to Improve Khmer Search Performance.

Background

See the Background section of Syllable Re-Writing to Improve Khmer Search Performance for more details on the Khmer writing system.

The relevant summary of Khmer script is:

Khmer is written without spaces between most words.
The writing system is non-linear, in that letters and diacritics added to a base character can go above, below, to the left of, or to the right of the base character.
Historically, Khmer fonts have not been strict about enforcing the canonical order of letters or diacritics, and may render "incorrect" orders or multiple diacritics reasonably well.
Khmer script often contains invisible/zero-width characters—like zero-width space, zero-width non-joiner, zero-width joiner, soft hyphen, invisible separator—that can be copied and pasted from other sources unintentionally.
Khmer has some deprecated characters that, depending on fonts, can look identical to their appropriate replacement characters.
For search, this is a problem roughly analogous to what it would be like in English if strap, srtap, satrp, and straaaaap all looked identical on the page/screen—something like ^rS_t^aP

A quick example: depending on your font support, these may all look the same, though they are underlyingly in different orders:

ង្រ្កា ( ង + ្រ + ្ក + ា )
ង្រា្ក ( ង + ្រ + ា + ្ក )
ង្ក្រា ( ង + ្ក + ្រ + ា )

Based on my research into the ideal canonical ordering of Khmer letters and diacritics, and tests on text from Khmer Wikipedia and Wiktionary, I developed an algorithm and prototype to canonically reorder Khmer syllables for the purposes of search (and remove or replace deprecated and invisible characters).

Data

I pulled 5,000 articles each from Khmer Wikipedia and Wiktionary, with the usual sanitization: mostly removing HTML markup and deduplicating lines (so that the equivalent of "References" and "Noun" don't skew the sample).

The New Filter and the Old Tokenizer

The ICU Tokenizer breaks up Khmer text into tokens; it seems to use a dictionary. It makes sense that canonically reordering Khmer text prior to tokenization would improve dictionary matches.

Early analysis showed that doing so would have a big impact, so I built an Elasticsearch plugin (extra-analysis-khmer) to do the reordering as a character filter (khmer_syll_reorder) so that it happens before tokenization.

Side Quests: A Series of Unfortunate Errors

A few complications contributed to this analysis taking longer than I wanted.

It was already slow because Khmer can be difficult to read when you don't know it very well, especially since subscripted consonants don't always look like their plain counterparts (though, to be fair, A/a, D/d, G/g, and R/r aren't super similar either, just familiar).

Adding to the complexity, we have...

An old frenemy returns

A while back (Nov 2017) I found an error in the ICU tokenizer. Briefly, changes in character set can trigger differences in tokenization across spaces. the 14th is tokenized as the | 14th, and ァ 14th is tokenized as ァ | 14 | th, because ァ and th are different character sets. (My guess is that some "previous character set" flag is not reset at the space, and numbers are not in a character set, so t is compared to ァ and they are not the same—but I'm not sure.)

It turns out there is also chunking in the tokenizer that breaks input into 4k chunks, with the added assumption that it is always okay to split chunks on spaces.

Because the Khmer reordering plugin deletes some duplicate and invisible characters, the boundary of some of the chunks changed just so and "fixed" the tokenizer error by doing the equivalent of splitting ァ and 14th into different chunks; this is a lot easier in Khmer because it's mostly spaceless, so there are lots of longer strings without, which makes the chunking more irregular.

Anyway, this change—which I first noticed as a difference in tokenizing 57x57px into either one token or two—freaked my out because the new filter should have only been affecting Khmer characters (and invisibles), so I had to make sure my plugin wasn't broken. That's when I discovered the above. I removed the tiny bits of offending bits of text from my corpus to do my analysis.

Upstream tickets: LUCENE-9754 ES-27290

A new challenger arises

Unfortunately, I got tripped up by another unexpected twist while reviewing the reordering results.

First, the ICU Tokenizer does some weird things when processing Khmer text. There's a character called a coeng in Unicode and ceung in English Wikipedia that is used to indicate that the following consonant should be subscripted. The ICU tokenizer sometimes (albeit rarely) splits tokens after a coeng. It doesn't make any sense; it's like splitting haček into haˇ and cek or almost as bad as splitting tower into tov and ver—it just ain't right!

Also, tracking the individual characters in the original text through the syllable reordering transformation is dodgy at best. Normally this isn't be a huge problem because the orthographic syllables aren't supposed to be broken up into pieces, and "normal" syllable boundaries are not affected by the reordering (there are some edge cases, but they start with malformed syllables anyway: garbage in, garbage out). So, generally, the boundaries of a syllable stay the same.

The way these two issues interact is that a sequence like s_rt gets rewritten, correctly, as s_tr, but then gets incorrectly tokenized as s_{_} and t_r. The tokens get mapped back into the original string as s_{_} and r_t, and it is all very confusing when looking at the analysis results.

My parsing of the input text for my analysis is similar to the process of highlighting, so I expect a small number of potentialy weird highlights to result from the reordering process. I expect the better search results will make up for it.

Analysis Analysis

Enabling the syllable re-ordering resulted in about 3.7% fewer tokens (112,970 out of 3,085,337) in my Khmer Wikipedia sample and 10.7% fewer tokens (13,642 out of 127,092) in my Khmer Wiktionary sample. My earlier syllable-based analysis showed roughly 0.2% of syllables were reordered, but we see a much bigger impact here. This makes sense because when a syllable gets properly reordered, it can be recognized as part of a larger word. For example, with hypothetical processing of English words and syllables, misspelled reordre might not be recognized as a word, so it gets tokenized as re, ord, re,, whereas reorder is recognized as one token (so we have two fewer tokens).

The impact on "collisions" (when tokens that were not previously analyzed as the same now are) is even higher. Roughly a fifth of analysis groups picked up at least one new distinct token, and 40-50% of tokens are in a group that picked up a new distinct token.

For example, there are 16 variants of ប្រកបដោយ that are now grouped together. They are visually identical in the default font in Chrome, but they all have various instances of zero-width spaces in them. There are ក្លាយទៅជា and កា្លយទៅជា, where the second one has the subscripted consonant and dependent vowel in the wrong order; or ចាប់ផ្តើម and ចាប់ផ្តើម, where the second has a soft hyphen hiding in it; or ជាច្រើន and ជាច្រេីន, where the second has the vowel sign ( ើ ) broken into two vowel signs ( េ + ី ).

Others show visual differences in my version of Chrome, but look fine (or at least not obviously broken) in other Khmer fonts, such as កម្ពុជា and កម្្ពុជា, where the second one has a double coeng ( ្ ), which should not happen (coeng should always be followed by a consonant to be subscripted).

There were also a vanishingly small number of splits (tokens that used to be analyzed the same no longer are).

A small number of token types (<0.5%) increased their token counts, and lots of token types (15-20%) decreased their token count. I put this down to improvements in parsing from the tokenizer; as an analogy, if realtable (parsed as real, table) becomes relatable, which is parsed as a single word, then both real and table have fewer tokens.

Numerals

I remembered late in the game that I had also planned to add a character filter to map Khmer numerals (០១២៣៤៥៦៧៨៩) to Arabic numerals (0123456789) to the Khmer analysis chain.

I added the filter and diffed the results with respect to having just the reordering filter, and it was about what you would expect. There are a lot of direct number mappings (១២៧៩ → 1279), some mixed number mappings (so that 0,0៦ and ០,០៦ are the same now—indexed as 0,06), etc.

Next Steps

✓ Merge the patch that activates the new Khmer analysis chain if the extra-analysis-khmer plugin is available. DONE

✓ Build and deploy the Khmer plugin.

✓ Reindex Khmer-language wikis.