User:TJones (WMF)/Notes/Analyzer Analysis for Elasticsearch Upgrade from 6.8 to 7.10

August 2022 — See TJones_(WMF)/Notes for other projects. See also T301131 and my 6.5 to 6.8 analysis. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Summary

There are no changes to most analyzers between 6.8 and 7.10.
The most impactful (and most debatable) changes to the Nori (Korean) tokenizer made between 6.5 and 6.8 have been reverted (keeping the smaller, better changes).
The Thai tokenizer now allows some less commonly used Unicode characters through, where before it would delete/ignore them.
The problem of narrow non-breaking spaces (NNBSP) that existed in the 6.5 ICU tokenizer and that was introduced in the 6.8 standard tokenizer persists, so I'm going to patch it.

Background & Data

We've already seen some changes from 6.5 to 6.8, but we decided to wait until we finished upgrading to 7.10 to handle the new problems (particularly with narrow no-break spaces (NNBSP, U+202F). (Some good changes include many less commonly used characters not being deleted/ignored by the tokenizer—oooo, foreshadowing!)

I used the same data sets that I used in the 6.5-to-6.8 analysis: 500 random documents each from the Wikipedia and Wiktionary for the following 47 languages: Arabic, Bulgarian, Bangla, Bosnian, Catalan, Czech, Danish, German, Greek, English, Esperanto, Spanish, Estonian, Basque, Persian, Finnish, French, Irish, Galician, Hebrew, Hindi, Croatian, Hungarian, Armenian, Indonesian, Italian, Japanese, Javanese, Khmer, Korean, Lithuanian, Latvian, Malay, Dutch, Norwegian, Polish, Portuguese, Romanian, Russian, Serbo-Croatian, Slovak, Serbian, Swedish, Thai, Turkish, Ukrainian, and Chinese; plus 500 random Wikipedia articles from the following 5 languages (which do not have Wiktionaries): Tibetan, Central Kurdish (aka Sorani), Gan Chinese, Mirandese, Rusyn; and 150 random articles from the Dzongkha Wikipedia (which does not yet have 500 articles).

A subset of these languages were originally chosen because they had an interesting mix of writing systems and language analysis configurations. The rest were added after my initial findings showed something interesting was going on, to include the rest of the languages with some sort of custom language analysis.

Most Things Remain the Same

It seems that the standard tokenizer and ICU tokenizer haven't changed further, and so there are no changes in most language analyzers.

However, there seem to be a very small number of changes for the Hebrew Wiktionary data, a few in Thai and Korean Wiktionary data, and a lot in Korean Wikipedi data.

Thai Changes

The changes to Thai were all good ones, and involved the Thai tokenizer not deleting/ignoring less commonly used characters. Tokens now allowed through include Ahom (e.g., 𑜉𑜣, 𑜌𑜤𑜉, 𑜎𑜀) and New Tai Lue (e.g., ᦶᦔ, ᦷᦈᦑᦓᦱ, ᦺᦎᧈᦎᦱᧄ), a character from the Kana Supplement Unicode Block (𛂕), and fair number of Ideographic characters you, including some I don't even have fonts for—e.g., 𫥭, 𫦔, 𫧛, 𫯺, 𫹋, 𫿆, 𬂎, 𬄒, 𬇊, 𬈌—I'm missing ⁴⁄₁₀! (Though none show up on-wiki.. weird.)

The impact of the changes is small—no changes in my Wikipedia sample, and only 0.8% more tokens in the Wiktionary sample.

Korean Changes

Our Korean analysis chain uses the Nori tokenizer, which apparently has been changed to split tokens on the transition from text to numbers. So, tokens like پایتخت1, лык1, πιστεύω1, 15장, すいか1, and wai3man6 are split up into numeric and non-numeric tokens. In a lot of cases, this seems good, but less so in cases like P2P being split up into p, 2, and p. However, the plain filter will take up some of the slack in cases like that.

A small number of tokens, on the order of a dozen, are being filtered by the Nori Part-of-Speech token filter (nori_posfilter), presumably because their parts of speech have been changed.

For example, prefixed "제" is being filtered; it is a prefix that marks ordinals (it's like the th in 4th, 5th, etc.), so that doesn't seem like a bad thing. It was previously tagged as an "NP(Pronoun)" and regularized into two overlapping tokens (제 and 저).. now it is tagged as "XPN(Prefix)" and filtered. The tagging changes are small in scope, so I'm not going to worry about them too much.

The impact on the Korean Wiktionary sample is small, only 0.8% more tokens. But the Korean Wikipedia sample had 6.7% more tokens, which is a much bigger increase! The vast majority of changes are because of the numeral/text token split.

A Look Back in Time

Looking at my notes on Korean from the 6.5 to 6.8 update, I see that this change is a partial reversal of changes made then!

I think the developers were trying to fix the problem of splitting too aggressively on character set changes. That is, splitting a token as you switch from Latin to Cyrillic is a defensible position (I disagree, but I see the point). However, splitting as you switch from Latin to Extended Latin is probably not what was intended. They did fix that in 6.8, however, not splitting on numbers came along for the ride, and they have now undone that. Reviewing my previous examples...

The tokenizer generally breaks words on character set changes, but earlier versions of the tokenizer were a little too fine-grained in assigning character classes, so that "Extended" character sets (e.g., "Extended Latin" or "Extended Greek"), IPA, and others were treated as a completely different character sets, causing mid-word token breaks. Some examples:
νοῦθοσ → was: νο, ῦ, θοσ; now: νοῦθοσ

пєԓӈа → was: пє, ԓ, ӈа; now: пєԓӈа

suɏong → was: su, ɏ, ong; now: suɏong

hɥidɯɽɦɥidɯɭ → was: h, ɥ, id, ɯɽɦɥ, id, ɯɭ; now: hɥidɯɽɦɥidɯɭ

boːneŋkai → was: bo, ː, neŋkai; now: boːneŋkai

On the other hand, numbers are now considered to not be part of any character set, so tokens no longer split on numbers. The standard tokenizer does this, too. The aggressive_splitting filter breaks them up for English and Italian. Some Korean examples:
1145년 → was: 1145, 년; now: 1145년 N—back to 1145, 년

22조의2 → was: 22, 조의, 2; now: 22조의2 N—back to 22, 조의, 2

лык1 → was: лык, 1; now: лык1 N—back to лык, 1

dung6mak6 → was: dung, 6, mak, 6; now: dung6mak6 N—back to dung, 6, mak, 6

The Korean tokenizer still splits on character set changes. This generally makes sense for Korean text, which often does not have spaces between words. However, it can give undesirable results for non-CJK mixed-script tokens, including words with homoglyphs, and stylized mix-script words. Some examples (with Cyrillic in bold):
chocоlate → choc, о, late

KoЯn → ko, я, n

NGiИX → ngi, и, x

The changes result in:
6.4% fewer tokens in my Korean Wikipedia sample, with more distinct tokens (e.g., x and 1 were already tokens, now so is x1.)
N—This is very close to the exact inverse of the 6.7% more tokens we see in the 7.10 update: $\frac{1}{1 - 6.4 %} - 1 = 6.8 % \approx 6.7 %$ .. not everything was reverted.

Hebrew Changes

Hmm. I originally had some changes showing two empty tokens sneaking into the output for Hebrew Wiktionary. However, I can't reproduce the problem. I was having a different problem with my analysis tool generating too many tokens—ES7 introduced a token limit for the API, which my code would overrun once in a while. Maybe there was a glitch and some unexpected extra info got parsed wrong by my script. Seems unlikely, but I have no other answer at the moment.

With a much higher limit for API tokens, I can't reproduce the problem, and it was only two tokens out of 85K tokens, so I'm not going to sweat it right now.

Note: There is no limit for internal analysis, so this isn't a problem for our data in production.

NNBSP: Old Problems Are New Again

As noted in my 6.5-to-6.8 analysis:

A fair number of new tokens with spaces appeared in my output. These are coming from narrow no-break spaces (NNBSP, U+202F) in the input, which are no longer treated as word boundaries by the standard tokenizer. The ICU normalizer, when present, converts NNBSP to a regular space. A fair number of new tokens with spaces appeared in my output. These are coming from narrow no-break spaces (NNBSP, U+202F) in the input, which are no longer treated as word boundaries by the standard tokenizer. The ICU normalizer, when present, converts NNBSP to a regular space.
This results in multi-word tokens with spaces in the middle, and single-word tokens with spaces at either end—or both!
Since NNBSP looks like a thin space, they are very difficult to detect while reading, which makes text containing them effectively unsearchable.

The standard tokenizer acquired this unwanted behavior in ES6.8. The ICU tokenizer already had it, but we hadn't noticed yet. We can fix it for both by adding a custom character filter that maps NNBSP => space.

Aggressive Splitting: English and Italian use the standard tokenizer, but don't have the NNBSP problem because the aggressive_splitting filter breaks them up early in the analysis chain.

Custom Tokenizers: Chinese, Hebrew, and Korean use custom tokenizers that don't have this problem, and so don't need any adjustments.

Monolithic Analyzers: Arabic, Armenian, Bulgarian, Central Kurdish (aka Sorani), Hungarian, Japanese, Latvian, Lithuanian, Persian, Romanian, Thai, Turkish, and Ukrainian are currently still using monolithic analyzers, so we can't customize their behavior. All except Thai use the standard tokenizer, and all can be fixed (and should be fixed automatically) when unpacked.

Plain Field: All of these languages use either the ICU or standard tokenizer in their plain index field, and so we should adjust all of those, too.

I'm going to make the fix in the 6.8 environment because there isn't a stable 7.10 environment yet (my dev environment is a mess!), and the fix applies there, and is the same for both 6.8 and 7.10.