User:TJones (WMF)/Notes/Language Analyzer Harmonization Notes
May 2023 — See TJones (WMF)/Notes for other projects. See also T219550. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.
Intro, Goals, Caveats
editThe goal of bringing language analyzers "into harmony" is to have as many of the non–language-specific elements of the analyzers to be the same as possible. Some split words on underscores and periods, some don't. Some split CamelCase words and some don't. Some use ASCII folding, some use ICU folding, and some don't use either. Some preserve the original word and have two ouptuts when folding, and some don't. Some use the ICU tokenizer and some use the standard tokenizer (for no particular reason—there are good reasons to use the ICU, Hebrew, Korean, or Chinese tokenizers in particular cases). When there is no language-specific reason for these differences, it's confusing, and we clearly aren't using analysis best practices everywhere.
My design goal is to have all of the relevant upgrades made by default across all language analysis configurations, with only the exceptions having to be explicitly configured.
Our performance goal is to reduce zero-results rate and/or increase the number of results returned for 75% of languages with relevant queries. This goal comes with some caveats, left out of the initial statement to keep it reasonably concise.
- "All wikis" is, in effect, "all reasonably active wikis"—if a wiki has only had twelve searches last month, none with apostrophes, it's hard to meaningfully measure anything. More details in "Data Collection" below.
- I'm also limiting my samples to Wikipedias because they have the most variety of content and queries, and to limit testing scope, allowing more languages to be included.
- I'm going to ignore wikis with unchanged configs (some elements are already deployed on some wikis), since they will have approximately 0% change in results (there's always a bit of noise).
- "Relevant" queries are those that have the feature being worked on. So, I will have a collection of queries with apostrophe-like characters in them to test improved apostrophe handling, and a collection of queries with acronyms to test better acronym processing. I'll still test general query corpora to get a sense of the overall impact, and to look for cases where queries without the feature being worked on still get more matches (for example, searching for NASA should get more matches to N.A.S.A. in articles).
- I'm also applying my usual filters (used for all the unpacking impact analyses) to queries, mostly to filter out porn and other junk. For example, I don't think it is super important whether the query s`wsdfffffffsf actually gets more results once we normalize the backtick/grave accent to an apostrophe.
- Smaller/lower-activity wikis may get filtered out for having too few relevant queries for a given feature.
- We are averaging rates across wikis so that wiki size isn't a factor (and neither is sample rate—so, I can oversample smaller wikis without having to worry about a lot of bookkeeping).
Data Collection
editI started by including all Wikipedias with 10,000 or more articles. I also gathered the number of active editors and the number of full-text queries (with the usual anti-bot filters) for March 2023. I dropped those with less than 700 monthly queries and fewer than 50 active editors. My original ideas for thresholds had been ~1000 monthly queries and ~100 active editors, but I didn't want or need a super sharp cut off. Limiting by very low active editor counts meant fewer samples to get at the query-gathering step, which is somewhat time-consuming. Limiting by query count also meant less work at the next step of filtering queries, and all later steps, too.
I ran my usual query filters (as mentioned above), and also dropped wikis with fewer than 700 unique queries after filtering. That left 90 Wikipedias to work with. In order of number of unique filtered monthly queries, they are: English, Spanish, French, German, Russian, Japanese, Chinese, Italian, Portuguese, Polish, Arabic, Dutch, Czech, Korean, Indonesian, Turkish, Persian, Vietnamese, Swedish, Hebrew, Ukrainian, Igbo, Finnish, Hungarian, Romanian, Greek, Norwegian, Catalan, Hindi, Thai, Simple English, Danish, Bangla, Slovak, Bulgarian, Swahili, Croatian, Serbian, Tagalog, Slovenian, Lithuanian, Georgian, Tamil, Malay, Uzbek, Estonian, Albanian, Azerbaijani, Latvian, Armenian, Marathi, Burmese, Malayalam, Afrikaans, Urdu, Basque, Mongolian, Telugu, Sinhala, Kazakh, Macedonian, Khmer, Kannada, Bosnian, Egyptian Arabic, Galician, Cantonese, Icelandic, Gujarati, Central Kurdish, Serbo-Croatian, Nepali, Latin, Kyrgyz, Belarusian, Esperanto, Norwegian Nynorsk, Assamese, Tajik, Punjabi, Odia, Welsh, Asturian, Belarusian-Taraškievica, Scots, Luxembourgish, Irish, Alemannic, Breton, & Kurdish.
- Or, in language codes: en, es, fr, de, ru, ja, zh, it, pt, pl, ar, nl, cs, ko, id, tr, fa, vi, sv, he, uk, ig, fi, hu, ro, el, no, ca, hi, th, simple, da, bn, sk, bg, sw, hr, sr, tl, sl, lt, ka, ta, ms, uz, et, sq, az, lv, hy, mr, my, ml, af, ur, eu, mn, te, si, kk, mk, km, kn, bs, arz, gl, zh-yue, is, gu, ckb, sh, ne, la, ky, be, eo, nn, as, tg, pa, or, cy, ast, be-tarask, sco, lb, ga, als, br, ku.
I sampled 1,000 unique filtered queries from each language (except for those that had fewer than 1000).
I also pulled 1,000 articles from each Wikipedia to use for testing.
I used a combined corpus of the ~1K queries and the 1K articles for each language to test analysis changes. This allows me to see interactions between words/characters that occur more in queries and words/characters that occur more in articles.
Relevant Query Corpora
editFor each task, I plan to pull a corpus of "relevant" queries for each language for before-and-after impact assessment, by grepping for the relevant characters. For each corpus, I'll also do some preprocessing to remove queries that are unchanged by the analysis upgrades being made.
For example, when looking at apostrophe-like characters, ICU folding already converts typical curly quotes (‘’) to straight quotes ('), so for languages with ICU folding enabled, curly quotes won't be treated any differently, so I plan to remove those queries as "irrelevant". Another example is reversed prime (‵), which causes a word break with the standard tokenizer; apostrophes are stripped at the beginning or ending of words, so reversed prime at the edge of a word isn't actually treated differently from an apostrophe in the same place—though the reasons are very different.
For very large corpora (≫1000, for sure), I'll probably sample the corpus down to a more reasonable size after removing "irrelevant" queries.
I'm going to keep (or sample) the "irrelevant" queries (e.g., words with straight apostrophes or typical curly quotes handled by ICU folding) for before-and-after analysis, because they may still get new matches on words in wiki articles that use the less-common characters, though there are often many, many fewer such words on-wiki—because the WikiGnomes are always WikiGnoming!
Another interesting wrinkle is that French and Swedish use ICU folding with "preserve original", so that both the original form and folded form are indexed (e.g., l’apostrophe is indexed as both l’apostrophe and l'apostrophe). This doesn't change matching, but it may affect ranking. I'm going to turn off the "preserve original" filter for the purpose of removing "irrelevant" queries, since we are focused on matching here.
Some Observations
editAfter filtering porn and likely junk queries and uniquifying queries, the percentage of queries remaining generally ranged from 94.52% (Icelandic—so many unique queries!) to 70.58% (Persian), with a median of 87.31% (Simple English), and a generally smooth distribution across that range.
There were three outliers:
- Swahili (57.51%) and Igbo (37.56%) just had a lot of junk queries.
- Vietnamese was even lower at 30.03%, with some junk queries but also an amazing number of repeated queries, many of which are quite complex (not like everyone is searching for just famous names or movie titles or something "simple"). A few queries I looked up on Google seem to exactly match titles or excerpts of web pages. I wonder if there is a browser tool or plugin somewhere that is automatically doing wiki searches based on page content.
Re-Sampling & Zero-Results Rate
editI found a bug in my filtering process, which did not properly remove certain very long queries that get 0 results, which I classify as "junk". These accounted for less than 1% of any given sample, but it was still weird to have many samples ranging from 990–999 queries instead of the desired 1,000. Since I hadn't used my baseline samples for anything at that point, I decided to re-sample them. This also gave me an opportunity to compare zero-results rates (ZRR) between the old and new samples.
In the case of very small queries corpora, the old and new samples may largely overlap, or even be identical. (For example, if there are only 800 queries to sample from, my sample "of 1000" is going to include all of them, every time I try to take a sample.) Since this ZRR comparison was not the point of the exercise, I'm just going to throw out what I found as I found it, and not worry about any sampling biases—though they obviously include overlapping samples, and potential effects of the original filtering error.
The actual old/new ZRR for these samples ranged from 6.3%/6.2% (Japanese) to 75.4%/76.1% (Igbo—wow!). The zero-results rate differences from the old to the new sample ranged from -4.2% (Gujarati, 64.3% vs 60.1%) to +5.6% (Dutch, 22.1% vs 27.7%), with a median of 0.0% and mean of -0.2%. Proportional rates ranged from -19.9% (Galician, 17.5% vs 14.6%) to +20.2% (Dutch, 22.1% vs 27.7%, again), with a median of 0.0%, and a mean of -0.5%.
Looking at the graph, there are some minor outliers, but nothing ridiculous, which is nice to see.
"Infrastructure"
editI've built up some temporary "infrastructure" to support impact analysis of the harmonization changes. Since every or almost every wiki will need to be reindexed to enable harmonization changes, timing the "before and after" query analyses for the 90 sampled wikis would be difficult.
Instead, I've set up a daily process that runs all 90 samples each day. There's an added bonus of seeing the daily variation in results without any changes.
I will also pull relevant sub-samples for each of the features (apostrophes, acronyms, word_break_helper
, etc.) being worked on and run them daily as well.
There's a rather small chance of having a reindexing finish while a sample is being run, so that half the sample is "before" and half is "after". If that happens, I can change my monitoring cadence to every other day for that sample for comparison's sake and it should be ok.
There are some pretty common apostrophe variations that we see all the time, particularly the straight vs curly apostrophes—e.g., ain't vs ain’t. And of course people (or their software) will sometimes curl the apostrophe the wrong way—e.g., ain‘t. But lots of other characters regularly (and some irregularly) get used as apostrophes, or apostrophes get used for them—e.g., Hawai'i or Hawai’i or Hawai‘i when the correct Hawaiian letter is the okina: Hawaiʻi.
A while back, we worked on a ticket (T311654) for the Nias Wikipedia to normalize some common apostrophe-like variants, and at the time I noted that we should generalize that across languages and wikis as much as possible. ICU normalization and ICU folding already do some of this (see the table below)—especially for the usual ‘curly’ apostrophes/single quotes, but those cases are common enough that we should take care of them even when the ICU plugin is not available. It'd also be nice if the treatment of these characters was more consistent across languages, and not dependent on the specific tokenizer and filters configured for a language.
There are many candidate "apostrophe-like" characters. The list below is distillation of the list of Unicode Confusables for apostrophe, characters I had already known were potential candidates from various Phab tickets and my own analysis experience (especially working on Turkish apostrophes), and the results of data-mining for apostrophe-like contexts (e.g., Hawai_i).
x'x | Desc. | #q | #wiki samp | UTF | Example | std tok (is) | icu tok (my) | heb tok (he) | nori tok (ko) | smart cn (zh) | icu norm (de) | icu fold (de) | icu norm (wsp) | icu norm + fold (wsp) | icu fold (wsp) | Nias | apos-like | transitive | apos is x-like? | final fold |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
a‵x | reversed prime | 0 | 0 | U+2035 | Ocean‵s | split | split | split | split | split/keep → , | → ' | → ' | – | + | – | + | ||||
bꞌx | Latin small letter saltillo | 0 | 0 | U+A78C | Miꞌkmaq | split/keep | → ' | → ' | → ' | + | + | |||||||||
c‛x | single high-reversed-9 quotation mark | 0 | 1 | U+201B | Het‛um | split | split | → ' | split | split/keep → , | → ' | → ' | + | + | ||||||
dߴx | N'ko high tone apostrophe | 1 | 0 | U+07F4 | памߴятки | split/keep | split/keep | split/keep | delete | delete | delete | – | – | – | ||||||
e῾x | Greek dasia | 1 | 2 | U+1FFE | Ch῾en | split | split | split | split | split/keep | → [ ̔] (sp + U+314) |
→ sp | delete | – | – | – | ||||
fʽx | modifier letter reversed comma | 1 | 8 | U+02BD | Geʽez | split/keep | delete | delete | delete | + | + | |||||||||
g᾿x | Greek psili | 1 | 11 | U+1FBF | l᾿ancienne | split | split | split | split | split/keep | → [ ̓] (sp + U+313) |
→ sp | delete | – | – | – | ||||
h᾽x | Greek koronis | 3 | 3 | U+1FBD | Ma᾽laf | split | split | split | split | split/keep | → [ ̓] (sp + U+313) |
→ sp | delete | – | – | – | ||||
i՚x | Armenian apostrophe | 8 | 1 | U+055A | Nobatia՚s | split | split | split | split | split/keep | – | + | + | |||||||
j`x | fullwidth grave accent | 11 | 0 | U+FF40 | JOLLY`S | split | split | split | split | split/keep → , | → ` | → ` | delete | – | + | + | ||||
k՝x | Armenian comma | 12 | 4926 | U+055D | People՝s | split | split | split | split | split/keep | → [ ́] (sp + U+301) |
→ sp | – | – | – | |||||
lʾx | modifier letter right half ring | 18 | 90 | U+02BE | Beʾer | split/keep | delete | delete | delete | ✓ | + | + | ||||||||
mˈx | modifier letter vertical line | 21 | 1041 | U+02C8 | Meˈyer | split/keep | delete | delete | delete | – | – | – | ||||||||
n'x | fullwidth apostrophe | 28 | 0 | U+FF07 | China's | → ' | split | split/keep → , | → ' | → ' | → ' | → ' | + | + | ||||||
oʹx | modifier letter prime | 63 | 16 | U+02B9 | Kuzʹmina | split/keep | delete | delete | delete | + | + | |||||||||
pʿx | modifier letter left half ring | 71 | 166 | U+02BF | Baʿath | split/keep | delete | delete | delete | ✓ | + | + | ||||||||
q′x | prime | 93 | 1133 | U+2032 | People′s | split | split | split | split | split/keep → , | → ' | → ' | + | + | ||||||
rˊx | modifier letter acute accent | 107 | 0 | U+02CA | kāˊvya | split/keep | delete | delete | delete | – | – | – | ||||||||
sˋx | modifier letter grave accent | 118 | 0 | U+02CB | Sirenˋs | split/keep | delete | delete | delete | + | + | |||||||||
t΄x | Greek tonos | 132 | 856 | U+0384 | Adelberg΄s | split | split | split | split | split/keep | delete | – | – | – | ||||||
uʼx | modifier letter apostrophe | 154 | 1665 | U+02BC | Baháʼí | split/keep | delete | delete | delete | ✓ | + | + | ||||||||
v׳x | Hebrew punctuation geresh | 389 | 54 | U+05F3 | Alzheimer׳s | split/keep | → ' | split | split/keep | → ' | → ' | → ' | – | + | + | |||||
wʻx | modifier letter turned comma | 824 | 14734 | U+02BB | Chʻeng | split/keep | delete | delete | delete | + | + | |||||||||
x´x | acute accent | 2769 | 229 | U+00B4 | Cetera´s | split | split | split | split | split/keep → , | → [ ́] (sp + U+301) |
→ sp | delete | + | + | |||||
y`x | grave accent | 2901 | 862 | U+0060 | she`s | split | split | split | split | split/keep → , | delete | delete | ✓ | + | + | |||||
z‘x | left single quotation mark | 3571 | 4977 | U+2018 | Hawai‘i | → ' | split | split/keep → , | → ' | → ' | → ' | ✓ | + | + | ||||||
za’x | right single quotation mark | 35333 | 18472 | U+2019 | Angola’s | → ' | split | split/keep → , | → ' | → ' | → ' | ✓ | + | + | ||||||
zb'x | apostrophe | 114116 | 148698 | U+0027 | apostrophe's | split | split/keep → , | == | == | |||||||||||
zcיx | Hebrew letter yod | 142513 | 261471 | U+05D9 | Archיologiques | split/keep | split/keep | split/keep | – | – | – |
Key
- x'x—It's hard to visually distinguish all the vaguely apostrophe-like characters on-screen, so after ordering them, I put a letter (or two) before them and an x after them. The letter before makes it easier to see where each one is/was when looking at the analysis output, and the x after doesn't seem to be modified by any of the analyzers I'm working with. And x'x is an easy shorthand to refer to a character without having to specify its full name.
- Also, apostrophe-like characters sometimes get treated differently at the margins of a word. (Schrodinger's apostrophe: inside a word it's an apostrophe, at the margins, it's a single quote.) Putting it between two alpha characters gives it the most apostrophe-like context.
- Desc.—The Unicode description of the character
- #q—The number of occurrences of this character (in any usage) in my 90-language full query sample. Samples can be heavily skewed: Hebrew letter yod occurs a lot in Hebrew queries—shocker! Big wiki samples are larger, so English is over-represented. Primary default sort key.
- #wiki samp—The number of occurrences of this character in my 90-language 1K Wikipedia sample. Samples can be skewed by language (as with Hebrew yod above), but less so by sample size. All samples are 1K articles, but some wikis have longer average articles. Secondary default sort key.
- UTF—UTF codepoint for the character. Tertiary default sort key.
- Example—An actual example of the character being used in an apostrophe-like way. Most come from English Wikipedia article or query samples. Others I had to look harder to find—in other samples, or using on-wiki search.
- Just because a word or a few words exist with the character used in an apostrophe-like way doesn't mean it should be treated as an apostrophe. When looking for words matching the Hawai_i pattern, I found Hawai*i, Hawai,i, and Hawai«i, too. I don't think anyone would suggest that asterisks, commas, or guillemets should be treated as apostrophes.
- I never found a real example of Hebrew yod being used as an apostrophe. I only found two instances of it embedded in a Latin-script word (e.g. Archיologiques), and there it looked like an encoding error, since it has clearly replaced é. I fixed both of those (through my volunteer account).
- I really did find an example of apostrophe's using a real apostrophe!
- std tok (is)—What does the standard tokenizer (exemplified by the is/Icelandic analyzer) do to this character?
- icu tok (my)—What does the ICU tokenizer (exemplified by the my/Myanmar analyzer) do to this character?
- heb tok (he)—What does the HebMorph tokenizer (exemplified by the he/Hebrew analyzer) do to this character?
- nori tok (ko)—What does the Nori tokenizer (exemplified by the ko/Korean analyzer) do to this character?
- smart cn (zh)—What does the SmartCN tokenizer (exemplified by the zh/Chinese analyzer) do to this character?
- icu norm (de)—What does the ICU normalizer filter (exemplified by the de/German analyzer) do to this character (after going through the standard tokenizer)?
- icu fold (de)—What does the ICU folding filter (exemplified by the de/German analyzer) do to this character (after going through the standard tokenizer)?
- icu norm (wsp)—What does the ICU normalizer filter do to this character, after going through a whitespace tokenizer? (The whitespace tokenizer just splits on spaces, tabs, newlines, etc. There's no language for this, so it was a custom config.)
- icu norm + fold (wsp)—What does the ICU normalizer filter + the ICU folding filter do to this character, after going through a whitespace tokenizer? (We never enable the ICU folding filter without enabling ICU normalization first—so this is a more "typical" config.)
- icu fold (wsp)—What does the ICU folding filter do to this character, after going through a whitespace tokenizer, without ICU normalization first?
- Tokenizer and Normalization Sub-Key
- split means the tokenizer splits on this character—at least in the context of being between Latin characters. Specifically non-Latin characters get split by the ICU tokenizer between Latin characters in general because it always splits on script changes. (General punctuation doesn't belong to a specific script.) So, the standard tokenizer splits a‵x to a and x.
- split/keep means the tokenizer splits before and after the character, but keeps the character. So, the ICU tokenizer splits dߴx to d, ߴ, and x.
- → ? means the tokenizer or filter converts the character to another character. So, the HebMorph tokenizer tokenizer c‛x as c'x (with an apostrophe).
- The most common conversion is to an apostrophe. The SmartCN tokenzier converts most punctuation to a comma. The ICU normalizer converts some characters to space plus another character (I don't get the reasoning, so I wonder if this might be a bug); I've put those in square brackets, though the space doesn't really show up, and put a mini-description in parens, e.g. "(sp + U+301)". Fullwidth grave accent gets normalized to a regular grave accent by ICU normalization.
- split/keep → ,—which is common in the SmartCN tokenizer column—means that text is split before and after the character, the character is not deleted, but it is converted to a comma. So, the SmartCN tokenizer tokenizes a‵x as a + , + x.
- delete means the tokenizer or filter deletes the character. So, ICU folding converts dߴx to dx.
- Nias—For reference, these are the characters normalized specifically for nia/Nias in Phab ticket T311654.
- apos-like—After reviewing the query and Wikipedia samples, this character does seem to commonly be used in apostrophe-like ways. (In cases of the rarer characters, like bꞌx, I had to go looking on-wiki for examples.)
- + means it is, – means it isn't, == means this is the row for the actual apostrophe!
- transitive—This character is not regularly used in an apostrophe-like way, but it is normalized by a tokenizer or filter into a character that is regularly used in an apostrophe-like way.
- apos is x-like?—While the character is not used in apostrophe-like way (i.e., doesn't appear in Hawai_i, can_t, don_t, won_t, etc.), apostrophes are used where this character should be.
- + means it is, – means it isn't, blank means I didn't check (because it was already apostrophe-like or transitively apostrophe-like).
- final fold—Should this character get folded to an apostrophe by default? If it is apostrophe-like, transitively apostrophe-like, or apostrophes get used where it gets used—i.e., a + in any of the three rpevious columns—then the answer is yes (+).
Character-by-Character Notes
edit- a‵x (reversed prime): This character is very rarely used anywhere, but it is normalized to apostrophe by ICU folding
- bꞌx (Latin small letter saltillo): This is used in some alphabets to represent a glottal stop, and apostrophes are often used to represent a glottal stop, so they are mixed up. In the English Wikipedia article for Mi'kmaq (apostrophe in the title), miꞌkmaq (with saltillo) is used 144 times, while mi'kmaq (with apostrophe) is used 78 times—on the same page!
- c‛x (single high-reversed-9 quotation mark): used as a reverse quote and an apostrophe.
- dߴx (N'ko high tone apostrophe): This seems to be an N'ko character almost always used for N'ko things. It's uncommon off the nqo/N'ko Wikipedia, and on the nqo/N'ko Wikipedia the characters do not seem to be not interchangeable.
- e῾x (Greek dasia): A Greek character almost always used for Greek things.
- fʽx (modifier letter reversed comma): Commonly used in apostrophe-like ways.
- g᾿x (Greek psili): A Greek character almost always used for Greek things.
- h᾽x (Greek koronis): A Greek character almost always used for Greek things.
- i՚x (Armenian apostrophe): An Armenian character almost always used for Armenian things, esp. in Western Armenian—however, the non-Armenian apostophe is often used for the Armenian apostrophe.
- j`x (fullwidth grave accent): This is actually pretty rare. It is mostly used in kaomoji, like (*´ω`*), and for quotes. But it often gets normalized to a regular grave accent, so it should be treated like one, i.e., folded to an apostrophe.
- It's weird that there's no fullwidth acute accent in Unicode.
- k՝x (Armenian comma): An Armenian character almost always used for Armenian things, and it generally appears at the edge of words (after the words), so it would usually be stripped as an apostrophe, too.
- lʾx (modifier letter right half ring): On the Nias list, and frequently used in apostrophe-like ways.
- mˈx (modifier letter vertical line): This is consistently used for IPA transcriptions, and apostrophes don't show up there very often.
- n'x (fullwidth apostrophe): Not very common, but does get normalized to a regular apostrophe by ICU normalization and ICU folding, so why fight it?
- oʹx (modifier letter prime): Consistently used on-wiki as palatalization in Slavic names, but apostrophes are used for that, too.
- pʿx (modifier letter left half ring): On the Nias list, and frequently used in apostrophe-like ways.
- q′x (prime): Consistently used for coordinates, but so are apostrophes.
- rˊx (modifier letter acute accent): Used for bopomofo to mark tone; only occurs in queries from Chinese Wikipedia.
- sˋx (modifier letter grave accent): Used as an apostrophe in German and Chinese queries.
- t΄x (Greek tonos): A Greek character almost always used for Greek things.
- uʼx (modifier letter apostrophe): Not surprising that a apostrophe variant is used as an apostrophe.
- v׳x (Hebrew punctuation geresh): A Hebrew character almost always used for Hebrew things... however, it is converted to apostrophe by both the Hebrew tokenizer and ICU folding.
- wʻx (modifier letter turned comma): Often used as an apostrophe.
- x´x (acute accent): Often used as an apostrophe.
- y`x (grave accent): Often used as an apostrophe.
- z‘x (left single quotation mark): Often used as an apostrophe.
- za’x (right single quotation mark): The curly apostrophe, so of course it's used as an apostrophe.
- zb'x (apostrophe): The original!
- zcיx (Hebrew letter yod): A Hebrew character almost always used for Hebrew things. The most examples because it is an actual Hebrew letter. Showed up on the confusabled list, but is never used as an apostrophe. Only examples are encoding issues: Palיorient, Archיologiques → Paléorient, Archéologiques
Apostrophe-Like Characters, The Official List™
editThe final set of 19 apostrophe-like characters to be normalized is [`´ʹʻʼʽʾʿˋ՚׳‘’‛′‵ꞌ'`]—i.e.:
- ` (U+0060): grave accent
- ´ (U+00B4): acute accent
- ʹ (U+02B9): modifier letter prime
- ʻ (U+02BB): modifier letter turned comma
- ʼ (U+02BC): modifier letter apostrophe
- ʽ (U+02BD): modifier letter reversed comma
- ʾ (U+02BE): modifier letter right half ring
- ʿ (U+02BF): modifier letter left half ring
- ˋ (U+02CB): modifier letter grave accent
- ՚ (U+055A): Armenian apostrophe
- ׳ (U+05F3): Hebrew punctuation geresh
- ‘ (U+2018): left single quotation mark
- ’ (U+2019): right single quotation mark
- ‛ (U+201B): single high-reversed-9 quotation mark
- ′ (U+2032): prime
- ‵ (U+2035): reversed prime
- ꞌ (U+A78C): Latin small letter saltillo
- ' (U+FF07): fullwidth apostrophe
- ` (U+FF40): fullwidth grave accent
Other Observations
edit- Since ICU normalization converts some of the apostrophe-like characters above to ́ (U+301, combining acute accent), ̓ (U+313, combining comma above), and ̔ (U+314, combining reversed comma above), I briefly investigated those, too. They are all used as combining accent characters and not as separate apostrophe-like characters. The combining commas above are both used in Greek, which makes sense, since they are on the list because Greek accents are normalized to them.
- In French examples, I sometimes see 4 where I'd expect an apostrophe, especially in all-caps. Sure enough, looking at the AZERTY keyboard you can see that 4 and the apostrophe share a key!
- The
hebrew_lemmatizer
in the Hebrew analyzer often generates multiple output tokens for a given input token—this is old news. However, looking at some detailed examples, I noticed that sometimes the multiple tokens (or some subset of the multiple tokens) are the same! Indexing two copies of a token on top of each other doesn't seem helpful—and it might skew token counts for relevance.
apostrophe_norm
edit
The filter for Nias that normalized some of the relevant characters was called apostrophe_norm
. Since the new filter is a generalization of that, it is also called apostrophe_norm
. There's no conflict because with the new generic apostrophe_norm
, as there's no longer a need for a Nias-specific filter, or any Nias-specific config at all.
I tested the new apostrophe_norm
filter on a combination of ~1K general queries and 1K Wikipedia articles per language (across the 90 harmonization languages). The corpus for each language was run through the analysis config for that particular language. (Languages that already have ICU folding, for example, already fold typical ‘curly’ quotes, so there'd be no change for them, but for other languages there would be.)
I'm not going to give detailed notes on all 90 languages, just note general trends and highlight some interesting examples.
- In general, there are lots of names and English, French, & Italian words with apostrophes everywhere (O´Reilly, R`n`R, d‘Europe, dell’arte).
- There are also plenty of native apostrophe-like characters in some languages; the typical right curly apostrophe (’) is by far the most common. (e.g., ইক'নমিক vs ইক’নমিক, з'ездам vs з’ездам, Bro-C'hall vs Bro-C’hall)
- Plenty of coordinates with primes (e.g., 09′15) across many languages—though coordinates with apostrophes are all over, too.
- Half-rings (ʿʾ) are most common in Islamic names.
- Encoding errors (e.g., Р’ Р±РѕР№ РёРґСѓС‚ РѕРґРЅРё «старики» instead of В бой идут одни «старики») sometimes have apostrophe-like characters in them. Converting them to apostrophes doesn't help.. it's just kinda funny.
- Uzbek searchers really like to mix it up with their apostrophe-like options. The apostrophe form o'sha will now match o`sha, oʻsha, o‘sha, o’sha, o`sha, oʻsha, o‘sha, and o’sha—all of which exist in my samples!
I don't always love how the apostrophes are treated (e.g., aggressive_splitting
in English is too aggressive), but for now it's good that all versions of a word with different apostrophe-like characters in it are at least treated the same.
There may be a few instances where the changes decrease the number of results a query gets, but it is usually an increase in precision. For example, l´autre would no longer match autre because the tokenizer isn't splitting on ´. However, it will match l'autre. Having to choose between them isn't great—I'm really leaning toward enabling French elision processing everywhere—but in a book or movie title, an exact match is definitely better. (And having to randomly match l to make the autre match is also arguably worse.)
aggressive_splitting
is enabled on English- and Italian-language wikis, so—assuming it does a good job and in the name of harmonization—we should look at enabling it everywhere. In particular, it splits up CamelCase words, which is generally seen as a positive thing, and was the original issue in the Phab ticket.
word_delimiter(_graph)
edit
The aggressive_splitting
filter is a word_delimiter
token filter, but the word_delimiter
docs say it should be deprecated in favor of word_delimiter_graph
. I made the change and ran a test on 1K English Wikipedia articles, and there were no changes in the analysis output, so I switched to the _graph
version before making any further changes.
Also, aggressive_splitting
, as a word_delimiter(_graph)
token filter, needs to be the first token filter if possible. (We already knew it needed to come before homoglyph_norm.) If any other filter makes any changes, aggressive_splitting
can lose the ability to track offsets into the original text. Being able to track those changes gives better (sub-word) highlighting, and probably better ranking and phrase matching.
So Many Options, and Some Zombie Code
editThe word_delimiter(_graph)
filter has a lot of options! Options enabling catenate_words
, catenate_numbers
, and catenate_all
are commented out in our code, saying they are potentially useful, but they cause indexing errors. The word_delimiter
docs say they cause problems for match_phrase
queries. The word_delimiter_graph
docs seem to say you can fix the indexing problem with the flatten_graph
filter, but still warn against using them with match_phrase
queries, so I think we're just gonna ignore them (and remove the commented out lines from our code).
Apostrophes, English Possessives, and Italian Elision
editIn the English analysis chain, the possessive_english
stemmer currently comes after aggressive_splitting
, so it does nothing, since aggressive_splitting
splits on apostrophes. However, word_delimiter(_graph)
has a stem_english_possessive
setting, which is sensibly off by default, but we can turn that on, just for English, which results in a nearly 90% reduction in s tokens.
After too much time looking at apostrophes (for Turkish and in general), always splitting on apostrophes seems like a bad idea to me. We can disable it in aggressive_splitting
by recategorizing apostrophes as "letters", which is nice, but that also disables removing English possessive –'s ... so we can put possessive_english
back... what a rollercoaster!
In the Italian analysis chain, italian_elision
comes before aggressive_splitting
, and has been set to be case sensitive. That's kind of weird, but I've never dug into it before—though I did just blindly reimplement it as-is when I refactored the Italian analysis chain. All of our other elision filters are case insensitive and the Elastic monolithic analyzer reimplementation/unpacking specifies case insensitivity for Italian, too. I think it was an error a long time ago because the default value is case sensitive, and I'm guessing someone just didn't specify it explicitly, and unintentionally got the case sensitive version.
Anyway, aggressive_splitting
splits up all of the leftover capitalized elision candidates, which makes the content part more searchable, but with a lot of extra bits. The italian_stop
filter removes some of them, but not all. Making italian_elision
case insensitive seems like the right thing to do, and as I mentioned above, splitting on apostrophes seems bad in general.
Apostrophe Hell, Part XLVII
editNow may not be the time to add extra complexity, but I can't help but note that d'– and l'– are overwhelmingly French or Italian, and dell'— is overwhelmingly Italian. Similarly, –'s, –'ve, –'re, and –'ll are overwhelmingly English. Some of the others addressed in Turkish are also predominantly in one language (j'n'–, j't'–, j'–, all'–, nell'–, qu'–, un'–, sull'–, dall'–)... though J's and Nell's exist, just to keep things complicated.
All that said, a simple global French/Italian elision filter for d'– and l'– and English possessive filter for –'s would probably improve recall almost everywhere.
CamelCase (via aggressive_splitting
)
edit
Splitting CamelCase seems like a good idea in general (it was the original issue in what became the aggressive_splitting
Phab ticket). In the samples I have, actual splits seem to be largely Latin script, with plenty of Cyrillic, and some Armenian, too.
Splitting CamelCase isn't great for Irish, because inflected capitalized words like bhFáinní get split into bh + Fáinní. Normally the stemmer would remove the bh, so the end result isn't terrible, but all those bh–'s are like having all the English possessive –'s in the index. However, we already have some hyphenation cleanup to remove stray h, n, and t, so adding bh (and b, g, and m, which are similar CamelCased inflection bits) to that mini stop word list works, and the plain index can pick still up instances like B.B. King.
Irish also probably has more McNames than other wikis, but they are everywhere. Proximity and the plain index will boost those reasonably well.
Splitting CamelCase often splits non-homoglyph multi-script tokens, like OpenМировой—some of which may be parsing errors in my data, but any of which could be real or even typos on-wiki. Anyway, splitting them seems generally good, and prevents spurious homoglyph corrections.
Splitting CamelCase is not great for iPad, LaTeX, chemical formulas, hex values, saRcAStiC sPonGEboB, and random strings of ASCII characters (as in URLs, sometimes), but proximity and the plain index take care of them, and we take a minor precision hit (mitigated by ranking) for a bigger, better recall increase.
Splitting CamelCase is good.
Others Things That Are Aggressively Split
editaggressive_splitting
definitely lives up to its name. Running it on non-English, non-Italian samples showed just how aggressive it is.
The Good
- Splits web domains on periods, so en.wikipedia.org → en + wikipedia + org
- Splits on colons
The Bad (or at least not Good)
- Splitting between letters and numbers is okay sometimes, but often bad, e.g. j2se → j + 2 + se
- Splitting on periods in IPA is not terrible, since people probably don't search it much; ˈsɪl.ə.bəl vs ˈsɪləbəl already don't match anyway.
- Splitting on periods and commas in numbers is.. unclear. Splitting on the decimal divider isn't terrible, but breaking up longer numbers into ones, thousands, millions, etc. sections is not good.
- On the other hand, having some systems use periods for decimals and commas for dividing larger numbers (3,141,592.653) and some doing it the other way around (3.141.592,653), and the Indian system (31,41,592.653)—plus the fact that the ones, thousands, millions, etc. sections are sometimes also called periods—makes it all an unrecoverable mess anyway.
The Ugly
- Splitting acronyms, so N.A.S.A. → N + A + S + A —Nooooooooooo!
- (Spoiler: there's a fix coming!)
- Splitting on soft hyphens is terrible—an invisible character with no semantic meaning can un pre dictably and ar bi trar i ly break up a word? Un ac cept able!
- Splitting on other invisibles, like various joiners and non-joiners and bidi marks, seems pretty terrible in other languages, especially in Indic scripts.
Conclusion Summary So Far
edit
aggressive_splitting
splits on all the things word_break_helper
splits on, so early on I was thinking I could get rid of word_break_helper
(and repurpose the ticket for just dealing with acronyms), but aggressive_splitting
splits too many things, including invisibles, which ICU normalization handles much more nicely.
I could configure away all of aggressive_splitting
's bad behavior, but given the overlap between aggressive_splitting
, word_break_helper
, and the tokenizers, it looks to be easiest to reimplement the CamelCase splitting, which is the only good thing aggressive_splitting
does that word_break_helper
doesn't do or can't do.
So, the plan is...
- Disable
aggressive_splitting
for English and Italian (but leave it forshort_text
andshort_text_search
, used by the ShortTextIndexField, because I'm not aware of all the details of what's going on over there). - Create and enable a CamelCase filter to pick up the one good thing that
aggressive_splitting
does thatword_break_helper
can't do. - Enable
word_break_helper
and the CamelCase filter everywhere.- Create an acronym filter to undo the bad things
word_break_helper
—andaggressive_splitting
!—do to acronyms.
- Create an acronym filter to undo the bad things
- Fix
italian_elision
to be case insensitive.
At this point, disabling aggressive_splitting
and enabling a new CamelCase filter on English and Italian are linked to prevent a regression, but the CamelCase filter doesn't depend on word_break_helper
or the acronym filter.
Enabling word_break_helper
and the new acronym filter should be linked, though, to prevent word_break_helper
from doing bad things to acronyms. (Example Bad Things: searching for N.A.S.A. on English Wikipedia does bring up NASA as the first result, but the next few are N/A, S/n, N.W.A, Emerald Point N.A.S., A.N.T. Farm, and M.A.N.T.I.S. Searching for M.A.N.T.I.S. brings up Operation: H.O.T.S.T.U.F.F./Operation: M.I.S.S.I.O.N., B.A.T.M.A.N., and lots of articles with "A B C D E F G H I J K L M N O P Q R S T U V W X Y Z" navigation in them, among others.)
I had linked word_break_helper
and aggressive_splitting
in my head, because they both split up acronyms, but since the plan is to not enable aggressive_splitting
in any text filters, we don't need the acronym fix to accompany it.
But Wait, There's More: CamelCase Encore
editSo, I created a simple CamelCase pattern_replace
filter, split_camelCase
. After my experience with Thai, I was worried about regex lookaheads breaking offset tracking. (I now wonder if in the Thai case it's because I merged three Nope, they're evil.)
pattern_replace
filters into one for efficiency.
However, the Elastic docs provide a simple but very general CamelCase char filter:
"pattern": "(?<=\\p{Lower})(?=\\p{Upper})",
"replacement": " "
My original formulation was pretty similar, except I used \p{Ll}
and \p{Lu}
, and no lookahead, instead capturing the uppercase letter. But I tested their method, and it works fine in terms of offset mapping. (Apparently, I was wildly mistaken, and lookaheads probably are evil, as I feared.)
However, there are rare cases[†] where CamelCase chunks end in combining diacritics or common invisibles (joiners, non-joiners, zero-width spaces, soft hyphens, and bidi marks being the most common). Fortunately \p{M}
and \p{Cf}
cover pretty much the right things. I tried adding [\\p{M}\\p{Cf}]*
to the lookbehind, but it was really, really sloooooooow. However, allowing 0–9 combining marks or invisibles seems like overkill when you spell it out like that, and there was no noticeable speed difference using {0-9}
instead of *
on my machine. Adding the possessive quantifier (overloaded +
—why do they do that?) to the range should only make it faster. My final pattern, with lookbehind, lookahead, and optional possessive combining marks and invisibles:
'pattern' => '(?<=\\p{Ll}[\\p{M}\\p{Cf}]{0,9}+)(\\p{Lu})',
'replacement' => ' $1'
(Overly observant readers will note a formatting difference. The Elastic example is a JSON snippet, mine is in PHP snippet. I left it because it amuses me, and everything should be clear from context.)
[†] As noted above in the Not A Conclusion, I had originally linked the CamelCase filter with word_break_helper
and the acronym filter. The combining diacritics and common invisibles are much more relevant to acronym processing—which I've already worked on as I'm writing this, and which made me go back and look for CamelCase cases—of which there are a few.
In Conclusion—No, Really, I Mean It!
editSo, the plan for this chunk of harmonization is:
- Disable
aggressive_splitting
for English and Italian. - Create and enable
split_camelCase
. - Fix
italian_elision
to be case insensitive.
And we can worry about enabling word_break_helper
and handling acronyms in the next chunk of harmonization.
Appendix: CamelCase Observations
edit(Technically this doesn't violate the terms of the conclusion being conclusive, since it's just some extra observations about the data, for funsies.)
The kinds of things that show up in data focused on examples of CamelCase—some we can help with (✓), some we cannot (✗):
- ✓ Cut and PasteCut and Paste / Double_NamebotDouble_Namebot
- ✗ mY cAPSLCOCK kEY iS bROKEN
- ✓ mySpaceBarIsBrokenButMyShiftKeyIsFine
- ✓ lArt dElision sans lApstrophe
- ✗ кодиÑование is hard.. oops, I mean, кодирование is hard
- ✗ "Wiki Loves Chemistry"?: 2CH2COOH + NaCO2 → 2CH2COONa + H2O + CO2
- ✗ WéIRD UPPèRCäSE FUñCTIôNS / wÉird lowÈrcÄse fuÑctiÔns
- ✓ Названиебот,ПробелПоедание (Namebot,SpaceEating)
- ✓ Lots of English examples in languages without an upper/lowercase distinction
I think the CamelCase fix is going to be very helpful for people who double-paste something (if it starts with uppercase and ends with lowercase, like Mr RogersMr Rogers). On the one hand, it's probably a rare mistake for any given person, but on the other, it still happens many times per day.
We—especially David and I—have been talking about "fixing acronyms" for years. On all wikis, NASA and N.A.S.A. do not match. And while they are not technically acronyms, the same problem arises for initials in names, such as J.R.R. Tolkien and JRR Tolkien; those ought to match! (I'd like to get J. R. R. Tolkien (with spaces) in on the game, too, but that's a different and more difficult issue.)
Long before either David or I were on the search team, English and Italian were configured to use word_break_helper
in the text field. Generally this is a good thing, because it breaks up things like en.wikipedia.org and word_break_helper into searchable pieces. However, it also breaks up acronyms like N.A.S.A. into single letters. This is especially egregious for NASA on English-language wikis, where a is a stop word (and thus not strictly required)—lots of one-letter words are stop words in various languages, so it's not just an English problem.
Anyway... there are three goals for this task:
- Merge acronyms into words (so NASA and N.A.S.A. match).
- Apply
word_break_helper
everywhere (once acronyms are mostly safe) - Extend
word_break_helper
to any other necessary characters, particularly colon (:)
Merging Acronyms
editI originally thought I would have to create a new plugin with a new filter to handle acronyms. Certainly the basic pattern of letter-period-letter-period... would be easy to match. However, I realized we could probably get away with a regular expression in a character filter, which would also avoid some potential tokenization problems that might prevent some acronyms from being single tokens.
We can't just delete periods between letters, since that would convert en.wikipedia.org to enwikipediaorg. Rather we want a period between two single letters. Probably. Certainly, that does the right thing for N.A.S.A. (converts to NASA.) and en.wikipedia.org (nothing happens).
However... and there is always a however... as noted above in the camelCase discussion, sometimes our acronyms can have combining diacritics or common invisibles (joiners, non-joiners, zero-width spaces, soft hyphens, and bidi marks being the most common). A simple example would be something like T.É.T.S or þ.á.m. or İ.T.Ü.—except that in those cases, Latin characters with diacritics are normalized into single code points.
Indic languages written with abugidas are a good example where more complex units than single letters can be used in acronyms or initials. We'll come back to that in more detail later.
So, what we need are single (letter-based) graphemes separated by periods. Well, and the fullwidth period (.), obviously... and maybe... sigh.
I checked the Unicode confusables list for period and got a lot of candidates, including Arabic-Indic digit zero (٠), extended Arabic-Indic digit zero (۰), Syriac supralinear full stop (܁), musical symbol combining augmentation dot (𝅭), Syriac sublinear full stop (܂), one-dot leader (․), Kharoshthi punctuation dot (𐩐), Lisu letter tone mya ti (ꓸ), and middle dot (·). Vai full stop (꘎) was also on the list, but that does not look like something someone would accidentally use as a period. Oddly, fullwidth period is not on the confusables list.
Given an infinite number of monkeys typing on an infinite number of typewriters a large enough number of WikiGnomes cutting-and-pasting, you will find examples of anything and everything, but the only characters I found regularly being used as periods in acronym-like contexts were actually fullwidth periods across languages, and one-dot leaders in Armenian. (Middle dot also gets used more than the others, but not a whole lot, and in both period-like and comma-like ways, so I didn't feel comfortable using it as an acronym separator.)
So, we want single graphemes—consisting of a letter, zero or more combining characters or invisibles—separated by periods or fullwidth periods (or one-dot leaders in the case of Armenian). A "single grapheme" is one that is not immediately preceded or followed by another letter-based grapheme (which may also be several Unicode code points). We also have to take into account the fact that an acronym could be the first or last token in a string being processed, and we have to explicitly account for "not immediately preceded or followed by" to include the case when there is nothing there at all—at the beginning or end of the string.
For Armenian, it turns out that one-dot leader is used pretty much anywhere periods are, though only about 10% as often, so I added a filter to convert one-dot leaders to periods for Armenian.
My original somewhat ridiculous regex started off with a look-behind for 1.) a start of string (^) or non-letter (\P{L}), followed by 2.) a letter-based grapheme—a letter (\p{L}), followed by optional combining marks (\p{M}) or invisibles (\p{Cf})—then 3.) the period or fullwidth period ([..]), followed by 4.) optional invisibles, then a capture group with 5.) another letter-based grapheme; and a look-ahead for 6.) a non-letter or end of string ($).
Some notes:
- In all its hideous, color-coded glory:
(?<=(?:^|\P{L})\p{L}[\p{M}\p{Cf}]{0,9}+)[..]\p{Cf}*+(\p{L}[\p{M}\p{Cf}]*+)(?=\P{L}|$)
- (1) and (2) in the look-behind aren't part of the matching string, (3) is the period we are trying to drop, (4) is invisible characters we drop anyway, (5) is the following letter, which we want to hold on to, and (6) is in the look-ahead, and not part of the matching string. In the middle of a simple acronym, (1) is the previous period and (2) is the previous letter, and (6) is the next period.
- For reasons of efficiency, possessive matching is used for the combining marks and invisibles, and combining marks and invisibles are limited to no more than 9 in the look-behind. (I have seen 14 Khmer diacritics stacked on top of each other, but that kind of thing is pretty rare.)
- The very simple look-ahead does not mess up the token's character offsets—phew!
- And finally—this doesn't work for certain cases that are
relatively commonnot unheard of in Brahmic scripts!—though they are hard to find in Latin texts.- Ugh.
First, an example using Latin characters. We want e.f.g. to be treated as an acronym and converted to efg. We don't want ef.g to be affected. As mentioned above, we want to handle diacritics, such as é.f.g. and éf.g, which are not actually a problem because é is a single code point. However, something like e̪ is not. It can only be represented as e + ̪. Within an acronym, we've got that covered, and d.e̪.f. is converted to de̪f. just fine. But ̪ is technically "not a letter" so the period in e̪f.g would get deleted, because f is preceded by "not a letter" and thus appears to be a single letter/single grapheme.
In some languages using Brahmic scripts (including Assamese, Gujarati, Hindi, Kannada, Khmer, Malayalam, Marathi, Nepali, Odia, Punjabi, Sinhala, Tamil, Telugu, and Thai), letters followed by separate combining diacritics are really common, because it's the most typical way of doing things. Basic consonant letters include an inherent vowel—Devanagari/Hindi स is "sa", for example. To change the vowel, add a diacritic: सा (saa) सि (si) सी (sii) सु (su) सू (suu) से (se) सै (sai) सो (so) सौ (sau).
Acronyms with periods in these languages aren't super common, but when they occur, they tend to / seem to / can use the whole grapheme (e.g., से, not स for a word starting with से). The problem is that the vowel sign (e.g., े) is "not a letter", just like ̪. So—randomly stringing letters together—सेफ.म would have its period removed, because फ is preceded by "not a letter".
The regex to fix this scenario is a little complicated—we need "not a letter", possibly followed by combining chars (rare, but does happen, as in 9̅) or invisibles (also rare, but they are sneaky and can show up anywhere since you can cut-n-paste them without knowing it). The regex that works—instead of (1) above—is something that is not a letter, not a combining mark, and not an invisible ([^\p{L}\p{M}\p{Cf}])—optionally followed by combining marks or invisibles. That allows us to recognize e̪ or से as a grapheme before another letter.
Some notes:
- Updated, in all its hideous, color-coded glory:
(?<=(?:^|[^\p{L}\p{M}\p{Cf}])[\p{M}\p{Cf}]{0,9}+\p{L}[\p{M}\p{Cf}]{0,9}+)[..]\p{Cf}*+(\p{L}[\p{M}\p{Cf}]*+)(?=\P{L}|$)
- The more complicated regex (and all in the look-behind!) didn't noticeably change the indexing time on my laptop.
- While Latin cases like e̪f.g are possible, the only language samples affected in my test sets were the ones listed above: Assamese, Gujarati, Hindi, Kannada, Khmer, Malayalam, Marathi, Nepali, Odia, Punjabi, Sinhala, Tamil, Telugu, and Thai. The changes in token counts ranged from 0 to 0.06%, with most below 0.03%—so this is not a huge problem.
- Hindi, the language with the 0% change in token counts, still had changes. You can change the tokens themselves without changing the number of tokens—they just get split in different places (see e.e.cummings, et al., below)—though not splitting is the more common scenario.
- In the Kanada sample—the one with the most changes from the regex upgrade—there were some clear examples where the new regex still doesn't work in every case.
- Ugh.
- However, these cases seem to be another order of magnitude less common, so I'm going to let them slide for now.
- Ugh.
However-however (However2?), I am going to document the scenario that still slips through the cracks in the regex, in case it is a bigger deal than it currently seems. (Comments to that effect from speakers of the languages are welcome!)
As mentioned before, Brahmic scripts have an inherent vowel, so स is "sa". The inherent vowel can be suppressed entirely with a virama. So स् is just "s"—and can be used to make consonant clusters.
- स (sa) + त (ta) + र (ra) + ी (ii) = सतरी ("satarii", though the second "a" may get dropped in normal speech.. I'm not sure), and it may or may not be a real word.
- स (sa) + ् (virama) + त (ta) + ् (virama) + र (ra) + ी (ii) = स्त्री (strii/stree), which means "woman".
So, we have a single grapheme that is letter + virama + letter + virama + letter + combining vowel mark. So, basically, we could allow an extra (letter + virama)+
or maybe (letter + virama){0,2}
in several places in our regex—though there are almost 30 distinct virama characters across scripts, and optional characters in the look-behind are complicated.
Plus—just to make life even more interesting!—in Khmer the virama-like role is played by the coeng, and conceptually it seems like it comes before the letter it modifies rather than after... though I guess in a sense both the virama and coeng come between the letters they interact with. (I do recall that for highlighting purposes, you want the virama with the letter before, and the coeng with the letter after. So I guess typographically they break differently.)
Anyway, adjusting the regex for these further cases probably isn't worth it at the moment—though, again, if there are many problem cases, we can look into it. (It might take a proper plugin with a new filter instead of a pattern-replace regex filter... though the interaction of such a filter with word_break_helper
would be challenging.)
More notes:
- Names with connected initials like J.R.R.Tolkien and e.e.cummings are converted to JRR.Tolkien and ee.cummings—which isn't great—until
word_break_helper
comes along and breaks them up properly! - I don't love that acronyms go through stemming and stop word filtering, but that's what happens to non-acronym versions (now both SARS and S.A.R.S. will be indexed as sar in English, for example)—they do match each other, though, which is the point.
- If you have an acronymic stop word, like F.O.R., it will get filtered as a stop word. The plain field has to pick up the slack, where it gets broken into individual letters. There's no great solution here.
word_break_helper
, at Long Last
edit
Now that most acronyms won't be exploded into individual letters graphemes, we can get word_break_helper
up and running.
The current word_break_helper
converts underscores, periods, and parentheses to spaces. My planned upgrade was to add colon, and fullwidth versions of underscore, period, and colon. What could be simpler? (Famous last words!)
Chinese and Korean say, "Not So Fast!"
editI ran some tests, not expecting anything.. unexpected.. to happen. To my surprise, there were some non-obvious changes in my Korean and Chinese samples. Upon further investigation, I discovered that both the Nori (Korean) and SmartCN (Chinese) tokenizers/segmenters take punctuation into account when parsing words—but often not spaces!
The simplest example is that "仁(义)" would be tokenized in Chinese as two different words, 仁 and 义, while "仁义" is tokenized as one. So far, so good. However, "仁 义" (with a space—or with five spaces) will also be tokenized as one word: "仁义".
Another Chinese example—"陈鸿文 (中信兄弟)":
- With parens, 陈 / 鸿 / 文 / 中信 / 兄弟
- Without parens, 陈 / 鸿 / 文 / 中 / 信 / 兄弟
Korean examples are similar. "(970 마이크로초각) 오차범위":
- With parens, 970 / 마이크 / 초각 / 오차 / 범위
- Without parens, 970 / 마이크로초 / 오차 / 범위
Other Korean examples may have less impact on search, because some Korean phrases get indexed once as a full phrase and once as individual words (in English, this would be like indexing football as both football and foot / ball)—"9. 보건복지부차관":
- With period, 9 / 보건복지부 / 보건 / 복지 / 부 / 차관
- Without period, 9 / 보건 / 복지 / 부 / 차관
Somehow, the lack of period blocks the interpretation of "보건복지부" as a phrase. My best guess for both Chinese and Korean is that punctuation resets some sort of internal sentence or phrase boundary.
One more Korean example, shows a bigger difference—"국가관할권 (결정원칙)":
- With parens, 국가관할권 / 국가 / 관할권 / 결정 / 원칙
- Without parens, 국가 / 관할 / 권 / 결정 / 원칙
This one is extra interesting, because the paren after "국가관할권" affects whether or not it is treated as a phrase, but also whether it is broken into two tokens or three.
I found a workaround that works with Nori and SmartCN tokenizers, as well as the standard tokenizer and the ICU tokenizer: replacing punctuation with the same punctuation, but with spaces around it. So wikipedia.org would become wikipedia . org, causing a token split, while "仁(义)" would become "仁 ( 义 ) ", which still blocks the token merger.
It works, but I really don't like it, because it is a lot of string manipulation to add spaces around, for example, every period in English Wikipedia for no real reason (replacing a single character with a different character happens in place and is much less computationally expensive).
I already knew that the SmartCN tokenizer converts almost all punctuation into tokens with a comma as their text. (We filter those.)
I specifically tested the four tokenizers (SmartCN, Nori, ICU, and standard) on parens, period, comma, underscore, colon, fullwidth parens, fullwidth period (.), ideographic period (。), fullwidth comma, fullwidth underscore, and fullwidth colon.
SmartCN and Nori split on all of them. The standard tokenizer and ICU tokenizer do not split on period, underscore, or colon, or their fullwidth counterparts. (They do strip regular and fullwidth periods and colons at word edges, so x. and .x are tokenized as x by itself, while _x and x_ are tokenized with their underscores. x.x and x_x are tokenized with their punctuation characters.)
The easiest way to solve all of these problems was to make sure word_break_helper
includes regular and fullwidth variants of period, underscore, and colon, and prevent word_break_helper
from being applied to Chinese or Korean.
Finally, We Can Do as Asked
editWith that done, I ran some more tests, and everything looked very good. The only place where the results were suboptimal were in IPA transcriptions, where colon (:) is sometimes used for triangular colon (ː), which is used to indicate vowel length.
Add remove_duplicates
to Hebrew
edit
I got ahead of myself a little and looked into adding remove_duplicates
to the Hebrew analysis chain. My analysis analysis tools assumed that there wouldn't be two identical tokens (i.e., identical strings) on top of each other. In the Hebrew analyzer, though, that's possible—common, even! I made a few changes, and the impact of adding remove_duplicates
is much bigger and easier to see.
The Hebrew tokenizer assigns every token a type: Hebrew, NonHebrew, or Numeric that I've seen so far.
The Hebrew lemmatizer adds one or more tokens of type Lemma for each Hebrew or NonHebrew token. The problem arises when the lemma output is the same as the token input—which is true for many Hebrew tokens, and true for every nonHebrew token.
I hadn't noticed these before because the majority of tokens from Hebrew-language projects are in Hebrew, and I don't read Hebrew, so I can't trivially notice tokens are the same.
Adding remove_duplicates
removes Lemma tokens that are the same as their corresponding Hebrew/NonHebrew token.
For a 10K sample of Hebrew Wikipedia articles, the number of tokens decreased by 19.1%! For 10K Hebrew Wiktionary entries, it was 22.7%!
Refactoring and Reimplementing, a.k.a., Plugin! The Musical!†
edit† No actual music was made (or harmed) in the making of this plugin.
While I was on vacation, Erik kindly started the global reindexing to enable apostrophe_norm
, camelCase splitting, acronym handling, and word_break_helper
updates. He noticed after a while that it was going reeeeeeeeeeeeeeally sloooooooooooow. He ran some tests and found it was taking about 3 times as long with the new filters in place. We decided to halt the reindexing while I looked into it some more. We also decided to leave the slower rebuilt indexes in place. Monitoring graphs maybe showed a little more lag, but nothing egregious. (Reindexing a document here and there and analyzing short query stings it not in the same CPU usage ballpark as cramming tens of millions of documents through the reindexing pipeline as fast as possible.) We did decide to temporarily semi-revert the code so that any reindexing during the time I was developing an alternative would be close enough to the old speed.
I hadn't noticed the slowdown before because my analysis analysis tools are built to easily gather stats on tokens, not push text through the pipeline as fast as possible. The overhead of the tools dwarfs the reindexing time. I've since mirrored Erik's timing framework, which nukes the index every time, and times directly stuffing thousands of documents (and tens of MB at a time) into a new index as fast as possible, with as few API calls as possible.
Notes on Timings
editThe timings are necessarily fuzzy. Erik averaged six reloads per timing, while I eventually limited myself to three, sometimes four, because I was doing a lot more testing.
I typically load 2500 documents at a go, which for my English sample is about 72MB of data. The limit for a single data transfer to Elastic in our configuration is about 100MB, so ~70MB is in the ballpark, but can fairly reliably be loaded as a single action. I also have used samples of French and Korean (5K documents) and Hebrew (3K documents) that are from 50–100MB.
My numbers are all relative. Some days my laptop seems randomly faster than others. I'm not sure whether it's the first few loads after rebooting Elastic, or the first few minutes after rebooting Elastic, but for a while there things are just slower, so I always do a few throw-away loads before taking timings. I also used whatever baseline was convenient when doing comparative timings, and as I progressed through adding new plugins, the baseline generally became longer, because I wanted to compare the potential next config to the most current new config.
So, I might say one filter added 3% to load time and another added 6%, versus a given baseline. Later, on a bad hair day slow laptop day, with a different baseline, the numbers might be 4% and 6.5%. Anything with a clear and consistent gap between them was worth taking into account.
To my happy surprise, filter load time increases added fairly linearly on general English data. (And French, Hebrew, and Korean, too.) That's not guaranteed, since the output of one filter can change the input of the next, giving it more or less to do than it would by itself. But those effects, if any, are lost in the general noise of timing.
In many cases, the fine details don't matter at all. The regex-based filters added around 249% to load time (i.e., ~3½x slower). Whether loading takes 240% or 260% longer doesn't really matter... it's waaaaaay too slow!
How Slow Is Slow, Anyway?
editAgainst my initial baseline:
filter/task | slowdown | notes |
word_break_helper
|
2% | a few character substitutions |
apostrophe_norm
|
5% | many character substitutions |
regex camelCase | 35% | |
regex acronyms | 206% | Holy guacamole, Batman! |
My first approach, now that I had a decent ability to estimate the impact/expense of any changes, was to try to tune the regexes and other config elements to be more efficient.
The regexes for acronyms and camelCase both used lookaheads, lookbehinds, and many optional characters (allowing 0–9‡ non-letter diacritics or common invisibles (e.g., soft hyphens or bidirectional markers)), which could allow for lots of regex backtracking, which can be super slow.
‡ Note that I had originally allowed "0 or more" diacritics or invisibles, but that was so slow that I noticed even with my not-very-sensitive set up. I should have paid more attention and thought harder about that, it seems.
CamelCase
editDecreasing the number of optional diacritics and invisibles from 0–9 to 0–3 improved the camelCase regex filter from a 32% increase in load time (against a different baseline than above) to 18%.
However, the biggest win for the regex-based camelCase filter was to remove the lookbehind. I had done the acronym regex first, and it required the lookbehind because the contexts of periods could overlap. (e.g., F.Y and Y.I overlap in F.Y.I.) It was easy to do the camelCase regex in an analogous way, but not necessary. That improved it from a 32% increase in load time to merely 12%.
Both changes—no lookbehind and only 0–3 optional diacritics and invisibles—did not improve over just getting rid of the lookbehind.
Acronyms
editWith a new baseline, the regex-based acronym filter increased runtime by 200%. Allowing only 0–3 diacritics and invisibles dropped the increase to 98%—a huge improvement, but still not acceptable. Limiting it to 0–2 diacritics and invisibles improved to 69%. 0–1 improved to 49%.
Allowing no diacritics or invisibles, which would work correctly more than 90% of the time, but which could fail in ways that are opaque to a reader or searcher, especially in non-alphabetic scripts, increased load time by only 24%.
A ¼ increase in load times might be acceptable on a smaller corpus. If you really worry about acronyms,§ a load time of 2½ hours instead of 2 hours might be worth it. However, our baseline load time for all wikis is 2 weeks. An extra 3½ days is a lot to ask for good acronym processing.
§ David and I have been complaining to each other about acronym processing for more than 5 years, easy, so we do really worry about them.
Apostrophes, Etc.
editAgainst a very stripped down baseline (just the standard tokenizer and the default lowercase
filter—which ran in ⅔ the time of earlier baselines, so all the numbers below are proportionally higher), I got these comparative numbers:
filter/task | type | slowdown | notes |
nnbsp_norm
|
char filter | 4.9% | one character substitution (narrow non-breaking space) |
word_break_helper
|
char filter | 5.8% | a few character substitutions |
apostrophe_norm
|
char filter | 7.4% | many character substitutions |
ICU normalization | token filter | 6.1% | lots of Unicode normalizations |
homoglyph_norm
|
token filter | 20.1% | normalizes Latin/Cyrillic multiscript tokens |
default en analyzer (unpacked) | analyzer | 50.2% |
Observations:
- The reciprocal of ⅔ is 1½ (i.e., 1 + 50%), so the default English analyzer running 50% slower than the stripped down version is just about perfect, mathematically speaking.
- The homoglyph normalization filter is pretty expensive!
- I was also surprised at how expensive the one-character character filter map for narrow non-breaking spaces is. That is the simplest possible
mapping
character filter: a one-character-to-one-character mapping.- The overall machinery for general
mapping
filters it pretty complex, since it can map overlapping strings (e.g., ab, bc, cd, and abcd can all have mappings) of any practical length to replacement strings of any practical length. However, we mostly use it to map single characters to either single characters or the empty string. Most of our other mappings are at most two characters being mapped from or to. I was wondering whether it would be possible to get that per–char filter overhead down for simple mappings like most of ours.
- The overall machinery for general
Using a similarly stripped-down baseline on a different day, I also evaluated English-specific components from our unpacked English analyzer and in the default English analyzer (except for the keyword filter), and some other filters we use or have used across multiple analyzers:
filter/task | type | slowdown | notes |
possessive_english
|
token filter | 0.60% | strips –'s |
english_stop
|
token filter | 0.40% | removes English stop words |
kstem
|
token filter | 6.20% | general English stemmer |
custom_stem (for English)
|
token filter | 6.30% | just stems ''guidelines'' => ''guideline'' |
ASCII folding | token filter | 2.30% | flattens many diacritics |
ASCII preserve orig | token filter | 4.30% | keeps both ASCII folded and original tokens |
ICU folding | token filter | 1.30% | aggressive Unicode diacritic squashing, etc. |
ICU preserve orig | token filter | 3.20% | keeps both ICU folded and original tokens |
kana_map
|
char filter | 4.20% | map hiragana to katakana, currently only for English, but plan to be global (except Japanese) |
nnbsp_norm + apostrophe_norm + word_break_helper
|
char filters | 15.30% | three separate filters |
nnbsp_norm ∪ apostrophe_norm ∪ word_break_helper
|
char filter | 6.80% | one filter with the same mappings as the three filters |
Observations:
- The
custom_stem
filter could have other mappings added to it to handle things kstem doesn't. It predates my time on the Search team, but kstem does still get guidelines wrong. - I'm surprised that the ICU filters are faster than the ASCII filters. ICU folding is a superset of ASCII folding, but ICU is still faster. ICU preserve is our homegrown parallel to ASCII preserve, so I'm surprised it is noticeably faster.
- In real life, we probably can't merge
nnbsp_norm
,apostrophe_norm
, andword_break_helper
becauseword_break_helper
has to come after acronym processing, but the experiment is enlightening. 8.5% overhead by having three separatemapping
filters instead of one is a lot!nnbsp_norm
andapostrophe_norm
could readily be merged, though, and with appropriate comments in the code it would only be somewhat confusing.:þ
I also ran some timings using the current English, French, Hebrew, and Korean analysis chains mostly on their own Wikipedia samples, but also a few tests running some of each through the others.
Despite different baseline analyzer configs, the timings for English and French were all within 1–2% of each other.
The camelCase filter had a much smaller impact on Hebrew and Korean, presumably because the majority of their text (in Hebrew and Korean script) doesn't have any capital letters for the regex to latch onto. The maximally simplified acronym regex ran much faster on Hebrew and Korean (~7.8% vs ~21.6% on English and French). I'm not 100% sure why. But it was still the most expensive component of the ones tested (apostrophe_norm
, word_break_helper
, camelCase, acronyms).
These timings of various components aren't directly relevant to the efficiency of the new filters, but they serve as a nice baseline for intuitions about how much processing time a component should take.
It's Plugin' Time!
editThe various improvements were quite good, relatively speaking, but after some discussion we set a rough heuristic threshold of 10% load time increase as warranting looking at a custom plugin. Both camelCase and acronyms were well over that, and together they were a very hefty increase, indeed.
We discussed folding the new custom filters for acronyms and camelCase into the existing extra plugin, but decided the code complexity wasn't worth it (having to not only check for the existence of a plugin, but also a specific version of the plugin) and the maintenance burden of deploying one more plugin isn't too high (there's a list of plugins, and the work is pretty much the same regardless of the length of the list, within reason) and there doesn't seem to be a run-time performance impact to having more plugins installed (yet). So, the acronym and camelCase handling are in a new plugin called extra-analysis-textify
.
Acronyms
editAfter considering various options for the acronym filter—most importantly whether a token filter or character filter would be better—I decided on a finite-state machine inside a character filter. Being aware of the efficiency issues in large documents, I was trying to avoid the common character filter approach of reading in all the text, modifying it, and then writing it back out one character at a time. The finite-state machine does a good job of processing character-by-character, but it also needs to be able to look ahead to know that a given period is in the right context to be deleted.
The simplest context for period deletion is, as previously discussed in the original acronym write-up above, a period between two letters, with each having a non-letter (space, text boundary, punctuation, etc.) on their other side. So, we'd delete the period in " A.C " or "#r.o.", but not the one in "ny.m".
The extra complication I insisted on inflicting on myself was dealing with combining diacritics and common invisible characters, so that the period would be rightly deleted in "#A`˛.-C¸≫!"—where ` ˛ and ¸ are combining diacritics, the hyphen is a normally invisible soft hyphen, and ≫ represents a left-to-right bidi mark—so the text looks like this: "#Ą̀.Ç!".
We can maintain the correct state in place up to the period, but after that we have to read ahead to see what comes after. The most ridiculous set of diacritics I've seen in real data on-wiki was 14 of the same Khmer diacritic on a given character; they rendered on top of each other, so the tiny ring just looked slightly bold. I rounded up to allow buffering up to 25 characters (including the actual letter after the period), so even mildly G̶̼͉̘̬̯̔.L̵͔̦͓͖͇͙͝.I̴̛̖͒̑̎̄͝.T̵͖̤̻̓̏͊͒̾̚.C̸̱̳̬͎̜͂̌.H̶͉̝͚̪͙̒̒̄̒̒̈͜.Ȳ̶̢̺̘̠͙̅͐̇͜. "Zalgo" acronyms will be processed properly, though that was not a primary concern. (I chose to buffer the readahead text within the filter rather than trying to get the stream it's readng from to back up properly. It's theoretically possible to do that, but not every stream supports it, etc., etc., so it just seemed safer to handle it internally.)
I also decided to handle 32-bit characters correctly. Elasticsearch/Lucene internals are old-school character-by-character (which I learned in Java is 2 bytes / 16 bits at a time, unlike the really oldskool byte-by-byte C that I cut my teeth on), so we also have to read ahead when we see a high surrogate to get the following low surrogate, munge them together, and find out what kind of character that is.
We can properly process all sorts of fancy 𝒜͍.𝕮.ℝ̫.𝓞̥.𝘕́.𝚈̘.𝕄⃞.𝔖. now!
I did a regression test against the regex-based acronym filter, and the only thing that came up was in the bookkeeping for the deleted periods. (Frankly, I'm always amazed that the pattern_replace
filter that uses the regexes can keep all the bookkeeping straight.) The offset corrections for the regex (which is following very general principles) differ from my special-purpose offsets, which means that there could be extra periods highlighted for the regex-generated tokens in specific cases (notably in non-alphabetic Asian languages where words get tokenized after being de-acronymed). The results are differences in highlights like អ.ស. vs អ.ស. in Khmer. Only a typography nerd would notice, and almost no one would care, I think.
The speed improvement was amazing! Against a specific baseline of the English analyzer with apostrophe handling added, the fully complex acronym regex filter added 274.1% to the load time, the maximally simplified acronym regex added 27.9%, and the plugin version added only 4.4% (ie., more than 98% faster, making it the same or slightly faster than a one-character mapping
filter, such as the narrow non-breaking space fix).
I also configured our code to fall back to the maximally simplified acronym regex if the textify
plugin is unavailable.
CamelCase
editThe camelCase plugin was more straightforward, but I also used a finite-state machine inside a character filter.
To recap, the simplest camelCase approach is to put a space between a lowercase letter and a following uppercase letter, but complications include combinging diacritics and invisibles, plus 32-bit characters with high and low surrogate characters. Compared to acronyms, though, it was much more straightforward—only the low surrogate and the capital letter following a lowercase letter (and its diacritics and invisibles) need to be buffered. Easy peasy.
Unicode Nerd Fact of the Day: In addition to UPPERCASE and lowercase letters, there are also a few letters that are TitleCase. The examples I found are characters that are made of multiple letters. So, Serbo-Croation has a Cyrillic letter љ which corresponds to
lj
in the Latin alphabet. It also comes as a single character:lj
. It comes in uppercase form,LJ
, for all uppercase words, and in title case form,Lj
, if a word happens to start with that letter. Thus, in theory, you want to split words between lowercase and either UPPERCASE or TitleCase—not that it comes up very often. Also, as TitleCase letters are essentially UPPERCASE on the frontside and lowercase on the backside, they can make a whole camel hump by themselves. "LjLjLj
" (3 TitleCase characters) should be camelCased to "Lj Lj Lj
". (This is a weird example, but it makes sense (as much sense as it can) because the string "LjLjLj
" (6 MiXeD cAsE characters) would be camelCased to "Lj Lj Lj
", and the TitleCase version is ICU normalized to that exact UPPER+lower version. So, both rule orderings—ICU normalization before or after camelCasing—give the same result, and that result matches the often visually identical UPPER+lower version.)
I did a regression test against the regex-based camelCase filter, and the only difference I found was in camelCase words with 32-bit characters. The uppercase Unicode regex pattern \p{Lu}
correctly matches uppercase 32-bit characters, but only modifies the offset by 1 character/2 bytes/16 bits. So, 𝗥𝗮𝗱𝗶𝗼𝗨𝘁𝗼𝗽𝗶𝗮 (in Unicode math characters) gets correctly split into 𝗥𝗮𝗱𝗶𝗼 and 𝗨𝘁𝗼𝗽𝗶𝗮, but the offset splits the 32-bit 𝗨 in half (and apparently throws away the other half), so the offset/highlight would be just include 𝘁𝗼𝗽𝗶𝗮. (So highlighting would be 𝗥𝗮𝗱𝗶𝗼𝗨𝘁𝗼𝗽𝗶𝗮—not great, but hardly a disaster.)
The plugin version of the camelCase splitter doesn't have that problem.
The speed improvement was not as impressive as the acronym speedup, but still great. The regex-based camelCase filter was 19.9% slower than the baseline. The plugin camelCase filter was only 3.6% slower—again faster than a mapping
filter!
I also configured our code to fall back to fairly reasonable camelCase regex if the textify
plugin is unavailable.
Also, I thought of a weird corner case... a.c.r.o.C.a.m.e.l. Should we de-acronym first, so the camelCase can be split? I found some real examples—S.p.A., G.m.b.H., and S.u.S.E.—and while I don't love what camelCase splitting does to them after de-acronyming them, it's the same as what happens to SpA, GmbH, and SuSE, so that seems as right as possible.
Apostrophes, et al., and Limited Mappings
editAs mentioned above, I was surprised that the simplest possible mapping
character filter (mapping one letter to one letter) increases loading time by 4–5%. A lot of our mapping
filters are one-char to one-char.
apostrophe_norm
andnnbsp_norm
are universally applied.word_break_helper
, thedotted_I_fix
, and the upcoming change tokana_map
are or will be applied to most languages.- Language-specific maps for Armenian, CJK, French, German, Persian, and Ukrainian are all one-char to one-char, as are numeral mappings for Japanese and Khmer.
near_space_flattener
andword_break_helper_source_text
are not used in the text field, but they are also one-char to one-char.
The Korean/Nori-specific character filter is mostly one-char to one-char, though it also has one-char to zero-char/empty string mappings (i.e., delete certain single characters).
Other languages have very simple mappings:
- Irish uses one-char to two-char mappings.
- Romanian and Russian use two-char to two-char mappings.
Even our complex mappings aren't that complex:
- Chinese uses some two-char to longer string mappings.
- Thai uses some three-char to two-char mappings.
In the simplest case—one-char to one-char mappings—there's no offset bookkeeping to be done, and the mapping can be stored in a simple hash table for fast lookups.
It seems like it should be possible to create a filter for some sort of limited_mapping
that is faster for these simple cases. We can have many mapping
filters in a given language analyzer: apostrophe_norm
, nnbsp_norm
, kana_map
, word_break_helper
, dotted_I_fix
, and a language-specific mapping
, for up to 6 mapping
s total in the text field. An improvement of 1% per filter would basically be "buy 5, get 1 free!"
So, I looked into it! I have to give credit to the developers of the general-purpose mapping
filter. I read over it, and they use some heavy-duty machinery to build complex data structures that are very efficient at the more complex task they've set themselves. My code does a moderately efficient job at a much less complex task.
And for all but the simplest mappings—involving two characters in either the mapping from or the mapping to direction—the generic mapping
filter was the same speed, modulo the fuzziness of the timings. Impressive! However, on the 1-to-1 mappings, my limited_mapping
approach was ~50% faster!
I decided that rather than trying to automatically detect instances of mapping
where limited_mapping
could be used (which would require pre-parsing the mappings with new code in PHP), I'd let the humans (just me for now) mark things as being ok for a limited_mapping
filter, and then convert all of them back to plain mapping
filters if the extra-analysis-textify
plugin isn't available.
Plugin Development and Deployment
editThe patches for the new textify
plugin and the code to use it when building analyzers is still in review, but it should all get straightened out and merged soon.
In keeping with the notion that adding components to existing, established plugins can add unwanted complexity, I've decided to leave the ASCII-folding/ICU-folding task on hold¶ adnd move up the ICU token repair project, since that will also require a custom filter. Then we can do one plugin deployment and one reindex afterwards to finally get all of these updates and upgrades into the hands of on-wiki searchers!
¶ It got placed there to deal with this reindexing speed problem that necessitated the plugin development so far.
Background
editThe ICU tokenizer is much better than the standard
tokenizer at processing "some Asian languages by using a dictionary-based approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and using custom rules to break Myanmar and Khmer text into syllables."
However, it has some undesirable idiosyncrasies.
(Note that we're using the ICU tokenizer that is compatible with Elasticsearch 7.10, which is the last version of Elasticsearch with a Wikimedia-compatible open-source license.)
UAX #29
editUAX #29 is a Unicode specification for text segmentation, which the ICU tokenizer largely implements. However, it does not quite follow word boundary rule 8 (WB8), which has this comment: Do not break within sequences of digits, or digits adjacent to letters (“3a”, or “A3”).
Given the following input text, "3д 3a 3a 3д", the default ICU tokenizer will generate the tokens, 3д, 3, a, 3a, 3, д. While this does, however opaquely, follow the internal logic of the ICU tokenizer, it is hard to imagine that this inconsistency is what typical users expect.
More Detailed Examples
editLet's look at a similar example with different numbers and letters for ease of reference. With input "1я 2a 3x 4д", the ICU tokenizer gives these tokens: 1я, 2, a, 3x, 4, д.
One of the ICU tokenizer's internal rules is to split on character set changes. Problems arise because numbers do not have an inherent character set. (This is also true for punctuation, emoji, and some other non–script-specific characters, many of which are called either "weak" or "neutral" in the context of bidirectional algorithms, and which I generally refer to collectively as "weak" when talking about the ICU tokenizer.)
In the case of a token like z7, the 7 is considered to be "Latin", like the z. Similarly, in щ8, the 8 is "Cyrillic", like the щ. In "1я 2a 3x 4д", the 2 is considered "Cyrillic" because it follows я, and the 4 is considered "Latin" because it follows x, even though there are spaces between them. Thus—according to the internal logic of the ICU tokenizer—the "Cyrillic" 2 and Latin a should be split, and the "Latin" 4 and Cyrillic д should be split.
This effect can span many non-letter tokens. Given the string "д ... 000; 456/789—☂☀☂☀☂☀☂☀ 3a", the ICU tokenizer assigns all the numbers and emoji between д and a to be "Cyrillic". (The punctuation characters are discarded, correctly, by the tokenizer.) As a result, the last two tokens generated from the string are 3 (which is "Cyrillic") and a (which is Latin). Changing the first letter of the string to x—i.e., "x ... 000; 456/789—☂☀☂☀☂☀☂☀ 3a"—results in the last token being 3a. This kind of inconsistency based on a long-distance dependency seems sub-optimal.
As a more real-world example, in a text like Напиток 7Up использует слоган "Drink 7Up" (which is a machine translation of the sentence The beverage 7Up uses the slogan "Drink 7Up"), the first 7Up is split into two tokens (7, Up), while the second is left as one token. Similar discussions of 3M, A1 steak sauce, or 23andMe in Armenian, Bulgarian, or Greek texts are subject to this kind of inconsistency.
Homoglyphs
editAnother important use case that spurred development of the icu_token_repair
filter is homoglyphs. For example, the word "chocоlate"—where the middle о is actually Cyrillic—will be tokenized by the ICU tokenizer as choc, о, late. This seems to be contrary to WB5 in UAX #29 (Do not break between most letters), but the ICU tokenizer is consistent about it, and always makes the split, because there is definitely a legitimate character set change.
On Wikimedia wikis, such homoglyphs are sometimes present as the result of vandalism, but more often as the result of typing errors, lack of easily accessible accented characters or other uncommon characters when translating, or cutting-and-pasting errors from other sources. We have a token filter homoglyph_norm
that is able to handle Cyrillic and Latin homoglyphs, and repair "chocоlate" to more typical "chocolate", but it only works on individual tokens, not across tokens that have already been split up.
Other Mixed-Script Tokens
editStylized, intentionally mixed-script text or names—such as "lιмιтed edιтιon", "NGiИX", or "KoЯn"—can also occur, and the ICU tokenizer consistently splits them into single-script sub-word tokens.
Sometimes mixed-script numerals, like "2١١2" occur. The ICU tokenizer treats ١ as Arabic, but 2 is still a weak character, so depending on the preceding context, the number could kept as a single token, or split into 2 and ١١2.
Not a <NUM>
ber
edit
Another issue discovered during development is that the ICU tokenizer will label tokens that end with two or more digits with the token type <NUM>
rather than <ALPHANUM>
. So, among the tokens abcde1, abcde12, 12345a, a1b2c3, h8i9j10, д1, д12, অ১, অ১১, क१, क११, ت۱, and ت۱۱, the bold ones are <NUM>
and the rest are <ALPHANUM>
. This seems counterintuitive.
This can become particular egregious in cases of scripts without spaces between words. The Khmer phrase និងម្តងទៀតក្នុងពាក់កណ្តាលចុងក្រោយនៃឆ្នាំ១៩៩២ ("and again in the last half of 1992") ends with four Khmer numerals (១៩៩២, underlined because bolding isn't always clear in Khmer text). It is tokenized (quite nicely—this is why we like the ICU tokenizer!) as និង, ម្តង, ទៀត, ក្នុង, ពាក់កណ្តាល, ចុងក្រោយ, នៃ, and ឆ្នាំ១៩៩២. The bad part is that all of these tokens are given the type <NUM>
, even though only the last one has any numerals in it!
If you don't do anything in particular with token types, this doesn't really matter, but parts of the ICU token repair algorithm use the token types to decide what to do, and they can go off the rails a bit when tokens like abcde12 are labelled <NUM>
.
The Approach
editThe plan is to repair tokens incorrectly split by the ICU tokenizer. To do this, we cache each token, fetch the next one, and decide whether to merge them. If we don't merge, emit the old one and cache the new one. If we do merge, cache the merged token and fetch the next token, and repeat.
The phrase "decide whether to merge them" is doing a lot of the heavy lifting here.
- Tokens must be adjacent, with the end offset of the previous one being equal to the start offset of the following one. No space, punctuation, ignored characters, etc., can intervene. This immediately rejects the vast majority of token pairs in languages with spaces, since tokens are not adjacent.
- Tokens must be in different scripts. If you have two Latin tokens in a row, they weren't split because of bad behavior by the ICU tokenizer. (Maybe camelCase processing got them!)
- Tokens must not be split by camelCase processing. By default, if one token ends with a lowercase letter and the next starts with an uppercase letter—ignoring certain invisible characters and diacritics—we don't rejoin them. A token like ВерблюжийCase, should be split for camelCase reasons, not mixed-script reasons. This can be disabled.
- Tokens must be of allowable types. By default,
<EMOJI>
,<HANGUL>
, and<IDEOGRAPHIC>
tokens cannot be rejoined with other tokens.<HANGUL>
and<IDEOGRAPHIC>
need to be excluded, because they are split from numbers by the ICU tokenizer, regardless of apparent script. So given the text "3년 X 7년", the ICU tokenizer generates the tokens 3 + 년 + X + 7 + 년, the 3 is Hangul (because the nearest script is Hangul) and the 7 is Latin, because it follows the Latin X, but both are separated and neither should be repaired. Allowing<HANGUL>
tokens to merge would only trigger repairing 7년, but not 3년. The<IDEOGRAPHIC>
situation is similar.- <EMOJI> are excluded because in most cases they are not intended to be part of words.
- The different scripts to be joined must be on a list of acceptable pairs of scripts. I originally didn't have this requirement, but after testing it became clear that, based on frequency and appropriateness of rejoined tokens, it is possible to make a list of desirable pairs or groups of scripts to be joined that cover almost all of the desirable cases and exclude many undesirable cases.
- The compatible script requirement is ignored if one of the tokens is of type
<NUM>
(corrected<NUM>
, after fixing types that should be<ALPHANUM>
; see below).<NUM>
tokens still can't merge with disallowed types, like<HANGUL>
.
- The compatible script requirement is ignored if one of the tokens is of type
- Joined tokens should't be absurdly long. The threshold for absurdity is subjective, but there are very few tokens over 100 characters long that are valuable, search-wise. The ICU tokenizer itself limits tokens to 4096 characters. Without some limit, arbitrarily long tokens could be generated by alternating sequences like xχxχxχ... (Latin x and Greek χ). I've set the default maximum length to 100, but it can be increased up to 5000 (chosen as a semi-random round number greater than 4096).
Merging tokens is relatively straightforward, though there are a few non-obvious steps:
- Token strings are simply concatenated, and offsets span from the start of the earlier token to the end of the later one.
- Position increments are adjusted so that the new tokens are counted correctly and if the unjoined tokens were sequential, the joined tokens will be sequential.
- Merged multi-script tokens generally get a script of "Unknown". (Values are limited to constants defined by IBM's UScript library, so there's no way to specify "Mixed" or joint "Cyrillic/Latin".) If they have different types (other than exceptions below), they get a merged type of
<OTHER>
.- The Standard tokenizer labels tokens with mixed Hangul and other alphanumeric scripts as
<ALPHANUM>
, so we say<HANGUL>
+<ALPHANUM>
=<ALPHANUM>
, too. - When merging with a "weak" token (like numbers or emoji), the other token's script and type values are used. For example, merging "Cyrillic"/
<NUM>
7 with Latin/<ALPHANUM>
x gives Latin/<ALPHANUM>
7x—rather than Unknown/<OTHER>
7x.
- The Standard tokenizer labels tokens with mixed Hangul and other alphanumeric scripts as
- "Weak" tokens that are not merged are given a script of "Common", overriding any incorrect specific script they may have had. (This is the script label they get if they are the only text analyzed.)
<NUM>
tokens that also match the Unicode regex pattern\p{L}
are relabelled as<ALPHANUM>
. (This applies primarily to mixed letter/number tokens that end in two or more digits, such as abc123, or longer strings of tokens from a spaceless string of text, as in the Khmer example above.)
Data, Examples, Testing
editMore Data for Script Testing
editFor testing purposes, I used the samples I had pulled from 90 different Wikipedias for general harmonization testing, plus an extra thirteen new Wikipedia samples with scripts not covered by the original 90. The new ones include Amharic, Aramaic, Tibetan, Cherokee, Divehi, Gothic, Inuktitut, Javanese, Lao, Manipuri, N’Ko, Santali, and Tamazight (scripts: Ethiopic, Syriac, Tibetan, Cherokee, Thaana, Gothic, Canadian Aboriginal Syllabics, Javanese, Lao, Meetei Mayek, N'Ko, Ol Chiki, Tifinagh; codes: am, arc, bo, chr, dv, got, iu, jv, lo, mni, nqo, sat, zgh).
Spurious Mergers
editI've complained before about felonious erroneous word mergers that occur during export, but now it has a phab ticket (T311051), so it's not just me. A fair number of the dubious mixed-script tokens I found in my samples came from these export errors. The most common type were from bulleted lists, where the intro to the list is in one script "..Early Greek television shows included" and the first item in the list being in another script "Το σπίτι με τον φοίνικα", leading to an apparent token like includedΤο (where the Το is Greek). This also happens to other same-script tokens (e.g., both merged words are in English, or both in Greek, etc.), but it isn't apparent in the current analysis. It's good that these mixed-script word-merging tokens are less common in real data than in my exports, because they would benefit from script-changing token splits if they were real, undermining the motivation for repairing ICU tokens.
What Can Merge With What?
editMixed-script tokens like piligrimనందు (Latin/Telugu), NASDAவின் (Latin/Tamil), WWEच्या (Latin/Marathi) are real, not spurious, and seem to be correct, in that they are English words with non-English/non-Latin inflections on them because they appear on non-English Wikipedias. However, splitting on scripts here is a feature, not a bug, since it makes foreign terms more findable on these wikis.
On the other hand, mixed-script tokens like Vərəϑraγna (Greek/Latin), Boйткo (Cyrillic/Latin), λειτóς (Greek/Latin), Метамόрфоси (Cyrillic/Greek), abаγad (Cyrillic/Greek/Latin) Ξασσօμχατօι (Armenian/Greek), M𐌹𐌺𐌷𐌰𐌹𐌻 (Gothic/Latin), KᎤᎠᏩᏎ (Cherokee/Latin), ⵉⵃEEⵓ (Latin/Tifinagh) are the kind we want to keep together. Breaking the words up into random bits isn't going to help anything. Some, like Vərəϑraγna, are correct; it's a pronunciation guide, using Latin and Greek symbols (though the unusual theta ϑ is pretty weird if you aren't familiar with it). Others, like Boйткo can be fixed by our Cyrillic/Latin homoglyph processing, and λειτóς, Метамόрфоси, and abаγad are definitely on the list for future Greek homoglyph processing. The Armenian and Gothic examples are currently farther down the list for homoglyph processing, but if we can keep it efficient, I'd love to be able to handle a much larger many-to-many homoglyph mapping.
Based on the frequency and appropriateness (and spuriousness!) of the kinds of mergers I saw when allowing all scripts to merge, I came up with the following groups to merge: Armenian/Coptic/Cyrillic/Greek/Latin, Lao/Thai, Latin/Tifinagh, Cherokee/Latin, Gothic/Latin, Canadian Aboriginal/Latin. (Groups with more than two scripts mean each of the scripts in the group is allowed to merge with any of the other scripts in the group.)
These groups are in fact mostly based on the presence of homoglyphs, which seems like the most obvious reason for not splitting on scripts. Alphabetic scripts also seem to like to mix-n-match in stylistic non-homoglyph ways, as in lιмιтed edιтιon. Technical non-homoglyph examples include words like sьrebro, which is a reconstructed Proto-Slavic form, ΑΝΤⲰΝΙΝΟϹ, which mixes Greek and Coptic to transcribe words on an ancient coin, and sciency usages like Δv, μm, or ΛCDM.
Timing & Efficiency
editUsing an uncustomized analysis config using the ICU tokenizer—in this case Tibetan/bo—adding the default icu_token_repair
filter increased load time of a sizable chunk of English text by an average of 6.01% across 4 reloads. Using the more complex English analysis config with the ICU tokenizer enabled as a baseline, the increase was only 3.95%, averaged across 4 reloads. So, "about 5% more" is a reasonable estimate of the runtime cost of implementing icu_token_repair
.
Limitations and Edge Cases
edit- The
icu_token_repair
filter should probably be as early as possible in the token filter part of the analysis chain, both because other filters might do better working on its output (e.g., homoglyph normalization), or because they might hamper its ability to make repairs (e.g., stemming). - Java 8 and v8.7 of the ICU tokenizer do not always handle uncommon 32-bit characters well. For example, they don't recognize some characters as upper or lowercase (e.g., Osage/𐓏𐓘𐓻𐓘𐓻𐓟) or digits (e.g., Tirhuta 𑓓 and Khudawadi 𑋳), or as having a script at all (likeMathematical Bold/Italic/Sans Serif/etc. Latin and Greek characters, like 𝝖𝝱𝝲 and 𝒳𝓎𝓏), labelling them as "Common".
- When numerals are also script-specific—like Devanagari २ (digit two)—they can be rejoined with other tokens, despite not being in the list of allowable scripts because they have type
<NUM>
. So, x२ will be split and then rejoined. This is certainly a feature rather than a bug in the case of chemical formulas and numerical dimensions, like CH৩CO২, C۱۴H۱۲N۴O۲S, or ૪૦૩X૧૦૩૮—especially when there is a later decimal normalization filter that converts them to ch3co2, c14h12n4o2s, and 403x1038.- On the other hand, having the digits in a token like २২੨૨᠒᥈߂᧒᭒ (digit two in Devanagari, Bengali, Gurmukhi, Gujarati, Mongolian, Limbu, N'ko, New Tai Lue, and Balinese) split and then rejoin doesn't seem particularly right or wrong, but it is what happens.
- Similarly, splitting apart and then rejoining the text x5क5x5x5क5क5д5x5д5x5γ into the tokens x5, क5, x5x5, क5क5, д5x5д5x5γ isn't exactly fabulous, but at least it is consistent (tokens are split after numerals, mergeable scripts are joined), but the input is kind of pathological anyway.
- Script-based splits can put apostrophes at token edges, where they are dropped, blocking remerging. rock'ո'roll (Armenian ո) or О'Connor (Cyrillic О) cannot be rejoined because the apostrophe is lost during tokenization (unlike all-Latin rock'n'roll or O'Connor)
- And, of course, typos always exist, and sometimes splitting on scripts fixes that, and sometimes it doesn't. I didn't see evidence that real typos (i.e., not data export problems) were a huge problem that splitting on scripts would readily fix.
In my previous analysis building and testing the icu_token_repair
filter above, I replaced the standard
tokenizer with the ICU tokenizer and used that as my baseline, looking at what icu_token_repair
would and could do to the tokens that were created by the icu_tokenizer
.
In this analysis, I'm starting with the production status quo and replacing the standard
tokenizer with the icu_tokenizer
‖ plus icu_token_repair
, and adding icu_token_repair
anywhere the icu_tokenizer
is already enabled. This gives different changes to review, because changes made by the icu_tokenizer
that are not affected by icu_token_repair
are now relevant. (For example, tokenizing Chinese text or—foreshadowing—deleting © symbols.)
‖ Technically, it's the
textify_icu_tokenizer
, which is just a clone of theicu_tokenizer
that lives within the right scope foricu_token_repair
to be able to read and modify its data.
Lost and Found: ICU Tokenizer Impact
edit
It's a very rough measure, but we can get a sense of the impact of enabling the ICU tokenizer (or, for some languages,# where it was already enabled, just adding icu_token_repair
) by counting the number of "lost" and "found" tokens in a sample. Lost tokens are ones that do not exist at all after a change is made, while found tokens are ones that did not exist at all before a change it made.
# The ICU tokenizer is always enabled for Buginese, Burmese, Cantonese, Chinese, Classical Chinese, Cree, Dzongkha, Gan Chinese, Hakka, Japanese, Javanese, Khmer, Lao, Min Dong, Min Nan, Thai, Tibetan, and Wu. Some of these have other language-specific tokenizers in the
text
field, but still use theicu_tokenizer
in theplain
andsuggest
fields.
For example, when using the standard
tokenizer, the text "维基百科" would be tokenized as 维 + 基 + 百 + 科. Switching to the ICU tokenizer, it would be tokenized as 维基 + 百科. So, in our diffs, 维, 基, 百, and 科 might be lost tokens (if they didn't occur anywhere else) and 维基 and 百科 would be found tokens (because they almost certainly don't exist anywhere else in the text because all Chinese text was broken up into single character unigrams by the standard
tokenizer).
Another example: if the ICU tokenizer is already being used, and icu_token_repair
is enabled, the text "KoЯn" might go from being tokenized as Ko + Я + n to being tokenized as KoЯn. Because, in a large sample, Я and n are likely to exist as single-letter tokens (say, discussing the Cyrillic alphabet for Я and doing math with n), then maybe only Ko would be a lost token, while KoЯn would be a found token.
So.. the median percentage of lost and found tokens in my 97 Wikipedia samples with changes (a handful had no changes at all) after turning on the ICU tokenizer + icu_token_repair
is 0.04% (1 in 2500) for lost tokens, and 0.02% (1 in 5000) for found tokens. The maximums were 0.34% (1 in ~300) for lost tokens and 0.29% (1 in ~350) for found tokens (for Malay and Malayalam, respectively). More than 85% of samples were at or below 0.10% (1 in 1000) for both. Note that text on wikis is much more carefully edited and refined than queries on wikis, so we'd expect a higher impact on queries, where typos/editos/cut-n-paste-os are more common. The low impact is generally good, because we're generally trying to do better parsing uncommon text (e.g., CJK text on a non-CJK wiki) and fixing uncommon errors that come with that (splitting multi-script words with or without homoglyphs).
Survey of Changes
editParsing the Unparsed
editThe most common changes are the ones we expect to see, and the reason we want to use the ICU tokenizer over the standard
tokenizer:
- Hanzi/kanji/hanja (Chinese characters in Chinese, Japanese, or Korean contexts) and Hiragana: unigrams become words!
- Hangul, Katakana, Lao, Myanmar, Thai: long tokens become shorter tokens (e.g., Thai tokens in the English Wikipedia sample had an average length of 11.6 with the
standard
tokenizer, but only 3.5 with the ICU tokenizer)- Some mixed Han/Kana tokens are parsed out of CJK text, too.
- It's odd that the
standard
tokenizer splits most Hiragana to unigrams, but lumps most Katakana into long tokens.
Unsplitting the Split
editSome less common changes are the ones we see in languages that already use the ICU tokenizer, so the only change we see is from icu_token_repair
.
- Tokens with leading numbers are no longer incorrectly split by the ICU tokenizer, like 3rd, 4x, 10x10. Some of these tokens are not "found" because other instances exist in contexts were they don't get split.
- Reminder: the ICU tokenizer splits on script changes, but numbers inherit script from the text before them, even across white space! Thus, the text "3rd αβγ 3rd" is tokenized as 3rd + αβγ + 3 + rd because the second 3 is considered "Greek" because it follows "αβγ".
icu_token_repair
fixes this to the more intuitive tokenization 3rd + αβγ + 3rd.
- Reminder: the ICU tokenizer splits on script changes, but numbers inherit script from the text before them, even across white space! Thus, the text "3rd αβγ 3rd" is tokenized as 3rd + αβγ + 3 + rd because the second 3 is considered "Greek" because it follows "αβγ".
Unsmushing the Smushed
editThere are also a fair number of dual-script tokens that are made up of two strings in different scripts joined together (e.g., Bibliometricsதகவல்), rather than characters of two or more scripts intermixed (e.g., KoЯn). These are properly split by the icu_tokenizer
and for many script pairs they are not repaired by icu_token_repair
. These come in a few flavors:
- Typos—someone left out a space, and the ICU tokenizer puts it back. Yay!
- In the cases where
icu_token_repair
rejoins them, we have a case of "garbage in, garbage out", so it's a wash.
- In the cases where
- Spaceless languages—the space doesn't need to be there in the given language, and we don't have a language-specific tokenizer. Splitting them up does make the foreign word from the non-spaceless language more findable.
- Inflected foreign words: some languages will add their morphology to foreign words, even in another script. As an unlikely but more understandable example, in English one could pluralize космонавт (Russian "cosmonaut") and talk about космонавтs (with a Latin plural -s on the end, there). Breaking these up for certain script pairs makes the foreign word more findable.
- Export errors: These look like one of the cases above, particularly the first one, but are actually artefacts in my data. Bibliometricsதகவல் is an example: it appears on-wiki as Bibliometrics on one line of a bullet list and தகவல் on the next line. These are impossible to detect automatically, though I have investigated them in some cases when there are a lot of them. The underlying error has been reported on Phabricator in T311051.
Miscellaneous Errors, Changes, and Observations
editSomethings still don't work the way we might want or expect them to. Some of these I may try to address now or in future patches, others I may just accept as a rare exception to an exception to an exception that won't work right all the time.
Some cases, examples, and observations (roughly in order of how much they annoy me) include:
- Symbols can be split off like numbers, but are harder to detect and correct.
- Example: µl (with µ, the micro sign) is distinct from μl (with μ, Greek mu), but both are used for "microliters". As a symbol, the micro sign has no inherent script (it is labelled "Common" if it is the only text analyzed). When proceeded by non-Latin characters, the micro sign gets a script label matching those non-Latin characters and gets split off from the Latin l. On the Tamil Wikipedia, for example, the micro sign (coming after Tamil characters) is tagged as "Tamil" script and we don't normally want to rejoin Tamil and Latin tokens. For numbers, we avoid this problem because the token type is
<NUM>
, which overrides the Tamil/Latin script mismatch. For the micro sign, the token type is<ALPHANUM>
, the same as most normal words, so it can't override the script mismatch. - Another example: Phonetic transcriptions like ˈdʒɒdpʊər lose the initial stress mark (ˈ) when the proceeding word is not in a script that can merge with Latin script. Of course, the use of stress marks is also inconsistent. Maybe we should just nuke them all! (That would also prevent us from indexing tokens that are just stress marks.)
- In theory, we could look for tokens that are all symbols or all Common characters or something similar, but I'm hesitant to start extensively and expensively re-parsing so many tokens. (Maybe I should have tried to hack the ICU tokenizer directly. Ugh.)
- Example: µl (with µ, the micro sign) is distinct from μl (with μ, Greek mu), but both are used for "microliters". As a symbol, the micro sign has no inherent script (it is labelled "Common" if it is the only text analyzed). When proceeded by non-Latin characters, the micro sign gets a script label matching those non-Latin characters and gets split off from the Latin l. On the Tamil Wikipedia, for example, the micro sign (coming after Tamil characters) is tagged as "Tamil" script and we don't normally want to rejoin Tamil and Latin tokens. For numbers, we avoid this problem because the token type is
- Word-internal punctuation can prevent repairs. Some punctuation marks are allowed word-internally but not at word boundaries, so text like "he 'can't' do it" will generate the token can't rather than 'can't'.
- The punctuation marks like this that I've found so far are straight and curly apostrophes/single quotes (' ‘ ’), middle dots (·), regular and fullwidth periods (. .), and regular and fullwidth colons (: :). Two instances of punctuation will cause a token split. Regular and fullwidth underscores (_ _) are retained before and after letters, even if there are multiple underscores.
- Interestingly, for Hanzi, Hiragana, Katakana, and Hangul (i.e., CJK characters), word-internal punctuation seems to generally seems to be treated as a token boundary. (Underscores can glom on to Katakana in my minimal test.) Arabic, Cyrillic, and Greek, as well as spacelessΔ Khmer, Lao, Myanmar, and Thai are treated the same as Latin text with respect to word-internal punctuation.
Δ These languages aren't all 100% spaceless; some use spaces between phrases, but in general many words are run together with their neighbors.
- The periods and colons generally aren't a problem for our analysis, since
word_break_helper
converts them to spaces. The curly quotes are straightened by the apostrophe normalization. The middle dots should probably get added toword_break_helper
. (They do get used as syl·la·ble separators, but so do hy-phens and en dash–es, so breaking on them would make it more consistent. They can also be used for multiplication in equations or units, like kW·h, but dot operators or spaces (kW⋅h or kW h) can be used, too. (Can't do anything about unspaced versions like kWh, though.) On balance, splitting seems like a good choice.) - So, ideally, only apostrophes should be in this category of internally-allowed punctuation. If there's a script change on one side of that internal apostrophe—then because the apostrophe is at a token boundary, it is discarded. Because there is a character (the apostrophe) between the tokens, they can't be rejoined.
- There are many examples from the Macedonian Wikipedia sample, though most turned out to be from a single article about a particular dialect of Macedonian. The article uses Latin ä, ą, á, é, ó, ú and surprisingly Greek η along with general Cyrillic for transcription of the dialect.
- An actual homoglyph error of this type is O'Хара (a transliteration of O'Hara or O'Hare, but the O is Latin). Once split by the ICU tokenizer, it can't be rejoined by
icu_token_repair
(and thus it can't be fixed by our homoglyph processing). - Intentional non-homoglyph examples, where the split-blocking apostrophe is a benefit, include:
- Microsoft'тун in Kyrgyz (ky), where -тун is a genitive inflection—similar to English -'s, plus it looks like it uses the apostrophe because Microsoft is foreign, a proper name, or non-Cyrillic.
- The Tamazight (zgh) Wikipedia, generally written in Tifinagh (ⵜⵉⴼⵉⵏⴰⵖ), quotes a French source (written in Latin) about liturgical Greek (written in Greek), so we have l'Ακολουθική Ελληνική (« grec liturgique ») embedded in the Tamazight/Tifinagh text!
- Numbers are separated by the
icu_tokenizer
from words in some spaceless languages/scripts, but not others. Numbers are separated from CJK scripts, but not Khmer, Lao, Myanmar, and Thai. (These line up with the ones that are split on all internal punctuation and those that are not. Must be some category internal to the ICU tokenizer.)- Having Arabic (0-9) or same-script numbers (e.g., Lao numbers between Lao words) in the text seems to glue together the word on each side of it. With other-script numbers (e.g., Khmer numbers between Lao words),
icu_token_repair
rejoins them, which is some sort of consistent, but not really desirable. - This is a regression for Thai, since the Thai tokenizer does split numbers (Arabic, Thai, and others) away from Thai script characters, and I added a filter to reimplement that when we switched to the otherwise generally better ICU tokenizer. However,
icu_token_repair
makes this inconsistent again, because Thai digits are "Thai" and can't rejoin, but Arabic digits are "Common" and can. I can hack it by pre-converting Arabic digits to Thai digits, blocking the rejoining, and let the Thai analyzer'sdecimal_digit
convert them back to Arabic.- However, that would introduce a further potential wrinkle.. The Thai analyzer would take Arabic digits embedded in Lao (or Khmer, or Myanmar) text, and convert them to Thai digits; the
icu_tokenizer
would split them, buticu_token_repair
would put them back together. Again, it's consistent in that Arabic, Thai, and Lao digits between Lao words would all be treated equally poorly by the Thai tokenizer. Maybe it's worth it to improve the treatment of Arabic and Thai digits in Thai text, or maybe I should let them all (Khmer, Lao, Myanmar, and Thai) suck equally for now and fixicu_token_repair
in the future. Or fix it now (which means redeploying again before reindexing—that'd be popular!) Epicycles within epicycles.
- However, that would introduce a further potential wrinkle.. The Thai analyzer would take Arabic digits embedded in Lao (or Khmer, or Myanmar) text, and convert them to Thai digits; the
- Having Arabic (0-9) or same-script numbers (e.g., Lao numbers between Lao words) in the text seems to glue together the word on each side of it. With other-script numbers (e.g., Khmer numbers between Lao words),
- Arabic thousand separators (٬ U+066C) are marked as particularly Arabic rather as "Common" punctuation, which means that "abc 123٬456٬789" is tokenized as abc (Latin) + 123 (Latin) + 456٬789 (Arabic), while dropping the abc or replacing it with Arabic script keeps 123٬456٬789 as one token. Since we see all combinations of comma or Arabic thousand separator and Western/Eastern Arabic digits—like 1٬234, ۱٬۲۳۴, 1,234, and ۱,۲۳۴—and a few cases where the Arabic thousand separator is used like a comma, it makes sense to normalize it to a comma.
- I also found the Arabic comma (، U+060C) is used between numbers, too. Converting that to a comma seems like it wouldn't hurt.
- Other intentional multi-script words
- Dhivehi Wikipedia seems to use the Arabic for "God" (usually as the single-character ligature ﷲ) in Arabic names that contain "Allah", such as ޢަބްދިﷲ & ޢަބްދުﷲ (both seem to be used as forms of "Abdallah"). In other places where Dhivehi and Arabic text come together, they don't seem to intentionally be one word. Neither the desirable nor undesirable mixed script tokens seem to be very common, and ﷲ by itself is uncommon, so I'm not too worried about this case.
- Hiragana and Katakana are not treated as consistently as I would expect. Sometimes they get split into unigrams. Perhaps that's what happens when no word is recognized by the ICU tokenizer.
- The Arabic tatweel character is used to elongate Arabic words for spacing or justification. It also seems to labelled as "Common" script by the ICU tokenizer, despite being very Arabic. As such, it picks up a script label from nearby words, like numbers do. The odd result is that in a string like "computerـەوە" (which, despite however it is displayed here, is "computer" + tatweel (probably for spacing) + Sorani/ckb grammar bits in Arabic script) the tatweel gloms on to computer, giving computerـ. In the Sorani analysis chain, the tatweel gets stripped later anyway (as it does in several other Arabic-script analysis chains).
- In the abstract, this falls into the same category as µ and ˈ, but in the most common cases it gets used in places where it will eventually be deleted.
- The ICU tokenizer deletes ©, ®, and ™. The
standard
tokenizer classifies these are<EMOJI>
and splits them off from words they may be attached to, which seems like the right thing to do. The ICU tokenizer ignores them entirely, which doesn't seem horrible, either, though doing so will make them unsearchable with quotes (e.g., searching "MegaCorp™" (with quotes) won't find instances with ™.. you'll have to useinsource
).
- Empty tokens can occur—though this is not specific to the ICU tokenizer. Certain characters like tatweel and Hangul Choseong Filler (U+115F) and Hangul Jungseong Filler (U+1160) can be parsed as individual tokens if they appear alone, but are deleted by the
icu_normalizer
, leaving empty tokens. We can filter empty tokens—and we do after certain filters, likeicu_folding
, where they seem more likely to pop up, but we don't currently filter them everywhere by default.
- Armenian emphasis marks (՛ U+055B), exclamation marks (՜ U+055C) and question marks (՞ U+055E) are split points in the
standard
tokenizer. This is not so good because they are diacritics that are placed on a vowel for stress (emphasis mark) or just the last vowel of the word they apply to (exclamation and question marks). So, it's as if English wrote questions like "What happeneˀd". The ICU tokenizer doesn't split on them, which is better, but dropping them seems to be an even better option.
- The
standard
tokenizer has a token length limit of 255. Tokens longer than that get truncated. The ICU tokenizer's length limit it 4096, so the rare Very Long Token is allowed by the ICU tokenizer.
- Language-specific diacritics can glom on to foreign script characters: ó್ (Kannada virama), a् (Devanagari virama), gੰ (Gurmukhi tippi).. on the other hand, "Latin" diacritics glom on to non-Latin scripts, too: უ̂ (Georgian + circumflex), ܪ̈ (Syriac + diaeresis), 𐌹̈ (Gothic + diaeresis). There's no obvious better thing to do if someone decides to use something like this to notate something or other—it's just confusing and looks like a probable error when it pops up. When it is an error—i.e., a Latin character pops up in the middle of a Kannada word, for example, there's no real right thing to do... garbage in, garbage out. The
standard
tokenizer splits on spaces and gives a garbage token; the ICU tokenizer splits on spaces and scripts, and gives three garbage tokens. (The extra garbage tokens are arguably an extra expense to index, but the numbers are minuscule.)
- The
standard
tokenizer splits on a couple of rare and archaic Malayalam characters, the vertical bar virama ഻ (U+0D3B) and circular virama ഼ (U+0D3C). They are so rare that my generally extensive font collection doesn't cover them, so I had an excuse to install a new font, which always makes for a fun day! The vertical bar virama is so rare that I couldn't find any instances on the Malayalam Wikipedia, Wiktionary, or Wikisource! (with a quickinsource
regex search) Anyway, the ICU tokenizer keeps these viramas and makes better tokens, even if they do look like tofu on non–Malayalam readers' screens.
Analysis Updates to Make
editBased on all of the above, these are the changes I plan to make.
- Add middle dot (· U+00B7) to
word_break_helper
. Thestandard
tokenizer doesn't split on them, either.
- Delete primary (ˈ U+02C8) and secondary (ˌ U+02CC) stress markers, since they are inconsistently used across phonetic transcriptions, and the
icu_tokenizer
will generate tokens that are just stress marks. (Out of 938 examples with stress marks across my Wikipedia samples, all but two cases of either stress mark appear to be IPA, and one of the other two was clearly a non-IPA pronunciation.)- The remaining inconsistency would come from using apostrophe (' U+0027) as a primary stress mark, but that already doesn't match (ˈ U+02C8), so it isn't any worse than the status quo.
- Delete tatweel (ـ U+0640) everywhere early on. This shouldn't cause any problems, but it would require checking to be sure.
- This might be a chance to merge some small, universal mappings (e.g.,
nnbsp_norm
,apostrophe_norm
into one mapping to save on overhead. (Seeglobo_norm
below.)
- This might be a chance to merge some small, universal mappings (e.g.,
- Update the Armenian character map to delete Armenian emphasis marks (՛ U+055B), exclamation marks (՜ U+055C) and question marks (՞ U+055E).
- On second thought, we can apply this globally. These characters occur in small numbers on non-Armenian wikis, usually in Armenian text. Global application as part of
globo_norm
(see below) would have a low marginal cost and would make the affected Armenian words on those wikis more searchable.
- On second thought, we can apply this globally. These characters occur in small numbers on non-Armenian wikis, usually in Armenian text. Global application as part of
- Convert Arabic thousands separator (٬ U+066C) to comma. Might as well convert Arabic comma (، U+060C) while we are here, too.
- Convert Arabic digits to Thai digits in the Thai analyzer, so the digit hack we have there can do the right thing without
icu_token_repair
"fixing" the wrong thing.
- Convert µ (micro sign) to μ (Greek mu) because it is the most commonly used "symbol" that gets split off of tokens in my samples.
It makes sense to create a globo_norm
char filter, combining small universal mappings: fold in nnbsp_norm
& apostrophe_norm
, delete primary (ˈ U+02C8) and secondary (ˌ U+02CC) stress markers, delete tatweel (ـ U+0640), convert Arabic thousand separator (٬ U+066C) and Arabic comma (، U+060C) to comma, delete Armenian emphasis marks (՛ U+055B), exclamation marks (՜ U+055C) and question marks (՞ U+055E), and convert µ (micro sign) to μ (Greek mu).
Testing all those little changes showed nothing unexpected! Whew!
Timing Tests
editMy usual procedure for timing tests involves deleting all indices in the CirrusSearch docker image on my laptop and then timing bulk loading between 1,000 and 5,000 documents, depending on language.◊ There is definitely variation in load times, so this time I ran five reloads and averaged the four fastest for the timing of a given config. Successive reloads are often very similar, though having one outlier is common. Re-running the same config 20 minutes later can easily result in variation in the 1–2% range, which is as big as some of the effect sizes we are looking at, so all of the seemingly precise numbers should be taken with a grain of salt.
◊ The bulk load can only handle files less than 100MB in a single action. Some of my exports are limited by number of documents (max 5K, though I often aimed for a smaller sample of 2.5K), and some are limited by file size (because the average document length is greater for a given wiki or because the encoding of the script uses more bytes per character).
I ran some timing tests for different languages, comparing the old config and the new config for that specific language, and using data from a recent dump for the corresponding Wikipedia (1K–5K articles, as above). When sorted by load time increases, the languages fall into nice groupings, based on how much has changed for the given language analyzer—though there is still a fair amount of variation.
Δ Load Time |
Wikipedia Sample |
Analyzer Notes |
3.20% | zh/ Chinese | uses other tokenizer for text field; already uses icu_tokenizer elsewhere; addicutokrep_no_camel_split
|
7.23% | ja/ Japanese | |
7.76% | he/ Hebrew | uses other tokenizer for text field; addingicu_tokenizer elsewhere, plusicu_token_repair + camel-safe version
|
8.97% | ko/ Korean | |
11.89% | th/ Thai | already uses icu_tokenizer in text fieldand elsewhere; adding icu_token_repair + camel-safe version |
18.40% | bo/ Tibetan | |
19.86% | de/ German | introducing icu_tokenizer in text fieldand elsewhere; adding icu_token_repair + camel-safe version |
22.48% | hy/ Armenian | |
23.08% | ru/ Russian | |
23.62% | en/ English | |
24.73% | it/ Italian | |
25.61% | ar/ Arabic | |
27.24% | fr/ French | |
27.63% | es/ Spanish |
Roughly, indexing time has increased by 7–12% for analyzers using the ICU tokenizer, and 20–27% for those that switch from the standard
tokenizer to the ICU tokenizer. That's more than expected based on my previous timing tests—which I realized is because the earlier tests only covered adding ICU token repair to the text fields, not the cost of the ICU tokenizer itself... it's a standard option, so it shouldn't be wildly expensive, right? (When working on the current config updates, I realized I should also take advantage of the opportunity to use the icu_tokenizer
everywhere the standard
tokenizer was being used, if icu_token_repair
is available, which further increased the computational cost a bit.)
English, at ~24% increase in indexing time, is the close to the median for analyzers among the eight largest wikis (Arabic, Armenian, English, French, German, Italian, Russian, Spanish) that are getting the ICU tokenizer for the first time. So, I did a step-by-step analysis of the elements of the config update in the English analysis chain to see where increased load time is coming from.
My initial analysis was done over a larger span of time, and the stages were done not in any particular order, as I was still teasing out the individual pieces of the upgrade that needed acconting for. I later re-ran the analysis with only two reloads per stage, and re-ran all stages in as quick succession as I could. I also ran two extra reloads for the beginning and ending stages (i.e., the old and new configs) at the beginning and end of the re-run timings, and they showed 1–1.5% variability in these timings taken ~30 minutes apart.
The table below shows the somewhat idealized, smoothed merger of the two step-by-step timing experiments. As such, the total of the stages shown adds up to 24.5%, rather than the 23.62% in the table above.
In summary, for English, using the ICU tokenizer instead of the standard
tokenizer everywhere (3 & 4) is about 7% of the increase (it does more, so it costs more); adding ICU token repair to the text
field (6) is about 4% (below previous estimates); adding ICU token repair to the plain
field (7) is 9.5%; adding ICU token repair to the suggest
, source_text_plain
, and word_prefix
fields (8) is only about 2.5%. Merging character filter normalizers (2) is a net gain in speed (–2.5%), though we spend some of it (+1.5%) by increasing the number of characters we want to normalize (5).
Stage | vs baseline | vs previous stage |
Full re-index estimate |
---|---|---|---|
(1) Baseline: old config withstandard tokenizer
|
— | — | 14–17.5 days |
(2) Combine nnbsp_norm and apostrophe_norm
|
–2.5% | –2.5% | |
(3) Switch from standard tokenizer to icu_tokenizer
|
+7% | +9.5% | |
(4) Switch from icu_tokenizer to textify_icu_tokenizer
|
+7% | 0% | |
(5) Add other character mappings | +8.5% | +1.5% | |
(6) Enable icu_token_repair for text field
|
+12.5% | +4% | |
(7) Enable icu_token_repair for text & plain fields
|
+22% | +9.5% | |
(8) Enable icu_token_repair for text , plain , suggest ,source_text_plain ,word_prefix fields
|
+24.5% | +2.5% | 17.4–21.8 days |
In recent times, we've said that reindexing all wikis takes "about two weeks", though it can be quite a bit longer if too many indexes fail (or large indexes like Commons or Wikidata fail). It may be a little longer than two weeks because no one minds when reindexes run a little longer overnight or over the weekend, and since we don't babysit them the whole time, we might not notice if they take a day or two longer than exactly two weeks. To put the increased load time in context: estimating "about two weeks" as 14–17.5 days, a ~24% increase would be 17.4–21.7 days, or "about 2½ to 3 weeks".↓
↓ The overall average index time increase should be a little bit less, since some large wikis (Chinese, Japanese, Korean) have smaller increases (~3–9%). We shall see!
Next Steps
edit- ✔ Get the
icu_token_repair
plugin ("textify", which also features acronym and camelCase processing) through code review. - ✔ Deploy the textify plugin.
- ✔ Configure the AnalysisConfigBuilder to be aware of the textify plugin and enable the various features appropriately (ICU tokenizer, ICU token repair, acronyms, camelCase, etc.).
- This includes testing the ICU tokenizer itself. For the ICU token repair, I enabled the ICU tokenizer almost everywhere as a baseline. I probably won't actually want to enable it where we currently use other custom tokenizers.
- Note to Future Self: I'm going to have to figure out the current Thai number splitting filter, which was put in place because it was a feature of the Thai tokenizer that the ICU tokenizer didn't have. ICU token repair can, in the right contexts, reassemble some of the number+Thai word tokens we tried to avoid creating. I'll have to see whether not splitting numbers or not repairing tokens is the best compromise for Thai. (Or, updating
icu_token_repair
to be aware of this issue for Thai.. and Lao, and others...)
- Reindex all the wikis and enable all this great stuff (T342444)—and incidentally take care of some old index issues (T353377).
- Try not to think too hard about whether I should have gone and just learned the ICU tokenizer Rule-Based Break Iterator syntax and spec rather than treating the ICU tokenizer as a black box. (I'm sure there's a conversation I had in a comment section somewhere that stated or implied that those rules wouldn't be able to solve the split-token problem... so I probably did the right thing....)
Background
editSee the Data Collection and "Infrastructure" sections above for more background.
To summarize, I took large samples of queries from 90 "reasonably active" wikis, and created samples of approximately 1000 queries for evaluation. Those queries are run daily, and each day's stats can be compared to the previous day's stats.
The daily run after reindexing for a given wiki lets us see the impact of the new analysis chain on that wiki by comparing it to the previous day. We also have daily runs for several days before and after reindexing to gauge what the typical amount of variability ("noise") is for a given wiki.
Examples of noise:
- The zero-results rate is very stable from day to day for a fixed set of queries for virtually all wikis. Rarely, ZRR may move slowly (usually downward as the amount of text on-wiki that a query can match tends to grow over time), but rarely more than 0.1% over a couple of weeks.
- Queries that increase their results count day-to-day can rarely got over 1% for some wikis, while for others is can reliably be over 20% every day.
- Similarly, queries that change their top result day-to-day are usually under 1%, but for some wikis may usually be 3–5%.
- Note that index shards may play a role in top result changes for large, active wikis, as a different shard may serve the query on a different day. When there's no obvious best result (i.e., for mediocre queries), the top result can change from shard-to-shard because the shards have slightly different term weight statistics.
- Results lower down the list may also swap around more easily, even for good queries with an obvious top result, but that doesn't affect our stats here. Smaller wikis only have one shard, and so are less prone to this kind of search/re-search volatility.
- Note that index shards may play a role in top result changes for large, active wikis, as a different shard may serve the query on a different day. When there's no obvious best result (i.e., for mediocre queries), the top result can change from shard-to-shard because the shards have slightly different term weight statistics.
- English Wikipedia is very active and somewhat volatile. Over a 3-week period, 21.2–27.9% of queries in a sample of 1000 increased their number of results returned every day. 3.1–8.3% changed their top result every day.
Samples: Relevant and "Irrelevant"
editThere is a lot of data to review! We have a general sample for each language wiki, plus "relevant" samples (see Relevant Query Corpora above) for as many languages as possible for each of acronyms, apostrophes, camelCase, word_break_helper
, and the ICU tokenizer (with token repair). Relevant samples have a feature, and their analysis results changed when the config changed. (For example, a word with curly apostrophes, like aujourd’hui, would change to a word with straight apostrophes—aujourd'hui—so that counts as "relevant".)
I also held on to the "irrelevant" queries—those that had a feature but didn't change their analysis output for a given set of changes. For example a word with a straight apostrophe didn't change with apostrophe handling, but it might now match a word on-wiki that had a curly apostrophe. (I originally downplayed this a bit, but I think that not all wikis have managed to be as strict about on-wiki content as the bigger wikis, and some non-standard stuff always slips through.)
And of course in some cases queries without a given feature could match on-wiki text with that feature. For example, after the updates, the query Fred Rogers could match FredRogers or fred_rogers in an article, hence the general sample for each language.
Earlier Reindexing: A False Start
editThe early regex-based versions of acronym handling and camelCase handling was waaaaaaay tooooooo sloooooooow, and reindexing was stopped after it got through the letter e.
The new plugin-based versions of acronym handling and camelCase handling have almost identical results (they can handle a few more rare corner cases that are really, really far into the corner), and the apostrophe normalization and word_break_helper
changes are functionally the same.
As a result, most general samples for wikis starting with a–e and feature-specific samples for everything but the ICU tokenizer upgrade (again, for wikis starting with a–e) are mostly unchanged with the most recent reindexing. I have daily diffs from the time of the earlier reindexing, though, so I will be looking at both.
Heuristics and Metrics
editBecause there is so much data to review, I'm making some assumptions and using some heuristics to streamline the effort.
Increased recall is the general goal, so any sample that has a net measurable decrease in the zero-results rate (i.e., by 0.1%) the day after re-indexing counts as a success. If there's no change in ZRR, then a marked increase in the number of queries getting more results counts as a somewhat lesser success. If there is no increase in the number of queries getting more results, I noted a marked increase in the number of queries changing their top result as a potential precision success (though it requires inspection).
Since the typical daily change in increased results varies wildly by language, I'm looking for a standout change the day after reindexing. Some are 5x the max for other days and thus pretty obvious, others are a bit more subjective. I'm counting a marked increase as either (i) a change of more than ~5% over the max (e.g., 20% to 25%) or (ii) more than ~1.5x the max (e.g. 1.2% to 1.8%), where the max is from the ~10 days before and after reindexing.
Sometimes ZRR increases, which is not what we generally want. But in those cases, I'm looking at the specifics to see if there is something that makes sense as an increase in precision. That tends to happen with an analysis change that prevents things from being split apart. For example, before the ICU tokenizer, 维基百科 is split into four separate characters, which could match anywhere in a document; with the ICU tokenizer, it's split into two 2-character tokens (维基 & 百科), so it may match fewer documents, but presumably with higher precision/relevance. Similarly, pre-acronym-handling, N.A.S.A. matches any document with an n and s in it (a is a stop word!).. either of which can show up as initials in references, for example. So, treating W.E.I.R.D.A.C.R.O.N.Y.M. as "weirdacronym" and getting no results is actually better than treating it as a collection of individual letters that matches A–Z navigation links on list pages willy-nilly.
Also, when ZRR increases, I will also note marked increases in the number of queries getting more results or marked increase in the number of queries changing their top result, so the total tagged samples can add up to more than 100%.
I'm putting samples with fewer than 30 queries into the "small sample" category. They are interesting to look at, but will not count (for or against) the goal of reducing the zero-results rate and/or increasing the number of results returned for 75% of languages with relevant queries. I filtered as much junk as I could from the samples, but I can't dig into every language to sort reasonable queries (FredRogers) from unreasonable queries (FrrrredddЯojerzz was here). Small samples (especially < 5) can easily be dominated by junky queries I couldn't or didn't have time to identify.
I'm going to ignore "small" "irrelevant" samples with no changes... a tiny sample where nothing happened is not too interesting, so I didn't even annotate them.
General Sample Results
editIn the general larger samples (~1K queries) from each of 90 Wikipedias, we had the following results (net across acronym handling, apostrophe normalization, camelCase handling, word_break_helper
changes, and adding ICU tokenization (with multiscript token repair):
The day after reindexing...
- 59 (66%) had their zero-results rate decrease
- 22 (24%) had the number of queries getting more results increase
- 6 (7%) had their zero-results rate increase
- 6 (7%) had no noticeable change in ZRR, number of results, or change in top result
The total is more than 90 (i.e., more than 100%) because a handful of wikis are in multiple groups. A few wikis that had their ZRR increase also had the number of queries getting more results increase, for example. Esperanto wiki (eo) was one of the wikis that was caught up in the earlier reindexing false start. The earlier reindexing caused a decrease in ZRR, while the later reindexing caused an increase in ZRR, so I just counted it once for each.
Of the six wikis that had their zero-results rate increase, the queries that got no results fell into a few overlapping groups:
- English, Esperanto, Italian, and Vietnamese all had Chinese or Japanese text that was no longer being broken up into single characters, so that's a likely increase in precision (and a drop in recall).
- The English and Italian language configs (which apply to the English, Simple English, and Italian Wikipedias) previously had
aggressive_splitting
enabled, which would also split words with number (e.g., 3D would become 3, D). Disabling that generally improved precision (with a related decrease in recall). - Russian had a query using an acute accent like an apostrophe. The acute accent caused a word break with the old config, while it was converted to an apostrophe with the new config, creating a single longer token. (This is not a clear win, but Russian is also in the group that had the number of queries getting more results increase.)
So, five of the six cases of ZRR increases reflect improved precision, and the sixth had an increase in the number of results, I'm calling all of those good. That leaves only six samples with no changes: Belarusian-Taraškievica, Hebrew, Japanese, Khmer, Kurdish, and Chinese (be-tarask, he, ja, km, ku, zh).
Thus 93% of general samples show positive effects overall from the combined harmonization efforts so far.
Note: The English general sample really highlights the limitation of looking at a single number (though that is all that is generally feasible with so much data and so little time to analyze it). Looking at everything but the ICU tokenizer (earlier reindexing), ZRR decreased. With only the ICU tokenizer, ZRR increased. General samples starting with f–z don't have this split, but still have the same ZRR tension internally. In the case of those where the decreased ZRR won out, I didn't look any closer. In the cases where increased ZRR dominated, I did, and most were cases of improved precision dampening recall. There are likely many more cases of improved precision dampening recall, but improved recall winning out overall in these samples.
Reminders & Results Key
edit- The queries in a relevant sample showed changes in analysis to the query itself. In the irrelevant sample the queries had the relevant feature (e.g., something that matched an acronym-seeking grep regex) but had no change in analysis to the query itself—so there might be interesting changes in matches, but we aren't particularly expecting them.
- Samples that are small have fewer than 30 queries in them, and are reported for the sake of interest, but are not officially part of the stats we are collecting. Small samples with no changes (noΔ) are noted in brackets, but are not counted as part of the "small" total, since the reason nothing interesting happened is likely that the samples are too small.
- Net zero-results rate decreases (ZRR↓) are assumed to be a sign of good things happening.
- Net zero-results rate increases (ZRR↑) require investigation. They could be bad news, or they could be the result of restricting matches to improve precision (no result is better than truly terrible nonsensical results). Zero-results rate increases that are most likely a net improvement will be re-counted under ZRR↑+, and discussed in the notes.
- If ZRR did not go down (including ZRR going up—ZRR↑), but the number of queries getting more results markedly increased (Res↑), that is assumed to be a sign of somewhat lesser good things happening—but good things nonetheless.
- If ZRR did not go down, and the number of queries getting more results did not go up, but the number of queries changing their top result (topΔ) markedly increased, those samples require investigation as possibly improved results. The same or fewer results, but with changed ranking could be the result of restricting or re-weighting matches to improve precision. Top result changes that are most likely the result of improvements in ranking or increased precision will be re-counted under topΔ+, and discussed in the notes.
Acronym Results
editThe table below shows the stats for the acronym-specific samples.
rel | rel (sm) | irrel | irrel (sm) | |
total | 40 | 26 | 1 | 0 |
ZRR↓ | 27 (68%) | 5 (19%) | ||
Res↑ | 7 (18%) | 20 (77%) | 1 (100%) | |
topΔ | 2 (5%) | 1 (4%) | ||
— topΔ+ | 1 (3%) | |||
ZRR↑ | 4 (10%) | |||
— ZRR↑+ | 4 (10%) | |||
noΔ | [14] |
Notes:
- Zero-results rate increases (ZRR↑) look like generally improved precision for English and Italian (though there are also a few queries from each that are affected by no longer splitting a1ph4num3ric words). The Korean queries are 40% improved precision, 20% improved recall, and 40% junk, so overall an improvement. Chinese are similar: 60% improved precision, 20% improved recall, 20% junk.
- The two acronym samples with changes to the top result are Simple English and Thai. Simple English has some good improvements, but also a lot of garbage-in-garbage-out queries in this sample, so I'm calling it a wash. The Thai examples I looked at seem to be more accurate. Thai acronyms seem unlikely to get zero results because each character has a good chance of occurring somewhere else. English acronyms on Thai Wikipedia, though, don't necessarily match as well. For example, the query R.M.S Lusitania originally only matched the article for the RMS Olympic (which mentions the Lusitania, and has individual R, M, and S in it); the de-acronymized RMS Lusitania matches the English name in the opening text of the correct Thai article.
- Irrelevant acronym queries are weird. There really shouldn't have been any, because acronyms are de-acronymized. The only way it happens is if both the acronym parts and the de-acronymized whole are removed from the query. This can happen in a few ways.
- In several languages, there were just one or two queries with a backslash that I didn't properly escape during the relevant/irrelevant determination. This caused an Elastic error, resulting in 0 tokens. With the new config, there is still an error, still resulting in 0 tokens.
- In Italian, there are non-Italian acronyms, like L.A. where the individual letters (l and a) and the de-acronymized token (la) are all stop words, so they generate no tokens!
- Polish is somewhat similar, in that all of the unchanged queries contain Sp. z o.o., which is the equivalent to English LLC. Without acronym processing or
word_break_helper
, o.o. is a stop word. Withword_break_helper
, each individual o is a stop word. With acronym processing, the two-letter oo gets removed by a filter set up to tamp down on the idiosyncrasies of the statistical stemmer. In all cases, no tokens are generated, so there are no changes.
- Polish is somewhat similar, in that all of the unchanged queries contain Sp. z o.o., which is the equivalent to English LLC. Without acronym processing or
- In Japanese, there are a couple of multiscript acronyms, like N.デ., which is split up by the tokenizer it uses (the
standard
tokenizer).- Thai is somewhat similar, with a de-acronymized Thai/Latin token split by the ICU tokenizer, which is a combo that is not allowed to be repaired by
icu_token_repair
. (The other Thai example isn't actually real because my acronym-seeking regex failed.. Thai is hard, man!)
- Thai is somewhat similar, with a de-acronymized Thai/Latin token split by the ICU tokenizer, which is a combo that is not allowed to be repaired by
- The Chinese tokenizer (
smartcn_tokenizer
) is even more aggressive and splits up any non-Chinese/non-ASCII characters into individual tokens, so Cyrillic, Thai, and Latin mixed ASCII/non-ASCII de-acronymized tokens get broken up (along with regular Cyrillic, Thai, and Latin mixed ASCII/non-ASCII words). Parsing of Chinese characters ignores punctuation, so acronyms and de-acronymized tokens are usually treated the same. (Acronym processing still works on all-Latin acronyms, like N.A.S.A.)
Apostrophe Normalization Results
editThe table below shows the stats for the apostrophe normalization–specific samples.
rel | rel (sm) | irrel | irrel (sm) | |
total | 16 | 23 | 25 | 4 |
ZRR↓ | 10 (63%) | 16 (70%) | 11 (44%) | |
Res↑ | 4 (25%) | 6 (26%) | 9 (36%) | 4 (100%) |
topΔ | 1 (6%) | 1 (4%) | ||
— topΔ+ | 1 (6%) | |||
ZRR↑ | 5 (31%) | 1 (4%) | 5(20%) | |
— ZRR↑+ | 4 (25%) | |||
noΔ | [25] |
- Czech had both an increase in zero-results rate and an increase changes in the top result. Almost all relevant queries used an acute accent (´) for an apostrophe, which causes a word split without apostrophe normalization. In a query like don´t know why, searching for don't, know, and why is obviously better than searching for don, t, know, and why (and not being able to match don't). These are precision improvements for both topΔ and ZRR↑.
- German's net increase in ZRR is a mix of recall and precision improvements, all from acute (´) and grave (`) accents being used mostly for apostrophes. A small number are typos or trying to use the accent as an accent, but most, like l`auberge and l´ame are clearly apostrophes.
- Spanish and Portuguese are the same.
- English increase in ZRR comes down to one example, hoʻomū, which uses a modifier letter turned comma instead of an apostrophe. Without apostrophe normalization, it gets deleted by
icu_folding
and matches hoomu. Looking at the top result changes for English, I see several modifier letters (modifier letter turned comma, modifier letter apostrophe, modifier letter reversed comma, modifier letter right half ring) used as apostrophes. In English-language queries, the change to apostrophe is a help (jacobʻs ladder). In non-English names and words (on English Wikipedia) it's a mixed bag. For example, both Oahu and O'ahu are used to refer to the Hawaiian island; without apostrophe processing, oʻahu matches Oahu (deleted byicu_folding
), but with apostrophe processing it matches O'ahu. Visually, I guess matching the version with an apostrophe is better. I think this is an improvement, but we were looking at ZRR, not top result changes, so I'll call this one a miss.
CamelCase Results
editThe table below shows the stats for the camelCase–specific samples.
rel | rel (sm) | irrel | irrel (sm) | |
total | 27 | 38 | 2 | |
ZRR↓ | 27 (100%) | 34 (87%) | ||
Res↑ | 4 (10%) | 2 (100%) | ||
topΔ | ||||
— topΔ+ | ||||
ZRR↑ | ||||
— ZRR↑+ | ||||
noΔ | [13] |
CamelCase is an easy one, apparently!
Italian and English had some minor activity, but since they both had aggressive_splitting
enabled before, they had no camelCase-related changes. Italian has one query with both camelCase and an a1ph4num3ric word, and the change in the a1ph4num3ric word was actually the cause in the change in results.
ICU Tokenization Results
editThe table below shows the stats for the samples specific to ICU tokenization (with ICU token repair).
rel | rel (sm) | irrel | irrel (sm) | |
total | 24 | 24 | 31 | 11 |
ZRR↓ | 8 (33%) | 4 (17%) | 7 (23%) | 7 (64%) |
Res↑ | 1 (4%) | 16 (52%) | 2 (18%) | |
topΔ | 1 (4%) | 1 (3%) | 1 (9%) | |
— topΔ+ | ||||
ZRR↑ | 16 (67%) | 18 (75%) | 16 (52%) | 1 (9%) |
— ZRR↑+ | 16 (67%) | |||
noΔ | 6 (19%) |
In all cases where the zero-results rate increased fo relevant queries, the queries are Chinese characters, or Japanese Hiragana, which the standard tokenizer splits up into single characters and which the ICU tokenizer parses more judiciously, so these are all precision improvements. Most of them also feature recall improvements, from Thai, Myanmar, or Japanese Katakana—which the standard tokenizer lumps into long strings—being parsed into smaller units; or, from incompatible mixed-script tokens (i.e., spacing errors) being split up.
word_break_helper
Results
edit
The table below shows the stats for the word_break_helper
–specific samples.
rel | rel (sm) | irrel | irrel (sm) | |
total | 49 | 16 | 25 | 17 |
ZRR↓ | 39 (80%) | 8 (50%) | 14 (56%) | 1 (6%) |
Res↑ | 6 (12%) | 4 (25%) | 7 (28%) | 15 (88%) |
topΔ | 3 (6%) | 4 (25%) | 1 (6%) | |
— topΔ+ | 1 (2%) | |||
ZRR↑ | 4 (16%) | |||
— ZRR↑+ | ||||
noΔ | 1 (2%) | [15] |
Two of the three with changes to the top result seemed to be mediocre queries (Marathi and Myanmar, but in English/Latin script!) that had random-ish changes. The other (Malayalam, though also largely with English queries) showed some fairly clear precision improvements.
Summary Results
editBelow are the summary results for the general samples and the relevant non-"small" samples for each feature. net+ shows the percentage that had a positive impact (ZRR↓, Res↑, topΔ+, or ZRR↑+). Note that the categories can overlap (esp. ZRR↑(+) with Res↑ or topΔ(+)), so net+ is not always a simple sum.
ZRR↓ | Res↑ | topΔ | topΔ+ | ZRR↑ | ZRR↑+ | noΔ | net+ | |
general | 66% | 24% | — | — | 7% | 6% | 7% | 93% |
acronym | 68% | 18% | 5% | 3% | 10% | 10% | — | 98% |
apostrophe | 63% | 25% | 6% | 6% | 31% | 25% | — | 94% |
camelCase | 100% | — | — | — | — | — | — | 100% |
ICU tokens | 33% | — | — | — | 67% | 67% | — | 100% |
wbh | 80% | 12% | 6% | 2% | — | — | 2% | 94% |
Our targe goal of 75% improvement in either zero-results rate (ZRR↓, preferred) or total results returned (Res↑) held true for all but the introduction of ICU tokenization, where we expected (and found) improvements in both recall and precision, depending on which "foreign" script a query and text on-wiki is in.
For each collection of targeted queries for a given feature, we saw an improvement across more than 90% of query samples.
For the general random sample of Wikipedia queries, we also saw improvement across more than 90% of language samples, mostly in direct improvements to the zero-results rate—indicating that the features we are introducing to all (or almost all) wikis are useful upgrades!
Background
editSome Turkic languages (like Turkish) use a different uppercase form of i and a different lowercase form of I, so that the upper/lowercase pairs are İ/i & I/ı. It's common in non-Turkish wikis to see İstanbul, written in the Turkish way, along with names like İbrahim, İlham, İskender, or İsmail.
On Turkish wikis, we want İ/i & I/ı to be lowercased correctly. The Turkish version of the lowercase
token filter does exactly that, so it is enabled for Turkish.
On non-Turkish wikis, we want everything to be normalized to i, since most non-Turkish speakers can't type İ or ı, plus, they may know the names without the Turkish letters, like the common English form of Istanbul.
The default lowercase
token filter lowercases both İ and I to i. However, the ICU upgrade filter, icu_normalizer
does not normalize İ to i. It generates a lowercase i with an extra combining dot above, as in i̇stanbul. The extra dot is rendered differently by different fonts, apps, and OSes. (On my Macbook, it looks especially terrible here in italics!) Sometimes it lands almost exactly on the dot of the i (or perhaps replaces it); other times it is placed above it. Sometimes it is a slightly different size, or a different shape, or slightly off-center. See the sample of fonts below.
Of course, by default search treats i and i-with-an-extra-dot as different letters.
When unpacking the various monolithic Elasticsearch analyzers, I noticed this problem, and started adding dotted_i_fix
(a char_filter that maps İ to I, to preserve any camelCase or other case-dependent processing that may follow) to unpacked analyzers by default. I left dotted_i_fix
out of any analyzers that I only refactored.
In our default analyzer, we upgrade lowercase
to icu_normalizer
if ICU components are available, which means default wikis have the dotted-I problem, too, but not the fix.
İ ❤️ Turkıc Languages
editLooking at the English Wikipedia pages for İ and ı, it looks like exceptions should be made for Azerbaijani, Crimean Tatar, Gagauz, Kazakh, Tatar, and Karakalpak (az, crh, gag, kk, tt, kaa)—all Turkic languages—and that they should use Turkish lowercasing.
When testing, I saw that the Karakalpak results didn't look so great, and upon further research, I discovered that Karakalpak uses I/i and İ/ı (sometimes I/ı, and maybe formerly I/i and Í/ı)... it's listed on the wiki page for ı, but not İ, so I should have known something was up! I took it off the list.
The others now have minimally customized analyzers to use Turkish lowercase
—which is applied before the icu_normalizer
, preventing any double-dot-i problems.
Things Not to Do: Before settling on a shared custom config for Azerbaijani, Crimean Tatar, Gagauz, Kazakh, and Tatar, I tried adding a limited character filter mapping I=>ı
and İ=>i
, which seems like it could be more lightweight than an extra lowercase
token filter (icu_normalizer
still needs to run afterward for the more "interesting" normalizations, so it's an addition, not a replacement). However, it can cause camelCase processing to split or not split incorrectly, including during ICU token repair, it can interact oddly with homoglyph processing, and the exact ordering was brittle, so it hurt more than it helped. Turkish lowercase
is just easier.
Global Custom Filters to the Rescue
editThe heart of the solution for non-Turkic languages is to add dotted_I_fix
to the list of Global Custom Filters, which are generally applied to every analyzer, modulo any restrictions specified in the config. A few character normalizations, acronym handling, camelCase handling, homoglyph normalization, and ICU tokenizer repair are already all on the list of Global Custom Filters.
I discovered that not only does lowercase
not cause the problem, but icu_folding
solves it, too—converting i-with-an-extra-dot to plain i. I hadn't given this a lot of thought previously because in my step-by-step unpacking analysis, I fixed the icu_normalizer
problem before enabling icu_folding
.
Letting icu_folding
handle İ instead of dotted_I_fix
causes a few changes.
- If
icu_folding
comes after stemming and stop word filtering, then words like İN are not dropped as stop words, and words like HOPİNG are not properly stemmed. This affects very few words across my samples in ~100 languages, and it is line with what already happens to words like ın and hopıng, as well as unrelated diacriticized words like ín, ÎN, hopïng, HOPÌNG, ĩn, ĬN, hopīng, HOPǏNG, etc. - Rarely, homoglyph processing is thwarted. A word like КОПİЙОК, which is all Cyrillic except for the Latin İ, doesn't get homoglyphified, but that's because
homoglyph_norm
doesn't currently link Latin İ and Cyrillic І̇. - (These issues occur in lots of languages, though many of my examples here are just English.)
However, as icu_folding
needs to be custom configured per language, it isn't available everywhere, so dotted_I_fix
is still needed, especially in the default config.
There are also some unusual legacy configs out there! For example, Italian does not upgrade asciifolding
to icu_folding
. It got customized (along with English) more than 9 years ago, and I haven't worked on it since, so while I've refactored the code, I never had time to tweak and re-test the resulting config. (Another task, T332342 "Standardize ASCII-folding/ICU-folding across analyzers", will address and probably change this—but I didn't want to look at it now. Got to keep the bites of the elephant reasonably small, eh?)
Other languages with custom config but without ICU folding: Chinese, Indonesian, Japanese/CJK, Khmer, Korean, Malay, Mirandese, and Polish. These were mostly unpacked (many by me, some before I got here, like Italian) before adding icu_folding
became a standard part of unpacking. Mirandese has a mini-analysis chain with no stemmer and didn't get the full work up. These should all generally get dotted_I_fix
in the new scheme.
Sometimes we have asciifolding_preserve
, which we upgrade to preserve_original
+ icu_folding
, which then holds on to the i-with-an-extra-dot, which could improve precision in some cases, I guess. (Making a firm decision is left as an exercise for the reader probably me, in T332342.)
Overall, removing dotted_I_fix
in cases where icu_folding
is available—which in the past has often been on larger wikis in major languages with Elasticsearch/Lucene analyzers available—may marginally improve efficiency.
So, with all that in mind, next comes the fun part: determining how to decide whether dotted_I_fix
should be applied in a given analyzer.
I originally tried excluding certain languages (Azerbaijani, Crimean Tatar, Gagauz, Kazakh, Tatar, and Turkish, obviously, but also Greek and Irish, because they have language-specific lowercasing). And I could skip enabling dotted_I_fix
if certain filters were used—but ascii_folding
is only upgraded if the ICU plugin is available.. except for Italian, which doesn't upgrade.. though that may change when T332342 gets worked on. Have I mentioned the epicycles?
Eventually I realized that (İ) if I moved enabling Global Custom Filters to be the very last upgrade step, then (ıı) I wouldn't have to guess whether lowercase
or ascii_folding
is upgraded, or (İIİ) try to maintain the long-distance tight coupling between language configs and the dotted_I_fix
config in Global Custom Filters (which can be tricky—Ukrainian sneaks in ICU folding only if it is unpacked, which is dependent on the right plugin being installed!).
Finally, all I actually had to do was block dotted_I_fix
on the presence of lowercase
or icu_folding
in the otherwise final analyzer config. Phew!
Analysis Results
editOverall, the mergers in non-Turkic languages are what we want to see: İstanbul analyzed the same as Istanbul, istanbul, and ıstanbul.
For the most part, the mergers in Turkic languages are also good, though there are occasional "impossible" triples. Given English Internet and Turkish İnternet, which should merge with lowercase internet? On a Turkish wiki, Turkish İnternet wins by default.
(I also rediscovered that the Chinese smartcn_tokenizer
splits on non-ASCII Latin text, so as noted in my write up, fußball → fu, ß, ball.. or RESPUBLİKASI → respubl, İ, kasi (I also-also just realized that the smartcn_tokenizer
lowercases A-Z, but not diacriticized uppercase Latin.. and it splits on every diacritical uppercase letter, so ÉÎÑŠȚÏẼǸ → É, Î, Ñ, Š, Ț, Ï, Ẽ, Ǹ—yikes!) With dotted_I_fix
, RESPUBLİKASI does okay in Chinese, though.)
Extreme Miscellany
editOut of nowhere, I noticed that the generic code for Norwegian (no) doesn't have a config, though the two more specific codes—Norwegian Bokmål (nb) and Norwegian Nynorsk (nn)—do have configs. no.wikipedia.org
uses nb, and nn.wikipedia.org
uses nn (and nb.wikipedia.org
rolls over to no.wikipedia.org
). Both nb and nn have the same config. (There are nn-specific stemmers available, but they don't get used at the moment.) We probably won't use the no config, but since I figured all this out, I added it in anyway, because no, nb, and nn all using the same config is probably not ideal, but it is less dumb than nb and nn using the same config while no doesn't have any config at all!
Background
editA long time ago (2017) in a Phabricator ticket far away (T176197), we enabled a mapping from Japanese hiragana to katakana for English-language wikis, to make it easier to find words that could be written in either (such as "wolf", which could be either オオカミ or おおかみ, both "ōkami").
After much discussion on various village pumps, the consensus was that this wasn't a good idea on Japanese wikis, but would probably be helpful on others, and it got positive feedback (or at least no negative feedback) on French, Italian, Russian, and Swedish Wikipedias and Wiktionaries.
Since then, much water has flowed under the proverbial bridge—and many code and config changes have flowed through Gerrit—which has changed the situation and the applicability of the hiragana-to-katakana mapping in our current NLP context.
Back in the old days, under the rule of the standard
tokenizer, hiragana was generally broken up into single characters, and katakana was generally kept as a single chunk. On non-Japanese wikis, both are much more likely to occur as single words, so converting おおかみ (previously indexed as お + お + か + み) to オオカミ (indexed as オオカミ) not only allowed cross-kana matching, but also improved precision, since we weren't trying to match individual hiragana characters.
Recently, though, we upgraded to the ICU tokenizer, exactly because it is better at parsing Japanese (and Chinese, Khmer, Korean, Lao, Myanmar, and Thai) and usually works much better for those languages, especially on wikis for other languages (e.g., parsing Japanese, et al., on English, Hungarian, or Serbian wikis).
Enabling the Kana Map
editMy first test was to enable the previously English-only kana_map
character filter almost everywhere. Note that character filters apply before tokenization, and treat the text as one big stream of characters.
As above, it makes sense to skip Japanese—and I was considering not enabling kana_map
for Korean and Chinese, too, because they have different tokenizers—but the big question was how all the other languages behaved with it enabled.
The result was a bit of a mess. There were a lot of unexpected parsing changes, rather than just mappings between words written in different kana. Hmmmm.
I See You, ICU
editThe ICU tokenizer uses a dictionary of Chinese and Japanese words to parse Chinese and Japanese text. (The dictionary is available as a ~2.5MB plain text file. This may not be the exact version we are using on our projects since we are currently paused on Elastic 7.10 and its compatible ICU components, but there have probably not been any huge changes.)
One of the big changes I was seeing after enabling the kana_map
was that words with mixed kanji (Chinese characters used in Japanese) and hiragana were being parsed differently. As a random example, 内回り (two kanji and one hiragana) is in the ICU dictionary, so it is parsed as one token. However, the version with katakana, 内回リ (i.e., the output of kana_map
), is not in the dictionary, so it gets broken up into three tokens: 内 + 回 + リ.
Sometimes this kind of situation also resulted in the trailing katakana character being grouped with following characters. So, where we might have had something tokenized like "CCH+KK" before, with the hiragana-to-katakana mapping applied, we get "C+C+KK+K"... not only are words broken up, word boundaries are moving around. That's double plus ungood.
A New Hope.. uh.. New Plan
editSo, converting hiragana to katakana before tokenization isn't working out, what about converting it after tokenization?
This should have been easy, but it wasn't. There is no generic character-mapping token filter, though there are lots of language-specific normalization token filters that do one-to-one (and sometimes more complex) mappings, all of which we use for various languages. I looked at the code for a couple of them, and they all use hard-coded switch statements rather than a generic mapping capability, presumably for speed.
It took a fair amount of looking, but I found a well-hidden feature of the ICU plugin, the icu_transform
filter, which can link together various conditions and pre-defined transformations and transliterations... and Hiragana-Katakana is one of them! ICU Transforms are a generic and powerful capability, which means the filter is probably pretty expensive, but it would do for testing, for sure!
The results were... underwhelming. All of the parsing problems went away, which was nice, but there were very few cross-kana mappings, and the large majority of those were single letters (e.g., お (hiragana "o") and オ (katakana "o") both being indexed as オ).
I See You, ICU—Part 2
editSo that Chinese/Japanese dictionary that the ICU tokenizer uses... it has almost 316K words (and phrases) in it. Many have hiragana (44.5K), many have katakana (22K), a a majority have hanzi/kanji (Chinese characters—287K), and—based on adding up those values—a fair number have various mixes of the three (including, for example, 女のコ—an alternate spelling of "girl"—which uses all three in one word!).
I filtered out the words with kana (almost 58,921) and converted all of the hiragana in that list to katakana, and then looked for duplicates, of which there were 2,268 (only 3.85%).
So, there are only about two thousand possible words that the ICU tokenizer could conceivably parse as both containing hiragana and as containing katakana, and then match them up after the hiragana-to-katakana mapping. And the original example, オオカミ vs おおかみ, is not even on the list. In Chrome or Safari on a Mac, if you double click on the end of オオカミ, it highlights the whole word. If you double click on the end of おおかみ, it only highlights the last three characters. That's because オオカミ is in the dictionary, but おおかみ is not. おかみ, meaning "landlady", is in the list, as is お, meaning "yes" or "okay". So the ICU tokenizer treats おおかみ as お + おかみ (which probably reads like terrible machine translation for "yes, landlady"). Converting お + おかみ to オ + オカミ still won't match オオカミ.
Also, the icu_transform
filter is pretty slow. In a very minimal timing test, adding it to the English analysis chain makes loading a big chunk of English text take 10.7% longer, so a custom hiragana/katakana mapping token filter would probably be much faster.
No Hope.. No Plan
edit(That section title is definitely a bit dramatic.)
On the one hand, mapping hiragana and katakana is clearly worth doing to people other than us. Some browsers support this in on-page searching. For example, in Safari and Chrome on a Mac, searching on the page for おおかみ finds オオカミ. Though in Firefox it doesn't. (I haven't tested other browsers or operating systems.)
On the other hand, Google Search is still treating them differently. Searching for おおかみ and オオカミ give very different numbers of results—11.8M for おおかみ (which includes 狼, the Chinese character for "wolf") vs 33.1M for オオカミ. Yahoo! Japan gives 12.9M vs 28.2M. These tell the same story as the numbers from 2017 on Phab.
Comparing the analysis in English, with and without the kana_map
character filter enabled, it's definitely better to turn it off. And as with all the other non-CJK languages, enabling the icu_transform
solves the worst problems, but doesn't do much positive, and it's very expensive.
So, in conclusion, the best thing seems to be to do the opposite of the original plan, and disable the hiragana-to-katakana mapping in favor of getting the value of the ICU tokenizer parsing Japanese text (and text in other Asian languages).
Harmonization Post-Reindex Evaluation, Part II
editSee Part I for background, including details on data collection, the definitions of relevant and "irrelevant", metrics, etc.
This second, smaller evaluation was planned to cover the universal application of the dotted I (İ) fix (T358495) and the hiragana-to-katakana mapping (T180387). However, the result of the hiragana-to-katakana mapping analysis was to not enable it everywhere, but rather to disable it for English (the only place it was enabled).
So, dotted I is all we have to look at.
General Sample
editUnfortunately, when I put together my general query samples across ~90 wikis in the summer of 2023, I normalized the queries a little too aggressively for proper dotted I testing. I generally normalized white space and lowercased everything, so that "Mr Rogers" and " mr rogers " (lowercased, and with extra spaces) would count as the same query. This turned out to be a problem for dotted I examples because the whole point is that İ is not lowercased correctly by software, or consistently across languages.
As a result, most potential sample queries that originally had İ now have double-dotted i (i.e., i̇, which may or may not display in a discernible way—it's i (U+0069) + ̇ (U+0307)).
Dotted I occurs on many wikis (for example, in Turkish proper names, such as İstanbul), but it doesn't occur too often in queries on wikis for languages that don't use it. Bosnian, Malay, Scots, and Uzbek had one query each (out of ~1K sample queries) with double-dotted i in them. The Kurdish sample had 7, the Azerbaijani sample had 39, and the Turkish sample had 41. Azerbaijani and Turkish make the I/i vs I/ı distinction.
So, dotted I changes generally had no effect in the general samples.
The one exception was Azerbaijani, which showed a decrease in recall and increase in zero-results rate (46.4% to 48.7% (2.3% absolute change; 5.0% relative change)). 22 of 26 queries that went from some results to no results had improperly lowercased double-dotted i in them, which correctly doesn't match anything when proper Turkic lowercasing is used. Three others are actual Azerbaijani queries, and one other includes English internet which, as a side-effect of Turkic lowercasing, no longer matches uppercase English Internet.
Dotted I Results
editA few reminders:
- Relevant sample showed changes in analysis to the query itself, irrelevant queries had dotted I's but showed no analysis changes.
- Small samples have fewer than 30 queries in them, and are reported for interest, but are not officially part of the stats we are collecting. Small samples with no changes (noΔ) are noted in brackets, but are not counted as part of the "small" total
See Reminders & Results Key above for full details.
The table below shows the stats for the acronym-specific samples.
rel | rel (sm) | irrel | irrel (sm) | |
total | 7 | 3 | 2 | |
ZRR↓ | 6 (86%) | 2 (67%) | ||
Res↑ | 1 (14%) | |||
topΔ | ||||
— topΔ+ | ||||
ZRR↑ | 2 (100%) | |||
— ZRR↑+ | ||||
noΔ | [5] | 1 (33%) | [28] |
- The Turkish sample had no changes, as expected, so it was excluded from the count.
- The one (non-small) irrelevant sample with no changes is English, which already had good dotted I support.
- Mixed script КОПİЙОК (Cyrillic with Latin İ) resulted in an increased ZRR for one "irrelevant" sample (Bulgarian). As part of the dotted I changes, we removed the
dotted_i_fix
filter for analysis chains with ICU folding. İ still gets mapped to I, butdotted_i_fix
did it beforehomoglyph_norm
, while ICU folding does it after. In general, our homoglyph processing doesn't get every accented version of every letter (which I might like to address in the future). While this is sup-optimal, it is more consistent with our processing of other accented characters. КОПİЙОК, КОПÍЙОК, and КОПĨЙОК are all treated the same. (On the other hand, КОПÏЙОК gets homoglyphed correctly because there is a single Cyrillic letter to map to: Ï ↔︎ Ї.) - German İm also causes a small uptick in ZRR for teh German sample because the normalization to im—which is a stop word—happens after stop word filtering. This is generally consistent with other unexpected non-German accents, such as Ím or Îm—though those two are actually normalized by the German stemmer—presumably because, after lowercasing, they are still single characters, and not characters with a combining diacritic. Hmm.
Limited Post-Harmonization Impact
editThere aren't that many examples of dotted I outside of languages that use it, especially in queries (as noted above, it does regularly show up on-wiki in Turkic proper names).
Many languages (including all that had previously been unpacked) already had good dotted I support. Changes generally come from reordering analysis chain components so that dotted I normalization happens later (when other diacritic normalization happens), changing whether a few words get homoglyphed, stemmed, or stopped. This is also why there are no non-small relevant samples!
The harmonization purpose was to make sure that most stray dotted I's get treated well in other languages, after we generally upgraded to ICU normalization over simple lowercasing, and icu_norm
just doesn't handle it.
There were two "semi-small" relevant sets that didn't meet the non-small criteria of 30+ queries but still had 10+ queries: Italian, with 19, and Chinese with 11.
Both of these had custom analysis configurations and thus needed harmonization. Italian was fairly customized, like English, before I started working on the Search team (which was back in 2015!), while Chinese needs very different processing than typical European languages, and so doesn't use all the same kind of components as other languages.
Adding proper dotted I support to these two previously excluded languages gives us a sense of how dotted I support likely affects other languages:
- Italian: The zero results rate (for unique queries with dotted I) dropped from 78.9% to 10.5% (-68.4% absolute change; -86.7% relative change).
- Chinese: The zero results rate (for unique queries with dotted I) dropped from 63.6% to 36.4% (-27.2% absolute change; -42.8% relative change).
I think most query examples come from searchers forgetting that they are using a Turkish keyboard (or similar) while typing in another language. That kind of scenario explains queries on Italian Wikipedia like İtaliani, İnterlingua, and GUİLİANO GUİLİANİ.
Having proper dotted I support across all languages has and will continue to benefit searchers accidentally using Turkic keyboards (unintentionally typing İtaliani), or searching for Turkic results (İstanbul on-wiki matching Istanbul as a query).
Background
editThe idea behind character folding is that you want to "fold" different characters together, and treat them as the same. A simple example is case folding, where uppercase and lowercase are treated as the same—usually either by converting everything to lowercase or converting everything to uppercase.
In English, accent folding is pretty straightforward, because accents are generally not distinctive or required, so removing them works well. ASCII folding is, in general, folding any non-ASCII character into the closest ASCII character, if such a thing exists. ICU folding is more general folding of Unicode characters to their "simplest" or "most basic" form, and it applies across a large number of characters and writing systems.
Note: ICU normalization is a much less aggressive form of folding, where very similar (up to visually identical) characters are folded together (e.g., µ → μ), and rare and historical forms of letters are converted to their typical or modern forms (e.g., ſ → s). ICU folding is much more aggressive and will convert, for example, almost everything that looks like an e (è, é, ê, ẽ, ē, ĕ, ė, ë, ẻ, ě, ȅ, ȇ, ę, ḙ, ḛ, ề, ế, ễ, ể, ḕ, ḗ, ệ, ḝ, ɇ, ɛ, ⱸ, though not ꬲ, ꬳ, or ꬴ) into a plain e.
The Elasticsearch filter asciifolding
has an option to preserve the original token when folding, so that, for example, zoë would be indexed as both zoe and zoë. We have named our version that preserves the original token asciifolding_preserve
.
If the ICU plugin is available, we automatically upgrade the asciifolding
filter to the icu_folding
filter. The icu_folding
filter does not have a "preserve original" option, so we wrap icu_folding
in two other filters, one that records the token before icu_folding
folds it, and another that comes right after icu_folding
and restores the original token if it is different from the output of icu_folding
. (For example, zoe is the same after folding, so the "original" token isn't restored.) We sometimes collectively refer to these filters as "icu_folding_preserve" by analogy to asciifolding_preserve
, but there is no single filter with that name.
Status Quo
editFor reference, the text
field indexes the text with maximal general and language-specific analysis, including apostrophe normalization, camelCase splitting, acronym handling, word_break_helper
, homoglyph normalization, ICU normalization, and others, and, if available, stemming, stop word filtering, and other language-specific processing. The plain
field has minimal language analysis: often just word_break_helper
and ICU normalization.
Normal queries use both the text
and plain
fields, with plain
trying to get something close to an exact match (better precision), and text
allowing for more general and expansive matches (increased recall). Queries in quotes use only the plain
field (and also require search terms in a single set of quotes to be in order in the text—but that's a different kettle of fish).
To further split hairs, the text
and plain
analyzers are used on the text found on-wiki, while text_search
and plain_search
are used on users' queries. text
and text_search
are generally the same (possibly always the same.. but there is always a chance that an exception will rear its ugly head if I dare say otherwise). plain
and plain_search
are generally the same except that plain_search
doesn't have any folding beyond lowercasing/ICU normalization; if plain
doesn't have any extra folding—which is the case for our default analyzers for uncustomized languages—plain
and plain_search
are the same.
(The intuition on the plain
/plain_search
difference for languages that have it is that either an editor (or the editor's source) or a searcher may leave off the non-native diacritics, so when a search term is quoted, we should still match both exactly and undiacritically, but not "cross-diacritically"—e.g., "zoë" (in quotes) does not match zoé. Without quotes (i.e., using both text
and plain
), on English wikis, zoë and zoé do match (though exact matches are ranked higher, everything else being equal).)
Idiosyncrasies
editEnglish & Italian
Lost to the mists of time (i.e., before any of the current team members joined the Search team), The Ancients configured English and Italian to use asciifolding_preserve
in the plain
field, but mere asciifolding
in the text
field.
So, Zoë, on English wikis, would be indexed in the plain
field as both zoë and zoe. When searching the plain
field (for general matches, or when a term is quoted), Zoë would only be analyzed as zoë, so it would only match Zoë or zoë in the original document. However, Zoe would match any of Zoe, zoe, Zoé, zoé, Zoë, or zoë. In general, this means that non-native diacritics (and other foldable variations) can be found if you search without them, even with quotes: "zoe" finds zoë, zoe, and zoé. But if you search with them, you only match the exact forms: "zoë" finds zoë, but not zoe or zoé.
Later—but still way back in the Early Days—we made the decision to upgrade asciifolding
to ICU folding, but in an abundance of caution, we upgraded English, but not Italian. It seems we never revisited that decision, though all other languages that have had asciifolding
added have had the ICU folding upgrade enabled. We also made enabling ICU folding for the plain
analyzer but not the plain_search
analyzer the default for all languages that have ICU folding enabled.
French & Swedish
The Ancients also originally configured the lowercase_keyword
analyzer (used for matching templates and categories with hastemplate
and incategory
) to use asciifolding
for English, French, and Italian. These were later upgraded to asciifolding_preserve
. By analogy to the others, I added the same config to Swedish when I first unpacked its text
analyzer to add asciifolding_preserve
.
Greek
Also in the Early Days (phab:T132637) we tried to help out the Greek-language searchers looking for Ancient Greek words, which have waaaaay more (small, similar, hard-to-read) accents than their modern counterparts. We made some changes to accommodate accent folding, including adding Greek to the list of languages that can have asciifolding
upgraded to icu_folding
. We enabled the upgrade, but forgot to add the baseline asciifolding
to be upgraded.
Summary
The current state of folding across the text
, text_search
, plain
, plain_search
, and lowercase_keyword
analyzers splits into two broad groups: ❶ uncustomized languages with no extra folding, and ❷ standard, generally unpacked & further customized languages. The idiosyncratic outliers include ❸ Greek, ❹ Italian, ❺ French & Swedish, and ❻ English.
Status Quo | |||||
text
|
text_search
|
plain
|
plain_search
|
lowercase_keyword
| |
∅ | ❶❸ | ❶❸ | ❶ | ❶❷❸❹❺❻ | ❶❷❸ |
ascii_folding
|
❹ | ❹ | |||
ascii_folding_preserve
|
❹ | ❹ | |||
icu_folding
|
❷❻ | ❷❻ | |||
"icu_folding_preserve" | ❺ | ❺ | ❷❸❺❻ | ❺❻ |
Key—
- ❶ Azerbaijani, Kazakh, Lao, Min Nan, Mirandese, Nias, Odia, Slovenian, Telugu, Tibetan, Vietnamese, etc.
- ① Chinese, Indonesian, Khmer, Korean, Polish
- ❷ Arabic, Armenian, Bangla, Czech, Esperanto, German, Hebrew, Hindi, Hungarian, Irish, Japanese, Persian, Portuguese, Russian, Serbian, Slovak, Thai, Turkish, Ukrainian, etc.
- ❸ Greek
- ❹ Italian
- ❺ French, Swedish
- ❻ English
Certain languages in the uncustomized Group ❶ surprised me, because they have had considerable customization, but no asciifolding
or icu_folding
—these are listed in sub-group ①.
Desired Harmony
editOur desired state of affairs is represented in this table:
Desired Harmony | |||||
text
|
text_search
|
plain
|
plain_search
|
lowercase_keyword
| |
∅ | ❶ | ❶ | ❶ | ❶❷ | ❶ |
icu_folding
|
❷ | ❷ | ❷ | ||
"icu_folding_preserve" | ❷ |
Key—
- ❶ Azerbaijani, Kazakh, Lao, Min Nan, Mirandese, Nias, Odia, Slovenian, Telugu, Tibetan, Vietnamese, etc.
- ❷ Arabic, Armenian, Bangla, Chinese, Czech, English, Esperanto, French, German, Greek, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Khmer, Korean, Persian, Polish, Portuguese, Russian, Serbian, Slovak, Swedish, Thai, Turkish, Ukrainian, etc.
Group ❶, the uncustomized languages, is unchanged.
Group ❷, the standard customized languages, has icu_folding
added to the lowercase_keyword
filter, so that categories and templates with non-native diacritics are easier to find (e.g., Curaçao, Württemberg, or Tōhoku on enwiki). English, French, and Swedish have this at the moment (and Italian has the asciifolding
version), and it's a good thing.
The miscellaneous outliers (❸ Greek, ❹ Italian, ❺ French & Swedish, and ❻ English) have all been folded into Group ❷, as have the more familiar & customized languages from ①.
Ideally, I'd love to customize icu_folding
for all languages, so that Group ❶ is empty (or only contains languages that have recently had wikis created).
Incidental Discoveries
editSome additional things I figured out or noticed along the way.
- I incidentally discovered that an extra filter,
dedup_asciifolding
, that we added to preventasciifolding_preserve
from duplicating tokens (i.e., zoe gets indexed twice, once for the "original" and once for the "folded" (but unchanged) copy). We opened a ticket upstream for that and they eventually fixed it (in Lucene 6.3—we're at 8.7), so this is now unneeded vestigial code.
- The problem with Greek was caused in part because we have to add
asciifolding
when we only wanticu_folding
. Sometimes we want asciifolding if that's all we have, and we'll take the upgrade toicu_folding
if available. Other times—as in thetext
,text_search
,plain
, andlowercase_keyword
analyzers—we really only wanticu_folding
, and if it's not available, we don't want to fall back onasciifolding
, because it isn't customized per language and so it is too aggressive. We should refactor to useasciifolding
when we want it (with possible upgrade) and useicu_folding
when that's what we mean. The current state is also potentially sub-optimal for third-party MediaWiki users who don't install the ICU plugin and don't customize their analyzers. They can get stuck with the overly aggressive instances ofasciifolding
.
- This may be the nerd-snipiest nerd-sniping to ever nerd-snipe, but I noticed that
preserve_original_recorder
—one of the actual components used in the multi-stage "icu_folding_preserve"—somehow blocksicu_folding
from updating the byte codes in the token info. So, zoë has bytes[7a 6f c3 ab]
. Whenicu_folding
converts it to zoe, it keeps those bytes (rather than becoming[7a 6f 65]
). Withoutpreserve_original_recorder
before it,icu_folding
makes the expected update to the byte array. This doesn't seem to have any repercussions in actual search, so I'll leave it as an odd side note for now.
The Plan
editThis whole thing has turned out to be fairly complicated, so I'm breaking it up into stages. Once all the pieces are done, it'll be time to reindex and make the changes go live.
- Part 1: The miscellaneous outliers (❸ Greek, ❹ Italian, ❺ French & Swedish, and ❻ English) all get folded into Group ❷, the standard customized languages. (Merged)
- Part 2: Apply "icu_folding_preserve" to
lowercase_keyword
for all languages whereicu_folding
is enabled (i.e., Group ❷). This is a small amount of code, but a lot of changes in our test suite, so it is best done alone. (Merged)
- Part 3: Remove
dedup_asciifolding
. This also generates a lot of changes in our test suite so it gets to be its own piece. (Merged)
- Part 4: Refactoring to disintermediate
asciifolding
when we really wanticu_folding
. This will probably also generate a lot of changes asasciifolding
will disappear from a lot of non-production test configurations. (Merged)
- Part 5: Apply icu_folding for the Group ① languages (Chinese, Indonesian, Khmer, Korean, Polish), and possibly others in Group ❶. If I have to time box this effort I may work on them by Wikipedia size—though I don't love that heuristic. If anyone reading this now has ideas for languages that need
icu_folding
, let me know. If you are reading in The Future™, open a ticket on Phab and tag me. (If you are reading in The Truly Distant Future, don't forget to take your copy of the wiki dumps with you when you leave Earth before the expanding sun consumes the planet.) (Merged) - Part 6: Apply icu_folding to more languages. Part 5 was already broken into 3 parts on Gerritt, so it has done enough work. (Merged)
Part 4—Refactoring & Analysis Notes
editRefactoring to disintermediate asciifolding
was a bit complicated, but it also led to some additional non-refactoring analysis, which led to more refactoring. (The circle of life is complete.)
Thinking about English, it's clear that asciifolding
is fine on its own, and not just as a pointer to icu_folding
, because English generally doesn't care about any accents. We did originally add asciifolding
to English (and French) before the upgrade to icu_folding
was available.
During unpacking, we did add asciifolding
to analyzers primarily to trigger the upgrade to icu_folding
, but a bunch of languages don't have any language-specific exceptions to icu_folding
, so asciifolding
would still benefit those languages. (asciifolding
"converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the 'Basic Latin' Unicode block) into their ASCII equivalents, if one exists." See the Lucene docs for more.)
Looking at the custom configs we have for various languages, they fell into three groups:
Allow icu_folding
only, no asciifolding
- These languages use the Latin script, and have language-specific ICU folding exceptions for some letters in their alphabets: Basque, Bosnian / Croatian / Serbian / Serbo-Croatian, Czech, Danish, Esperanto, Estonian, Finnish, Galician, German, Hungarian, Latvian, Lithuanian, Norwegian, Romanian, Spanish, Swedish, Turkish.
- Note that the Bosnian, Serbian, and Serbo-Croatian wikis use both Latin and Cyrillic scripts, but everything is converted to Latin by the stemmer we use, to facilitate cross-script searching.
For these languages, we will configure icu_folding
directly, and remove it if the ICU plugin is not available.
Allow icu_folding
and asciifolding
- These languages use the Latin alphabet, but don't have any diacritical letters in their alphabets subject to
asciifolding
: Brazilian Portuguese, Catalan, Dutch, English, French, Irish, Italian, Portuguese. - Slovak does have diacritical letters subject to
asciifolding
, but doesn't make exceptions for any of them; speakers prefer search to ignore diacritics. - These languages use non-Latin alphabets, and so their native words are not affected by
asciifolding
(though it is still useful for normalizing Latin text, which is everywhere, and some non-alphabetic characters): Arabic / Egyptian Arabic / Moroccan Arabic, Armenian, Bengali, Bulgarian, Greek, Hebrew, Hindi, Japanese (CJK), Persian, Russian, Sorani, Thai, Ukrainian.- The currently deployed analyzer for Japanese uses the CJK analyzer.
- Here, Ukrainian referes to the unpacked Ukrainian analyzer. The monolithic Ukrainian analyzer may be used instead depending on what plugins are installed, but it cannot be customized with folding, so it isn't relevant.
For these languages, we will configure asciifolding
and allow it to upgrade to icu_folding
if the ICU plugin is available.
Out of scope
- These languages some custom analysis config, and do not currently have any folding enabled: Azerbaijani / Crimean Tatar / Gagauz / Kazakh / Tatar, Chinese, Khmer, Korean, Mirandese.
- These languages were unpacked, but have no extra folding (
asciifolding
oricu_folding
) enabled: Indonesian / Malay, Polish. - We have configuration for Japanese using the Kuromoji analyzer, but we don't currently use it, so it is also out of scope for now.
- Part 5 will probably add customized
icu_folding
for unpacked Indonesian / Malay, and Polish, as well as custom Chinese, Khmer, and Korean, and possibly others.
- Part 5 will probably add customized
Bonus Updates
editAs a fun bonus, automating the inclusion of remove_empty
after icu_folding
(because it can sometimes fold certain non-letter tokens into empty tokens) revealed that it wasn't manually included everywhere it should have been. Automation for the win!
Part 5—Apply icu_folding
to more languages
edit
I started with the Group ① languages above—Chinese, Indonesian / Malay, Khmer, Korean, Polish. (Previously, I had incorrectly included Slovak in this group, but it is already set up as it should be.)
Next I looked at other languages that already have some customization in our code, to set the standard that if we work on cusotmizing anything for a given language, we should try to enable its ICU folding exceptions, too. This group included Mirandese, and the Turkic languages that all needed Turkish lowecasing to handle I/ı & İ/i: Azerbaijani, Crimean Tatar, Gagauz, Kazakh, and Tatar.
For the next tranche of customized languages, I looked at my list of 90 languages I'd sampled for testing, which I'd primarily sorted by number of unique queries (i.e., the amount of data I'd have to work with—see Data Collection above... waaaaaaay at the top). These nine finished the top 50 on my list: Vietnamese, Igbo, Swahili, Tagalog, Slovenian, Georgian, Tamil, Uzbek, and Albanian.
☞ This adds support for customized ICU folding to 20 new languages, and expands ICU folding coverage to the languages of the top 50 Wikipedias and 61 of the top 90 in my list (by unique query volume), and to the languages of 65 Wikipedias overall. (Mirandese, Crimean Tatar, Gagauz, and Tatar are not in the top 90, but already had custom config.)
To make things easier to review, I'm going to group languages by the kind of changes we expect to see.
(i) asciifolding
and icu_folding
These languages usually have either a non-Latin script (like Korean), or they have no accented characters in their Latin alphabet (like Indonesian), so asciifolding
all Latin characters is still helpful, even if icu_folding
is not available.
- Indonesian / Malay and Swahili have no accented characters in their alphabet, so folding all the diacritics should be fine.
- Korean and Georgian each use a non-Latin alphabet, and didn't need any exceptions for
icu_folding
. - Khmer uses a non-Latin alphabet, but did need exceptions for
icu_folding
:icu_folding
removes all diacritics and combining marks in Khmer, including ones that change the vowel in the syllable or control placement of letters.. so those diacritics matter! I used the most of the Khmer range ([ក-៝]) because it is an easier regex. I skipped some symbols the tokenizer currently deletes. I skipped Khmer numbers, because if our current early numeral normalization were ever disabled, we'd still want number normalization.
- Chinese, in theory, should be able to take advantage of
asciifolding
, but thesmartcn_tokenizer
breaks up all non-Chinese characters into single letters, so zoë gets tokenized as z + o + ë, which will match any document with a z, an o, and an ë in it. It's not clear that expanding that out to any document with an e in it is going to help anything. Fortunately, theplain
field uses the ICU tokenizer and has "icu_folding_preserve" and that does a decent job on many non-Chinese words. - Tamil uses a non-Latin alphabet, and I needed to add one exception for ் (Tamil virama), which
icu_folding
deletes. It looks likeicu_folding
has it out for viramas in most scripts! - Uzbek uses the Cyrillic alphabet, and needs a few exceptions: Ёё, Йй, Ўў, Ққ, Ғғ, Ҳҳ.
(ii) icu_folding
only
These languages use a Latin script and need folding exceptions for certain characters. Since asciifolding
doesn't allow for exceptions, we only configure them to use icu_folding
(when the ICU plugin is available).
- Polish needs folding exceptions for these letters: Ąą, Ćć, Ęę, Łł, Ńń, Óó, Śś, Źź, Żż.
- Mirandese needs folding exceptions for just Çç.
- Azerbaijani needs folding exceptions for Çç, Əə, Ğğ, Iı, İi, Öö, Şş, Üü.
- Technically plain ASCII I and i don't need folding exceptions, but it is clearer to have the Iı & İi pair in the list, here and elsewhere.
- Crimean Tatar exceptions: Cyrillic Ёё & Йй, Latin Çç, Ğğ, Iı, İi, Ññ, Öö, Şş, Üü.
- Gagauz exceptions: Ää, Çç, Êê, Iı, İi, Öö, Şş, Ţţ, Üü.
- Gagauz seems to suffer the inverse problem of Romanian, and S and T with commas (Șș, Țț) are often used for S & T with cedillas (Şş, Ţţ). For example, all of the following are on Gagauz Wikipedia: Şveţiya, Şvețiya, Șveţiya, Șvețiya ("Sweden"), with the incorrect comma version being the most likely, by far! It makes sense to map Șș → Şş and Țț → Ţţ, and treat those as localized normalizations for the letters with commas.
- Kazakh exceptions: Cyrillic Ғғ, Ёё, Йй, Ққ, Ңң, Ұұ; Latin Ää, Ğğ, Iı, İi, Ññ, Öö, Şş, Ūū, Üü.
- While Kazakh wikis are mostly written in Cyrillc, they also have the comma/cedilla problem for S. I found two examples of a Cyrillic word and its Latin transliteration appearing next to each other, but one with a comma, one with a cedilla: Бокша / Bocșa and Бокша / Bocşa. Again, Șș → Şş makes sense as a local normalization.
- Tatar exceptions: Cyrillic Ёё, Җҗ, Йй, Ңң; Latin Ää, Çç, Ğğ, Iı, İi, Ññ, Öö, Şş, Üü.
- Again the local normalization of Șș → Şş makese sense, though here most examples I found were of Romanian text in the Tatar wiki, but written with both characters. There are many examples of și and şi (Romanian "and") in the same article.
- Vietnamese has 12 vowels: a, ă, â, e, ê, i, o, ô, ơ, u, ư, y. Each vowel can have one of 5 tones added to it (or no tone, give 6 options): ơ, ớ, ờ, ở, ỡ, ợ. And of course, each can be either upper- or lowercase: Ểể.. giving 12×6×2 = 144 vowel symbols! Vietnamese also uses the diacritical consonant Đđ. Fortunately, a fairly large chunk of Vietnamese letters are in one section of Unicode, so we can cover many with the regex span [Ạ-ỹ]. Our final exception regex is still a whopper: [ÁáÀàÃãĂăÂâĐđÉéÈèÊêÍíÌìĨĩÓóÒòÕõÔôƠơÚúÙùŨũƯưÝýẠ-ỹ]
- While the letters with tone marks are not distinct letters in the Vietnamese alphabet, the tone marks are critical to distinguishing written words, which are often simple consonant-vowel(-consonant) syllables. A six-way example: ma "ghost", má "cheek", mà "however", mã "cipher", mạ "rice seedlings", mả "tomb". (Some of these words have many more than one meaning, too—even when distinguishing tones there are homophones. I just listed the first one from English Wiktionary to demonstrate that they are not at all related.)
- I noticed that the capital form of Đđ ("d with stroke") looks a lot like the capital form of Ðð ("eth"), and sure enough I found some capitalized words on the Vietnamese Wikipedia (Ðại, Ðến, Ðó, Ðơn, Ðức) that when lowercased reveal their eth-y nature: (ðại, ðến, ðó, ðơn, ðức). A local normalization of Ð → Đ makes sense, though we'll have to bring ð → đ along for the ride so upper and lowercase forms of words with legitimate Ðð can find each other.
- Igbo exceptions: Ịị, Ṅṅ, Ọọ, Ụụ.
- Igbo optionally uses ácute and gràve accents to mark tones, so correctly folding those will be useful for matching words with and without the tone marks, such as ụmụntakịrị and ụ́mụ̀ńtàkị́rị́.
- Tagalog exceptions: Latin Ññ; Baybayin ᜔.
- Slovenian exceptions: Čč, Šš, Žž; Ćć, Đđ.
- This was a tough one, and I wouldn't be shocked if we have to revisit it in the future. Only Čč, Šš, Žž are part of the Slovene alphabet, but Serbian Ćć, Đđ show up often enough that they are alphabetized separately, unlike other foreign diacritical letters. Ćć & Đđ seem to be on par with Qq & Ww as foreign letters that are allowed to hang out with the locals.
- Albanian exceptions: Çç, Ëë.
Part 6—Apply icu_folding
to more languages
edit
The next three languages on my list of 90 (by unique query volume) were all Indic languages using Brahmic scripts, as were 11 of the next 20, so I decided to work on those 11 together, since they are likely to have similar concerns, while being relatively straightforward because they are non-Latin. (Past Trey says, "I am not jinxing it!") The list: Marathi, Burmese, Malayalam, Telugu, Sinhala, Kannada, Gujarati, Nepali, Assamese, Punjabi, and Odia.
...
Only one paragraph (but several days) later and this is Future Trey and I was totally jinxed! Right out of the gate with Marathi! Marathi is usually written in Devanagari, the same script as Hindi, so I thought it would be easy. Instead, I have new questions about Hindi (see below), and my current heuristic for figuring out which characters should or should not be folded—i.e., what's in a language's alphabet—doesn't really work for complex cases of non-alphabetic scripts (Devanagari is an abugida). Tamil, another Indic abugida, only had its virama folded by icu_folding
—and viramas generally should not be folded—so it was much more straightforward.
My new generic non-alphabetic heuristic is that if I can't easily determine the right answer, leave it as is for now. If it isn't currently folded, not folding it isn't making anything worse, and all the folding of foreign script diacritics (and, often, local numerals) will be a net win.
- Marathi needs exceptions for the Devanagari virama ्, Devanagari rra ऱ, and the Modi virama 𑘿.
- Marathi uses a form of Devanagari called "Balbodh" (and, historically, Modi). Modi is straightforward, and I added an exception for the Modi virama, even though it doesn't seem to actually be used on the Marathi Wikipedia (even in the article on the Modi script).
- For Balbodh/Marathi Devanagari, I stripped nukta/nuqta, as it is generally not used in Marathi (though not for rra). I also stripped stress markers ॑ (U+0951), ॒ (U+0952), ॓ (U+0953), and various Extended Devanagari characters, including cantillation marks used for chanting vedas.
- I added a mapping for a rare variant of the Marathi-specific "eyelash reph / raphar", which is used to indicate that a mid-word r-sound is at the beginning of the next syllable (e.g. सुऱ्या = su-ryā, while सुर्या is sur-yā). It usually uses a variant of rra (e.g., ऱ्य (ऱ[rra] + ्[virama] + य)), but can be written with ra and a zero-width joiner (e.g., र्य (र[ra] + ्[virama] + [ZWJ] + य)).
- Since rra [ऱ] is important in Marathi, I also added a map from ऱ (र[ra] + ़[nukta]) to ऱ[rra]—even though
icu_normalizer
will do that, too—and added an exception foricu_folding
for rra ऱ.
- Burmese needs exceptions for virama, vowel signs, and other combining diacritics: ် ္ ့ း ာ ါ ေ ဲ ု ူ ိ ီ ွ ံ ၖ ၗ.
- Rather than except the whole Mon-Burmese Unicode block (other than numerals), I only excepted elements combining characters specific to the Burmese alphabet. Mon-Burmese is also used to write other languages, and they include non-Burmese characters—like Shan tone markers—that are still "foreign" to Burmese.
- Malayalam needs exceptions for its virama, vowel signs, etc.: ് ി ു ൃ ൢ െ ൊ ാ ീ ൂ ൄ ൣ േ ോ ൈ.
- In reading up on Malayalam and looking at my sample, I noticed a few things that need mapping:
- ൌ (U+0D4C) is an archaic form for the 'au' vowel, while ൗ (U+0D57) is the modern form. In isolation, and depending on the font, they look the same. In a word, they do not. However, forms that alternate between them exist, and I found on-wiki discussions on which form to use for particular words, so it seems like mapping the archaic form to the modern form is a good idea.
- The older dot reph ൎ (U+0D4E) is usually replaced with chillu rr ർ (U+0D7C) in the reformed orthography, and while it seems rare, there are examples on-wiki of the same word using each, so mapping to old to the new again seems like a good idea.
- There are two archaic virama characters, ഻ (U+0D3B) and ഼ (U+0D3C), that are also rarely used, but it makes sense to map them to the modern virama ് (U+0D4D). I also found that double viramas ( ് + ്) and ( ് + ഼) both occur in my sample, so simplifying them down to one modern virama makes sense.
- In reading up on Malayalam and looking at my sample, I noticed a few things that need mapping:
- Telugu seems to be treated more like Tamil, and the only exception it needs is for its virama: ్.
- Sinhala needs exceptions for its virama-like character ( ්, called "al-lakuna" in the Unicode viewer on my laptop, and "hal kirīma" on English Wikipedia, and for a couple of other characters that are composed of parts that include the hal kirīma: ෝ and ේ.
- These composed characters, and a few others, can theoretically be written in at least a couple of different ways. These generally look good in the four Sinhala fonts I have, but your operating system / browser / font may differ:
- චේ (ච + ේ) vs චේ (ච + ෙ + ්)
- චො (ච + ො) vs චො (ච + ෙ + ා)
- චෞ (ච + ෞ) vs චෞ (ච + ෙ + ෟ)
- චෝ (ච + ෝ) vs චෝ (ච + ො + ්) vs චෝ (ච + ෙ + ා + ්)
- Fortunately, these are actually pretty hard to type (at least on a Macintosh in Firefox) and there are many layers of processing that try to normalize them to the single-character version. For testing purposes, I had to use Java \u encoding to reliably get them to Elasticsearch, and I used some hacky regex tricks to look for them on-wiki with
insource
. If they do get through, they get normalized byicu_normalizer
, so all is well there. - There are two other doubled combining characters ( ෛ and ෲ) that are much easier to type/input, and which do occur on-wiki, and which are not corrected by
icu_normalizer
. They look okay in some fonts but not others, so again your mileage may vary with the examples below:- චෛ (ච + ෛ) vs චෙෙ (ච + ෙ + ෙ)
- චෲ (ච + ෲ) vs චෘෘ (ච + ෘ + ෘ)
- I added a mapping to normalize these, and found a few instances of the first one ( ෙ + ෙ) in my Sinhala sample, and a few examples of the other ( ෘ + ෘ) on-wiki.
- These composed characters, and a few others, can theoretically be written in at least a couple of different ways. These generally look good in the four Sinhala fonts I have, but your operating system / browser / font may differ:
- Kannada only needs an exception for its virama: ್
- Similar to Sinhala, Kannada has some composed characters that can be written with single or multiple combining characters, but as before these are already normalized by
icu_normalizer
:- ದೀ (ದ + ೀ) vs ದೀ (ದ + ಿ + ೕ)
- ದೈ (ದ + ೈ) vs ದೈ (ದ + ೆ + ೖ)
- ದೊ (ದ + ೊ) vs ದೊ (ದ + ೆ + ೂ)
- ದೇ (ದ + ೇ) vs ದೇ (ದ + ೆ + ೕ)
- ದೋ (ದ + ೋ) vs ದೋ (ದ + ೊ + ೕ) vs ದೋ (ದ + ೆ + ೂ + ೕ)
- As with Marathi, I let
icu_folding
strip the Kannada nukta/nuqta ( ಼), which is used to indicate similar but non-native sounds in transliterations. It is used in Kannada, but not consistently, as in ಫ಼್ರಾನ್ಸಿಸ್ (frānsis) / ಫ್ರಾನ್ಸಿಸ್ (phrānsis), both "Francis", as in "Francis Crick".
- Similar to Sinhala, Kannada has some composed characters that can be written with single or multiple combining characters, but as before these are already normalized by
- Gujarati only needs an exception for its virama: ્
- Other factors are familiar: composed characters that can be written with single or multiple combining characters, though none normalized by
icu_normalizer
(probably because they render poorly (on purpose) in most fonts—though these variants to occur on-wiki). I added a mapping to normalize these:- કૉ (ક + ૉ) vs કાૅ (ક + ા + ૅ)
- કો (ક + ો) vs કાે (ક + ા + ે)
- કૌ (ક + ૌ) vs કાૈ (ક + ા + ૈ)
- As with other languages, I let
icu_folding
strip the Gujarati nukta/nuqta ( ઼), which is used to indicate non-native sounds, but not consistently, as in ફ઼્રાંસિસી (frānsisī) / ફ્રાંસિસી (phrānsisī), both "French".
- Other factors are familiar: composed characters that can be written with single or multiple combining characters, though none normalized by
- Nepali, written in Devanagari, only needs an exception for its virama: ्.
- Nepali is sometimes written in Tibetan script, but on-wiki it seems to only be used sparingly, so I didn't add any Tibetan exceptions for Nepali wikis.
- Assamese, written in the Bengali–Assamese script, needs
icu_folding
exceptions for its virama ্.- To complicate matters, Assamese uses three characters with nukta (ড়, ঢ়, য়), which can be encoded as either single characters, or as base characters (ড, ঢ, য) plus a nukta ( ়). As far as I can tell, there are few words that differ only by these nukta, and informal and digital writing (i.e., user queries) are likely to omit the nukta. The Bengali analysis chain also normalizes away the nukta, though that was approved by our Bengali speaker at the time.
- So, I've decided to allow the nukta to be stripped from these characters for now. If we decide to reverse that decision, it's a bit complicated, especially if we want to strip other nukta (which would be in words from Bengali, for example), but very doable. We need to map the decomposed versions (ড + ়) to the precomposed versions (ড়) in a character filter, and add a folding exception for the precomposed versions to both
icu_folding
andicu_normalizer
. Then the internal version will be the precomposed version instead of the decomposed version, but that shouldn't matter after reindexing. - I looked at the Bengali analysis chain's behavior on these characters, and it's also a bit ridiculous: precomposed characters (ড়) get decomposed by
icu_normalizer
(ড + ়); decomposed versions (both input that way and the ones modified byicu_normalizer
) get recomposed byindic_normalization
, and thenicu_folding
strips off the nukta anyway.
- Punjabi, written in Gurmukhi, did not need any exceptions!
- Punjabi has a nukta, but its use is inconsistent (ਅਫਗਾਨਿਸਤਾਨ (aphagānisatāna) vs ਅਫ਼ਗ਼ਾਨਿਸਤਾਨ (afaġānisatāna) vs ਅਫ਼ਗਾਨਿਸਤਾਨ (afagānisatāna)), so stripping it is fine.
- Punjabi has a virama (called halantă or halandă), but it is generally not commonly used except in "Sanskritized text" or dictionaries for extra phonetic info.
- However!—why is there so often a "however"?—there are subscript versions of three letters that are used for consonant clusters or to mark tone. The virama/halantă is used in Unicode, as in other Indic scripts, to indicate that the subscript version of the character should be used. However, it seems to be inconsistently used—for example ਪਰਦਰਸ਼ਿਤ / ਪ੍ਰਦਰਸ਼ਿਤ / ਪ੍ਰਦ੍ਰਸ਼ਿਤ all seem to mean "show / exhibit / display".
- Odia only needs an exception for its virama ୍.
- Odia has a nukta and a couple of characters with nuktas (ଡ଼ & ଢ଼), but like other languages, it seems to be inconsistently used in informal writing, so we are stripping it.
- Odia also has some combining vowel marks that are the same as multiple others, e.g. ୌ vs େ + ୗ.
icu_normalizer
takes care of those. In some fonts, though, some of them will display okay even when reversed, and a few of these exist on-wiki.- ୖ + େ vs ୈ
- ା + େ vs ୋ
- ୗ + େ vs ୌ
- I've added a custom character map to handle these cases.
In summary...
“Experience is something you don't get until just after you need it.” —Steven Wright
Within the Indic abugidas, there are patterns—somewhat inconsistent patterns, but patterns nonetheless: ICU folding has a strong dislike for viramas; in some Indic scripts vowel signs are clobberized, in others they are untouched, so you gotta check; nuktas are generally okay to strip, though there can be specific exceptions; "composed" diacritics are sometimes written in pieces, and the number of fonts that render underlyingly different strings the same is a rough predictor of how likely icu_normalizer
is to fix them; normalizing numerals is a good thing.
☞ With previous updates this adds support for customized ICU folding to 31 new languages, with ICU folding coverage for the languages of the top 50 Wikipedias and 72 of the top 90 in my list (by unique query volume), and for the languages of 76 Wikipedias overall.
Future Work
editI'm going to stop here for now. Note to future self: these remain in the top 90 languages in my list:
- (Latin script) Afrikaans/af, Icelandic/is, Latin/la, Welsh/cy, Asturian/ast, Scots/sco, Luxembourgish/lb, Alemannic/als, Breton/br
- (Cyrillic) Mongolian/mn, Macedonian/mk, Kyrgyz/ky, Belarusian/be, Belarusian-Taraškievica/be-tarask, Tajik/tg (cy/la)
- (Arabic script) Urdu/ur, Kurdish/ku (ar/la)
- (CJK) Cantonese/zh-yue
To finish off languages with Wikipedias with 100,000 or more articles, we'd need to cover these, too:
- (Latin script) Cebuano/ceb, Waray/war, Min Nan/zh-min-nan, Ladin/lld, Minangkabau/min
- (Cyrillic) Chechen/ce
- (Arabic script) South Azerbaijani/azb
ICU Folding Results
editThe table below shows the before-and-after analysis stats for the ICU-folding–specific samples, and the generic samples, for the 31/32 languages above. (Indonesian and Malay share a configuration, but have different wikis.) See Reminders & Results Key above for details on labels and analysis.
rel | rel (sm) | irrel | irrel (sm) | general | |
total | 25 | 6 | 28 | 1 | 32 |
ZRR↓ | 25 (100%) | 3 (50%) | 11 (39%) | 1 (100%) | 24 (75%) |
Res↑ | 3 (50%) | 4 (14%) | 2 (6%) | ||
topΔ | 1 (4%) | 1 (3%) | |||
noΔ | [1] | 12 (43%) | [3] | 5 (16%) |
Notes:
- All of the non-small samples (30+ queries) that had tokens affected by folding ("rel(evant") showed improvement in their zero-results rate. Of the small samples with changes, half showed improvements to ZRR, and half had a marked increase in results.
- More than half of the "irrel(evant)" samples (those with diacritical tokens that were nonetheless not parsed differently), still had an improvement in ZRR, total results, or changes in top result.
- In the general sample (unweighted but filtered samples for each language Wikipedia, see above), 84% showed the same kinds of improvements, indicating that "foreign" diacritics and other variant characters are relatively common (≥ 0.1% of queries) in general search.
Excursus on weighted and unweighted zero-results rate calculations
In a discussion with Erik between computing earlier results and computing the folding results, we noticed a discrepancy between his and my general ZRR estimates. We eventually determined that I was counting queries and he was counting search sessions, so our rates were not directly ocmparable.
However, I also realized that I was only looking at unique queries. So, for ICU folding, I computed ZRR on both unique queries (unweighted) and on all queries (weighted). The total number of queries was often the same for targeted samples—that is, queries in need of folding tended to be unique in a small sample. In the general samples, the total number of queries was up to ~15% higher, and ZRR often varied by 1–2%, and sometimes 3–4% (which is also 10–15% of the ZRR; e.g., 30% unweighted ZRR vs 33% weighted ZRR has a 3 percentage point difference, but also a difference of 10% of the ZRR).
The one wild outlier was Vietnamese. The general unweighted sample has 1000 queries and a ZRR of 19.6%. The same sample, weighted by frequency, has 15,836 queries and a ZRR of 82.6%. The bulk of the difference is one query repeated over 12,000 times; it gets zero results, and it matches the title of a social media scandal post. Smells like a bot or a tool of some sort. The five queries repeated more than 100 times are all similar.
I think I like the unweighted metric—though maybe I just need to get better at filtering queries.
Other Things to Do
editA list of incidental things to do that I noticed while working on the more focused sub-projects above.
The first list is relatively simple things that should definitely be done.
- Add Latin İ / Cyrillic І̇ to the
homoglyph_norm
list. (Added to Trey's list.) - ✔ Enable
dotted_I_fix
(almost?) everywhere, and maybe enable Turkishlowercase
for languages that distinguish I/ı and İ/i. - ✔ Add
remove_duplicates
afterhebrew_lemmatizer
in the Hebrew analysis chain, to remove exact duplicates. (See "Addremove_duplicates
to Hebrew" above). - ✔ Refactor English, Japanese, etc. configs to use AnalysisBuilder
- ✔ Merge
mapping
filters when possiblennbsp_norm
andapostrophe_norm
are universal and (can) occur first, so merging makes sense.nnbsp_norm
is used in other places so it needs to exist on its own, too, though.kana_map
will not be universal (it will not be used in Japanese), so merging it would be... tricky? I've thought about trying to build up singlemapping
filter with all the general and language-specific mappings in it, but it might be too much maintenance burden and code complexity.And it could cause a mess in multi-lingual configs like Wikidata.
- ✔ Enable a config for no/Norwegian.
- ✔ I saw a lot of Greek/Latin and Greek/Cyrillic homoglyphs go by—again!—so I opened T373471 to take homoglyphs off my 10% project list and made it a real task.
The second list involves somewhat more complicated issues or lower priority issues that could use looking at.
- When unpacking Hindi (written in Devanagari), ICU folding did not require any exceptions (i.e., enabling ICU folding did not seem to have any effect on Hindi tokens). At the time, I didn't think anything of it. Here, every Indic script seems to need at least an exception for its virama. In fact, Marathi, also written in the Devanagari script, needs an exception for the virama. It turns out that filter
hindi_normalization
provided by Lucene via Elastic, strips viramas as part of a larger set of normalizations. This seems contrary to my understanding of Indic scripts in general, and the advice I got from Santhosh on viramas in general. It is also reminiscent of the overly aggressive normalization done by thebengali_normalization
filter, which we disabled. This needs to be investigated and at least briefly discussed with Hindi-speaking wiki users, but not during the current task. (T375565) - Relatedly, I ended up looking at the code for
indic_normalization
, and it is definitely pretty complex. It also explicitly calls out several scripts: Bengali, Devanagari, Gujarati, Gurmukhi, Kannada, Malayalam, Odia, Tamil, and Telugu. It would definitely make sense to test whether it is generally useful on languages using those scripts. (It is already used as part of the Bengali and Hindi analyzers. In Bengali it didn't seem to cause any problems, but we didn't necessarily look at a significant amount of text in Gujarati, Kannada, or Tamil scripts when reviewing the Bengali analyzer.) (T375567) - Either add a reverse number hack and number/period split similar to the one for Thai to Khmer, Lao, and Myanmar, or change
icu_token_repair
to not merge those specific scripts with adjacent numbers. - Investigate identifying tokens generated by the ICU tokenizer that are all symbols or all Common characters or something similar, and allowing them to merge in certain situations (to handle the micro sign case more generally).
- See if any parts of the Armenian (hy) analysis chain can do useful things for Western Armenian (hyw) wikis.
- Consider a minor global French/Italian elision filter for d'– and l'– and English possessive filter for –'s (almost?) everywhere.
- Try to make homoglyph norm more efficient, especially if I ever get around to expanding it to include Greek.
- Do some timings on the Khmer syllable reording, just to see how terrible it is!
- Investigate nn/Nynorsk stemmers for nnwiki.
- Look into
smartcn_tokenizer
token repair (similar to ICU token repair), which would put non-Chinese tokens back together, so that zoë would be tokenized as zoë and not z + o + ë. Alternatively, is theicu_tokenizer
good enough at Chinese to use instead of thesmartcn_tokenizer
?- Alternatively, look at fixing it / opening a ticket upstream.
- Also, should the Chinese
plain
field also convert everything to simplified characters before tokenizing?