User:TJones (WMF)/Notes/How Many Languages Does Search Support?
How Many Languages Does Wikimedia Search Support?
editJuly September 2024—TL;DR: On-wiki search "supports" a lot of "languages"! "Search supports more than 50 language varieties" is a defensible position to take. "Search supports more than 40 languages" is 100% guaranteed. Precise numbers present a philosophical conundrum.
This is a squishy question!
The definition of what qualifies as a language is very squishy. We can try to avoid some of the debate by outsourcing the decision to the language codes we use—different codes equal different languages—though it won't save us.
Another squishy concept is what we mean by "support", since the level of language-specific processing provided for each language varies wildly, and even what it means to be "language-specific" is open to interpretation. But before we unrecoverably careen off into the land of philosophy of language, let's tackle the easier parts of the question.
Full Support
edit"Full" support for many languages means that we have a stemmer or tokenizer, a stop word list, and we do any necessary language-specific normalization. (See the Anatomy of Search series of blog posts, or the Bare-Bones Basics of Full-Text Search video for technical details on stemmers, tokenizers, stop words, normalization, and more.)
CirrusSearch/Elasticsearch/Lucene
editThe wiki-specific custom component of on-wiki search is called CirrusSearch, which is built on the Elasticsearch search engine, which in turn is built on the Apache Lucene search library.
Out of the box, Elasticsearch 7.10 supports these 32 languages, so CirrusSearch does, too.
- Arabic, Armenian, Basque, Bengali, Bulgarian, Catalan, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Lithuanian, Norwegian, Portuguese, Romanian, Russian, Sorani (Central Kurdish), Spanish, Swedish, Turkish, and Thai.
Notes:
- Sorani has language code ckb, and it is often called Central Kurdish in English.
- Thai does not have a stemmer, but that seems to be because it doesn't need one.
Running Count: 32
Elasticsearch 7.10 also has three other language analyzers:
- The "Brazilian" analyzer is for Brazilian Portuguese, which is represented by a sub-language code (pt-br). However, the Brazilian analyzer has all separate components, and we do use it for the brwikimedia wiki ("Wiki Movimento Brasil").
- The Persian analyzer has some normalization and a stop word filter, but no stemmer. Persian would benefit from a stemmer, so it doesn't count as "full" support.†
- The "CJK" (which stands for "Chinese, Japanese, and Korean") analyzer only normalizes non-standard half-width and fixed-width characters (ア→ア and A→A), breaks up CJK characters into overlapping bigrams (e.g., ウィキペディア is indexed as ウィ, ィキ, キペ, ペデ, ディ, and ィア), and applies some English stop words. That's not really "full" support, so we won't count it here. (We also don't use it for Chinese or Korean.)
We will count Brazilian Portuguese as a language that we support, but also keep a running sub-tab of "maybe only sort of distinct" language varieties.
We'll come back to Chinese, Japanese, Korean, and the CJK analyzer a bit later.
[†] Later versions of Elasticsearch look like they will support Persian stemming, but we can't deploy those versions for licensing reasons. Fortunately, the underlying functionality comes from Lucene, which is license compatible. Longer term, if we switch to OpenSearch, it will have the newer functionality from Lucene; or—as a last resort—we can create our own simple plugin to make it available for our users.
Running Count: 32–33 (32 languages + 1 major language variety)
Other Open Source Analysis
editWe have found some open source software that does stemming or other processing for particular languages. Some as Elasticsearch plugins, some as stand-alone Java code, and some in other programming languages. We have used, wrapped, or ported as needed to make the algorithms available for our wikis.
- We have open-source Serbian, Esperanto, and Slovak stemmers that we ported to Elasticsearch plugins.
- There are currently no stop word lists for these languages. However, for a typical significantly inflected alphabetic Indo-European language,‡ a decent stemmer is the biggest single improvement that can be added to an analysis chain for that language. Stop words are very useful, but general word statistics will discount them even without an explicit stop word list.
- Having a stemmer (for a language that needs one) can count as the bare minimum for "full" support.
[‡] English is weird in that it is not significantly inflected. Non-Indo-European languages can have very different inflection patterns (like Inuit—so much!—or Chinese—so little!), and non-alphabetic writing systems (like Arabic or Chinese) can have significantly different needs beyond stemming to count as "fully" supported.
- For Chinese (Mandarin) we have something beyond the not-so-smart (but much better than nothing!) CJK analyzer provided by Elasticsearch/Lucene. Chinese doesn't really need a stemmer, but it does need a good tokenizer to break up strings of text without spaces into words. That's the most important component for Chinese, and we found an open-source plugin to do that. Our particular instantiation of Chinese comes with additional complexity because we allow both Traditional and Simplified characters, often in the same sentence. We have an additional open-source plugin to convert everything to Simplified characters internally.
- For Hebrew we found an open-source Elasticsearch plugin that does stemming. It also handles the ambiguity caused by the lack of vowels in Hebrew (by sometimes generating more than one stem).
- For Korean, we have another open-source plugin that is much better than the very basic processing provided by the CJK analyzer. It does tokenizing and part-of-speech tagging and filtering.
- For Polish and Ukrainian, we found an open-source plugin for each that provides a stemmer and stop word list. They both needed some tweaking to handle odd cases, but overall both were successes.
Running Count: 40–41 (40 languages + 1 major language variety)
Shared Configs
editSome languages come in different varieties. As noted before, the distinction between "closely related languages" and "dialects" is partly historical, political, and cultural. Below are some named language varieties with distinct language codes that share language analysis configuration with another language. How you count these is a philosophical question, so we'll incorporate them into our numerical range.
- Egyptian Arabic and Moroccan Arabic use the same configuration as Standard Arabic. Originally they had some extra stop words, but it turned out to be better to use those stop words in Standard Arabic, too. Add two languages/language varieties.
- Serbo-Croatian—also called Serbo-Croat, Serbo-Croat-Bosnian (SCB), Bosnian-Croatian-Serbian (BCS), and Bosnian-Croatian-Montenegrin-Serbian (BCMS)—is a pluricentric language with four mutually intelligible standard varieties, namely Serbian, Croatian, Bosnian, and Montenegrin. For various historical and cultural reasons, we have Serbian, Croatian, and Bosnian (but no Montenegrin) wikis, as well as Serbo-Croatian wikis. The Serbian and Serbo-Croatian Wikipedias support Latin and Cyrillic, while the Croatian and Bosnian Wikipedias are generally in Latin script. The Bosnian, Croatian, and Serbo-Croatian wikis use the same language analyzer as the Serbian wikis. Add three languages/language varieties.
- Malay is very closely related to Indonesian—close enough that we can use the Elasticsearch Indonesian analyzer for Malay. (Indonesian is a standardized variety of Malay.) Add another language/language variety.
- Norwegian has two written standards, Bokmål and Nynorsk, and we have wikis in each. It's even harder than usual to map them to "languages" because there are long, complex histories and cultural associations for each, and they are primarily written forms that correspond to lots of varied spoken dialects. We use the default Norwegian language analyzer (Bokmål) for both, even though Nynorsk components exist. (We have a Phab task to look at using the Nynorsk components on the Nynorsk wikis, but haven't gotten to it yet.) The two standards are close enough that the sub-optimal components are much better than nothing, but the proper ones would likely be even better. Add half a language/language variety.
Running Count: 40–47½ (40 languages + 7½ major language varieties)
Moderate Language-Specific Processing
editThese languages have some significant language-specific(ish) processing that improves search, while still lacking some obvious component (like a stemmer or tokenizer).
- For Japanese, we currently use the CJK analyzer (described above). This is the bare minimum of custom configuration that might be considered "moderate" support. It also stretches the definition of "language-specific", since bigram tokenizing—which would be useful for many languages without spaces—isn't really specific to any language, though the decision to apply it is language-specific.
- There is a "full" support–level Japanese plugin (Kuromoji) that we tested years ago (and have configured in our code, even), but we decided not to use it because of some problems. We have a long-term plan to re-evaluate Kuromoji (and our ability to customize it for our use cases) and see if we could productively enable it for Japanese.
- The Khmer writing system is very complex and—for Historical Technological Reasons™—there are lots of ways to write the same word that all look the same, but are underlyingly distinct sequences of characters. We developed a very complex system that normalizes most sequences to a canonical order. The ICU Tokenizer breaks up Khmer text (which doesn't use spaces between words) into orthographic syllables, which are very often smaller than words. It's somewhat similar to breaking up Chinese into individual characters—many larger "natural" units are lost, but all of their more easily detected sub-units are indexed for searching.
- This is probably the maximum level of support that counts as "moderate". It's tempting to move it to "full" support, but true full support would require tokenizing the Khmer syllables into Khmer words, which requires a dictionary and more complex processing. On the other hand, our support for the wild variety of ways people can (and do!) write Khmer is one place where we currently outshine the big internet search engines.
- For Mirandese, we were able to work with a community member to set up elision rules (for word-initial l', d', etc., as in some other Romance languages) and translate a Portuguese stop word list.
- As mentioned above, version 7.10 of Elasticsearch supports normalization and stop words for Persian.
Running Count:
—Full: 40–47½ (40 languages + 7½ major language varieties)
—Moderate: 4
Minimal Language-Specific Processing
editAzerbaijani, Crimean Tatar, Gagauz, Kazakh, and Tatar have the smallest possible amount of language-specific processing. Like Turkish, they use the uppercase/lowercase pairs İ/i and I/ı, so they have the Turkish version of lowercasing configured.
However, Tatar is generally written in Cyrillic (at least on-wiki). Kazakh is also generally in Cyrillic on-wiki, and the switch to using İ/i and I/ı in the Kazakh Latin script was only made in 2021, so maybe we should count that as half?
Running Count:
—Full: 40–47½ (40 languages + 7½ major language varieties)
—Moderate: 4
—Minimal: 4½–5
(Un)Intentional Specific Generic Support
editWell there's a noun phrase you don't see every day—what does it even mean?
Sometimes a language-specific (or wiki community–specific) issue gets generalized to the point where there's no trace of the motivating source. Conversely, a generic improvement can have an outsized impact on a specific language, wiki, or community.
For example, the Nias language uses lots of apostrophes, and some of the people in its Wikipedia community are apparently more comfortable composing articles in word processors, with the text then being copied to the Nias Wikipedia. Some word processors like to "smarten" quotes and apostrophes, automatically replacing them with the curly variants. This kind of variation makes searching hard. When I last looked (some time ago, now) it also resulted in Nias Wikipedia having article titles that only differ by apostrophe curliness—I assume people couldn't find the one so they created the other. Once we got the Phab ticket, we added some Nias-specific apostrophe normalization that fixed a lot of their problems.
Does Nias-specific apostrophe normalization count as supporting Nias? It might arguably fall into the "minimal" category.
About a year later, we cautiously and deliberately tested similar apostrophe normalization for all wikis, and eventually added it as a default, which removed all Nias-specific config in our code.
Does general normalization inspired by a strong need from the Nias Wiki community (but not really inherent to the Nias language) count as supporting Nias? I don't even know.
At another time, I extended some general normalization upgrades that remove "non-native" diacritics to a bunch of languages, and an unexpectedly large benefit was that it was super helpful in Basque, because Basque searchers often ignore Spanish diacritics on Spanish words, while editors use the correct diacritics in articles, creating a mismatch.
If I hadn't bothered to do some analysis after going live, I wouldn't have known about this specific noticeable improvement. On the other hand, if I'd known about the specific problem and there wasn't a semi-generic solution, I would've wanted to implement something Basque-specific to solve it.
Does a general improvement that turns out to strongly benefit Basque count as supporting Basque? I don't even know! (In practice, this is a slightly philosophical question, since Basque has a stemmer and stopword list, too, so it's already otherwise on the "full support" list.)
I can't think of any other language-specific cases that generalized so well—though Nias wasn't the first or only case of apostrophe-like characters needing to be normalized.
Of course, general changes that were especially helpful to a particular language are easy to miss, if you don't go looking for them. Even if you do, they can be subtle. The Basque case was much easier for me, personally, to notice, because I don't speak Basque, but I know a little Spanish, so the Spanish words really stood out as such when looking at the data.
Running Count:
—Full: 40–47½ (40 languages + 7½ major language varieties)
—Moderate: 4
—Minimal: 4½–5
—I Don't Even Know: 2+
Vague Categorical Support
editIt's easy enough to say that the CJK analyzer supports Japanese (where we are currently using it) and that it would be supporting Chinese and Korean if we were using it for those languages—in small part because it has limited scope, and in large part because it seems specific to Chinese, Japanese, and Korean because of the meaning of "CJK".
But what about a configuration that is not super specific, but still applied to a subset of languages?
Back in the day, we identified that "spaceless languages" (those whose writing system doesn't put spaces between words) could benefit from (or be harmed by) specific configurations.
We identified the following languages as "spaceless". We initially passed on enabling an alternate ranking algorithm (BM25) for them (Phab T152092), but we also deployed the ICU tokenizer for them by default.
- Tibetan, Dzongkha, Gan, Japanese, Khmer, Lao, Burmese, Thai, Wu, Chinese, Classical Chinese, Cantonese, Buginese, Min Dong, Cree, Hakka, Javanese, and Min Nan.
14 of those are new.
We eventually did enable BM25 for them, but this list has often gotten special consideration and testing to make sure we don't unexpectedly do bad things to them when we make changes that seem fine for languages with clearer word boundaries (like Phab T266027).
And what about the case where the "category" we are trying to support is "more or less all of them"? Our recent efforts at cross-wiki "harmonization"—making all language processing that is not language-specific as close to the same as possible on all wikis (see Phab T219550)—was a rising language tide that lifted all/most/many language boats. (An easy to understand example is acronym processing, so that NASA and N.A.S.A. can match more easily. However, some languages—because of their writing systems—have few if any native acronyms. Foreign acronyms (like N.A.S.A.) still show up, though.)
Running Count:
—Full: 40–47½ (40 languages + 7½ major language varieties)
—Moderate: 4
—Minimal: 4½–5
—I Don't Even Know: 0–∞
Beyond Language Analysis
editSo far we've focussed on the most obviously languagey of the language support in Search, which is language analysis. However, there are other parts of our system that support particular wikis in a language-specific way.
Learning to Rank
editLearning to Rank (LTR) is a plugin that uses machine learning—based on textual properties and user behavior data—to re-rank search results to move better results higher in the result list.
It makes use of many ranking signals, including making wiki-specific interpretations of textual properties—like word frequency stats, the number of words in a query or document, the distribution of matching terms, etc.
Arguably some of what the model learns is language-specific. Some is probably wiki-specific (say, because Wikipedia titles are organized differently than Wikisource titles), and some may be community-specific (say, searchers search differently on Wikipedia than they do on Wiktionary).
The results are the same or better than our previously hand-tuned ranking, and the models are regularly retrained, allowing them to keep up with changes to the way searchers behave in those languages on those wikis.
Does that count as minimal language-specific support? Maybe? Probably?
We tested the LTR plugin on 18 wikis:
- Arabic, Chinese, Dutch, Finnish, French, German, Hebrew, Indonesian, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Swedish, and Vietnamese.
One of those, Vietnamese, is new to the list.
Running Count:
—Full: 40–47½ (40 languages + 7½ major language varieties)
—Moderate: 4
—Minimal: 4½–6
—I Don't Even Know: 0–∞
Cross-Language Searching
editYears ago we worked on a project on some wikis to do language detection on queries that got very few or no results, to see if we could provide results from another wiki. The process was complicated, so we only deployed it to nine of the largest (by search volume) Wikipedias:
- Dutch, English, French, German, Italian, Japanese, Portuguese, Spanish, and Russian.
Those are all covered by language analyzers above. However, for each of those wikis, we limited the specific languages that could be identified by the language-ID tool (called TextCat), to maximize accuracy and relevance.
The specific languages allowed to be identified per wiki are listed in a table in a write up about the project.
The consolidated list of those languages is:
- Afrikaans, Arabic, Armenian, Bengali, Breton, Burmese, Chinese, Croatian, Czech, Danish, Dutch, English, Finnish, French, Georgian, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Korean, Latin, Latvian, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Tagalog, Telugu, Thai, Ukrainian, Urdu, and Vietnamese.
Nine of those are not covered by the language analyzers, and eight are not covered by the LTR plugin: Afrikaans, Breton, Burmese, Georgian, Icelandic, Latin, Tagalog, Telugu, and Urdu. (Vietnamese is covered both by Learning to Rank and TextCat.)
Does sending queries from the largest wikis to other wikis count as some sort of minimal support? Maybe. Arguably. Perhaps.
Running Count:
—Full: 40–47½ (40 languages + 7½ major language varieties)
—Moderate: 4
—Minimal: 4½–14
—I Don't Even Know: 0–∞
Updates
editIn order to make updates easier to find and to keep the bookkeeping simple—or at least simpler—I'm going to add new info here at the end.
ICU Folding—Minimal Inside-Out Language-Specific Processing
edit(Sept 2024) We do some basic normalization on all languages, like lowercasing and converting fairly outré text like 𝒩ⓞ𝖗𝚖𝑎𝖑ᵢ𝔷𝕒𝐭𝒊𝕠ⁿ to the more pedestrian normalization.
We also have some fairly aggressive normalization (ICU folding) available, but we can't enable it everywhere because it will normalize things that should not be normalized in a given language, like converting Russian й, to и, Swedish å to a and Khmer ខ្មែរ to ខមែរ.
On English-language wikis, these conversions aren't terrible (and worth it to usefully conflate outré and outre). There's not an overwhelming amount of Russian, Swedish, or Khmer text on English-language wikis, and the simplified or more basic forms can be easier to type. But on Swedish-language wikis we want to preserve å (and ä and ö), and on Russian-language wikis we want to preserve й (but not ё, for historical reasons), and on Khmer-language wikis... ICU folding needs to keep its hands to itself!
Fortunately, we can customize ICU folding for each language, but it takes a little investigation to do so. I did some work on harmonizing our folding configurations (converting four custom configs to one of the two standard configs), and then expanding ICU folding to 31½ additional languages:
- Albanian, Assamese, Azerbaijani, Burmese, Chinese, Crimean Tatar, Gagauz, Georgian, Gujarati, Igbo, Indonesian / Malay, Kannada, Kazakh, Khmer, Korean, Malayalam, Marathi, Mirandese, Nepali, Odia, Polish, Punjabi, Sinhala, Slovenian, Swahili, Tagalog, Tamil, Tatar, Telugu, Uzbek, Vietnamese.
Of those, these 15 are new to our accounting here:
- Albanian, Assamese, Gujarati, Igbo, Kannada, Malayalam, Marathi, Nepali, Odia, Punjabi, Sinhala, Slovenian, Swahili, Tamil, Uzbek
Though at a finer grain of detail, not all of the 31½ languages added (or the 15 new ones) feel like they got the same level of support.
- Some, like Swahili, Indonesian, and Punjabi, needed no customization. They just got added to the "ICU folding is ok" list.
- Most got a list of language-specific folding exceptions of varying length—from one character for several Indic scripts, to 134 characters for Vietnamese!
On the other hand, ICU folding is really a sort of inside-out support for a language, and the length of the list of exceptions isn't the best metric. We are generally attempting to not do anything to words in the "host" language, while we clobber the diacritics and variant characters in every other language or script we can. The benefit is really to the "foreign" text on those wikis—and for the readers who will have an easier time with unfamiliar diacritics, and will not have to keep track of which flavor of Zoe/Zoë/Zöe/Zoé/Zoê they are dealing with on any given day.
Some languages really did get a little extra support, though—it's still fairly minimal, but every little bit helps! Sometimes, while I was eyeball-deep in their orthography, I learned about some historically rare or obsolete character, or some variation in the way important diacritics and other combining characters can be stacked, and right then seemed like the best time to throw in a little mapping hack to handle those cases more transparently... so I did!
Running Count:
—Full: 40–47½ (40 languages + 7½ major language varieties)
—Moderate: 4
—Minimal: 19½–29
—I Don't Even Know: 0–∞
Conclusions?
editWhat, if any, specific conclusions can we draw? Let's look again at the list we have so far (even though it is also right above.)
"Final" Count:
—Full: 40–47½ (40 languages + 7½ major language varieties)
—Moderate: 4
—Minimal: 19½–29
—I Don't Even Know: 0–∞
We have good to great support ("moderate" or "full") for 44 inarguably distinct languages, though it's very reasonable to claim 51½ named language varieties.
The Search Platform team loves to make improvements to on-wiki search that are relevant to all or almost all languages (like acronym handling) or that help all wikis (like very basic parsing for East Asian languages on any wiki). So, how many on-wiki communities does the Search team support? All of them, of course!
Exactly how many languages is that? I don’t even know.