User:TJones (WMF)/Notes/Fallback Redux
September/October 2017 — See TJones_(WMF)/Notes for other projects. See also T147959 and Disabling Messaging Fallbacks for Language Analysis.
Background
editAs noted in my write up from last year messaging fallback languages that make sense geographically and historically but not necessarily linguistically are also being used to enable language analyzers in places where they don't make a ton of sense.
Data
editI did a quick-n-dirty analysis of languages as configured in code last time, but this time I pulled out actual live configuration for Wikipedias in every language, and "Other Wikimedia projects" listed in the Special:SiteMatrix page on mediawiki where possible. For private wikis, I used the info on the main page of the wiki and the config under wgLanguageCode
in wmf-config/InitialiseSettings.php
(very large link).
There are a few mismatches between config in code and the live config in production, probably caused by fallback languages being configured after the wikis were started; those wikis haven't been re-indexed yet, so the new fallback config hasn't had a chance to take effect.
Analysis
editThe table below has all the wikis I looked at, grouped by language configured. For wikis with fallback language analyzers enabled, I also listed the number of articles on the wikis and the percentage of search traffic for each wiki. The numbers are snapshots so the links have changed, but they should give workable estimates.
The columns include:
- Compatibility (and some other info), indicated by the following codes:
- + = configured language matches content, or is ICU default
- ! = plausible, as languages are listed as mutually intelligible in written form, but not guaranteed
- ? = genetic relation to fallback; may be useful (but I'm extremely doubtful)
- x = no genetic relation to fallback
- # = should be in configured language, and would be if re-indexed, but is not currently
- - = wiki is closed
- WP Articles—count of articles in Wikipedia in that language
- Search Volume—percentage of search volume from Discovery dashboards; 3% is high for anything other than English
- Lg of Wiki—Language of the Wiki in question.
- Lg Used—Language analyzer configured. Note that CJK is a generic processor for Chinese, Japanese, and Korean. ICU is an open-source library for Unicode processing.
- Wiki domain—the domain of the wiki
- Notes—Notes on mutual intelligibility, differences in code/live configuration, etc. * indicates that I had to get the language used from code or the main page of wiki, since the live config was unavailable since it's a private wiki.
For each language group there is a summary row. The row lists totals for displayed article counts and search volumes (i.e., those from potentially incompatible wikis). Language families of the unrelated languages are also listed (e.g., Eskimo-Aleut is listed in the row for Danish because Greenlandic, which falls back to Danish, is an Eskimo-Aleut language, while Danish is not—it's Germanic).
The language groups with potential problems are listed here alphabetically. The others are listed at the end of the page—provided for completeness, but not very interesting.
Notes
edit- Having a "genetic relation" to a fallback (indicated by ?) means the languages may be only as closely related as Spanish and Romanian or German and English—so they are not necessarily very similar.
- Mutual intelligibility (indicated by !) between languages means that speakers are usually clever enough to figure out how to understand one another. It does not mean that software will necessarily have a similarly easy time.
- As a hypothetical example, a simple variant of English replacing all plural -s with -z, the endings -ing and -ed with -in and -t, -able/-ible with -oble, -ly with -like, and all ws with vs vould be perfectlike unserstandoble (that is, very highlike mutuallike intelligoble) to most English speakerz, but our English language analysis softvare vould be confust by the changez and vould be makin many mistakez.
- On the other hand, sometimes two very similar "dialects" are varieties of the same language separated for historical, political, or cultural reasons.
- The French Wikipedia has the French language analyzer and ICU folding enabled, while all others with French only have the French language analyzer.
- Atikamekw and Kabiye have French configured as their fallback, but are not using it in production.
- A new Hebrew language analyzer has recently been deployed. The Yiddish Wikipedia is configured to use it but it will not be not enabled until the wiki is re-indexed.
- I've given some examples of poor and pointless processing on the main community communication page.
The Table
editCompatibility | WP Articles | Search Volume | Lg of Wiki | Lg Used | Wiki domain | Notes |
Arabic | 17,186 | 0.033% | ||||
+ | Arabic | Arabic | ar.wikipedia.org | |||
? | 17,186 | 0.033% | Egyptian Arabic | Arabic | arz.wikipedia.org | |
Catalan | 83,546 | 0.027% | ||||
+ | Catalan | Catalan | ca.wikipedia.org | |||
! | 83,546 | 0.027% | Occitan | Catalan | oc.wikipedia.org | (high mutual intelligibility) |
Czech | 222,068 | 0.276% | ||||
+ | Czech | Czech | cs.wikipedia.org | |||
! | 222,068 | 0.276% | Slovak | Czech | sk.wikipedia.org | (significant mutual intelligibility) |
+ | Czech | Czech | arbcom-cs.wikipedia.org | * | ||
Danish | 1,646 | 0.001% | (Eskimo–Aleut) | |||
+ | Danish | Danish | da.wikipedia.org | |||
x | 1,646 | 0.001% | Greenlandic | Danish | kl.wikipedia.org | |
+ | Danish | Danish | dk.wikimedia.org | |||
Dutch | 30,642 | 0.009% | ||||
+ | Dutch | Dutch | nl.wikipedia.org | |||
? | 6,952 | 0.001% | Dutch Low Saxon | Dutch | nds-nl.wikipedia.org | |
? | 12,043 | 0.004% | Limburgish | Dutch | li.wikipedia.org | |
x | 1,059 | 0.000% | Sranan | Dutch | srn.wikipedia.org | |
? | 6,209 | 0.004% | West Flemish | Dutch | vls.wikipedia.org | |
? | 4,379 | 0.000% | Zeelandic | Dutch | zea.wikipedia.org | |
+ | Dutch | Dutch | nl.wikimedia.org | |||
+ | Dutch | Dutch | arbcom-nl.wikipedia.org | * | ||
Finnish | 2,308 | 0.000% | ||||
+ | Finnish | Finnish | fi.wikipedia.org | |||
!? | 2,308 | 0.000% | Livvi-Karelian | Finnish | olo.wikipedia.org | (dialect of Karelian, which is highly mutually intelligible with Finnish) |
+ | Finnish | Finnish | fi.wikimedia.org | |||
+ | Finnish | Finnish | arbcom-fi.wikipedia.org | * | ||
French | 233,042 | 0.070% | (Niger-Congo, Austronesian, Algic, Celtic) | |||
x | 3,090 | 0.000% | Bambara | French | bm.wikipedia.org | |
x | 63,035 | 0.018% | Breton | French | br.wikipedia.org | |
! | 2,627 | 0.000% | Franco-Provençal | French | frp.wikipedia.org | |
+ | French | French + ICU Folding | fr.wikipedia.org | |||
x | 220 | 0.000% | Fula | French | ff.wikipedia.org | |
? | 51,518 | 0.002% | Haitian | French | ht.wikipedia.org | |
x | 2,915 | 0.001% | Lingala | French | ln.wikipedia.org | |
x | 84,634 | 0.010% | Malagasy | French | mg.wikipedia.org | |
? | 3,627 | 0.000% | Norman | French | nrm.wikipedia.org | |
? | 3,525 | 0.031% | Picard | French | pcd.wikipedia.org | |
x | 253 | 0.000% | Sango | French | sg.wikipedia.org | |
x | 1,191 | 0.000% | Tahitian | French | ty.wikipedia.org | |
? | 14,631 | 0.004% | Walloon | French | wa.wikipedia.org | |
x | 1,157 | 0.003% | Wolof | French | wo.wikipedia.org | |
x# | 79 | 0.000% | Atikamekw | French | atj.wikipedia.org | Using ICU normalizer + Standard tokenizer |
x# | 540 | 0.001% | Kabiye | French | kbp.wikipedia.org | Using ICU normalizer + Standard tokenizer |
German | 154,421 | 0.059% | (Slavic) | |||
? | 23,330 | 0.017% | Alemannic | German | als.wikipedia.org | |
? | 23,087 | 0.009% | Bavarian | German | bar.wikipedia.org | |
+ | German | German | de.wikipedia.org | |||
? | 26,703 | 0.008% | Low Saxon | German | nds.wikipedia.org | |
x | 3,088 | 0.001% | Lower Sorbian | German | dsb.wikipedia.org | |
! | 50,146 | 0.011% | Luxembourgish | German | lb.wikipedia.org | (partial mutual intelligibility) |
? | 5,303 | 0.001% | North Frisian | German | frr.wikipedia.org | |
? | 2,071 | 0.001% | Palatinate German | German | pfl.wikipedia.org | |
? | 1,800 | 0.001% | Pennsylvania German | German | pdc.wikipedia.org | |
? | 2,836 | 0.000% | Ripuarian | German | ksh.wikipedia.org | |
? | 3,786 | 0.001% | Saterland Frisian | German | stq.wikipedia.org | |
x | 12,271 | 0.009% | Upper Sorbian | German | hsb.wikipedia.org | |
+ | German | German | arbcom-de.wikipedia.org | * | ||
Greek | 453 | 0.000% | ||||
+ | Greek | Greek | el.wikipedia.org | |||
? | 453 | 0.000% | Pontic Greek | Greek | pnt.wikipedia.org | ("at best" partial mutual intelligibility) |
Hebrew | 14,101 | 0.008% | (Germanic) | |||
+# | Hebrew | Hebrew | he.wikipedia.org | Hebrew analysis not yet deployed | ||
x# | 14,101 | 0.008% | Yiddish | Hebrew | yi.wikipedia.org | Hebrew analysis not yet deployed |
+# | Hebrew | Hebrew | il.wikimedia.org | * / Hebrew analysis not yet deployed | ||
Hindi | 22,999 | 0.040% | ||||
+ | Hindi | Hindi | hi.wikipedia.org | |||
? | 11,817 | 0.002% | Maithili | Hindi | mai.wikipedia.org | |
? | 11,182 | 0.038% | Sanskrit | Hindi | sa.wikipedia.org | |
Indonesian | 347,328 | 0.027% | ||||
? | 7,228 | 0.000% | Acehnese | Indonesian | ace.wikipedia.org | |
? | 1,727 | 0.000% | Banjar | Indonesian | bjn.wikipedia.org | |
? | 13,285 | 0.001% | Banyumasan | Indonesian | map-bms.wikipedia.org | |
x | 14,120 | 0.000% | Buginese | Indonesian | bug.wikipedia.org | (partially in Lontara alphabet) |
+ | Indonesian | Indonesian | id.wikipedia.org | |||
? | 50,295 | 0.016% | Javanese | Indonesian | jv.wikipedia.org | |
? | 221,993 | 0.001% | Minangkabau | Indonesian | min.wikipedia.org | |
? | 38,680 | 0.009% | Sundanese | Indonesian | su.wikipedia.org | |
Italian | 181,597 | 0.041% | ||||
! | 5,454 | 0.013% | Corsican | Italian | co.wikipedia.org | |
? | 9,034 | 0.005% | Emilian-Romagnol | Italian | eml.wikipedia.org | |
? | 3,186 | 0.000% | Friulian | Italian | fur.wikipedia.org | |
+ | Italian | Italian | it.wikipedia.org | |||
? | 3,281 | 0.000% | Ligurian | Italian | lij.wikipedia.org | |
x | 36,147 | 0.003% | Lombard | Italian | lmo.wikipedia.org | (explicitly listed as not mutually intelligible with Italian) |
!x | 14,466 | 0.002% | Neapolitan | Italian | nap.wikipedia.org | (conflicting info on mutual intelligibility with Italian) |
? | 64,183 | 0.001% | Piedmontese | Italian | pms.wikipedia.org | |
!x | 25,642 | 0.014% | Sicilian | Italian | scn.wikipedia.org | (conflicting info on mutual intelligibility with Italian) |
? | 9,234 | 0.000% | Tarantino | Italian | roa-tara.wikipedia.org | |
x | 10,970 | 0.003% | Venetian | Italian | vec.wikipedia.org | (explicitly listed as not mutually intelligible with Italian) |
Latvian | 801 | 0.000% | ||||
? | 801 | 0.000% | Latgalian | Latvian | ltg.wikipedia.org | |
+ | Latvian | Latvian | lv.wikipedia.org | |||
Lithuanian | 16,128 | 0.000% | ||||
+ | Lithuanian | Lithuanian | lt.wikipedia.org | |||
? | 16,128 | 0.000% | Samogitian | Lithuanian | bat-smg.wikipedia.org | |
Norwegian | 134,828 | 0.038% | ||||
+ | Norwegian Bokmål | Norwegian | no.wikipedia.org | |||
? | 134,828 | 0.038% | Norwegian Nynorsk | Norwegian | nn.wikipedia.org | Elastic has a light_nynorsk stemmer |
+ | Norwegian | Norwegian | no.wikimedia.org | |||
+ | Norwegian | Norwegian | noboard-chapters.wikimedia.org | * | ||
Persian | 70,131 | 0.006% | (Turkic) | |||
? | 5,679 | 0.000% | Gilaki | Persian | glk.wikipedia.org | |
? | 12,539 | 0.002% | Mazandarani | Persian | mzn.wikipedia.org | |
? | 5,324 | 0.000% | Northern Luri | Persian | lrc.wikipedia.org | |
+ | Persian | Persian | fa.wikipedia.org | |||
x | 46,589 | 0.004% | Southern Azerbaijani | Persian | azb.wikipedia.org | |
Polish | 11,537 | 0.033% | ||||
x | 5,205 | 0.029% | Kashubian | Polish | csb.wikipedia.org | (explicitly listed as not mutually intelligible with Polish) |
+ | Polish | Polish | pl.wikipedia.org | |||
? | 6,332 | 0.004% | Silesian | Polish | szl.wikipedia.org | |
+ | Polish | Polish | pl.wikimedia.org | |||
Portuguese | 3,517 | 0.001% | ||||
! | 3,517 | 0.001% | Mirandese | Portuguese | mwl.wikipedia.org | |
+ | Portuguese | Portuguese | pt.wikipedia.org | |||
Romanian | 2,205 | 0.001% | (Indo-Iranian) | |||
? | 1,210 | 0.000% | Aromanian | Romanian | roa-rup.wikipedia.org | |
x- | 394 | 0.000% | Moldovan Cyrillic (Romanian) | Romanian | mo.wikipedia.org | |
x | 601 | 0.001% | Romani | Romanian | rmy.wikipedia.org | |
+ | Romanian | Romanian | ro.wikipedia.org | |||
Russian | 394,996 | 0.022% | (Turkic, Uralic, Mongolic) | |||
x | 3,220 | 0.001% | Abkhazian | Russian | ab.wikipedia.org | |
x | 2,312 | 0.000% | Avar | Russian | av.wikipedia.org | |
x | 39,808 | 0.007% | Bashkir | Russian | ba.wikipedia.org | |
x | 1,989 | 0.000% | Buryat | Russian | bxr.wikipedia.org | |
x | 164,314 | 0.002% | Chechen | Russian | ce.wikipedia.org | |
x | 40,620 | 0.001% | Chuvash | Russian | cv.wikipedia.org | |
x | 3,861 | 0.000% | Erzya | Russian | myv.wikipedia.org | |
x | 10,240 | 0.000% | Hill Mari | Russian | mrj.wikipedia.org | |
x | 2,074 | 0.000% | Kalmyk | Russian | xal.wikipedia.org | |
x | 2,019 | 0.001% | Karachay-Balkar | Russian | krc.wikipedia.org | |
x | 5,250 | 0.001% | Komi | Russian | kv.wikipedia.org | |
x | 3,448 | 0.000% | Komi-Permyak | Russian | koi.wikipedia.org | |
x | 1,213 | 0.000% | Lak | Russian | lbe.wikipedia.org | |
x | 3,846 | 0.000% | Lezgian | Russian | lez.wikipedia.org | |
x | 9,649 | 0.000% | Meadow Mari | Russian | mhr.wikipedia.org | |
x | 1,171 | 0.000% | Moksha | Russian | mdf.wikipedia.org | |
x | 10,529 | 0.001% | Ossetian | Russian | os.wikipedia.org | |
+ | Russian | Russian | ru.wikipedia.org | |||
x | 11,407 | 0.001% | Sakha | Russian | sah.wikipedia.org | |
x | 72,540 | 0.007% | Tatar | Russian | tt.wikipedia.org | |
x | 1,410 | 0.000% | Tuvan | Russian | tyv.wikipedia.org | |
x | 4,076 | 0.000% | Udmurt | Russian | udm.wikipedia.org | |
+ | Russian | Russian | ru.wikimedia.org | |||
Spanish | 128,134 | 0.057% | (Aymara, Tupi–Guarani, Uto-Aztecan, Quechuan) | |||
! | 32,383 | 0.011% | Aragonese | Spanish | an.wikipedia.org | |
! | 50,499 | 0.018% | Asturian | Spanish | ast.wikipedia.org | |
x | 4,250 | 0.001% | Aymara | Spanish | ay.wikipedia.org | |
? | 3,004 | 0.000% | Chavacano | Spanish | cbk-zam.wikipedia.org | |
? | 2,910 | 0.000% | Extremaduran | Spanish | ext.wikipedia.org | |
x | 3,209 | 0.023% | Guarani | Spanish | gn.wikipedia.org | |
? | 4,498 | 0.003% | Ladino | Spanish | lad.wikipedia.org | |
x | 7,113 | 0.000% | Nahuatl | Spanish | nah.wikipedia.org | |
x | 20,268 | 0.001% | Quechua | Spanish | qu.wikipedia.org | |
+ | Spanish | Spanish | es.wikipedia.org | |||
+ | Spanish | Spanish | ar.wikimedia.org | |||
+ | Spanish | Spanish | co.wikimedia.org | |||
+ | Spanish | Spanish | mx.wikimedia.org | |||
Turkish | 2,757 | 0.003% | ||||
!? | 2,757 | 0.003% | Gagauz | Turkish | gag.wikipedia.org | (partial mutually intelligibility) |
+ | Turkish | Turkish | tr.wikipedia.org | |||
+ | Turkish | Turkish | tr.wikimedia.org | |||
Ukrainian | 6,160 | 0.003% | ||||
? | 6,160 | 0.003% | Rusyn | Ukrainian | rue.wikipedia.org | |
+ | Ukrainian | Ukrainian | uk.wikipedia.org | |||
+ | Ukrainian | Ukrainian | ua.wikimedia.org |
Next Steps
editThere are 102 wikis with non-exact language analysis configurations:
- 47 are obvious linguistic mis-matches.
- 12 are configured with the analyzer for a reasonably mutually intelligible language and so have a reasonable potential to be doing more good than harm.
- The middle 43 are genetically related, but not really very likely on average to benefit hugely from having the wrong-language analyzer.
I've done a more detailed but still rough analysis of the similarity of the potential keepers, and asked for community for feedback on the following:
- Egyptian Arabic (Maṣri) as Arabic
- Gagauz as Turkish
- Limburgish as Dutch
- Livvi-Karelian as Finnish
- Mirandese as Portuguese
- Occitan as Catalan
- Slovak as Czech
We'll see what comes of those discussions. In the meantime I've configured these as exceptions in the [WIP] patch I've submitted to Gerrit.
The rest are scheduled to be disabled in the code the week of October 9th, though the actual re-indexing after that may take a while after that. Re-indexing is tracked on Phab task T177871.
The outline of the plan has been laid out on another page: Disabling Messaging Fallbacks for Language Analysis, which is where community discussion will be directed, though there are also links back to here and to Phab.
The Rest of the Table
editThis is the rest of the table from above, where nothing terribly exciting is happening. Everything is either using the appropriate language or the ICU default.
Compatibility | Lg of Wiki | Lg Used | Wiki domain | Notes |
Armenian | Armenian | Armenian | Armenian | |
+ | Armenian | Armenian | hy.wikipedia.org | |
Basque | ||||
+ | Basque | Basque | eu.wikipedia.org | |
Brazilian Portuguese | ||||
+ | Brazilian Portuguese | Brazilian Portuguese | br.wikimedia.org | |
Bulgarian | ||||
+ | Bulgarian | Bulgarian | bg.wikipedia.org | |
Chinese | ||||
+ | Chinese | Chinese | zh.wikipedia.org | |
+ | Chinese | Chinese | cn.wikimedia.org | |
CJK | ||||
+ | Japanese | CJK | ja.wikipedia.org | |
+ | Korean | CJK | ko.wikipedia.org | |
English | ||||
+ | English | English | en.wikipedia.org | |
+ | English | English | simple.wikipedia.org | |
+ | English | English | nostalgia.wikipedia.org | |
+ | English | English | test.wikipedia.org | |
+ | English | English | test2.wikipedia.org | |
+ | English | English | be.wikimedia.org | (yep, the Wikimedia Belgium site is in English) |
+ | English | English | beta.wikiversity.org | |
+ | English | English | ca.wikimedia.org | |
+ | English | English | commons.wikimedia.org | |
+ | English | English | donate.wikimedia.org | |
+ | English | English | incubator.wikimedia.org | |
+ | English | English | labtestwikitech.wikimedia.org | |
+ | English | English | login.wikimedia.org | |
+ | English | English | meta.wikimedia.org | |
+ | English | English | nyc.wikimedia.org | |
+ | English | English | outreach.wikimedia.org | |
+ | English | English | species.wikimedia.org | |
+ | English | English | test.wikidata.org | |
+ | English | English | vote.wikimedia.org | |
+ | English | English | wikimania2017.wikimedia.org | |
+ | English | English | wikimediafoundation.org | |
+ | English | English | wikisource.org | |
+ | English | English | wikitech.wikimedia.org | |
+ | English | English | www.mediawiki.org | |
+ | English | English | www.wikidata.org | |
+- | English | English | ten.wikipedia.org | |
+- | English | English | advisory.wikimedia.org | |
+- | English | English | nz.wikimedia.org | |
+- | English | English | pa-us.wikimedia.org | |
+- | English | English | quality.wikimedia.org | |
+- | English | English | strategy.wikimedia.org | |
+- | English | English | usability.wikimedia.org | |
+- | English | English | wikimania2005.wikimedia.org | |
+- | English | English | wikimania2006.wikimedia.org | |
+- | English | English | wikimania2007.wikimedia.org | |
+- | English | English | wikimania2008.wikimedia.org | |
+- | English | English | wikimania2009.wikimedia.org | |
+- | English | English | wikimania2010.wikimedia.org | |
+- | English | English | wikimania2011.wikimedia.org | |
+- | English | English | wikimania2012.wikimedia.org | |
+- | English | English | wikimania2013.wikimedia.org | |
+- | English | English | wikimania2014.wikimedia.org | |
+- | English | English | wikimania2015.wikimedia.org | |
+- | English | English | wikimania2016.wikimedia.org | |
+ | English | English | affcom.wikimedia.org | * |
+ | English | English | arbcom-en.wikipedia.org | * |
+ | English | English | auditcom.wikimedia.org | * |
+ | English | English | chair.wikimedia.org | * |
+ | English | English | checkuser.wikimedia.org | * |
+ | English | English | collab.wikimedia.org | * |
+ | English | English | ec.wikimedia.org | * |
+ | English | English | exec.wikimedia.org | * |
+ | English | English | fdc.wikimedia.org | * |
+ | English | English | grants.wikimedia.org | * |
+ | English | English | iegcom.wikimedia.org | * |
+ | English | English | legalteam.wikimedia.org | * |
+ | English | English | office.wikimedia.org | * |
+ | English | English | ombudsmen.wikimedia.org | * |
+ | English | English | otrs-wiki.wikimedia.org | * |
+ | English | English | projectcom.wikimedia.org | * |
+ | English | English | searchcom.wikimedia.org | * |
+ | English | English | steward.wikimedia.org | * |
+ | English | English | transitionteam.wikimedia.org | * |
+ | English | English | wikimaniateam.wikimedia.org | * |
+ | English | English | zero.wikimedia.org | * |
+ | English | English | board.wikimedia.org | * |
+ | English | English | boardgovcom.wikimedia.org | * |
+ | English | English | internal.wikimedia.org | * |
+ | English | English | movementroles.wikimedia.org | * |
+ | English | English | spcom.wikimedia.org | * |
+ | English | English | techconduct.wikimedia.org | * |
+ | English | English | wg-en.wikipedia.org | * |
Galician | ||||
+ | Galician | Galician | gl.wikipedia.org | |
Hungarian | ||||
+ | Hungarian | Hungarian | hu.wikipedia.org | |
Irish | ||||
+ | Irish | Irish | ga.wikipedia.org | |
Sorani | ||||
+ | Sorani | Sorani | ckb.wikipedia.org | |
Swedish | ||||
+ | Swedish | Swedish | sv.wikipedia.org | |
+ | Swedish | Swedish | se.wikimedia.org | |
Thai | ||||
+ | Thai | Thai | th.wikipedia.org | |
ICU normalizer + ICU tokenizer | ||||
+ | Tibetan | ICU normalizer + ICU tokenizer | bo.wikipedia.org | |
+ | Min Dong | ICU normalizer + ICU tokenizer | cdo.wikipedia.org | |
+ | Cree | ICU normalizer + ICU tokenizer | cr.wikipedia.org | |
+ | Dzongkha | ICU normalizer + ICU tokenizer | dz.wikipedia.org | |
+ | Gan | ICU normalizer + ICU tokenizer | gan.wikipedia.org | |
+ | Hakka | ICU normalizer + ICU tokenizer | hak.wikipedia.org | |
+ | Khmer | ICU normalizer + ICU tokenizer | km.wikipedia.org | |
+ | Lao | ICU normalizer + ICU tokenizer | lo.wikipedia.org | |
+ | Burmese | ICU normalizer + ICU tokenizer | my.wikipedia.org | |
+ | Wu | ICU normalizer + ICU tokenizer | wuu.wikipedia.org | |
+ | Classical Chinese | ICU normalizer + ICU tokenizer | zh-classical.wikipedia.org | |
+ | Min Nan | ICU normalizer + ICU tokenizer | zh-min-nan.wikipedia.org | |
+ | Cantonese | ICU normalizer + ICU tokenizer | zh-yue.wikipedia.org | |
ICU normalizer + Standard tokenizer | ||||
+ | Adyghe | ICU normalizer + Standard tokenizer | ady.wikipedia.org | |
+ | Afrikaans | ICU normalizer + Standard tokenizer | af.wikipedia.org | |
+ | Akan | ICU normalizer + Standard tokenizer | ak.wikipedia.org | |
+ | Amharic | ICU normalizer + Standard tokenizer | am.wikipedia.org | |
+ | Anglo-Saxon | ICU normalizer + Standard tokenizer | ang.wikipedia.org | |
+ | Aramaic | ICU normalizer + Standard tokenizer | arc.wikipedia.org | |
+ | Assamese | ICU normalizer + Standard tokenizer | as.wikipedia.org | |
+ | Azerbaijani | ICU normalizer + Standard tokenizer | az.wikipedia.org | |
+ | Central Bicolano | ICU normalizer + Standard tokenizer | bcl.wikipedia.org | |
+ | Belarusian-Taraškievica | ICU normalizer + Standard tokenizer | be-tarask.wikipedia.org | |
+ | Belarusian | ICU normalizer + Standard tokenizer | be.wikipedia.org | |
+ | Bihari | ICU normalizer + Standard tokenizer | bh.wikipedia.org | |
+ | Bislama | ICU normalizer + Standard tokenizer | bi.wikipedia.org | |
+ | Bengali | ICU normalizer + Standard tokenizer | bn.wikipedia.org | |
+ | Bishnupriya Manipuri | ICU normalizer + Standard tokenizer | bpy.wikipedia.org | |
+ | Bosnian | ICU normalizer + Standard tokenizer | bs.wikipedia.org | |
+ | Cebuano | ICU normalizer + Standard tokenizer | ceb.wikipedia.org | |
+ | Chamorro | ICU normalizer + Standard tokenizer | ch.wikipedia.org | |
+ | Cherokee | ICU normalizer + Standard tokenizer | chr.wikipedia.org | |
+ | Cheyenne | ICU normalizer + Standard tokenizer | chy.wikipedia.org | |
+ | Crimean Tatar | ICU normalizer + Standard tokenizer | crh.wikipedia.org | |
+ | Old Church Slavonic | ICU normalizer + Standard tokenizer | cu.wikipedia.org | |
+ | Welsh | ICU normalizer + Standard tokenizer | cy.wikipedia.org | |
+ | Dinka | ICU normalizer + Standard tokenizer | din.wikipedia.org | |
+ | Zazaki | ICU normalizer + Standard tokenizer | diq.wikipedia.org | |
+ | Doteli | ICU normalizer + Standard tokenizer | dty.wikipedia.org | |
+ | Divehi | ICU normalizer + Standard tokenizer | dv.wikipedia.org | |
+ | Ewe | ICU normalizer + Standard tokenizer | ee.wikipedia.org | |
+ | Esperanto | ICU normalizer + Standard tokenizer | eo.wikipedia.org | |
+ | Estonian | ICU normalizer + Standard tokenizer | et.wikipedia.org | |
+ | Võro | ICU normalizer + Standard tokenizer | fiu-vro.wikipedia.org | |
+ | Fijian | ICU normalizer + Standard tokenizer | fj.wikipedia.org | |
+ | Faroese | ICU normalizer + Standard tokenizer | fo.wikipedia.org | |
+ | West Frisian | ICU normalizer + Standard tokenizer | fy.wikipedia.org | |
+ | Scottish Gaelic | ICU normalizer + Standard tokenizer | gd.wikipedia.org | |
+ | Goan Konkani | ICU normalizer + Standard tokenizer | gom.wikipedia.org | |
+ | Gothic | ICU normalizer + Standard tokenizer | got.wikipedia.org | |
+ | Gujarati | ICU normalizer + Standard tokenizer | gu.wikipedia.org | |
+ | Manx | ICU normalizer + Standard tokenizer | gv.wikipedia.org | |
+ | Hausa | ICU normalizer + Standard tokenizer | ha.wikipedia.org | |
+ | Hawaiian | ICU normalizer + Standard tokenizer | haw.wikipedia.org | |
+ | Fiji Hindi | ICU normalizer + Standard tokenizer | hif.wikipedia.org | |
+ | Croatian | ICU normalizer + Standard tokenizer | hr.wikipedia.org | |
+ | Interlingua | ICU normalizer + Standard tokenizer | ia.wikipedia.org | |
+ | Interlingue | ICU normalizer + Standard tokenizer | ie.wikipedia.org | |
+ | Igbo | ICU normalizer + Standard tokenizer | ig.wikipedia.org | |
+ | Inupiak | ICU normalizer + Standard tokenizer | ik.wikipedia.org | |
+ | Ilokano | ICU normalizer + Standard tokenizer | ilo.wikipedia.org | |
+ | Ido | ICU normalizer + Standard tokenizer | io.wikipedia.org | |
+ | Icelandic | ICU normalizer + Standard tokenizer | is.wikipedia.org | |
+ | Inuktitut | ICU normalizer + Standard tokenizer | iu.wikipedia.org | |
+ | Jamaican Patois | ICU normalizer + Standard tokenizer | jam.wikipedia.org | |
+ | Lojban | ICU normalizer + Standard tokenizer | jbo.wikipedia.org | |
+ | Georgian | ICU normalizer + Standard tokenizer | ka.wikipedia.org | |
+ | Karakalpak | ICU normalizer + Standard tokenizer | kaa.wikipedia.org | |
+ | Kabyle | ICU normalizer + Standard tokenizer | kab.wikipedia.org | |
+ | Kabardian | ICU normalizer + Standard tokenizer | kbd.wikipedia.org | |
+ | Kongo | ICU normalizer + Standard tokenizer | kg.wikipedia.org | |
+ | Kikuyu | ICU normalizer + Standard tokenizer | ki.wikipedia.org | |
+ | Kazakh | ICU normalizer + Standard tokenizer | kk.wikipedia.org | |
+ | Kannada | ICU normalizer + Standard tokenizer | kn.wikipedia.org | |
+ | Kashmiri | ICU normalizer + Standard tokenizer | ks.wikipedia.org | |
+ | Kurdish | ICU normalizer + Standard tokenizer | ku.wikipedia.org | |
+ | Cornish | ICU normalizer + Standard tokenizer | kw.wikipedia.org | |
+ | Kirghiz | ICU normalizer + Standard tokenizer | ky.wikipedia.org | |
+ | Latin | ICU normalizer + Standard tokenizer | la.wikipedia.org | |
+ | Luganda | ICU normalizer + Standard tokenizer | lg.wikipedia.org | |
+ | Maori | ICU normalizer + Standard tokenizer | mi.wikipedia.org | |
+ | Macedonian | ICU normalizer + Standard tokenizer | mk.wikipedia.org | |
+ | Malayalam | ICU normalizer + Standard tokenizer | ml.wikipedia.org | |
+ | Mongolian | ICU normalizer + Standard tokenizer | mn.wikipedia.org | |
+ | Marathi | ICU normalizer + Standard tokenizer | mr.wikipedia.org | |
+ | Malay | ICU normalizer + Standard tokenizer | ms.wikipedia.org | |
+ | Maltese | ICU normalizer + Standard tokenizer | mt.wikipedia.org | |
+ | Nauruan | ICU normalizer + Standard tokenizer | na.wikipedia.org | |
+ | Nepali | ICU normalizer + Standard tokenizer | ne.wikipedia.org | |
+ | Newar | ICU normalizer + Standard tokenizer | new.wikipedia.org | |
+ | Novial | ICU normalizer + Standard tokenizer | nov.wikipedia.org | |
+ | Northern Sotho | ICU normalizer + Standard tokenizer | nso.wikipedia.org | |
+ | Navajo | ICU normalizer + Standard tokenizer | nv.wikipedia.org | |
+ | Chichewa | ICU normalizer + Standard tokenizer | ny.wikipedia.org | |
+ | Oromo | ICU normalizer + Standard tokenizer | om.wikipedia.org | |
+ | Oriya | ICU normalizer + Standard tokenizer | or.wikipedia.org | |
+ | Punjabi | ICU normalizer + Standard tokenizer | pa.wikipedia.org | |
+ | Pangasinan | ICU normalizer + Standard tokenizer | pag.wikipedia.org | |
+ | Kapampangan | ICU normalizer + Standard tokenizer | pam.wikipedia.org | |
+ | Papiamentu | ICU normalizer + Standard tokenizer | pap.wikipedia.org | |
+ | Pali | ICU normalizer + Standard tokenizer | pi.wikipedia.org | |
+ | Norfolk | ICU normalizer + Standard tokenizer | pih.wikipedia.org | |
+ | Western Punjabi | ICU normalizer + Standard tokenizer | pnb.wikipedia.org | |
+ | Pashto | ICU normalizer + Standard tokenizer | ps.wikipedia.org | |
+ | Romansh | ICU normalizer + Standard tokenizer | rm.wikipedia.org | |
+ | Kirundi | ICU normalizer + Standard tokenizer | rn.wikipedia.org | |
+ | Kinyarwanda | ICU normalizer + Standard tokenizer | rw.wikipedia.org | |
+ | Sardinian | ICU normalizer + Standard tokenizer | sc.wikipedia.org | |
+ | Scots | ICU normalizer + Standard tokenizer | sco.wikipedia.org | |
+ | Sindhi | ICU normalizer + Standard tokenizer | sd.wikipedia.org | |
+ | Northern Sami | ICU normalizer + Standard tokenizer | se.wikipedia.org | |
+ | Serbo-Croatian | ICU normalizer + Standard tokenizer | sh.wikipedia.org | |
+ | Sinhalese | ICU normalizer + Standard tokenizer | si.wikipedia.org | |
+ | Slovenian | ICU normalizer + Standard tokenizer | sl.wikipedia.org | |
+ | Samoan | ICU normalizer + Standard tokenizer | sm.wikipedia.org | |
+ | Shona | ICU normalizer + Standard tokenizer | sn.wikipedia.org | |
+ | Somali | ICU normalizer + Standard tokenizer | so.wikipedia.org | |
+ | Albanian | ICU normalizer + Standard tokenizer | sq.wikipedia.org | |
+ | Serbian | ICU normalizer + Standard tokenizer | sr.wikipedia.org | |
+ | Swati | ICU normalizer + Standard tokenizer | ss.wikipedia.org | |
+ | Sesotho | ICU normalizer + Standard tokenizer | st.wikipedia.org | |
+ | Swahili | ICU normalizer + Standard tokenizer | sw.wikipedia.org | |
+ | Tamil | ICU normalizer + Standard tokenizer | ta.wikipedia.org | |
+ | Tulu | ICU normalizer + Standard tokenizer | tcy.wikipedia.org | |
+ | Telugu | ICU normalizer + Standard tokenizer | te.wikipedia.org | |
+ | Tetum | ICU normalizer + Standard tokenizer | tet.wikipedia.org | |
+ | Tajik | ICU normalizer + Standard tokenizer | tg.wikipedia.org | |
+ | Tigrinya | ICU normalizer + Standard tokenizer | ti.wikipedia.org | |
+ | Turkmen | ICU normalizer + Standard tokenizer | tk.wikipedia.org | |
+ | Tagalog | ICU normalizer + Standard tokenizer | tl.wikipedia.org | |
+ | Tswana | ICU normalizer + Standard tokenizer | tn.wikipedia.org | |
+ | Tongan | ICU normalizer + Standard tokenizer | to.wikipedia.org | |
+ | Tok Pisin | ICU normalizer + Standard tokenizer | tpi.wikipedia.org | |
+ | Tsonga | ICU normalizer + Standard tokenizer | ts.wikipedia.org | |
+ | Tumbuka | ICU normalizer + Standard tokenizer | tum.wikipedia.org | |
+ | Twi | ICU normalizer + Standard tokenizer | tw.wikipedia.org | |
+ | Uyghur | ICU normalizer + Standard tokenizer | ug.wikipedia.org | |
+ | Urdu | ICU normalizer + Standard tokenizer | ur.wikipedia.org | |
+ | Uzbek | ICU normalizer + Standard tokenizer | uz.wikipedia.org | |
+ | Venda | ICU normalizer + Standard tokenizer | ve.wikipedia.org | |
+ | Vepsian | ICU normalizer + Standard tokenizer | vep.wikipedia.org | |
+ | Vietnamese | ICU normalizer + Standard tokenizer | vi.wikipedia.org | |
+ | Volapük | ICU normalizer + Standard tokenizer | vo.wikipedia.org | |
+ | Waray | ICU normalizer + Standard tokenizer | war.wikipedia.org | |
+ | Xhosa | ICU normalizer + Standard tokenizer | xh.wikipedia.org | |
+ | Mingrelian | ICU normalizer + Standard tokenizer | xmf.wikipedia.org | |
+ | Yoruba | ICU normalizer + Standard tokenizer | yo.wikipedia.org | |
+ | Zhuang | ICU normalizer + Standard tokenizer | za.wikipedia.org | |
+ | Zulu | ICU normalizer + Standard tokenizer | zu.wikipedia.org | |
+- | Afar | ICU normalizer + Standard tokenizer | aa.wikipedia.org | |
+- | Choctaw | ICU normalizer + Standard tokenizer | cho.wikipedia.org | |
+- | Hiri Motu | ICU normalizer + Standard tokenizer | ho.wikipedia.org | |
+- | Herero | ICU normalizer + Standard tokenizer | hz.wikipedia.org | |
+- | Nuosu | ICU normalizer + Standard tokenizer | ii.wikipedia.org | |
+- | Kuanyama | ICU normalizer + Standard tokenizer | kj.wikipedia.org | |
+- | Kanuri | ICU normalizer + Standard tokenizer | kr.wikipedia.org | |
+- | Marshallese | ICU normalizer + Standard tokenizer | mh.wikipedia.org | |
+- | Muscogee | ICU normalizer + Standard tokenizer | mus.wikipedia.org | |
+- | Ndonga | ICU normalizer + Standard tokenizer | ng.wikipedia.org | |
+ | ICU normalizer + Standard tokenizer | bd.wikimedia.org | ||
+ | ICU normalizer + Standard tokenizer | ee.wikimedia.org | ||
+ | ICU normalizer + Standard tokenizer | mai.wikimedia.org | ||
+ | ICU normalizer + Standard tokenizer | mk.wikimedia.org | ||
+ | ICU normalizer + Standard tokenizer | pt.wikimedia.org | ||
+ | ICU normalizer + Standard tokenizer | rs.wikimedia.org | ||
+ | ICU normalizer + Standard tokenizer | wb.wikimedia.org | ||
+ | ICU normalizer + Standard tokenizer | wikimania2018.wikimedia.org |