User:TJones (WMF)/Notes/TextCat Optimization for frwiki eswiki itwiki and dewiki
April 2016 — See TJones_(WMF)/Notes for other projects. (Phabricator ticket: T132466)
TextCat Optimization for frwiki, eswiki, itwiki, and dewiki
editSummary of Results
editUsing the default 3K models, the best options for each wiki are presented below:
frwiki
- languages: French, English, Arabic, Russian, Chinese, Armenian, Thai, Greek, Hebrew, Korean
- lang codes: fr, en, ar, ru, zh, th, el, hy, he, ko
- relevant poor-performing queries: 29%
- f0.5: 89.0%
eswiki
- languages: Spanish, English, Russian, Chinese, Arabic, Japanese
- lang codes: es, en, ru, zh, ar, ja
- relevant poor-performing queries: 47%
- f0.5: 95.6%
itwiki
- languages: Italian, English, Russian, Arabic, Chinese, Japanese, Greek, Korean
- lang codes: it, en, ru, ar, zh, ja, el, ko
- relevant poor-performing queries: 29%
- f0.5: 92.2%
dewiki
- languages: German, English, Chinese, Greek, Russian, Arabic, Hindi, Thai, Korean, Japanese
- lang codes: de, en, zh, el, ru, ar, hi, th, ko, ja
- relevant poor-performing queries: 35%
- f0.5: 88.2%
Background
editI’ve previously looked at optimizing the set of languages to use with TextCat for language detection on enwiki, and we have an A/B test in the works.
The next step is to do a similar analysis for other big wikis, based on query volume. The next four wikis are Italian, German, Spanish, and French Wikipedias. Due to technical difficulties and personal preferences, I will be looking at French, Spanish, Italian, and then German.
I’ve also done some preliminary work—corpus creation and initial filtering—on the next four candidate wikis: Russian, Japanese, Portuguese, and Indonesian. This was useful for defining and streamlining the process, especially with non-Latin Russian and Japanese.
Query Corpus Creation
editThe first step for each wiki is to extract a recent corpus of relevant queries, and then do some initial filtering to remove some of the dreck and other undesirable queries in the corpora.
Random sampling
editSelect 10K random queries from a recent one-week period (generally in March 2016) that meet the following criteria:
- Query came from the search box on <wiki>.wikipedia.org
- Exclude any IP that made more than 30 queries per day
- Include not more than one query from any given IP for any given day
- Only the <wiki>_content index was searched (except for wikis that search multiple indexes by default)
- Query had < 3 results
On one occasion, the Hive query got stuck in the reduce stage. The only work-around I found was to use a different week-long time period from which to extract the queries.
The Hive query for frwiki is available as an example.
Initial filtering
editThere is a lot of junk in our query logs, and some of it is relatively easy to identify with relative accuracy. The process is as follows:
- Extract queries that have the same 1-to-10 character sequence at least three times in a row and manually review—these are mostly junk, but there are some good ones. Put the good ones back.
- Extract queries that are nothing but consonants and spaces—there are more than you would think, and it works on non-Latin languages/wikis (like ru and ja), too! These are all junk, but they are reviewed anyway.
- Extract queries that have four Latin characters in a row that are not in [aeiouhy]. Again, more than you would think, most are junk. Works in non-Latin languages, too. Works less well in German, but still found a lot of junk.
- Review remaining queries. Sort and review:
- Remove most queries with www, http, @, .com, .org, .net, .mobi, .biz, .xxx, .co.uk, and common TLDs for the language under review.
- Remove queries that aren’t words—mostly numbers, things that look like serial numbers, ID numbers, phone numbers, addresses, etc.
- Remove queries that are mostly or completely emoji.
- Note any obvious “other” languages in use in the sample. Different scripts are really obvious because they are grouped when sorted. Other languages using the same script are hit or miss.
- Incidentally remove any unwanted queries as they go by: proper names, chemical names, any other obvious gibberish not caught by the gibberish filters above.
- Sort, uniq, and randomly order remaining queries.
This cuts down the query pool by 5-25% (median of 8 languages so far: about 11%.).
Language annotation and further filtering
editOnce we’ve removed the really obvious junk, it’s time to manually review queries to create a corpus.
- Take a the first 1000 queries from the filtered and randomized sample.
- Run current language identification on it, using the language of the wiki, English (which is everywhere), and any other languages noted during initial filtering. This is far from perfect, but when it works decently, it’s helpful and reduces context switching. For example, most of the non-junk queries identified from frwiki as French are in fact French.
- Skim language ID results and see if anything is obviously terrible (e.g., most of the “German” queries are obviously French) or obviously missing (oops, there’s a query in Armenian) and run again if necessary.
- Review and manually tag the queries, removing queries that are proper names (people, places, language names, companies, products, fictional characters, etc., etc.), acronyms, more gibberish (there’s always more gibberish), scientific terminology and other words that are extremely ambiguous and not specific to any one language, and anything that’s unidentifiable.
- Queries with typos are often left in, even though they make automatic identification hard(er).
- Longer queries that include a few “undesirable” words are kept. (e.g., “Le declin du système éducatif haïtien. Quelles en sont les causes fondamentales?” would be kept since it is mostly French, but “Haïti” would not because it’s a name.)
- Proper names that are made up of common nouns are kept. (e.g., names of movies, like “Seeking a Friend for the End of the World”, are often phrases made up of normal words, and are kept. Similarly country names made up of normal words are kept: "Costa Rica", "Puerto Rico", "Côte d'Ivoire", etc.)
For French, this cut the query pool by about two thirds, leaving only one third (three hundred and something) of the queries. So the process was repeated on the next 1000 queries from the filtered and randomized sample. For Spanish, just less than half of queries were eliminated, so I stopped with 520. The goal is > 500 queries.
For frwiki, the result is a corpus of 682 queries (from 2000 reviewed, after ~15% were previously removed).
Thus, for French, only ~30% of the queries that meet the criteria for possible language detection (< 3 results) are actually in an identifiable language. In production, much of the other 70% would also often be labelled as being in a particular language, but those results (on names, acronyms, gibberish, etc.) are unpredictable and any results from another wiki may or may not be helpful. Hence the need for A/B testing after this analysis is done.
Corpus size
editWhile 682 (the size of the French query corpus) is not a huge sample, it’s enough to get a sense of what languages are commonly present among these poor-performing queries, and optimize the choice of what languages to detect. The 95% confidence interval for the margin of error on a proportion (read more; calculator) maxes out at 50%. For a sample of size 500, that’s 4.38%. For a smaller proportion, the error is smaller, but larger relative to the proportion (e.g., 0.87% for a proportion of 1% out of 500).
Overall, though, that’s good enough for us to say things like, “Based on a sample of 682 poor-performing queries on frwiki that are in some language, about 70% are in French, 10-15% are in English, 7-12% are in Arabic, fewer than 3% are in Portuguese, German, and Spanish, and there are a handful of other languages present.”—which is enough for us to optimize the languages to be used for language detection for accuracy and run-time performance.
French Results
editAbout 15% of the original 10K corpus was removed in the initial filtering. A 2,000-query random sample was taken, and about 66% of those queries were discarded, leaving a 682-query corpus. Thus only about 29% of poor-performing queries are in an identifiable language.
Other languages searched on frwiki
editBased on a sample of 682 poor-performing queries on frwiki that are in some language, about 70% are in French, 10-15% are in English, about 7-12% are in Arabic, fewer than 3% are in Portuguese, German, and Spanish, and there are a handful of other languages present.
Below are the results for frwiki, with raw counts, percentage, and 95% margin of error.
count | lg | % | +/- |
468 | fr | 68.62% | 3.48% |
89 | en | 13.05% | 2.53% |
66 | ar | 9.68% | 2.22% |
12 | pt | 1.76% | 0.99% |
11 | de | 1.61% | 0.95% |
10 | es | 1.47% | 0.90% |
5 | ru | 0.73% | 0.64% |
4 | zh | 0.59% | 0.57% |
3 | nl | 0.44% | 0.50% |
2 | pl | 0.29% | 0.41% |
2 | it | 0.29% | 0.41% |
2 | co | 0.29% | 0.41% |
1 | th | 0.15% | 0.29% |
1 | sw | 0.15% | 0.29% |
1 | sv | 0.15% | 0.29% |
1 | la | 0.15% | 0.29% |
1 | is | 0.15% | 0.29% |
1 | hy | 0.15% | 0.29% |
1 | hu | 0.15% | 0.29% |
1 | br | 0.15% | 0.29% |
In order, those are French, English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Dutch, Polish, Italian, Corsican, Thai, Swahili, Swedish, Latin, Icelandic, Armenian, Hungarian, Breton.
We don’t have query-trained language models for all of the languages represented here, such as Corsican, Swahili, Breton, Icelandic, Latin, or Hungarian. Since these each represent very small slices of our corpus (1-2 queries each), we aren’t going to worry about them, and accept that they will not be detected correctly.
Looking at the larger corpus of 8,517 remaining queries after the initial filtering, focusing on queries in other writing systems, there are also a small number of Greek, Hebrew, and Korean queries.
Analysis and Optimization
editUsing all of the language models available, the performance report (for the 3000-ngram models* we use in enwiki) is below.
* I also ran tests on other model sizes, in increments of 500 up to 5,000 and increments of 1,000 up to 10,000. 3000 is still the best model size.
f0.5 f1 f2 recall prec total hits misses TOTAL 83.0% 83.1% 83.1% 83.1% 83.0% 681 566 116 French 95.5% 91.7% 88.1% 85.9% 98.3% 468 402 7 English 80.6% 75.5% 70.9% 68.2% 84.5% 88 60 11 Arabic 100.0% 100.0% 100.0% 100.0% 100.0% 66 66 0 Portuguese 62.5% 66.7% 71.4% 75.0% 60.0% 12 9 6 German 44.3% 50.0% 57.4% 63.6% 41.2% 11 7 10 Spanish 15.4% 20.8% 32.1% 50.0% 13.2% 10 5 33 Russian 100.0% 100.0% 100.0% 100.0% 100.0% 5 5 0 Chinese 100.0% 100.0% 100.0% 100.0% 100.0% 4 4 0 Dutch 21.3% 28.6% 43.5% 66.7% 18.2% 3 2 9 Corsican 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Italian 6.1% 9.1% 17.9% 50.0% 5.0% 2 1 19 Polish 29.4% 40.0% 62.5% 100.0% 25.0% 2 2 6 Armenian 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 Breton 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Hungarian 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Icelandic 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Swahili 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Swedish 7.7% 11.8% 25.0% 100.0% 6.2% 1 1 15 Thai 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 f0.5 f1 f2 recall prec total hits misses
Spanish, Dutch, Italian, Polish, and Swedish do very poorly. They have too few actual instances that they can get correct, which are heavily outweighed by the false positives they do get.
Portuguese and German are not great, either. I reran the analysis without Portuguese and German, and it was better. I added them each back into the mix separately and in both cases the results were worse.
As noted above, Greek, Hebrew, and Korean are present in the larger sample, and from earlier work on the balanced query sets, our models for these languages are very high accuracy.
So, I dropped Portuguese and German, added Greek, Hebrew, and Korean, and re-ran the performance report with the 3000-ngram models (to check the performance and double-check that Greek, Hebrew, and Korean aren’t causing problems). The results are below:
f0.5 f1 f2 recall prec total hits misses TOTAL 89.0% 89.1% 89.1% 89.1% 89.0% 681 607 75 French 94.8% 95.1% 95.5% 95.7% 94.5% 468 448 26 English 67.0% 74.9% 84.9% 93.2% 62.6% 88 82 49 Arabic 100.0% 100.0% 100.0% 100.0% 100.0% 66 66 0 Portuguese 0.0% 0.0% 0.0% 0.0% 0.0% 12 0 0 German 0.0% 0.0% 0.0% 0.0% 0.0% 11 0 0 Spanish 0.0% 0.0% 0.0% 0.0% 0.0% 10 0 0 Russian 100.0% 100.0% 100.0% 100.0% 100.0% 5 5 0 Chinese 100.0% 100.0% 100.0% 100.0% 100.0% 4 4 0 Dutch 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Corsican 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Italian 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Polish 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Armenian 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 Breton 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Hungarian 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Icelandic 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Swahili 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Swedish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Thai 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 f0.5 f1 f2 recall prec total hits misses
Recall went up and precision went down for French and English, but overall performance improved. Queries in unrepresented languages were all identified as either French or English (decreasing precision for both), but those now unused models are no longer generating lots of false positives and bringing down precision overall.
frwiki: Best Options
editThe optimal settings for frwiki, based on these experiments, would be to use models for French, English, Arabic, Russian, Chinese, Armenian, Thai, Greek, Hebrew, Korean (fr, en, ar, ru, zh, th, el, hy, he, ko), using the default 3000-ngram models.
Spanish Results
editAbout 10% of the original 10K corpus was removed in the initial filtering. A 1,000-query random sample was taken, and 48% of those queries were discarded, leaving a 520-query corpus. Thus only about 47% of poor-performing queries are in an identifiable language.
Other languages searched on eswiki
editBased on the sample of 520 poor-performing queries on eswiki that are in some language, about 90% are in Spanish, 4-8% are in English, and fewer than 2% each are in a handful of other languages.
Below are the results for eswiki, with raw counts, percentage, and 95% margin of error.
count | lg | % | +/- |
476 | es | 91.54% | 2.39% |
32 | en | 6.15% | 2.07% |
3 | la | 0.58% | 0.65% |
2 | ru | 0.38% | 0.53% |
1 | zh | 0.19% | 0.38% |
1 | pt | 0.19% | 0.38% |
1 | it | 0.19% | 0.38% |
1 | gn | 0.19% | 0.38% |
1 | fr | 0.19% | 0.38% |
1 | de | 0.19% | 0.38% |
1 | ca | 0.19% | 0.38% |
In order, those are Spanish, English, Latin, Russian, Chinese, Portuguese, Italian, Guarani*, French, German, Catalan.
* Mbaé’chepa!
We don’t have query-trained language models for all of the languages represented here, such as Latin, Guarani, and Catalan. Since these each represent very small slices of our corpus (1-3 queries each), we aren’t going to worry about them, and accept that they will not be detected correctly.
Looking at the larger corpus of 9,003 remaining queries after the initial filtering, focusing on queries in other writing systems, there are also a small number of Arabic and Japanese queries, and one each for Cherokee and Aramaic (for which we do not have models).
Analysis and Optimization
editUsing all of the language models available, I ran tests on various model sizes, in increments of 500 up to 5,000 and increments of 1,000 up to 10,000. Results for the 3K models, and some of the models that did better, are here:
model size 3000 3500 6000 7000 8000 9000 10000 TOTAL 83.2% 84.4% 84.8% 85.0% 85.4% 85.9% 86.3% Spanish 91.6% 92.2% 92.3% 92.6% 92.8% 93.2% 93.4% English 73.0% 75.0% 75.8% 73.8% 75.0% 76.2% 76.2% Latin 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Russian 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% Catalan 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Chinese 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% French 0.0% 0.0% 33.3% 20.0% 22.2% 22.2% 22.2% German 22.2% 22.2% 25.0% 22.2% 20.0% 20.0% 20.0% Italian 8.0% 9.1% 9.1% 10.0% 11.8% 11.8% 11.1% Portuguese 4.7% 4.8% 4.8% 5.3% 5.0% 5.3% 5.7%
Performance details for the 3K models are here (details for larger models are similar in terms of which language models perform the most poorly):
f0.5 f1 f2 recall prec total hits misses TOTAL 83.2% 83.2% 83.2% 83.2% 83.2% 519 432 87 Spanish 96.3% 91.6% 87.3% 84.7% 99.8% 476 403 1 English 73.7% 73.0% 72.3% 71.9% 74.2% 32 23 8 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Russian 100.0% 100.0% 100.0% 100.0% 100.0% 2 2 0 Catalan 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Chinese 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 French 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 7 German 15.2% 22.2% 41.7% 100.0% 12.5% 1 1 7 Italian 5.2% 8.0% 17.9% 100.0% 4.2% 1 1 23 Portuguese 3.0% 4.7% 10.9% 100.0% 2.4% 1 1 41 f0.5 f1 f2 recall prec total hits misses
Italian, Portuguese, French, and German all do very poorly, with too many false positives.
As noted above, Arabic and Japanese are present in the larger sample, and as our models for these languages are high accuracy, I’ve included them.
The final language set is Spanish, English, Russian, Chinese, Arabic, and Japanese. As above, 3K is not the optimal model size—my current unsupported hypothesis is that 3K isn’t the best here because there are really only two languages in contention. The 3K results are shown below along with the best performing model sizes:
model size 1500 2000 2500 3000 9000 10000 TOTAL 96.5% 96.1% 96.0% 95.8% 96.0% 96.1% Spanish 98.7% 98.5% 98.4% 98.2% 98.3% 98.4% English 78.9% 76.3% 75.3% 76.5% 76.9% 77.9% Latin 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Russian 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% Catalan 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Chinese 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% French 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% German 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Italian 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Portuguese 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
However, the accuracy is very high, and the differences are not huge, so it makes sense to stick with the default 3K models for now, but I'll continue to keep an eye out for significant performance improvements with other model sizes when working with other corpora.
The detailed report for the 3K model is here*:
[* I inadvertently forgot to include "Guarani" as a known language, so Spanish totals were off by one (519 instead of 520). Since we don't have a Guarani language detector, it is of course incorrect, slightly lowering the overall score, but not really changing the final recommendations. The report below is corrected, those above are not.]
f0.5 f1 f2 recall prec total hits misses TOTAL 95.6% 95.6% 95.6% 95.6% 95.6% 520 497 23 Spanish 98.8% 98.2% 97.6% 97.3% 99.1% 476 463 4 English 66.8% 75.6% 87.1% 96.9% 62.0% 32 31 19 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Russian 100.0% 100.0% 100.0% 100.0% 100.0% 2 2 0 Catalan 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Chinese 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 French 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 German 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Guarani 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Italian 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Portuguese 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 f0.5 f1 f2 recall prec total hits misses
Recall went up and precision went down for Spanish and English, but overall performance improved. Queries in unrepresented languages were all identified as either Spanish or English (decreasing precision for both), but those now unused models are no longer generating lots of false positives and bringing down precision overall.
eswiki: Best Options
editNon-optimal settings for eswiki (while being consistent with other wikis using 3K models), based on these experiments, would be to use models for Spanish, English, Russian, Chinese, Arabic, Japanese (es, en, ru, zh, ar, ja), using the default 3000-ngram models.
Italian Results
editAbout 15% of the original 10K corpus was removed in the initial filtering. A 1,600-query random sample was taken, and 65% of those queries were discarded, leaving a 550-query corpus. Thus only about 29% of low-performing queries are in an identifiable language.
Other languages searched on itwiki
editBased on the sample of 550 poor-performing queries on itwiki that are in some language, about 75% are in Italian, 20% are in English, and fewer than 1% each are in a handful of other languages.
Below are the results for itwiki, with raw counts, percentage, and 95% margin of error.
count | lg | % | +/- |
404 | it | 73.45% | 3.69% |
109 | en | 19.82% | 3.33% |
8 | es | 1.45% | 1.00% |
6 | de | 1.09% | 0.87% |
4 | la | 0.73% | 0.71% |
4 | fr | 0.73% | 0.71% |
3 | ru | 0.55% | 0.62% |
3 | ro | 0.55% | 0.62% |
3 | pt | 0.55% | 0.62% |
3 | ar | 0.55% | 0.62% |
1 | zh | 0.18% | 0.36% |
1 | pl | 0.18% | 0.36% |
1 | cs | 0.18% | 0.36% |
In order, those are Italian, English, Spanish, German, Latin, French, Russian, Romanian, Portuguese, Arabic, Chinese, Polish, Czech.
We don’t have query-trained language models for all of the languages represented here, such as Latin and Romanian. Since these each represent very small slices of our corpus (< 5 queries each), we aren’t going to worry about them, and accept that they will not be detected correctly.
Looking at the larger corpus of 8,533 remaining queries after the initial filtering, focusing on queries in other writing systems, there are also a small number of Greek and Japanese queries, and one each for Korean and Bengali, and one for Punjabi (for which we do not have a model).
Analysis and Optimization
editUsing all of the language models available, I ran tests on various model sizes, in increments of 500 up to 5,000 and increments of 1,000 up to 10,000. Results for the 3K models, and some of the models that did better are here:
3000 3500 4000 4500 10000 TOTAL 83.5% 84.0% 84.0% 83.8% 84.0% Italian 93.1% 93.4% 93.4% 93.8% 93.8% English 80.9% 82.1% 80.4% 77.8% 77.4% Spanish 33.3% 28.6% 22.9% 27.8% 32.4% German 46.2% 46.2% 42.9% 40.0% 41.4% French 20.7% 22.2% 32.0% 32.0% 33.3% Latin 0.0% 0.0% 0.0% 0.0% 0.0% Arabic 100.0% 100.0% 100.0% 100.0% 100.0% Portuguese 19.0% 20.0% 30.0% 31.6% 27.3% Romanian 0.0% 0.0% 0.0% 0.0% 0.0% Russian 100.0% 100.0% 100.0% 100.0% 100.0% Chinese 100.0% 100.0% 100.0% 100.0% 100.0% Czech 25.0% 25.0% 0.0% 0.0% 0.0% Polish 50.0% 50.0% 66.7% 66.7% 66.7%
Performance details for the 3K model are here (details for larger models are similar in terms of which language models perform the most poorly):
f0.5 f1 f2 recall prec total hits misses TOTAL 83.5% 83.5% 83.5% 83.5% 83.5% 550 459 91 Italian 96.2% 93.1% 90.2% 88.4% 98.3% 404 357 6 English 89.4% 80.9% 73.8% 69.7% 96.2% 109 76 3 Spanish 25.0% 33.3% 50.0% 75.0% 21.4% 8 6 22 German 34.9% 46.2% 68.2% 100.0% 30.0% 6 6 14 French 14.4% 20.7% 36.6% 75.0% 12.0% 4 3 22 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 4 0 0 Arabic 100.0% 100.0% 100.0% 100.0% 100.0% 3 3 0 Portuguese 13.3% 19.0% 33.3% 66.7% 11.1% 3 2 16 Romanian 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Russian 100.0% 100.0% 100.0% 100.0% 100.0% 3 3 0 Chinese 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 Czech 17.2% 25.0% 45.5% 100.0% 14.3% 1 1 6 Polish 38.5% 50.0% 71.4% 100.0% 33.3% 1 1 2 f0.5 f1 f2 recall prec total hits misses
Spanish, German, French, and Portuguese all do very poorly, with too many false positives. Czech and Polish aren’t terrible in terms of raw false positives, but aren’t great, either.
As noted above, Greek, Japanese, and Korean are present in the larger sample, and as our models for these languages are very high accuracy, I’ve included them. I did not include Bengali because it hasn't been well tested as this point.
The final language set is Italian, English, Russian, Arabic, Chinese, Japanese, Greek, and Korean. As above, 3K is not the optimal model size, but it is within 0.2%. The 3K results are shown below along with the best performing model sizes:
3000 3500 4000 4500 10000 TOTAL 92.2% 92.4% 92.2% 91.8% 92.2% Italian 96.7% 96.9% 96.6% 96.4% 96.6% English 87.3% 87.8% 87.8% 86.8% 87.7% Spanish 0.0% 0.0% 0.0% 0.0% 0.0% German 0.0% 0.0% 0.0% 0.0% 0.0% French 0.0% 0.0% 0.0% 0.0% 0.0% Latin 0.0% 0.0% 0.0% 0.0% 0.0% Arabic 100.0% 100.0% 100.0% 100.0% 100.0% Portuguese 0.0% 0.0% 0.0% 0.0% 0.0% Romanian 0.0% 0.0% 0.0% 0.0% 0.0% Russian 100.0% 100.0% 100.0% 100.0% 100.0% Chinese 100.0% 100.0% 100.0% 100.0% 100.0% Czech 0.0% 0.0% 0.0% 0.0% 0.0% Polish 0.0% 0.0% 0.0% 0.0% 0.0%
The accuracy is very high, and the differences are very small, so it makes sense to stick with the default 3K models for now, but keep an eye out for significant performance improvements with other model sizes.
The detailed report for the 3K model is here:
f0.5 f1 f2 recall prec total hits misses TOTAL 92.2% 92.2% 92.2% 92.2% 92.2% 550 507 43 Italian 95.4% 96.7% 98.1% 99.0% 94.6% 404 400 23 English 84.9% 87.3% 89.9% 91.7% 83.3% 109 100 20 Spanish 0.0% 0.0% 0.0% 0.0% 0.0% 8 0 0 German 0.0% 0.0% 0.0% 0.0% 0.0% 6 0 0 French 0.0% 0.0% 0.0% 0.0% 0.0% 4 0 0 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 4 0 0 Arabic 100.0% 100.0% 100.0% 100.0% 100.0% 3 3 0 Portuguese 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Romanian 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Russian 100.0% 100.0% 100.0% 100.0% 100.0% 3 3 0 Chinese 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 Czech 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Polish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 f0.5 f1 f2 recall prec total hits misses
Recall went up and precision went down for Italian and English, but overall performance improved. Queries in unrepresented languages were all identified as either Italian or English (decreasing precision for both), but those now unused models are no longer generating lots of false positives and bringing down precision overall.
itwiki: Best Options
editThe barely sub-optimal settings (though consistent with others using 3K models) for itwiki, based on these experiments, would be to use models for Italian, English, Russian, Arabic, Chinese, Japanese, Greek, Korean (it, en, ru, ar, zh, ja, el, ko), using the default 3000-ngram models.
German Results
editAbout 6% of the original 10K corpus was removed in the initial filtering. A 1400-query random sample was taken, and ~63% of those queries were discarded, leaving a 520-query corpus. Thus only about 35% of low-performing queries are in an identifiable language.
Other languages searched on dewiki
editBased on the sample of 520 poor-performing queries on dewiki that are in some language, about 70% are in German, about 25% are in English, and fewer than 2% each are in a handful of other languages.
Below are the results for dewiki, with raw counts, percentage, and 95% margin of error.
count | lg | % | +/- |
360 | de | 69.23% | 3.97% |
123 | en | 23.65% | 3.65% |
8 | la | 1.54% | 1.06% |
8 | it | 1.54% | 1.06% |
8 | es | 1.54% | 1.06% |
5 | fr | 0.96% | 0.84% |
2 | zh | 0.38% | 0.53% |
2 | pl | 0.38% | 0.53% |
1 | vi | 0.19% | 0.38% |
1 | tr | 0.19% | 0.38% |
1 | sv | 0.19% | 0.38% |
1 | nl | 0.19% | 0.38% |
In order, those are German, English, Latin, Italian, Spanish, French, Chinese, Polish, Vietnamese, Turkish, Swedish, Dutch.
We don’t have query-trained language models for all of the languages represented here, in particular Latin. Since it represents a very small slice of our corpus (8 queries), we aren’t going to worry about it, and accept that it will not be detected correctly.
Looking at the larger corpus of 9,439 remaining queries after the initial filtering, focusing on queries in other writing systems, there are also a small number of Greek, Russian, Arabic, Hindi, Thai, Korean, and Japanese queries, and a one Odia query (for which we do not have a model).
Analysis and Optimization
editUsing all of the language models available, I ran tests on various model sizes, in increments of 500 up to 5,000 and increments of 1,000 up to 10,000. Results for the 3K models, and some of the models that did better are here:
2500 3000 3500 4000 6000 7000 8000 9000 10000 TOTAL 74.6% 74.4% 74.6% 75.3% 75.7% 76.3% 76.9% 77.0% 77.6% German 88.9% 88.9% 89.2% 89.5% 90.0% 90.5% 90.9% 90.9% 91.2% English 73.8% 74.5% 74.1% 75.7% 74.9% 74.0% 74.0% 73.8% 74.1% Italian 25.0% 40.0% 40.0% 38.5% 37.0% 42.9% 46.2% 48.0% 51.9% Latin 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Spanish 50.0% 48.3% 50.0% 46.2% 43.5% 45.5% 47.6% 47.6% 50.0% French 34.8% 36.4% 34.8% 33.3% 32.0% 32.0% 33.3% 32.0% 32.0% Chinese 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Dutch 6.9% 5.9% 6.2% 6.2% 7.1% 7.4% 7.1% 6.9% 7.1% Polish 25.0% 25.0% 22.2% 25.0% 28.6% 28.6% 22.2% 22.2% 25.0% Swedish 5.0% 0.0% 0.0% 0.0% 0.0% 0.0% 5.7% 5.9% 5.9% Turkish 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 22.2% 22.2% Vietnamese 40.0% 50.0% 50.0% 50.0% 66.7% 100.0% 100.0% 100.0% 100.0%
Performance details for the 3K model are here (details for larger models are similar in terms of which language models perform the most poorly):
f0.5 f1 f2 recall prec total hits misses TOTAL 74.5% 74.4% 74.3% 74.2% 74.5% 520 386 132 German 94.5% 88.9% 83.9% 80.8% 98.6% 360 291 4 English 85.6% 74.5% 66.0% 61.3% 95.0% 124 76 4 Italian 32.9% 40.0% 51.0% 62.5% 29.4% 8 5 12 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 8 0 0 Spanish 38.0% 48.3% 66.0% 87.5% 33.3% 8 7 14 French 27.4% 36.4% 54.1% 80.0% 23.5% 5 4 13 Chinese 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Dutch 3.8% 5.9% 13.5% 100.0% 3.0% 1 1 32 Polish 17.2% 25.0% 45.5% 100.0% 14.3% 1 1 6 Swedish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 38 Turkish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 7 Vietnamese 38.5% 50.0% 71.4% 100.0% 33.3% 1 1 2 f0.5 f1 f2 recall prec total hits misses
Dutch and Swedish do very poorly, with too many false positives. Italian, Spanish, French, Polish, Turkish, and Vietnamese aren’t terrible in terms of raw false positives, but aren’t great, either.
As noted above, Greek, Russian, Arabic, Hindi, Thai, Korean, and Japanese are present in the larger sample, and as our models for these languages are very high accuracy, I’ve included them.
The final language set is German, English, Chinese, Greek, Russian, Arabic, Hindi, Thai, Korean, and Japanese. As above, 3K is not the optimal model size, but it is within half a percent. The 3K results are shown below along with the best performing model sizes:
3000 4000 4500 5000 9000 10000 TOTAL 88.2% 88.7% 88.7% 88.7% 88.9% 88.7% German 94.8% 95.3% 95.2% 95.4% 95.5% 95.2% English 81.9% 82.8% 83.2% 82.7% 83.3% 83.2% Italian 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Latin 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Spanish 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% French 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Chinese 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Dutch 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Polish 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Swedish 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Turkish 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Vietnamese 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
The accuracy is very high, and the differences are very small, so it makes sense to stick with the default 3K models for now, but keep an eye out for significant performance improvements with other model sizes.
The detailed report for the 3K model is here:
f0.5 f1 f2 recall prec total hits misses TOTAL 88.2% 88.2% 88.1% 88.1% 88.2% 520 458 61 German 93.9% 94.8% 95.8% 96.4% 93.3% 360 347 25 English 77.9% 81.9% 86.3% 89.5% 75.5% 124 111 36 Italian 0.0% 0.0% 0.0% 0.0% 0.0% 8 0 0 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 8 0 0 Spanish 0.0% 0.0% 0.0% 0.0% 0.0% 8 0 0 French 0.0% 0.0% 0.0% 0.0% 0.0% 5 0 0 Chinese 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Dutch 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Polish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Swedish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Turkish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Vietnamese 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 f0.5 f1 f2 recall prec total hits misses
Recall went up and precision went down for German and English, but overall performance improved. Queries in unrepresented languages were all identified as either German or English (decreasing precision for both), but those now unused models are no longer generating lots of false positives and bringing down precision overall.
Observations
editInterestingly, even though Chinese was enabled, neither of the two queries in Chinese were tagged as such. Generally, Chinese is a relatively high-accuracy identifier. In this case, the queries include a random string of numbers and Latin letters and one includes ".html". They also include a number of less common Chinese characters. As a result, the less common Chinese characters get the same score from the Chinese and German language detectors (the maximum penalty for an "unknown" character), and the individual letters score well in German, which the known Chinese characters score less well in Chinese. The Chinese model includes not only individual characters, but also bigrams and larger n-grams, so there aren't even 3,000 singleton Chinese characters in the model.
The Chinese model did a better job on the other Chinese examples in the larger un-tagged dewiki sample.
dewiki: Best Options
editThe barely sub-optimal settings (though consistent with others using 3K models) for dewiki, based on these experiments, would be to use models for German, English, Chinese, Greek, Russian, Arabic, Hindi, Thai, Korean, Japanese (de, en, zh, el, ru, ar, hi, th, ko, ja), using the default 3000-ngram models.
Next Up
edit- English re-do, Russian, Japanese, Portuguese, and Indonesian (See T138315)
- Others, if we continue. (See T121541)
Other thoughts
edit- It would be easy to build high-accuracy identifiers for languages that have unique character sets—or at least character sets that are effectively unique in practice. For example, Yiddish can be written with the Hebrew alphabet, but on most wikis, we'd expect most Hebrew characters to actually be Hebrew (and identifying Yiddish as Hebrew is better than identifying it as anything else other than Yiddish). Similarly, Persian and Arabic writing share many characters, but on frwiki, for example, we only see Arabic. Cherokee is rare, but examples have shown up in our samples. Korean, Armenian, Hebrew, Greek, Georgian, Thai, and others could be used on most wikis because they are low risk, if the run-time cost of enabling them is not too high.