User:TJones (WMF)/Notes/TextCat Re-optimization for enwiki
June 2016 — See TJones_(WMF)/Notes for other projects. (Phabricator ticket: T138315)
Background & Highlights
editI’m posting the results for optimizing TextCat for enwiki separately from the others in the same Phab ticket because this is a re-evaluation of English using different criteria to extract a sample. The good news is that while the selection criteria were fairly different and the specifics of the long tail differ, the sample extracted has a fairly similar distribution of languages represented, the optimized set of languages for identification is compatible, and the previous set of languages performs quite well on the current sample. See “Comparison to Earlier Analysis” below for more details.
See the earlier report on frwiki, eswiki, itwiki, and dewiki for information on how the corpus was created.
Summary of Results
editUsing the default 3K models, the best options for enwiki are presented below:
enwiki
- languages: English, Chinese, Spanish, Arabic, Persian, Vietnamese, Russian, Polish, Indonesian, Japanese, Bengali, Hebrew, Korean, Thai, Ukrainian, Hindi, Greek, Telugu, and Georgian; possibly Bulgarian, Tamil, and Portuguese
- lang codes: en, zh, es, ar, fa, vi, ru, pl, id, ja, bn, he, ko, th, uk, hi, el, te, ka; possibly bg, ta, pt
- relevant poor-performing queries: 31%
- f0.5: 83.0%
English Results
editAbout 13% of the original 10K corpus was removed in the initial filtering. A 2000-query random sample was taken, and 64% of those queries were discarded, leaving a 721-query corpus. Thus only about 31% of low-performing queries are in an identifiable language.
Other languages searched on enwiki
editBased on the sample of 721 poor-performing queries on enwiki that are in some language, about 70% are in English, 3-5% each in Chinese, Spanish, Arabic, and German, and fewer than 1-2% each are in a large number of other languages.
Below are the results for enwiki, with raw counts, percentage, and 95% margin of error.
count | lg | % | +/- |
500 | en | 69.35% | 3.37% |
32 | zh | 4.44% | 1.50% |
27 | es | 3.74% | 1.39% |
25 | ar | 3.47% | 1.34% |
23 | de | 3.19% | 1.28% |
11 | fa | 1.53% | 0.89% |
10 | fr | 1.39% | 0.85% |
7 | vi | 0.97% | 0.72% |
7 | ru | 0.97% | 0.72% |
7 | pl | 0.97% | 0.72% |
7 | id | 0.97% | 0.72% |
6 | it | 0.83% | 0.66% |
5 | pt | 0.69% | 0.61% |
5 | ja | 0.69% | 0.61% |
4 | cs | 0.55% | 0.54% |
3 | sv | 0.42% | 0.47% |
3 | no | 0.42% | 0.47% |
3 | ms | 0.42% | 0.47% |
3 | hr | 0.42% | 0.47% |
3 | he | 0.42% | 0.47% |
3 | bn | 0.42% | 0.47% |
2 | tr | 0.28% | 0.38% |
2 | tl | 0.28% | 0.38% |
2 | th | 0.28% | 0.38% |
2 | nl | 0.28% | 0.38% |
2 | la | 0.28% | 0.38% |
2 | is | 0.28% | 0.38% |
2 | az | 0.28% | 0.38% |
2 | af | 0.28% | 0.38% |
1 | ur | 0.14% | 0.27% |
1 | uk | 0.14% | 0.27% |
1 | sw | 0.14% | 0.27% |
1 | sk | 0.14% | 0.27% |
1 | rw | 0.14% | 0.27% |
1 | ko | 0.14% | 0.27% |
1 | km | 0.14% | 0.27% |
1 | hu | 0.14% | 0.27% |
1 | ha | 0.14% | 0.27% |
1 | ga | 0.14% | 0.27% |
1 | am | 0.14% | 0.27% |
In order, those are English, Chinese, Spanish, Arabic, German, Persian, French, Vietnamese, Russian, Polish, Indonesian, Italian, Portuguese, Japanese, Czech, Swedish, Norwegian, Malay, Croatian, Hebrew, Bengali, Turkish, Tagalog, Thai, Dutch, Latin, Icelandic, Azerbaijani, Afrikaans, Urdu, Ukrainian, Swahili, Slovak, Kinyarwanda, Korean, Khmer, Hungarian, Hausa, Irish, and Amharic.
We don’t have query-trained language models for many of the languages in the long tail. Since these each represent very small slices of our corpus (<= 3 queries each), we aren’t going to worry about them, and accept that they will not be detected correctly.
Looking at the larger corpus of 8,727 remaining queries after the initial filtering, focusing on queries in other writing systems, there are also a small number of Greek, Telugu, Georgian, and Hindi queries, and Malayalam, Amharic, and Khmer (for which we do not have models).
Analysis and Optimization
editUsing all of the language models available, I ran tests on various model sizes, in increments of 500 up to 5,000 and increments of 1,000 up to 10,000. Results for the 3K models, and the models that did as well or better are here:
model size 3000 3500 TOTAL 74.2% 74.4% English 84.2% 84.3% Chinese 93.3% 93.3% Spanish 61.3% 61.3% Arabic 100.0% 100.0% German 59.0% 60.0% Persian 95.7% 95.7% French 34.0% 34.0% Indonesian 48.0% 46.2% Polish 63.2% 70.0% Russian 92.3% 92.3% Vietnamese 100.0% 100.0% Italian 20.5% 20.5% Japanese 90.9% 90.9% Portuguese 53.3% 53.3% Czech 36.4% 36.4% Bengali 100.0% 100.0% Croatian 0.0% 0.0% Hebrew 100.0% 100.0% Malay 0.0% 0.0% Norwegian 0.0% 0.0% Swedish 16.7% 16.7% Afrikaans 0.0% 0.0% Azerbaijani 0.0% 0.0% Dutch 16.7% 18.2% Icelandic 0.0% 0.0% Latin 0.0% 0.0% Tagalog 0.0% 0.0% Thai 100.0% 100.0% Turkish 33.3% 33.3% Amharic 0.0% 0.0% Hausa 0.0% 0.0% Hungarian 0.0% 0.0% Irish 0.0% 0.0% Khmer 0.0% 0.0% Kinyarwanda 0.0% 0.0% Korean 100.0% 100.0% Slovak 0.0% 0.0% Swahili 0.0% 0.0% Ukrainian 66.7% 50.0% Urdu 0.0% 0.0%
Performance details for the 3K model are here (details for larger models are similar in terms of which language models perform the most poorly):
f0.5 f1 f2 recall prec total hits misses TOTAL 74.5% 74.2% 73.9% 73.6% 74.7% 721 531 180 English 92.7% 84.2% 77.1% 73.0% 99.5% 500 365 2 Chinese 97.2% 93.3% 89.7% 87.5% 100.0% 32 28 0 Spanish 56.9% 61.3% 66.4% 70.4% 54.3% 27 19 16 Arabic 100.0% 100.0% 100.0% 100.0% 100.0% 25 25 0 German 51.4% 59.0% 69.2% 78.3% 47.4% 23 18 20 Persian 93.2% 95.7% 98.2% 100.0% 91.7% 11 11 1 French 24.7% 34.0% 54.2% 90.0% 20.9% 10 9 34 Indonesian 38.0% 48.0% 65.2% 85.7% 33.3% 7 6 12 Polish 54.5% 63.2% 75.0% 85.7% 50.0% 7 6 6 Russian 96.8% 92.3% 88.2% 85.7% 100.0% 7 6 0 Vietnamese 100.0% 100.0% 100.0% 100.0% 100.0% 7 7 0 Italian 14.5% 20.5% 35.1% 66.7% 12.1% 6 4 29 Japanese 86.2% 90.9% 96.2% 100.0% 83.3% 5 5 1 Portuguese 44.4% 53.3% 66.7% 80.0% 40.0% 5 4 6 Czech 31.2% 36.4% 43.5% 50.0% 28.6% 4 2 5 Bengali 100.0% 100.0% 100.0% 100.0% 100.0% 3 3 0 Croatian 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Hebrew 100.0% 100.0% 100.0% 100.0% 100.0% 3 3 0 Malay 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Norwegian 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Swedish 11.5% 16.7% 30.3% 66.7% 9.5% 3 2 19 Afrikaans 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Azerbaijani 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Dutch 11.1% 16.7% 33.3% 100.0% 9.1% 2 2 20 Icelandic 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Tagalog 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Thai 100.0% 100.0% 100.0% 100.0% 100.0% 2 2 0 Turkish 23.8% 33.3% 55.6% 100.0% 20.0% 2 2 8 Amharic 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Hausa 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Hungarian 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Irish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Khmer 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Kinyarwanda 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Korean 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 Slovak 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Swahili 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Ukrainian 55.6% 66.7% 83.3% 100.0% 50.0% 1 1 1 Urdu 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 f0.5 f1 f2 recall prec total hits misses
French, German, Italian, Swedish, and Dutch all do very poorly, with too many false positives. Turkish isn’t terrible in terms of raw false positives, but aren’t great, either. Once French and Italian are eliminated, Portuguese does very poorly, too.
As noted above, Greek, Telugu, Georgian are present in the larger sample, and as our models for these languages are very high accuracy, I’ve included them.
The final language set is English, Chinese, Spanish, Arabic, Persian, Vietnamese, Russian, Polish, Indonesian, Japanese, Bengali, Hebrew, Korean, Thai, Ukrainian, Hindi, Greek, Telugu, and Georgian. With this language set, 3K is the optimal model size.
The detailed report for the 3K model is here:
f0.5 f1 f2 recall prec total hits misses TOTAL 83.0% 82.8% 82.6% 82.5% 83.1% 721 595 121 English 92.4% 92.6% 92.9% 93.0% 92.3% 500 465 39 Chinese 97.2% 93.3% 89.7% 87.5% 100.0% 32 28 0 Spanish 47.5% 58.1% 74.9% 92.6% 42.4% 27 25 34 Arabic 100.0% 100.0% 100.0% 100.0% 100.0% 25 25 0 German 0.0% 0.0% 0.0% 0.0% 0.0% 23 0 0 Persian 93.2% 95.7% 98.2% 100.0% 91.7% 11 11 1 French 0.0% 0.0% 0.0% 0.0% 0.0% 10 0 0 Indonesian 21.6% 30.0% 49.2% 85.7% 18.2% 7 6 27 Polish 35.4% 46.7% 68.6% 100.0% 30.4% 7 7 16 Russian 96.8% 92.3% 88.2% 85.7% 100.0% 7 6 0 Vietnamese 81.4% 87.5% 94.6% 100.0% 77.8% 7 7 2 Italian 0.0% 0.0% 0.0% 0.0% 0.0% 6 0 0 Japanese 86.2% 90.9% 96.2% 100.0% 83.3% 5 5 1 Portuguese 0.0% 0.0% 0.0% 0.0% 0.0% 5 0 0 Czech 0.0% 0.0% 0.0% 0.0% 0.0% 4 0 0 Bengali 100.0% 100.0% 100.0% 100.0% 100.0% 3 3 0 Croatian 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Hebrew 100.0% 100.0% 100.0% 100.0% 100.0% 3 3 0 Malay 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Norwegian 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Swedish 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Afrikaans 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Azerbaijani 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Dutch 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Icelandic 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Tagalog 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Thai 100.0% 100.0% 100.0% 100.0% 100.0% 2 2 0 Turkish 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Amharic 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Hausa 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Hungarian 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Irish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Khmer 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Kinyarwanda 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Korean 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 Slovak 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Swahili 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Ukrainian 55.6% 66.7% 83.3% 100.0% 50.0% 1 1 1 Urdu 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 f0.5 f1 f2 recall prec total hits misses
Recall went up and precision went down for English, Spanish, Indonesian, Polish, Vietnamese and others, but overall performance improved. Queries in unrepresented languages were most often English, Spanish, or Indonesian (decreasing precision for all three), but those now unused models are no longer generating lots of false positives and bringing down precision overall.
Comparison to Earlier Analysis
editPreviously, we’ve been using a very different data source for optimizing languages for TextCat for enwiki. In my original analysis for enwiki I used a 1K query set gathered for a general review of enwiki usage. It was sampled from a single day, included API requests (which made up about 2/3 of the queries) and had none of the simple anti-bot precautions we use now (e.g., queries from search box, exclude users with more than 30 queries/day, only one query from any IP/day, etc.) It also was limited to queries that got zero results, rather than the current criterion of fewer than three results, i.e., “poorly performing”). It also had significantly fewer “junk” queries, which I hypothesize is due to the inclusion of API queries—but that’s just a guess.
The proportions of queries in different languages for the previous and current samples are below. Given the differences in the sources, significant differences would not be surprising, but only English, Arabic, and German have non-overlapping 95% confidence intervals (using the Wilson Score Interval, which “has good properties even for a small number of trials and/or an extreme probability”—i.e., it won’t give negative numbers—instead of the simple margin of error calculations I have been using, as in the table above). The Arabic 95% intervals miss by less than 0.01%, and the all languages overlap in their 99% confidence intervals.
previous | current | lang |
77.32% | 69.35% | English |
2.58% | 4.44% | Chinese |
5.54% | 3.74% | Spanish |
1.29% | 3.47% | Arabic |
1.03% | 3.19% | German |
0.52% | 1.53% | Persian |
1.29% | 1.39% | French |
0.52% | 0.97% | Indonesian |
0.13% | 0.97% | Polish |
0.64% | 0.97% | Russian |
0.97% | Vietnamese | |
0.26% | 0.83% | Italian |
0.13% | 0.69% | Japanese |
2.45% | 0.69% | Portuguese |
0.55% | Czech | |
0.26% | 0.42% | Bengali |
0.13% | 0.42% | Croatian |
0.42% | Hebrew | |
0.77% | 0.42% | Malay |
0.26% | 0.42% | Norwegian |
0.13% | 0.42% | Swedish |
0.28% | Afrikaans | |
0.28% | Azerbaijani | |
0.13% | 0.28% | Dutch |
0.28% | Icelandic | |
0.13% | 0.28% | Latin |
1.16% | 0.28% | Tagalog |
0.13% | 0.28% | Thai |
0.64% | 0.28% | Turkish |
0.14% | Amharic | |
0.14% | Hausa | |
0.14% | Hungarian | |
0.14% | Irish | |
0.14% | Khmer | |
0.14% | Kinyarwanda | |
0.39% | 0.14% | Korean |
0.14% | Slovak | |
0.52% | 0.14% | Swahili |
0.14% | Ukrainian | |
0.14% | Urdu | |
0.26% | Bulgarian | |
0.13% | Estonian | |
0.13% | Finnish | |
0.13% | Greek | |
0.26% | Hindi | |
0.13% | Hmong | |
0.13% | Kannada | |
0.13% | Serbian | |
0.13% | Somali | |
0.13% | Tamil | |
0.13% | Uzbek | |
776 | 721 | sample size |
The long tails are noisy and differ, but given the limited sample sizes, that’s to be expected.
Based on the current sample, the best set of languages for enwiki is (alphabetically) Arabic, Bengali, Chinese, English, Georgian, Greek, Hebrew, Hindi, Indonesian, Japanese, Korean, Persian, Polish, Russian, Spanish, Telugu, Thai, Ukrainian, and Vietnamese, with F0.5 of 83.0%.
Based on the previous sample, the best set of languages for enwiki is (alphabetically) Arabic, Bengali, Bulgarian, Chinese, English, Greek, Hindi, Japanese, Korean, Persian, Portuguese, Russian, Spanish, Tamil, and Thai, with a slightly higher F0.5 of 83.1%.
The difference is the addition of Georgian, Hebrew, Indonesian, Polish, Telugu, Ukrainian, and Vietnamese, and the removal of Bulgarian, Portuguese, and Tamil. Why these changes?
The previous sample had no Hebrew, Ukrainian, or Vietnamese, and the newer sample had no Bulgarian or Tamil. Georgian and Telugu were added because they are present in the much larger 100K unreviewed sample, and cause no false recall problems when added.
That leaves Portuguese (removed), and Indonesian and Polish (added). Interestingly, there’s a pattern in the percentage of queries in the sample and the direction of change: the percentage of Portuguese queries decreased, and the percentage of Indonesian and Polish queries increases. My hypothesis is that having more queries (especially more than just one) to potentially get correct can offset the generally more stable number of false positives among more well-represented languages.
For Indonesian and Portuguese, the effect is quite small. Removing Indonesian doesn’t change the overall score for the evaluation set (the errors just shift around, and I prefer using more languages to fewer); adding in Portuguese decreases F0.5 by 0.4%. Removing Polish a small effect, decreasing F0.5 by 0.2%.
These minor differences probably represent some overfitting to these particular samples.
Running the current sample with the optimized list from the previous sample gives and F0.5 score of 81.4%, further indicating that we’re probably overfitting a bit, and that it doesn’t matter too much.
enwiki: Best Options
editThe optimal settings for enwiki, based on these experiments, would be to use models for English, Chinese, Spanish, Arabic, Persian, Vietnamese, Russian, Polish, Indonesian, Japanese, Bengali, Hebrew, Korean, Thai, Ukrainian, Hindi, Greek, Telugu, and Georgian (en, zh, es, ar, fa, vi, ru, pl, id, ja, bn, he, ko, th, uk, hi, el, te, ka), using the default 3000-ngram models.
Based on information from earlier experiments, including Bulgarian, Tamil, and even Portuguese (bg, ta, pt) would not be amiss.
So far, English Wikipedia has the most diverse collection of languages represented in its queries. If the cost of running so many models (19 or 22 models!) is too high, it would be least damaging to drop Ukrainian, Hindi, Greek, Telugu, Georgian, Bulgarian, Tamil, Portuguese, Korean, and Thai.