User:TJones (WMF)/Notes/TextCat Optimization for plwiki arwiki zhwiki and nlwiki
September 2016 — See TJones_(WMF)/Notes for other projects. (Phabricator ticket: T142140)
TextCat Optimization for plwiki, arwiki, zhwiki, and nlwiki
editSummary of Results
editUsing the default 3K models, the best options for each wiki are presented below:
nlwiki
- languages: Dutch, English, Chinese, Arabic, Korean, Greek, Hebrew, Japanese, Russian
- lang codes: nl, en, zh, ar, ko, el, he, ja, ru
- relevant poor-performing queries: 36%
- f0.5: 82.3%
Background
editSee the earlier report on frwiki, eswiki, itwiki, and dewiki for information on how the corpora were created.
Dutch Results
editAbout 16.8% of the original 10K corpus was removed in the initial filtering. A 1200-query random sample was taken, and 57.1% of those queries were discarded, leaving a 515-query corpus. Thus only about 35.7% of low-performing queries are in an identifiable language.
Other languages searched on nlwiki
editBased on the sample of 515 poor-performing queries on nlwiki that are in some language, about 63% are in Dutch, 25% in English, 2-3% in French and German, less than 2% each are in a handful of other languages.
Below are the results for nlwiki, with raw counts, percentage, and 95% margin of error.
count | lg | % | +/- |
326 | nl | 63.30% | 4.16% |
128 | en | 24.85% | 3.73% |
16 | fr | 3.11% | 1.50% |
11 | de | 2.14% | 1.25% |
6 | es | 1.17% | 0.93% |
5 | it | 0.97% | 0.85% |
4 | la | 0.78% | 0.76% |
3 | zh | 0.58% | 0.66% |
2 | tr | 0.39% | 0.54% |
2 | pl | 0.39% | 0.54% |
2 | fi | 0.39% | 0.54% |
2 | ar | 0.39% | 0.54% |
1 | vi | 0.19% | 0.38% |
1 | pt | 0.19% | 0.38% |
1 | my | 0.19% | 0.38% |
1 | ko | 0.19% | 0.38% |
1 | hr | 0.19% | 0.38% |
1 | da | 0.19% | 0.38% |
1 | cs | 0.19% | 0.38% |
1 | af | 0.19% | 0.38% |
In order, those are Dutch, English, French, German, Spanish, Italian, Latin, Chinese, Turkish, Polish, Finnish, Arabic, Vietnamese, Portuguese, Burmese, Korean, Croatian, Danish, Czech, Afrikaans.
We don’t have query-trained language models for all of the languages represented here, such as Afrikaans, Danish, Finnish, Croatian, Latin, and Burmese (af, da, fi, hr, la, my). Since these each represent very small slices of our corpus (< 5 queries each), we aren’t going to worry about them, and accept that they will not be detected correctly.
Looking at the larger corpus of 8,323 remaining queries after the initial filtering, focusing on queries in other writing systems, there are also a small number of Russian, Hebrew, Greek, and Japanese queries, and Amharic (for which we do not have models).
Analysis and Optimization
editUsing all of the language models available, I ran tests on various model sizes, in increments of 500 up to 5,000 and increments of 1,000 up to 10,000. Results for the 3K models, and some of the models that did better are here:
3000 3500 4000 4500 5000 6000 9000 TOTAL 74.5% 74.5% 75.5% 76.1% 77.0% 78.3% 78.7% Dutch 88.2% 88.6% 89.2% 88.8% 89.3% 89.9% 89.9% English 71.3% 69.8% 71.0% 73.6% 75.1% 75.9% 78.2% French 57.1% 58.3% 59.6% 66.7% 64.0% 71.1% 71.1% German 32.1% 32.7% 34.0% 32.7% 34.6% 39.2% 37.7% Spanish 46.2% 44.4% 44.4% 46.2% 46.2% 46.2% 50.0% Italian 20.7% 21.4% 14.8% 14.8% 15.4% 13.8% 13.3% Latin 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Chinese 80.0% 80.0% 80.0% 80.0% 80.0% 100.0% 100.0% Arabic 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% Finnish 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Polish 50.0% 50.0% 66.7% 66.7% 66.7% 66.7% 66.7% Turkish 57.1% 50.0% 50.0% 40.0% 44.4% 44.4% 44.4% Afrikaans 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Burmese 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Croatian 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Czech 22.2% 25.0% 28.6% 40.0% 50.0% 50.0% 50.0% Danish 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% Korean 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% Portuguese 25.0% 20.0% 22.2% 25.0% 22.2% 28.6% 28.6% Vietnamese 40.0% 40.0% 40.0% 40.0% 50.0% 66.7% 66.7%
Performance details for the 3K model are here (details for larger models are similar in terms of which language models perform the most poorly):
f0.5 f1 f2 recall prec total hits misses TOTAL 74.6% 74.5% 74.4% 74.4% 74.7% 515 383 130 Dutch 93.7% 88.2% 83.3% 80.4% 97.8% 326 262 6 English 80.2% 71.3% 64.2% 60.2% 87.5% 128 77 11 French 47.3% 57.1% 72.2% 87.5% 42.4% 16 14 19 German 23.6% 32.1% 50.6% 81.8% 20.0% 11 9 36 Spanish 34.9% 46.2% 68.2% 100.0% 30.0% 6 6 14 Italian 14.9% 20.7% 34.1% 60.0% 12.5% 5 3 21 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 4 0 0 Chinese 90.9% 80.0% 71.4% 66.7% 100.0% 3 2 0 Arabic 100.0% 100.0% 100.0% 100.0% 100.0% 2 2 0 Finnish 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Polish 38.5% 50.0% 71.4% 100.0% 33.3% 2 2 4 Turkish 45.5% 57.1% 76.9% 100.0% 40.0% 2 2 3 Afrikaans 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Burmese 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Croatian 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Czech 15.2% 22.2% 41.7% 100.0% 12.5% 1 1 7 Danish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Korean 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 Portuguese 17.2% 25.0% 45.5% 100.0% 14.3% 1 1 6 Vietnamese 29.4% 40.0% 62.5% 100.0% 25.0% 1 1 3 f0.5 f1 f2 recall prec total hits misses
French, German, Spanish, and Italian all do very poorly, with too many false positives. (When Spanish and Italian are disabled, French does even worse). Polish, Turkish, Czech, Portuguese and Vietnamese aren’t terrible in terms of raw false positives, but aren’t great, either.
As noted above, Greek, Hebrew, Japanese, and Russian are present in the larger sample, and as our models for these languages are very high accuracy, I’ve included them.
The final language set is Dutch, English, Chinese, Arabic, Korean, Greek, Hebrew, Japanese, and Russian. As above, 3K is not the optimal model size, but it is within 1.5%. The 3K results are shown below along with the best performing model sizes:
3000 3500 6000 7000 9000 TOTAL 82.3% 82.5% 82.9% 83.3% 83.7% Dutch 92.1% 92.4% 92.8% 92.9% 92.7% English 76.0% 76.2% 76.5% 77.6% 79.1% French 0.0% 0.0% 0.0% 0.0% 0.0% German 0.0% 0.0% 0.0% 0.0% 0.0% Spanish 0.0% 0.0% 0.0% 0.0% 0.0% Italian 0.0% 0.0% 0.0% 0.0% 0.0% Latin 0.0% 0.0% 0.0% 0.0% 0.0% Chinese 100.0% 100.0% 100.0% 100.0% 100.0% Arabic 80.0% 80.0% 80.0% 80.0% 80.0% Finnish 0.0% 0.0% 0.0% 0.0% 0.0% Polish 0.0% 0.0% 0.0% 0.0% 0.0% Turkish 0.0% 0.0% 0.0% 0.0% 0.0% Afrikaans 0.0% 0.0% 0.0% 0.0% 0.0% Burmese 0.0% 0.0% 0.0% 0.0% 0.0% Croatian 0.0% 0.0% 0.0% 0.0% 0.0% Czech 0.0% 0.0% 0.0% 0.0% 0.0% Danish 0.0% 0.0% 0.0% 0.0% 0.0% Korean 100.0% 100.0% 100.0% 100.0% 100.0% Portuguese 0.0% 0.0% 0.0% 0.0% 0.0% Vietnamese 0.0% 0.0% 0.0% 0.0% 0.0%
The accuracy is very high, and the differences are <2%, so it makes sense to stick with the default 3K models for now, but keep an eye out for significant performance improvements with other model sizes.
The detailed report for the 3K model is here:
f0.5 f1 f2 recall prec total hits misses TOTAL 82.3% 82.3% 82.3% 82.3% 82.3% 515 424 91 Dutch 92.4% 92.1% 91.9% 91.7% 92.6% 326 299 24 English 68.5% 76.0% 85.4% 93.0% 64.3% 128 119 66 French 0.0% 0.0% 0.0% 0.0% 0.0% 16 0 0 German 0.0% 0.0% 0.0% 0.0% 0.0% 11 0 0 Spanish 0.0% 0.0% 0.0% 0.0% 0.0% 6 0 0 Italian 0.0% 0.0% 0.0% 0.0% 0.0% 5 0 0 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 4 0 0 Chinese 100.0% 100.0% 100.0% 100.0% 100.0% 3 3 0 Arabic 71.4% 80.0% 90.9% 100.0% 66.7% 2 2 1 Finnish 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Polish 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Turkish 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Afrikaans 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Burmese 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Croatian 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Czech 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Danish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Korean 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 Portuguese 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Vietnamese 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 f0.5 f1 f2 recall prec total hits misses
Recall went up and precision went down for Dutch and English, but overall performance improved. Queries in unrepresented languages were almost all identified as either Dutch or English (decreasing precision for both), but those now unused models are no longer generating lots of false positives and bringing down precision overall. (The one query in Burmese was identified as Arabic, probably because it scored the same in all languages—with the max “unknown” score—and Arabic is alphabetically first among the contenders.)
nlwiki: Best Options
editThe barely sub-optimal settings (though consistent with others using 3K models) for nlwiki, based on these experiments, would be to use models for Dutch, English, Chinese, Arabic, Korean, Greek, Hebrew, Japanese, and Russian (nl, en, zh, ar, ko, el, he, ja, ru), using the default 3000-ngram models.