User:TJones (WMF)/Notes/Balanced Language Identification Evaluation Set for Queries
February 2016 — See TJones_(WMF)/Notes for other projects. (Phabricator ticket: T121539)
Balanced Language Identification Evaluation Set for Queries
editBuilding the Corpus
editThe goal of this task was to create a balanced language identification evaluation set for queries for top 21 wikis by query volume. It would have been the top 20, but I accidentally grabbed the top 20 after English, so we get 21. The purpose of a hand-selected balanced query set is to be able to test the accuracy of language identification where all languages are competing equally (by volume) and all queries are decent exemplars of the language in question.
The 21 languages are: Arabic, Chinese, Czech, Dutch, English, French, German, Hebrew, Indonesian, Italian, Japanese, Korean, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Turkish, Ukrainian, and Vietnamese.
I extracted a few day’s worth of full text queries from all wikis (19,273,806 queries total). For each of the 21 languages, I randomly selected several hundred queries for each language, and whittled them down to 200 queries each, removing queries composed primarily of names of people, places, and products, text in the wrong language, bad misspellings, numbers or acronyms, appeared bot-like (i.e., a very large number of very similar queries), etc. Names made up of normal words were kept—e.g., “The Revenant” and “Bridge of Spies” are names of movies, but they are made up of non-name words. Longer queries were allowed a small bit of text from any of the unacceptable categories.
I did not filter out queries that would obviously be hard for language identification, such as very short, unaccented queries in the Latin script, like Portuguese os (“the”), Swedish ur (“from”), English the, and French rue (“street”). The longest queries are hundreds of characters.
TextCat Evaluation
editI tested TextCat against the balanced corpus of 200 queries in each of 21 languages (4,200 queries total) in two ways:
- against the known list of 21 languages
- against the full list of 59 languages for which language models have been built on query data.
Note that some of the full set of 59 models are known to be pretty poor (Igbo has way too much English in the training data, for example) and part of the purpose of this set is to let us better evaluate these models.
In each case, I tested language models in increments of 500 ngrams up to 10,000 ngrams. Previous work on a sample derived from enwiki queries showed an optimal model size of 3,000 ngrams (on messy data that was also heavily unbalanced—i.e., mostly English). In this case, surprisingly, the best results came from the maximum 10,000 ngram models! However, the improvement probably isn’t enough to warrant the extra cost in speed and memory of using the 10K model—it’s no more than 4% F0.5 score.
Looking at Model Sizes
editF0.5 scores against the known 21 languages:
ngrams 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 TOTAL 84.0% 85.6% 86.5% 87.1% 87.2% 87.5% 87.8% 87.9% 88.2% 88.3% Arabic 92.6% 92.1% 92.5% 92.7% 93.0% 93.9% 93.9% 93.9% 93.7% 93.7% Chinese 81.5% 85.5% 86.9% 87.7% 87.1% 87.9% 89.0% 89.0% 89.0% 89.4% Czech 89.9% 91.1% 92.9% 91.9% 91.8% 92.6% 93.2% 93.0% 93.8% 93.8% Dutch 72.8% 75.6% 78.0% 78.3% 78.2% 79.4% 79.6% 80.2% 80.9% 81.1% English 77.6% 83.7% 86.6% 87.3% 86.2% 84.9% 85.4% 85.3% 86.2% 86.8% French 85.7% 88.2% 89.3% 88.9% 89.0% 90.1% 88.7% 88.9% 89.6% 88.8% German 75.7% 77.6% 80.5% 79.3% 79.5% 80.2% 80.8% 81.7% 82.6% 82.8% Hebrew 99.3% 100.0% 100.0% 99.8% 99.8% 99.8% 100.0% 100.0% 100.0% 99.8% Indonesian 80.5% 83.1% 83.4% 83.9% 85.4% 86.1% 86.3% 86.7% 86.7% 86.1% Italian 74.8% 74.6% 73.3% 74.6% 76.3% 77.3% 78.2% 78.2% 78.6% 78.8% Japanese 79.9% 83.9% 85.4% 87.2% 86.2% 86.7% 87.9% 87.9% 88.5% 88.8% Korean 99.2% 99.5% 99.7% 99.7% 99.5% 99.7% 99.7% 99.7% 99.7% 99.7% Persian 91.9% 91.7% 92.0% 92.0% 92.5% 93.6% 93.5% 93.1% 93.0% 93.0% Polish 90.0% 91.7% 93.3% 93.6% 93.3% 94.3% 94.1% 94.8% 95.3% 96.0% Portuguese 73.4% 73.1% 74.6% 76.1% 77.1% 75.8% 78.4% 78.4% 79.7% 79.5% Russian 85.5% 84.8% 85.0% 85.0% 84.4% 84.4% 83.9% 83.9% 83.9% 84.5% Spanish 72.5% 74.2% 73.4% 78.1% 78.7% 77.4% 78.1% 77.9% 78.0% 78.4% Swedish 68.9% 72.3% 76.6% 77.5% 78.0% 78.0% 78.9% 79.1% 80.1% 80.4% Turkish 89.6% 92.4% 92.1% 93.4% 93.2% 93.6% 93.6% 93.9% 93.6% 93.1% Ukrainian 82.9% 81.0% 82.1% 82.4% 81.6% 81.3% 80.8% 80.8% 81.2% 81.7% Vietnamese 97.5% 98.5% 99.0% 99.3% 99.0% 98.8% 98.8% 98.3% 98.0% 97.8%
F0.5 scores against the all 59 available languages:
ngrams 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 TOTAL 69.8% 73.3% 74.7% 76.0% 76.6% 76.9% 77.4% 77.5% 78.1% 78.5% Arabic 92.2% 91.5% 91.9% 92.6% 93.6% 93.8% 94.1% 93.9% 93.6% 93.9% Chinese 52.7% 55.8% 58.0% 61.1% 60.2% 60.6% 62.2% 64.3% 65.9% 67.2% Czech 83.9% 87.5% 88.6% 87.6% 88.3% 87.7% 87.4% 87.6% 87.9% 87.7% Dutch 67.3% 71.1% 76.1% 75.7% 76.6% 77.6% 78.5% 79.1% 80.0% 79.6% English 66.7% 72.3% 73.2% 74.5% 74.2% 74.2% 72.8% 72.2% 72.8% 74.2% French 84.6% 87.6% 86.7% 87.2% 87.7% 87.3% 87.3% 87.7% 88.0% 87.9% German 74.5% 76.3% 80.0% 79.7% 81.5% 81.6% 80.9% 82.2% 83.6% 83.9% Hebrew 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% Indonesian 42.1% 50.7% 54.1% 58.7% 62.6% 66.0% 68.7% 68.7% 69.7% 71.2% Italian 71.9% 74.6% 74.4% 75.1% 76.9% 78.4% 79.5% 78.8% 78.5% 78.5% Japanese 80.4% 82.9% 85.2% 86.2% 85.9% 86.4% 87.6% 87.6% 88.2% 88.2% Korean 99.5% 99.5% 99.7% 99.7% 99.7% 99.7% 99.7% 99.7% 99.7% 99.7% Persian 85.3% 86.6% 87.0% 87.6% 87.6% 88.4% 88.0% 87.4% 86.8% 86.8% Polish 92.1% 93.1% 93.8% 93.9% 93.8% 94.1% 94.1% 95.4% 95.6% 96.4% Portuguese 70.2% 70.4% 72.2% 74.0% 74.5% 74.7% 76.6% 77.1% 78.2% 78.2% Russian 72.8% 77.0% 76.8% 79.6% 77.5% 77.2% 77.8% 78.4% 78.7% 79.1% Spanish 66.7% 70.4% 72.0% 74.6% 75.5% 75.8% 76.5% 75.7% 78.4% 78.7% Swedish 55.2% 59.2% 62.0% 62.8% 65.3% 65.9% 65.2% 66.3% 66.9% 68.5% Turkish 85.8% 88.8% 89.8% 90.7% 91.5% 91.0% 91.3% 91.2% 90.2% 90.5% Ukrainian 78.9% 79.9% 80.2% 80.8% 79.7% 78.5% 78.5% 77.8% 77.9% 77.9% Vietnamese 97.5% 99.0% 99.0% 99.2% 98.7% 98.5% 98.5% 98.5% 98.5% 98.5%
Obviously, performance is noticeably worse when additional “spoiler” languages are available to be selected.
Looking at Languages with a 3,000-ngram model
editSince we are using 3,000-ngram models for our current A/B tests, we’ll evaluate those models by language.
21 Known Languages
editHere is the detailed accuracy report by language when using the set of 21 known languages, with 3,000 ngram models:
f0.5 f1 f2 recall prec total hits misses TOTAL 86.5% 86.5% 86.5% 86.5% 86.5% 4200 3635 565 Arabic 90.8% 92.5% 94.3% 95.5% 89.7% 200 191 22 Chinese 83.8% 86.9% 90.2% 92.5% 81.9% 200 185 41 Czech 91.9% 92.9% 93.8% 94.5% 91.3% 200 189 18 Dutch 81.6% 78.0% 74.6% 72.5% 84.3% 200 145 27 English 90.4% 86.6% 83.2% 81.0% 93.1% 200 162 12 French 86.9% 89.3% 91.8% 93.5% 85.4% 200 187 32 German 81.1% 80.5% 79.9% 79.5% 81.5% 200 159 36 Hebrew 100.0% 100.0% 100.0% 100.0% 100.0% 200 200 0 Indonesian 80.9% 83.4% 86.1% 88.0% 79.3% 200 176 46 Italian 72.4% 73.3% 74.3% 75.0% 71.8% 200 150 59 Japanese 91.0% 85.4% 80.5% 77.5% 95.1% 200 155 8 Korean 99.9% 99.7% 99.6% 99.5% 100.0% 200 199 0 Persian 93.6% 92.0% 90.5% 89.5% 94.7% 200 179 10 Polish 92.6% 93.3% 94.0% 94.5% 92.2% 200 189 16 Portuguese 75.8% 74.6% 73.3% 72.5% 76.7% 200 145 44 Russian 81.3% 85.0% 89.1% 92.0% 79.0% 200 184 49 Spanish 70.6% 73.4% 76.4% 78.5% 68.9% 200 157 71 Swedish 79.0% 76.6% 74.4% 73.0% 80.7% 200 146 35 Turkish 91.3% 92.1% 92.9% 93.5% 90.8% 200 187 19 Ukrainian 86.6% 82.1% 78.0% 75.5% 89.9% 200 151 17 Vietnamese 98.7% 99.0% 99.3% 99.5% 98.5% 200 199 3 f0.5 f1 f2 recall prec total hits misses
The poorest performers in recall are Dutch (72.0%), Swedish (72.5%), Ukrainian (75.0%), Portuguese (75.0%), Italian (77.0%), Japanese (79.5%), and Spanish (79.5%).
The poorest performers in precision are Spanish (72.6%), Italian (73.0%), Portuguese (76.9%), and Russian (78.3%).
Below are the most common identification errors for each language (all cases ≥10, plus highest for each language), grouped by similarity (language and/or script family) when there is considerable confusion within the group.
Most common identification errors:
Arabic Persian (9) Persian Arabic (21)
Chinese Japanese (8) Japanese Chinese (41)
Dutch German (17) German Dutch (16)
French Italian (4) Italian Spanish (12) Indonesian (11) Portuguese (11) Portuguese Spanish (37) Italian (11) Spanish Portuguese (21) Italian (11)
Russian Ukrainian (16) Ukrainian Russian (49)
Czech Polish (4) English Dutch (5) French (5) German (5) Spanish (5) Indonesian Italian (6) Korean Turkish (1) Polish Indonesian (3) Swedish Indonesian (15) Turkish Indonesian (3) Swedish (3) Vietnamese Italian (1)
So, confusion among Arabic/Persian, Chinese/Japanese, Dutch/German, French/Italian/Portuguese/Spanish, and Russian/Ukrainian is not too surprising.
Indonesian seems to be the most obvious outlier here, incorrectly claiming a fair number of Italian and Swedish queries.
59 Available Language Models
editKeep in mind that some of these are known to be a bit dodgy.
Here is the detailed accuracy report by language when using the full set of 59 languages, with 3,000 ngram models:
f0.5 f1 f2 recall prec total hits misses TOTAL 74.7% 74.7% 74.7% 74.7% 74.7% 4200 3138 1062 Arabic 91.0% 91.9% 92.9% 93.5% 90.3% 200 187 20 Chinese 67.5% 58.0% 50.9% 47.0% 75.8% 200 94 30 Czech 94.8% 88.6% 83.2% 80.0% 99.4% 200 160 1 Dutch 82.9% 76.1% 70.4% 67.0% 88.2% 200 134 18 English 85.0% 73.2% 64.3% 59.5% 95.2% 200 119 6 French 88.0% 86.7% 85.4% 84.5% 88.9% 200 169 21 German 84.1% 80.0% 76.3% 74.0% 87.1% 200 148 22 Hebrew 100.0% 100.0% 100.0% 100.0% 100.0% 200 200 0 Indonesian 66.1% 54.1% 45.8% 41.5% 77.6% 200 83 24 Italian 80.5% 74.4% 69.1% 66.0% 85.2% 200 132 23 Japanese 91.8% 85.2% 79.4% 76.0% 96.8% 200 152 5 Korean 99.9% 99.7% 99.6% 99.5% 100.0% 200 199 0 Persian 91.7% 87.0% 82.6% 80.0% 95.2% 200 160 8 Polish 95.6% 93.8% 92.1% 91.0% 96.8% 200 182 6 Portuguese 76.9% 72.2% 68.0% 65.5% 80.4% 200 131 32 Russian 80.7% 76.8% 73.2% 71.0% 83.5% 200 142 28 Spanish 74.6% 72.0% 69.5% 68.0% 76.4% 200 136 42 Swedish 71.7% 62.0% 54.5% 50.5% 80.2% 200 101 25 Turkish 92.5% 89.8% 87.2% 85.5% 94.5% 200 171 10 Ukrainian 87.9% 80.2% 73.8% 70.0% 94.0% 200 140 9 Vietnamese 99.0% 99.0% 99.0% 99.0% 99.0% 200 198 2 Albanian 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 12 Azerbaijani 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 11 Basque 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 25 Bengali 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 16 Bulgarian 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 50 Cantonese 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 107 Catalan 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 39 Croatian 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 7 Danish 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 26 Estonian 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 7 Finnish 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 25 Hungarian 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 7 Igbo 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 37 Kazakh 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 8 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 65 Latvian 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 10 Lithuanian 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 8 Macedonian 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 21 Malay 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 85 Malayalam 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 1 Mongolian 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 2 Norwegian 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 28 Romanian 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 29 Serbian 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 2 Serbo-Croatian 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 15 Slovak 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 27 Slovenian 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 14 Tagalog 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 19 Tamil 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 2 Urdu 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 25 f0.5 f1 f2 recall prec total hits misses
The poorest performers in recall are Indonesian (41.5%), Chinese (47.0%), Swedish (50.5%), English (59.5%), Portuguese (65.5%), Italian (66.0%), Dutch (67.0%), and Spanish (68.0%).
The poorest performers in precision are Chinese (75.8%), Indonesian (77.6%), and Spanish (76.4%).
The poorest performers in terms of false positives among the languages not in the balanced query set are Cantonese (107), Malay (85), Latin (65), Bulgarian (50), Catalan (39), and Igbo (37).
Below are the most common identification errors for each language (all cases ≥10, plus highest for each language), grouped by similarity (language and/or script family) when there is considerable confusion within the group.
Most common identification errors:
Arabic Persian (8) Persian Arabic (20) Urdu (20)
Chinese Cantonese (94) Japanese Chinese (30) Cantonese (13)
Dutch German (12) German Dutch (11)
French Catalan (7) Italian Latin (10) Portuguese Spanish (24) Latin (17) Spanish Portuguese (18) Catalan (13)
Russian Bulgarian (28) Macedonian (15) Ukrainian Russian (28) Bulgarian (22)
Czech Slovak (20)
Indonesian Malay (75)
Swedish Norwegian (17) Danish (11)
English Igbo (32)
Korean Azerbaijani (1) Polish Latin (3) Serbo-Croatian (3) Turkish Azerbaijani (7) Vietnamese Italian (1) Latin (1)
As before, confusion among Arabic/Persian/Urdu, Chinese/Japanese/Cantonese, Dutch/German, French/Italian/Portuguese/Spanish/Catalan/Latin, and Russian/Ukrainian/Bulgarian/Macedonian is not too surprising. Neither are Czech/Slovak, Indonesian/Malay, nor Swedish/Norwegian/Danish.
English/Igbo would be a surprise, but we already know there’s a lot of English in the Igbo training data.
Conclusions
editFor the 21 languages we should be able to release these query-based models and include them with the PHP version of TextCat used for our A/B tests.
Indonesian needs the most work, since it is performing poorly in unexpected ways (i.e., with Swedish and Italian).
The other language/script families that perform poorly may also benefit from additional work to improve the quality of their training data.
For the full list of 59 languages, Igbo sticks out as the worst performing. As expected, language/script families are generally more easily confused.
Next Steps
editTo Do:
- Release the rest of the 21 languages in the balanced query set, because they seem to be working reasonably well on reasonably clean and balanced data. T121539
To Consider:
- Try to improve the training data for Indonesian, and re-assess against this test set. T121547
- Try to improve the training data for the various language/script families, and re-asses against this test set. also T121547
- Release improved models.
- Add to the balanced test set additional languages, based on query volume, the uniqueness of the language-script mapping (e.g., Thai, Armenian), by language family, or some other criteria of desirability. Assess performance on this set.
- Determine which models need improvement, and release the acceptable models.