User:TJones (WMF)/Notes/Favoring Recall in Language Identification
May 2016 — See TJones_(WMF)/Notes for other projects. (Phabricator ticket: T134431)
TL; DR
editI prefer per-wiki tuning, but it seems like a reasonable generic recommendation for improving recall and/or coverage would be to allow a second language result from TextCat, and if you prefer coverage over accuracy, ignore the home language of the wiki.
Introduction
editUnfortunately language detection generally becomes more difficult as strings become shorter—ambiguity increases (see “a” in Wiktionary for an extreme example) and language-level statistics on letters and letter combinations are less reliable because of the small sample size available in only a few words. Thus deciding what to do about the tradeoff between recall and precision has a sizable effect on the language identification results when working with queries.
In my various analyses related to Language Identification I have been favoring precision (F0.5) over recall (F2—note that F1 is balanced between the two). My thought was that it is better to give fewer correct answers than to make lots of “silly”[1] mistakes. However, the contrary philosophy, favoring recall, is also reasonable: i.e., provide enough answers and there’s a better chance you’ll provide something that’s useful.
In the context of the performance of language detection on the annotated query sets I’ve been using, overall recall and precision are tightly coupled when only one language is allowed per query. This is because any false positive for one language is a false negative for another language. Recall and precision for individual languages can vary wildly (and they do!), but overall recall and precision are the same, except for occasional rounding errors and very rare cases where no language is detected (resulting in a penalty to recall but not precision). Similarly, F0.5, F1, and F2 are approximately the same at the overall level because a weighted average of two nearly identical values still has to be between the two values.
Approaches
editI’m investigating two main approaches for improving recall. The first, suggested by David, is to ignore the language of the wiki. The second is to provide more than one language as a result for the language detected.
Ignoring the Language of the Wiki
editEven though we are generally looking at doing language detection only on poor-performing queries (e.g., those that get fewer than three results), most of the queries that are in a language[2] are in the language of the wiki we are looking at. That is, for example, most of the poor-performing queries on French Wikipedia are still in French, and it’s the same for the other wikis I’ve looked at.
In order to improve precision, I’ve always included the language of the wiki I’m looking at among the language options[3] for a given wiki. David pointed out that we obviously aren’t going to get many results (fewer than three in fact) in the language of the wiki we are on, so we might as well ignore the language of the wiki we are on and look for results elsewhere—that is, worry less about precision (i.e., about avoiding “silly” results), and focus more on recall (i.e., offering some sort of result, because no result is not very helpful).
Of course, if about 70% of poor-performing queries on French Wikipedia are in French, then most of our language detection guesses will be wrong if French is not among the possible answers. However, we have a better chance of finding something relevant on another wiki if we are looking.
Returning Multiple Languages
editAnother approach to improve recall is to give more answers. At a ridiculous extreme, we could return every known language as an answer for every query. The right answer would be in there, but it wouldn’t be very helpful. However, returning two or three languages (i.e., including the language detector’s second or third choice option if provided) is more manageable.
TextCat, for example, returns the best-matching language, plus any alternatives that are, by default, within 5% of the best match. Returning two results doubles the chances that at least one of them is right.
Returning multiple languages is terrible for precision, of course. If we give two languages as results for every query, then many more may be “silly”, and at least half of them will be wrong—and so precision would max out at 50%. In practice, not every query will get multiple results (despite the difficulties overall, some queries are actually pretty easy to get right), so precision higher than 50% is possible.
Another option is to tune individual languages based on position in the returned list. On the German Wikipedia data, for example, detecting Polish may be 85% accurate when Polish is the first option, but only 3% accurate when it is the second option, in which case it makes sense to only accept Polish as an answer when it is the first option returned. English, on the other hand, may be 85% accurate in the first position, but still 80% accurate in the third position, in which case it makes sense to consider English results even in the third position. This is of course more complex to optimize for and to implement than just returning a fixed number of results.
Note that with multiple lang-ID results per query, recall and precision are no longer tightly coupled. It’s possible for any given query to get no answer (false negative),[4] one right answer (true positive), or one wrong answer (false positive and false negative), or two wrong answers (two false positives and one false negative), or one right and one wrong (true positive and false positive).
Combining Both Approaches
editOf course, it is possible to combine both approaches—ignoring the language of the wiki and returning multiple results.
The combination also allows yet another permutation: include but ignore the language of the wiki while allowing multiple results.
For example, on the German Wikipedia, as in the example above, we might have decided that Polish is only allowed as the first result, while English is allowed as the first or second result.
If we include German among the the languages being detected, we have two possibilities when both German and Polish are detected: either Polish is first, or German is first. In this case, we might say that if the query looks more Polish than German, we will treat it as Polish. But if it looks more German than Polish, we will ignore the Polish result, even though we are ignoring all German results, too. On the other hand, we’ll consider English whether it comes first or second to German.
Another possible outcome is that the matching on German is so good that no other languages are not considered reasonable alternatives (i.e., score within 5% of German), resulting in no language detection for a given query.
Tangent on User Interface
editI think the way the language detection information is used and presented affects and is affected by whether we favor recall or precision. If we have high-precision results and only one language detected, it may make more sense to just provide the cross-wiki results right on the wiki page.
On the other hand, if we are maximizing recall and providing, say, up to three language detection results per query, it might make more sense to only provide a link that says something like, “Would you like to see results on Spanish Wikipedia, French Wikipedia, or Portuguese Wikipedia?”
“Silly” language detection results may be more tolerable to users when they only result in an extra link or two on the results page for poor-performing queries.
Some users may find “silly” results confusing in all cases, and some may never mind them. We’ll definitely need to consult the user community and try out various options as A/B tests before coming to a final decision on how best to show results.
Overview of Options
editIn the analysis that follows, I’m going to consider 7 (!) options for each wiki. It may turn out that there isn’t much difference among them):
1) Ignore Home Language: Ignore the language of the wiki we are on when doing language detection to increase the chances of finding something on another wiki. We will only allow one result per query. We will note but not consider as errors misidentification on queries in the language of the wiki. (e.g., on German Wikipedia, a query in German identified as English isn’t counted as an error, though we will note how often it happens.)
2) Allow Multiple Lang-ID Results: Even though with multiple results, all but one of them must be wrong, the chances of getting that one correct result increase when we allow more results. Precision will take a hit, but we’ll pay more attention to recall using F2. This includes the language of the wiki.
3) Allow Multiple Lang-ID Results, with per-Language Thresholds: Allow multiple lang-ID results per query, but limit whether languages are considered based on their position within the results. (e.g., Polish counts if it is the best result, but not the second best, while English counts in either place.) This includes the language of the wiki.
4) Allow Multiple Lang-ID Results, Ignoring Home Language: Allow multiple lang-ID results per query, but do not consider the language of the wiki during detection. We will note but not consider as errors misidentification on queries in the language of the wiki.
5) Allow Multiple Lang-ID Results, Ignoring Home Language, with per-Language Thresholds: Allow multiple lang-ID results per query, but do not consider the language of the wiki during detection, and limit whether languages are considered based on their position within the results. We will note but not consider as errors misidentification on queries in the language of the wiki.
6) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language: Allow multiple lang-ID results per query, including the language of the wiki—however, ignore results in the language of the wiki for the purposes of calculating recall and precision, which may result in “no result” or other languages being pushed down a postion in the ranked results. We will note but not consider as errors misidentification on queries in the language of the wiki.
7) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language, with per-Language Thresholds: Allow multiple lang-ID results per query, including the language of the wiki—however, ignore results in the language of the wiki for the purposes of calculating recall and precision—and limit whether languages are considered based on their position within the results. We will note but not consider as errors misidentification on queries in the language of the wiki.
Notes
editApples & Oranges
editIn the analysis summaries, the results are divided into those that include the “home” language of the wiki, and those that don’t. These can’t necessarily be compared directly to each other. The home language is always the largest category of queries, and removing it takes away the best source of correct language identification, and can drastically change the interactions among the remaining languages, especially if the second most common language is not as dominant over there remainder as the home language is over all the others. Another factor, for these samples, is that the non-home language sample size can be smaller than I’d like (esp. for Spanish, which is ridiculously small!). The samples were taken to reach a target of 500+ annotated queries, but not with any minimum non-home language sample size.
Coverage vs Recall
editAnother way of looking at the problem is as one of “coverage” rather than recall. In this sense, coverage indicates the number of queries that return some language that can be used for cross-wiki searching, even if it isn’t the correct one. This could be called this a desperate attempt return anything at all. In the options above, (1), (4), and (5) have at least 99% coverage—some result that is not the home wiki language is returned for almost all queries. If coverage rather than recall is important, we can choose the option from among (1), (4), and (5) with the best F2 score.
“Extra” Language Models
editFor the present analysis I’m not including the “extra” language models that I included in the precision-favoring analysis. These were high-accuracy models for queries that were found in the larger sample for each wiki, but not in the hand-coded sample used for optimization. They could and should be included (after a quick check that they don’t cause unexpected problems) if any recommendations are taken from this analysis.
Results
editCompare to precision-favoring results.
French
edit0) Precision-Favoring Results—F2: 89.1%
1) Ignore Home Language—F2: 84.9%
The best language set is English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Polish, Thai, Armenian. (en, ar, pt, de, es, ru, zh, pl, th, hy)
f0.5 f1 f2 recall prec total hits misses TOTAL 84.7% 84.8% 84.9% 85.0% 84.6% 213 181 33 English 91.0% 88.8% 86.6% 85.2% 92.6% 88 75 6 Arabic 100.0% 100.0% 100.0% 100.0% 100.0% 66 66 0 Portuguese 70.3% 72.0% 73.8% 75.0% 69.2% 12 9 4 German 52.6% 62.5% 76.9% 90.9% 47.6% 11 10 11 Spanish 54.1% 61.5% 71.4% 80.0% 50.0% 10 8 8 Russian 100.0% 100.0% 100.0% 100.0% 100.0% 5 5 0 Chinese 100.0% 100.0% 100.0% 100.0% 100.0% 4 4 0 Dutch 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Corsican 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Italian 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Polish 38.5% 50.0% 71.4% 100.0% 33.3% 2 2 4 Armenian 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 Breton 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Hungarian 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Icelandic 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Swahili 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Swedish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Thai 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 f0.5 f1 f2 recall prec total hits misses
The 468 queries that are actually French are tagged as (i.e., potentially “silly” results): English (221), Spanish (100), German (74), Portuguese (68), Polish (5)
2) Allow Multiple Lang-ID Results—F2: 89.1%
The best language set, with a threshold of 1 language, is French, English, Arabic, Russian, Chinese, Thai, Armenian. (fr en, ar, ru, zh, th, hy)
This is the same result as (0), because the optimal threshold is 1.
f0.5 f1 f2 recall prec total hits misses TOTAL 89.0% 89.1% 89.1% 89.1% 89.0% 681 607 75 French 94.8% 95.1% 95.5% 95.7% 94.5% 468 448 26 English 67.0% 74.9% 84.9% 93.2% 62.6% 88 82 49 Arabic 100.0% 100.0% 100.0% 100.0% 100.0% 66 66 0 Portuguese 0.0% 0.0% 0.0% 0.0% 0.0% 12 0 0 German 0.0% 0.0% 0.0% 0.0% 0.0% 11 0 0 Spanish 0.0% 0.0% 0.0% 0.0% 0.0% 10 0 0 Russian 100.0% 100.0% 100.0% 100.0% 100.0% 5 5 0 Chinese 100.0% 100.0% 100.0% 100.0% 100.0% 4 4 0 Dutch 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Corsican 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Italian 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Polish 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Armenian 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 Breton 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Hungarian 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Icelandic 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Swahili 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Swedish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Thai 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 f0.5 f1 f2 recall prec total hits misses
The remaining queries that are actually French are tagged as (i.e., potentially “silly” results): English (20)
3) Allow Multiple Lang-ID Results, with per-Language Thresholds—F2: 89.7%
The best language set is French, English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Dutch, Polish, Thai, Armenian. (fr, en, ar, pt, de, es, ru, zh, nl, pl, th, hy). Thresholds are shown in the table below.
thresh f0.5 f1 f2 recall prec total hits misses TOTAL 4 79.3% 84.2% 89.7% 93.8% 76.3% 681 639 198 French 4 94.1% 94.7% 95.3% 95.7% 93.7% 468 448 30 English 3 61.4% 70.1% 81.8% 92.0% 56.6% 88 81 62 Arabic 1 100.0% 100.0% 100.0% 100.0% 100.0% 66 66 0 Portuguese 2 37.2% 47.8% 67.1% 91.7% 32.4% 12 11 23 German 1 43.7% 52.9% 67.2% 81.8% 39.1% 11 9 14 Spanish 2 20.0% 28.6% 50.0% 100.0% 16.7% 10 10 50 Russian 1 100.0% 100.0% 100.0% 100.0% 100.0% 5 5 0 Chinese 1 93.8% 85.7% 78.9% 75.0% 100.0% 4 3 0 Dutch 1 18.2% 25.0% 40.0% 66.7% 15.4% 3 2 11 Corsican - 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Italian - 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Polish 1 23.8% 33.3% 55.6% 100.0% 20.0% 2 2 8 Armenian 1 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 Breton - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Hungarian - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Icelandic - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Latin - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Swahili - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Swedish - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Thai 1 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 thresh f0.5 f1 f2 recall prec total hits misses
The remaining queries that are actually French are tagged as (i.e., potentially “silly” results): English (50), Spanish (38), Portuguese (16), Dutch (6), German (6), Polish (4)
4) Allow Multiple Lang-ID Results, Ignoring Home Language—F2: 84.9%
The best language set, with a threshold of 1 language, is English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Polish, Thai, Armenian. (en, ar, pt, de, es, ru, zh, pl, th, hy)
Note that this is the same as (1) above since the threshold was just one language. Allowing 3 languages offered the same F2 score (to one decimal place), with moderately higher recall and a good lower precision (the unbalanced trade-off being the nature of the weighted F2 measure).
thresh f0.5 f1 f2 recall prec total hits misses TOTAL (213) 1 84.7% 84.8% 84.9% 85.0% 84.6% 213 181 33 2 72.1% 77.7% 84.2% 89.2% 68.8% 213 190 86 3 69.6% 76.5% 84.9% 91.5% 65.7% 213 195 102
5) Allow Multiple Lang-ID Results, Ignoring Home Language, with per-Language Thresholds—F2: 88.1%
The best language set is English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Polish, Thai, Armenian. (en, ar, pt, de, es, ru, zh, pl, th, hy). Thresholds are shown in the table below.
thresh f0.5 f1 f2 recall prec total hits misses TOTAL 3 80.2% 84.0% 88.1% 91.1% 77.9% 213 194 55 English 3 84.7% 88.4% 92.5% 95.5% 82.4% 88 84 18 Arabic 1 100.0% 100.0% 100.0% 100.0% 100.0% 66 66 0 Portuguese 2 59.8% 68.7% 80.9% 91.7% 55.0% 12 11 9 German 1 52.6% 62.5% 76.9% 90.9% 47.6% 11 10 11 Spanish 2 49.0% 60.6% 79.4% 100.0% 43.5% 10 10 13 Russian 1 100.0% 100.0% 100.0% 100.0% 100.0% 5 5 0 Chinese 1 100.0% 100.0% 100.0% 100.0% 100.0% 4 4 0 Dutch 1 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Corsican - 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Italian - 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Polish 1 38.5% 50.0% 71.4% 100.0% 33.3% 2 2 4 Armenian 1 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 Breton - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Hungarian - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Icelandic - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Latin - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Swahili - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Swedish - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Thai 1 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 thresh f0.5 f1 f2 recall prec total hits misses
The 468 queries that are actually French are tagged as (i.e., potentially “silly” results): English (357), Spanish (184), Portuguese (161), German (74), Polish (5)
6) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language—F2: 83.7%
The best language set, with a threshold of 2 languages, is French, English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Thai, Armenian. (fr, en, ar, pt, de, es, ru, zh, th, hy)
f0.5 f1 f2 recall prec total hits misses TOTAL 70.6% 76.6% 83.7% 89.2% 67.1% 213 190 93 English 86.5% 88.5% 90.6% 92.0% 85.3% 88 81 14 Arabic 98.8% 99.2% 99.7% 100.0% 98.5% 66 66 1 Portuguese 62.5% 71.0% 82.1% 91.7% 57.9% 12 11 8 German 26.1% 36.1% 58.5% 100.0% 22.0% 11 11 39 Spanish 51.0% 62.5% 80.6% 100.0% 45.5% 10 10 12 Russian 100.0% 100.0% 100.0% 100.0% 100.0% 5 5 0 Chinese 100.0% 100.0% 100.0% 100.0% 100.0% 4 4 0 Dutch 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Corsican 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Italian 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Polish 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Armenian 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 Breton 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Hungarian 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Icelandic 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Swahili 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Swedish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Thai 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 French 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 19 f0.5 f1 f2 recall prec total hits misses
The remaining queries that are actually French are tagged as (i.e., potentially “silly” results): English (45), Spanish (38), German (21), Portuguese (18)
7) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language, with per-Language Thresholds—F2: 87.8%
The best language set is French, English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Dutch, Polish, Thai, Armenian. (fr, en, ar, pt, de, es, ru, zh, nl, pl, th, hy). Thresholds are shown in the table below.
thresh f0.5 f1 f2 recall prec total hits misses TOTAL 4 81.3% 84.4% 87.8% 90.1% 79.3% 213 192 50 English 4 86.9% 89.1% 91.5% 93.2% 85.4% 88 82 14 Arabic 1 100.0% 100.0% 100.0% 100.0% 100.0% 66 66 0 Portuguese 2 65.5% 73.3% 83.3% 91.7% 61.1% 12 11 7 German 1 57.0% 64.3% 73.8% 81.8% 52.9% 11 9 8 Spanish 2 51.0% 62.5% 80.6% 100.0% 45.5% 10 10 12 Russian 1 100.0% 100.0% 100.0% 100.0% 100.0% 5 5 0 Chinese 1 93.8% 85.7% 78.9% 75.0% 100.0% 4 3 0 Dutch 1 32.3% 40.0% 52.6% 66.7% 28.6% 3 2 5 Corsican - 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Italian - 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Polish 1 38.5% 50.0% 71.4% 100.0% 33.3% 2 2 4 Armenian 1 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 Breton - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Hungarian - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Icelandic - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Latin - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Swahili - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Swedish - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Thai 1 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 thresh f0.5 f1 f2 recall prec total hits misses
The remaining queries that are actually French are tagged as (i.e., potentially “silly” results): English (56), Spanish (38), Portuguese (16), Dutch (6), German (6), Polish (4)
Summary
editConfigurations that include reporting French (n = 681 samples), by F2:
89.7% | 3) Allow Multiple Lang-ID Results, with per-Language Thresholds |
89.1% | 0) Precision-Favoring Results |
89.1% | 2) Allow Multiple Lang-ID Results |
Configurations that ignore French (n = 213), by F2:
88.1% | 5) Allow Multiple Lang-ID Results, Ignoring Home Language, with per-Language Thresholds |
87.8% | 7) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language, with per-Language Thresholds |
84.9% | 1) Ignore Home Language |
84.9% | 4) Allow Multiple Lang-ID Results, Ignoring Home Language |
83.7% | 6) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language |
Spanish
edit0) Precision-Favoring Results—F2: 95.6%
1) Ignore Home Language—F2: 79.5%
The best language set is English, Russian, Chinese, Portuguese. (en, ru, zh, pt)
f0.5 f1 f2 recall prec total hits misses TOTAL 79.5% 79.5% 79.5% 79.5% 79.5% 44 35 9 English 86.1% 89.9% 93.9% 96.9% 83.8% 32 31 6 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Russian 100.0% 100.0% 100.0% 100.0% 100.0% 2 2 0 Catalan 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Chinese 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 French 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 German 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Guarani 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Italian 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Portuguese 29.4% 40.0% 62.5% 100.0% 25.0% 1 1 3 f0.5 f1 f2 recall prec total hits misses
The 476 queries that are actually Spanish are tagged as (i.e., potentially “silly” results): Portuguese (440), English (36)
2) Allow Multiple Lang-ID Results—F2: 96.1%
The best language set, with a threshold of 2 languages, is Spanish, English, Russian, Chinese. (es, en, ru, zh)
f0.5 f1 f2 recall prec total hits misses TOTAL 91.3% 93.7% 96.1% 97.9% 89.8% 520 509 58 Spanish 97.8% 98.4% 99.1% 99.6% 97.3% 476 474 13 English 47.1% 58.7% 78.0% 100.0% 41.6% 32 32 45 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Russian 100.0% 100.0% 100.0% 100.0% 100.0% 2 2 0 Catalan 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Chinese 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 French 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 German 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Guarani 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Italian 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Portuguese 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 f0.5 f1 f2 recall prec total hits misses
The remaining queries that are actually Spanish are tagged as (i.e., potentially “silly” results): English (38)
3) Allow Multiple Lang-ID Results, with per-Language Thresholds—F2: 96.9%
The best language set is Spanish, English, Russian, Chinese. (es, en, ru, zh). Thresholds are shown in the table below.
thresh f0.5 f1 f2 recall prec total hits misses TOTAL 2 94.8% 95.8% 96.9% 97.7% 94.1% 520 508 32 Spanish 2 97.8% 98.4% 99.1% 99.6% 97.3% 476 474 13 English 1 66.8% 75.6% 87.1% 96.9% 62.0% 32 31 19 Latin - 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Russian 1 100.0% 100.0% 100.0% 100.0% 100.0% 2 2 0 Catalan - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Chinese 1 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 French 1 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 German 1 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Guarani - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Italian 1 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Portuguese 1 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 thresh f0.5 f1 f2 recall prec total hits misses
The remaining queries that are actually Spanish are tagged as (i.e., potentially “silly” results): English (13).
4) Allow Multiple Lang-ID Results, Ignoring Home Language—F2: 79.5%
The best language set, with a threshold of 1 language, is English, Russian, Chinese, Portuguese. (en, ru, zh, pt)
This is the same as (1).
5) Allow Multiple Lang-ID Results, Ignoring Home Language, with per-Language Thresholds—F2: 79.5%
The best language set, with a threshold of 1 for every result, is English, Russian, Chinese, Portuguese. (en, ru, zh, pt)
This is also the same as (1).
thresh f0.5 f1 f2 recall prec total hits misses TOTAL 1 79.5% 79.5% 79.5% 79.5% 79.5% 44 35 9 English 1 86.1% 89.9% 93.9% 96.9% 83.8% 32 31 6 Latin - 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Russian 1 100.0% 100.0% 100.0% 100.0% 100.0% 2 2 0 Catalan - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Chinese 1 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 French - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 German - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Guarani - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Italian - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Portuguese 1 29.4% 40.0% 62.5% 100.0% 25.0% 1 1 3 thresh f0.5 f1 f2 recall prec total hits misses
6) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language—F2: 77.3%
The best language set, with a threshold of 1 language, is Spanish, English, Russian, Chinese, Portuguese. (es, en, ru, zh, pt)
f0.5 f1 f2 recall prec total hits misses TOTAL 77.3% 77.3% 77.3% 77.3% 77.3% 44 34 10 English 85.2% 88.2% 91.5% 93.8% 83.3% 32 30 6 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Russian 100.0% 100.0% 100.0% 100.0% 100.0% 2 2 0 Catalan 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Chinese 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 French 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 German 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Guarani 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Italian 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Portuguese 38.5% 50.0% 71.4% 100.0% 33.3% 1 1 2 Spanish 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 2 f0.5 f1 f2 recall prec total hits misses
The remaining queries that are actually Spanish are tagged as (i.e., potentially “silly” results): Portuguese (46), English (9)
7) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language, with per-Language Thresholds—F2: 80.3%
The best language set is Spanish, English, Russian, Chinese. (es, en, ru, zh). Thresholds are shown in the table below.
thresh f0.5 f1 f2 recall prec total hits misses TOTAL 2 82.5% 81.4% 80.3% 79.5% 83.3% 44 35 7 English 2 85.1% 90.1% 95.8% 100.0% 82.1% 32 32 7 Latin - 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Russian 1 100.0% 100.0% 100.0% 100.0% 100.0% 2 2 0 Catalan - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Chinese 1 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 French - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 German - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Guarani - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Italian - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Portuguese - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 thresh f0.5 f1 f2 recall prec total hits misses
The remaining queries that are actually Spanish are tagged as (i.e., potentially “silly” results): English (38)
Summary
editConfigurations that include reporting Spanish (n=520), by F2:
96.9% | 3) Allow Multiple Lang-ID Results, with per-Language Thresholds |
96.1% | 2) Allow Multiple Lang-ID Results |
95.6% | 0) Precision-Favoring Results |
Configurations that ignore Spanish (n=44), by F2:
[Note that this sample is very small, probably too small to draw any strong conclusions from).]
80.3% | 7) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language, with per-Language Thresholds |
79.5% | 1) Ignore Home Language |
79.5% | 4) Allow Multiple Lang-ID Results, Ignoring Home Language |
79.5% | 5) Allow Multiple Lang-ID Results, Ignoring Home Language, with per-Language Thresholds |
77.3% | 6) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language |
Italian
edit0) Precision-Favoring Results—F2: 92.2%
1) Ignore Home Language—F2: 79.5%
The best language set is English, Spanish, Russian, Romanian, Portuguese, Arabic, Chinese. (en, es, ru, ro, pt, ar, zh)
f0.5 f1 f2 recall prec total hits misses TOTAL 79.5% 79.5% 79.5% 79.5% 79.5% 146 116 30 English 89.1% 90.1% 91.1% 91.7% 88.5% 109 100 13 Spanish 51.5% 60.9% 74.5% 87.5% 46.7% 8 7 8 German 0.0% 0.0% 0.0% 0.0% 0.0% 6 0 0 French 0.0% 0.0% 0.0% 0.0% 0.0% 4 0 0 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 4 0 0 Arabic 100.0% 100.0% 100.0% 100.0% 100.0% 3 3 0 Portuguese 21.3% 28.6% 43.5% 66.7% 18.2% 3 2 9 Romanian 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Russian 100.0% 100.0% 100.0% 100.0% 100.0% 3 3 0 Chinese 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 Czech 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Polish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 f0.5 f1 f2 recall prec total hits misses
The 404 queries that are actually Italian are tagged as (i.e., potentially “silly” results): Spanish (196), English (114), Portuguese (94)
2) Allow Multiple Lang-ID Results—F2: 92.2%
The best language set, with a threshold of 1 language, is Italian, English, Russian, Arabic, Chinese. (it, en, ru, ar, zh)
This is the same result as (0), because the optimal threshold is 1.
f0.5 f1 f2 recall prec total hits misses TOTAL 92.2% 92.2% 92.2% 92.2% 92.2% 550 507 43 Italian 95.4% 96.7% 98.1% 99.0% 94.6% 404 400 23 English 84.9% 87.3% 89.9% 91.7% 83.3% 109 100 20 Spanish 0.0% 0.0% 0.0% 0.0% 0.0% 8 0 0 German 0.0% 0.0% 0.0% 0.0% 0.0% 6 0 0 French 0.0% 0.0% 0.0% 0.0% 0.0% 4 0 0 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 4 0 0 Arabic 100.0% 100.0% 100.0% 100.0% 100.0% 3 3 0 Portuguese 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Romanian 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Russian 100.0% 100.0% 100.0% 100.0% 100.0% 3 3 0 Chinese 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 Czech 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Polish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 f0.5 f1 f2 recall prec total hits misses
The remaining queries that are actually Italian are tagged as (i.e., potentially “silly” results): English (27)
3) Allow Multiple Lang-ID Results, with per-Language Thresholds—F2: 92.2%
The best language set is Italian, English, Russian, Arabic, Chinese. (it, en, ru, ar, zh). Thresholds are shown in the table below.
The same F2 score can be had with the thresh for English set to 1, which is then the same as (2) and (0).
thresh f0.5 f1 f2 recall prec total hits misses TOTAL 2 88.6% 90.3% 92.2% 93.5% 87.4% 550 514 74 Italian 1 95.4% 96.7% 98.1% 99.0% 94.6% 404 400 23 English 2 72.2% 80.1% 90.1% 98.2% 67.7% 109 107 51 Spanish - 0.0% 0.0% 0.0% 0.0% 0.0% 8 0 0 German - 0.0% 0.0% 0.0% 0.0% 0.0% 6 0 0 French - 0.0% 0.0% 0.0% 0.0% 0.0% 4 0 0 Latin - 0.0% 0.0% 0.0% 0.0% 0.0% 4 0 0 Arabic 1 100.0% 100.0% 100.0% 100.0% 100.0% 3 3 0 Portuguese - 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Romanian - 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Russian 1 100.0% 100.0% 100.0% 100.0% 100.0% 3 3 0 Chinese 1 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 Czech - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Polish - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 thresh f0.5 f1 f2 recall prec total hits misses
The remaining queries that are actually Italian are tagged as (i.e., potentially “silly” results): English (4).
4) Allow Multiple Lang-ID Results, Ignoring Home Language—F2: 80.5%
The best language set, with a threshold of 2 languages, is English, Spanish, Russian, Romanian, Arabic, Chinese. (en, es, ru, ro, ar, zh)
f0.5 f1 f2 recall prec total hits misses TOTAL 74.0% 77.1% 80.5% 82.9% 72.0% 146 121 47 English 87.0% 90.6% 94.5% 97.2% 84.8% 109 106 19 Spanish 26.3% 36.4% 58.8% 100.0% 22.2% 8 8 28 German 0.0% 0.0% 0.0% 0.0% 0.0% 6 0 0 French 0.0% 0.0% 0.0% 0.0% 0.0% 4 0 0 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 4 0 0 Arabic 100.0% 100.0% 100.0% 100.0% 100.0% 3 3 0 Portuguese 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Romanian 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Russian 100.0% 100.0% 100.0% 100.0% 100.0% 3 3 0 Chinese 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 Czech 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Polish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 f0.5 f1 f2 recall prec total hits misses
The 404 queries that are actually Italian are tagged as (i.e., potentially “silly” results): Spanish (364), English (245)
5) Allow Multiple Lang-ID Results, Ignoring Home Language, with per-Language Thresholds—F2: 82.3%
The best language set is English, Spanish, Russian, Romanian, Portuguese, Arabic, Chinese. (en, es, ru, ro, pt, ar, zh). Thresholds are shown in the table below.
thresh f0.5 f1 f2 recall prec total hits misses TOTAL 3 78.8% 80.5% 82.3% 83.6% 77.7% 146 122 35 English 3 87.6% 91.0% 94.6% 97.2% 85.5% 109 106 18 Spanish 1 51.5% 60.9% 74.5% 87.5% 46.7% 8 7 8 German 1 0.0% 0.0% 0.0% 0.0% 0.0% 6 0 0 French - 0.0% 0.0% 0.0% 0.0% 0.0% 4 0 0 Latin - 0.0% 0.0% 0.0% 0.0% 0.0% 4 0 0 Arabic 1 100.0% 100.0% 100.0% 100.0% 100.0% 3 3 0 Portuguese 1 21.3% 28.6% 43.5% 66.7% 18.2% 3 2 9 Romanian - 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Russian 1 100.0% 100.0% 100.0% 100.0% 100.0% 3 3 0 Chinese 1 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 Czech 1 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Polish 1 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 thresh f0.5 f1 f2 recall prec total hits misses
The 404 queries that are actually Italian are tagged as (i.e., potentially “silly” results): English (222), Spanish (196), Portuguese (94)
6) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language—F2: 77.1%
The best language set, with a threshold of 4 languages, is Italian, English, Russian, Arabic, Chinese, Spanish, Portuguese. (it, en, ru, ar, zh, es, pt)
f0.5 f1 f2 recall prec total hits misses TOTAL 60.4% 67.8% 77.1% 84.9% 56.4% 146 124 96 English 88.8% 91.8% 95.0% 97.2% 86.9% 109 106 16 Spanish 29.4% 40.0% 62.5% 100.0% 25.0% 8 8 24 German 0.0% 0.0% 0.0% 0.0% 0.0% 6 0 0 French 0.0% 0.0% 0.0% 0.0% 0.0% 4 0 0 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 4 0 0 Arabic 100.0% 100.0% 100.0% 100.0% 100.0% 3 3 0 Portuguese 14.0% 20.7% 39.5% 100.0% 11.5% 3 3 23 Romanian 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Russian 100.0% 100.0% 100.0% 100.0% 100.0% 3 3 0 Chinese 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 Czech 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Polish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Italian 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 33 f0.5 f1 f2 recall prec total hits misses
The remaining queries that are actually Italian are tagged as (i.e., potentially “silly” results): Spanish (69), Portuguese (45), English (21)
7) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language, with per-Language Thresholds—F2: 82.5%
The best language set is Italian, English, Spanish, Russian, Portuguese, Arabic, Chinese. (it, en, es, ru, pt, ar, zh). Thresholds are shown in the table below.
thresh f0.5 f1 f2 recall prec total hits misses TOTAL 4 79.6% 81.1% 82.5% 83.6% 78.7% 146 122 33 English 4 88.8% 91.8% 95.0% 97.2% 86.9% 109 106 16 Spanish 1 62.5% 66.7% 71.4% 75.0% 60.0% 8 6 4 German - 0.0% 0.0% 0.0% 0.0% 0.0% 6 0 0 French - 0.0% 0.0% 0.0% 0.0% 0.0% 4 0 0 Latin - 0.0% 0.0% 0.0% 0.0% 0.0% 4 0 0 Arabic 1 100.0% 100.0% 100.0% 100.0% 100.0% 3 3 0 Portuguese 2 22.4% 31.6% 53.6% 100.0% 18.8% 3 3 13 Romanian - 0.0% 0.0% 0.0% 0.0% 0.0% 3 0 0 Russian 1 100.0% 100.0% 100.0% 100.0% 100.0% 3 3 0 Chinese 1 100.0% 100.0% 100.0% 100.0% 100.0% 1 1 0 Czech 1 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Polish 1 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 thresh f0.5 f1 f2 recall prec total hits misses
The remaining queries that are actually Italian are tagged as (i.e., potentially “silly” results): Portuguese (32), English (21), Spanish (20)
Summary
editConfigurations that include reporting Italian (n= 550), by F2:
92.2% | 0) Precision-Favoring Results |
92.2% | 2) Allow Multiple Lang-ID Results |
92.2% | 3) Allow Multiple Lang-ID Results, with per-Language Thresholds |
Configurations that ignore Italian (n=146), by F2:
82.5% | 7) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language, with per-Language Thresholds |
82.3% | 5) Allow Multiple Lang-ID Results, Ignoring Home Language, with per-Language Thresholds |
80.5% | 4) Allow Multiple Lang-ID Results, Ignoring Home Language |
79.5% | 1) Ignore Home Language |
77.1% | 6) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language |
German
edit0) Precision-Favoring Results—F2: 88.1%
1) Ignore Home Language—F2: 78.1%
The best language set is English, Italian, Spanish, Chinese. (en, it, es, zh)
f0.5 f1 f2 recall prec total hits misses TOTAL 78.1% 78.1% 78.1% 78.1% 78.1% 160 125 35 English 92.1% 91.4% 90.8% 90.3% 92.6% 124 112 9 Italian 39.5% 48.0% 61.2% 75.0% 35.3% 8 6 11 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 8 0 0 Spanish 36.5% 46.7% 64.8% 87.5% 31.8% 8 7 15 French 0.0% 0.0% 0.0% 0.0% 0.0% 5 0 0 Chinese 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Dutch 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Polish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Swedish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Turkish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Vietnamese 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 f0.5 f1 f2 recall prec total hits misses
The 360 queries that are actually German are tagged as (i.e., potentially “silly” results): English (278), Italian (72), Spanish (10)
2) Allow Multiple Lang-ID Results—F2: 88.3%
The best language set, with a threshold of 1 language, is German, English, Chinese. (de, en, zh)
f0.5 f1 f2 recall prec total hits misses TOTAL 88.3% 88.3% 88.3% 88.3% 88.3% 520 459 61 German 94.0% 95.0% 96.0% 96.7% 93.3% 360 348 25 English 77.9% 81.9% 86.3% 89.5% 75.5% 124 111 36 Italian 0.0% 0.0% 0.0% 0.0% 0.0% 8 0 0 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 8 0 0 Spanish 0.0% 0.0% 0.0% 0.0% 0.0% 8 0 0 French 0.0% 0.0% 0.0% 0.0% 0.0% 5 0 0 Chinese 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Dutch 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Polish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Swedish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Turkish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Vietnamese 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 f0.5 f1 f2 recall prec total hits misses
The remaining queries that are actually German are tagged as (i.e., potentially “silly” results): English (12)
3) Allow Multiple Lang-ID Results, with per-Language Thresholds—F2: 89.1%
The best language set is German, English, Italian, Spanish, Chinese, Vietnamese. (de, en, it, es, zh, vi). Thresholds are shown in the table below.
thresh f0.5 f1 f2 recall prec total hits misses TOTAL 4 78.1% 83.2% 89.1% 93.5% 75.0% 520 486 162 German 4 87.5% 91.4% 95.6% 98.6% 85.1% 360 355 62 English 3 69.6% 77.4% 87.1% 95.2% 65.2% 124 118 63 Italian 1 26.0% 33.3% 46.3% 62.5% 22.7% 8 5 17 Latin - 0.0% 0.0% 0.0% 0.0% 0.0% 8 0 0 Spanish 1 32.4% 42.4% 61.4% 87.5% 28.0% 8 7 18 French - 0.0% 0.0% 0.0% 0.0% 0.0% 5 0 0 Chinese 1 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Dutch - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Polish - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Swedish - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Turkish - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Vietnamese 1 38.5% 50.0% 71.4% 100.0% 33.3% 1 1 2 thresh f0.5 f1 f2 recall prec total hits misses
The remaining queries that are actually German are tagged as (i.e., potentially “silly” results): English (49), Italian (6), Spanish (5)
4) Allow Multiple Lang-ID Results, Ignoring Home Language—F2: 79.1%
The best language set, with a threshold of 3 languages, is English, Spanish, Chinese. (en, es, zh)
f0.5 f1 f2 recall prec total hits misses TOTAL 71.8% 75.3% 79.1% 81.9% 69.7% 160 131 57 English 89.2% 92.4% 95.9% 98.4% 87.1% 124 122 18 Italian 0.0% 0.0% 0.0% 0.0% 0.0% 8 0 0 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 8 0 0 Spanish 18.2% 25.9% 44.9% 87.5% 15.2% 8 7 39 French 0.0% 0.0% 0.0% 0.0% 0.0% 5 0 0 Chinese 100.0% 100.0% 100.0% 100.0% 100.0% 2 2 0 Dutch 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Polish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Swedish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Turkish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Vietnamese 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 f0.5 f1 f2 recall prec total hits misses
The 360 queries that are actually German are tagged as (i.e., potentially “silly” results): English (356), Spanish (114)
5) Allow Multiple Lang-ID Results, Ignoring Home Language, with per-Language Thresholds—F2: 83.5%
The best language set is English, Italian, Spanish, Chinese. (en, it, es, zh). Thresholds are shown in the table below.
thresh f0.5 f1 f2 recall prec total hits misses TOTAL 4 77.8% 80.6% 83.5% 85.6% 76.1% 160 137 43 English 3 89.7% 92.8% 96.1% 98.4% 87.8% 124 122 17 Italian 1 39.5% 48.0% 61.2% 75.0% 35.3% 8 6 11 Latin - 0.0% 0.0% 0.0% 0.0% 0.0% 8 0 0 Spanish 1 36.5% 46.7% 64.8% 87.5% 31.8% 8 7 15 French - 0.0% 0.0% 0.0% 0.0% 0.0% 5 0 0 Chinese 4 100.0% 100.0% 100.0% 100.0% 100.0% 2 2 0 Dutch - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Polish - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Swedish - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Turkish - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Vietnamese - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 thresh f0.5 f1 f2 recall prec total hits misses
The 360 queries that are actually German are tagged as (i.e., potentially “silly” results): English (345), Italian (72), Spanish (10)
6) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language—F2: 74.7%
The best language set, with a threshold of 2 languages, is German, English, Italian, Spanish, Chinese, Vietnamese. (de, en, it, es, zh, vi)
f0.5 f1 f2 recall prec total hits misses TOTAL 58.1% 65.3% 74.7% 82.5% 54.1% 160 132 112 English 91.8% 92.4% 93.1% 93.5% 91.3% 124 116 11 Italian 27.0% 37.2% 59.7% 100.0% 22.9% 8 8 27 Latin 0.0% 0.0% 0.0% 0.0% 0.0% 8 0 0 Spanish 29.2% 38.9% 58.3% 87.5% 25.0% 8 7 21 French 0.0% 0.0% 0.0% 0.0% 0.0% 5 0 0 Chinese 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Dutch 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Polish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Swedish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Turkish 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Vietnamese 29.4% 40.0% 62.5% 100.0% 25.0% 1 1 3 German 0.0% 0.0% 0.0% 0.0% 0.0% 0 0 50 f0.5 f1 f2 recall prec total hits misses
The remaining queries that are actually German are tagged as (i.e., potentially “silly” results): English (45), Italian (14), Spanish (6), Vietnamese (1)
7) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language, with per-Language Thresholds—F2: 80.8%
The best language set is German, English, Italian, Spanish, Chinese, Vietnamese. (de, en, it, es, zh, vi). Thresholds are shown in the table below.
thresh f0.5 f1 f2 recall prec total hits misses TOTAL 3 77.6% 79.2% 80.8% 81.9% 76.6% 160 131 40 English 3 90.5% 92.2% 93.9% 95.2% 89.4% 124 118 14 Italian 1 34.7% 41.7% 52.1% 62.5% 31.2% 8 5 11 Latin - 0.0% 0.0% 0.0% 0.0% 0.0% 8 0 0 Spanish 1 39.8% 50.0% 67.3% 87.5% 35.0% 8 7 13 French - 0.0% 0.0% 0.0% 0.0% 0.0% 5 0 0 Chinese 1 0.0% 0.0% 0.0% 0.0% 0.0% 2 0 0 Dutch - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Polish - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Swedish - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Turkish - 0.0% 0.0% 0.0% 0.0% 0.0% 1 0 0 Vietnamese 1 38.5% 50.0% 71.4% 100.0% 33.3% 1 1 2 thresh f0.5 f1 f2 recall prec total hits misses
The remaining queries that are actually German are tagged as (i.e., potentially “silly” results): English (49), Italian (6), Spanish (5)
Summary
editConfigurations that include reporting German (n= 520), by F2:
89.1% | 3) Allow Multiple Lang-ID Results, with per-Language Thresholds |
88.3% | 2) Allow Multiple Lang-ID Results |
88.1% | 0) Precision-Favoring Results |
Configurations that ignore German (n=160), by F2:
83.5% | 5) Allow Multiple Lang-ID Results, Ignoring Home Language, with per-Language Thresholds |
80.8% | 7) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language, with per-Language Thresholds |
79.1% | 4) Allow Multiple Lang-ID Results, Ignoring Home Language |
78.1% | 1) Ignore Home Language |
74.7% | 6) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language |
Discussion
editAmong configurations that include reporting the home language (0, 2, 3), we see the expected increase in F2 score over the baseline (0), from allowing multiple languages to be reported (2), or optimizing how far down the list to consider each language independently (3). Sometimes there is no difference between the options, but when there is, the order is always the same, and the improvement is minor (< 1.5%).
Option (2) usually results in either maintaining the threshold of one language, or increasing it a bit to 2 or 3.
Option (3) usually results in increasing the threshold for more strongly represented languages (the home language and the second most common), while keeping the others at 1. Thus, for the less frequent languages, it’s best to only accept that answer if it’s the best guess, while for the more common languages, a less confident guess is still a good one.
Among configurations that ignore the home language (1, 4, 5, 6, 7), the two that allow for per-language thresholds (5 and 7) are consistently the best, which makes sense as they can be more finely tuned (or perhaps overfitted!). Among the others, (6) is consistently the worst. There’s a consistent partial ordering, in that (5) and (7) are the same or better than (4), which is the same or better than (1), which is better than (6)
The overall span in F2 scores for this group isn’t huge, but is sometimes moderate (3-9%). F2 scores for these configs are consistently worse than those including the home language (0, 2, 3, above), but they are apples and oranges (see Notes above).
Conclusions
editFor raw F2 score, including the home language gives the highest score, but can’t be directly compared to scores ignoring the home language. Allowing for multiple results is the best way to increase overall F2 score. Per-language threshold tweaking is often slightly better, but may not be worth the complexity.
In terms of coverage (see Notes above), we get nearly full coverage (some non-home language alternative is offered for every query) with options (1), (4), and (5). Option (4), allowing multiple language results and ignoring the home language, is the best middle ground. It gives more accurate results than (1), while being less complex than (5).
I prefer per-wiki tuning, but it seems like a reasonable generic recommendation for improving recall and/or coverage would be to allow a second language result from TextCat, and if you prefer coverage over accuracy, ignore the home language of the wiki.
Footnotes
edit- ↑ It’s hard to know what to call these kinds of errors. To a human, some of the incorrect results can seem ridiculous. Mistaking a one-word query in Spanish for Portuguese is not so much of a “silly” mistake. Mistaking a string of Chinese characters followed by what looks like a serial number with Latin characters and numbers for German is. I can explain why it happened, based on language model sizes, rarity of the Chinese characters used, the relative frequencies of individual characters in different writing systems, and the presence of three characters, “aus” which are very characteristically German out of context—but it still looks ridiculous to a normal human user. We’ll go with “silly” for now.
- ↑ Overall, most of the poor performing queries are not in a language. There’s a lot of junk (e.g., “fhdjskhfdsjkhfjdks”), and a lot of names (of people, places, products, books, movies, songs, bands, etc., etc.) that aren’t really in any language. (Names that are made up of common words of a language—“The Rolling Stones”, “A Hundred Years of Solitude”, “Fly by Night”, or “So I Married an Axe Murderer”—are counted as being in a language.) There are also a few transliterations, particularly of films (e.g., Naadodigal, and transliteration of Tamil நாடோடிகள், isn’t in Tamil, but isn’t in any particular language that uses the Latin alphabet, either), serial numbers and product codes, heavily ambiguous words (e.g., “a”), etc.
- ↑ It is necessary to restrict the possible languages to be reported by the language detection to the “plausible” ones. For example, an early language detector we tested reported that 38% of ~100K queries on English Wikipedia were in Romanian. I’m sure at least one of them was, but in a test set of over 1,000 queries, I didn’t see any Romanian. Clearly, leaving Romanian out of the mix improves overall results. Other cases are less clear-cut, but a language detector that gets many many more false positives than true positives is hurting more than it is helping, so it is best left out. The right blend of languages depends on the wiki in question.
- ↑ Recall and precision don’t take into account true negatives. In many information retrieval contexts, there are lots of results that are correctly not returned. So many, in fact, that it’s often hard to get below 95% accuracy on measures that include them, making those measures basically useless. In the context of language detection, for example, you can start with a list of a hundred languages, and be sure in every case that at least 99 of them are not the language of the query. If you return three lang-ID results, even if they are all wrong, there are 97 others that you correctly did not return (i.e., true negatives), giving a true negative rate (or “specificity”) of 97%. (See more the definitions of various accuracy measures for more.) For search engine accuracy metrics, the imbalance is even more lopsided. For any query there are thousands, millions, or even billions of irrelevant documents that are correctly not returned in the top 10 results, every time.