User:TJones (WMF)/Notes/Favoring Recall in Language Identification

May 2016 — See TJones_(WMF)/Notes for other projects. (Phabricator ticket: T134431)

TL; DR

I prefer per-wiki tuning, but it seems like a reasonable generic recommendation for improving recall and/or coverage would be to allow a second language result from TextCat, and if you prefer coverage over accuracy, ignore the home language of the wiki.

Introduction

Unfortunately language detection generally becomes more difficult as strings become shorter—ambiguity increases (see “a” in Wiktionary for an extreme example) and language-level statistics on letters and letter combinations are less reliable because of the small sample size available in only a few words. Thus deciding what to do about the tradeoff between recall and precision has a sizable effect on the language identification results when working with queries.

In my various analyses related to Language Identification I have been favoring precision (F0.5) over recall (F2—note that F1 is balanced between the two). My thought was that it is better to give fewer correct answers than to make lots of “silly”^[1] mistakes. However, the contrary philosophy, favoring recall, is also reasonable: i.e., provide enough answers and there’s a better chance you’ll provide something that’s useful.

In the context of the performance of language detection on the annotated query sets I’ve been using, overall recall and precision are tightly coupled when only one language is allowed per query. This is because any false positive for one language is a false negative for another language. Recall and precision for individual languages can vary wildly (and they do!), but overall recall and precision are the same, except for occasional rounding errors and very rare cases where no language is detected (resulting in a penalty to recall but not precision). Similarly, F0.5, F1, and F2 are approximately the same at the overall level because a weighted average of two nearly identical values still has to be between the two values.

Approaches

I’m investigating two main approaches for improving recall. The first, suggested by David, is to ignore the language of the wiki. The second is to provide more than one language as a result for the language detected.

Ignoring the Language of the Wiki

Even though we are generally looking at doing language detection only on poor-performing queries (e.g., those that get fewer than three results), most of the queries that are in a language^[2] are in the language of the wiki we are looking at. That is, for example, most of the poor-performing queries on French Wikipedia are still in French, and it’s the same for the other wikis I’ve looked at.

In order to improve precision, I’ve always included the language of the wiki I’m looking at among the language options^[3] for a given wiki. David pointed out that we obviously aren’t going to get many results (fewer than three in fact) in the language of the wiki we are on, so we might as well ignore the language of the wiki we are on and look for results elsewhere—that is, worry less about precision (i.e., about avoiding “silly” results), and focus more on recall (i.e., offering some sort of result, because no result is not very helpful).

Of course, if about 70% of poor-performing queries on French Wikipedia are in French, then most of our language detection guesses will be wrong if French is not among the possible answers. However, we have a better chance of finding something relevant on another wiki if we are looking.

Returning Multiple Languages

Another approach to improve recall is to give more answers. At a ridiculous extreme, we could return every known language as an answer for every query. The right answer would be in there, but it wouldn’t be very helpful. However, returning two or three languages (i.e., including the language detector’s second or third choice option if provided) is more manageable.

TextCat, for example, returns the best-matching language, plus any alternatives that are, by default, within 5% of the best match. Returning two results doubles the chances that at least one of them is right.

Returning multiple languages is terrible for precision, of course. If we give two languages as results for every query, then many more may be “silly”, and at least half of them will be wrong—and so precision would max out at 50%. In practice, not every query will get multiple results (despite the difficulties overall, some queries are actually pretty easy to get right), so precision higher than 50% is possible.

Another option is to tune individual languages based on position in the returned list. On the German Wikipedia data, for example, detecting Polish may be 85% accurate when Polish is the first option, but only 3% accurate when it is the second option, in which case it makes sense to only accept Polish as an answer when it is the first option returned. English, on the other hand, may be 85% accurate in the first position, but still 80% accurate in the third position, in which case it makes sense to consider English results even in the third position. This is of course more complex to optimize for and to implement than just returning a fixed number of results.

Note that with multiple lang-ID results per query, recall and precision are no longer tightly coupled. It’s possible for any given query to get no answer (false negative),^[4] one right answer (true positive), or one wrong answer (false positive and false negative), or two wrong answers (two false positives and one false negative), or one right and one wrong (true positive and false positive).

Combining Both Approaches

Of course, it is possible to combine both approaches—ignoring the language of the wiki and returning multiple results.

The combination also allows yet another permutation: include but ignore the language of the wiki while allowing multiple results.

For example, on the German Wikipedia, as in the example above, we might have decided that Polish is only allowed as the first result, while English is allowed as the first or second result.

If we include German among the the languages being detected, we have two possibilities when both German and Polish are detected: either Polish is first, or German is first. In this case, we might say that if the query looks more Polish than German, we will treat it as Polish. But if it looks more German than Polish, we will ignore the Polish result, even though we are ignoring all German results, too. On the other hand, we’ll consider English whether it comes first or second to German.

Another possible outcome is that the matching on German is so good that no other languages are not considered reasonable alternatives (i.e., score within 5% of German), resulting in no language detection for a given query.

Tangent on User Interface

I think the way the language detection information is used and presented affects and is affected by whether we favor recall or precision. If we have high-precision results and only one language detected, it may make more sense to just provide the cross-wiki results right on the wiki page.

On the other hand, if we are maximizing recall and providing, say, up to three language detection results per query, it might make more sense to only provide a link that says something like, “Would you like to see results on Spanish Wikipedia, French Wikipedia, or Portuguese Wikipedia?”

“Silly” language detection results may be more tolerable to users when they only result in an extra link or two on the results page for poor-performing queries.

Some users may find “silly” results confusing in all cases, and some may never mind them. We’ll definitely need to consult the user community and try out various options as A/B tests before coming to a final decision on how best to show results.

Overview of Options

In the analysis that follows, I’m going to consider 7 (!) options for each wiki. It may turn out that there isn’t much difference among them):

1) Ignore Home Language: Ignore the language of the wiki we are on when doing language detection to increase the chances of finding something on another wiki. We will only allow one result per query. We will note but not consider as errors misidentification on queries in the language of the wiki. (e.g., on German Wikipedia, a query in German identified as English isn’t counted as an error, though we will note how often it happens.)

2) Allow Multiple Lang-ID Results: Even though with multiple results, all but one of them must be wrong, the chances of getting that one correct result increase when we allow more results. Precision will take a hit, but we’ll pay more attention to recall using F2. This includes the language of the wiki.

3) Allow Multiple Lang-ID Results, with per-Language Thresholds: Allow multiple lang-ID results per query, but limit whether languages are considered based on their position within the results. (e.g., Polish counts if it is the best result, but not the second best, while English counts in either place.) This includes the language of the wiki.

4) Allow Multiple Lang-ID Results, Ignoring Home Language: Allow multiple lang-ID results per query, but do not consider the language of the wiki during detection. We will note but not consider as errors misidentification on queries in the language of the wiki.

5) Allow Multiple Lang-ID Results, Ignoring Home Language, with per-Language Thresholds: Allow multiple lang-ID results per query, but do not consider the language of the wiki during detection, and limit whether languages are considered based on their position within the results. We will note but not consider as errors misidentification on queries in the language of the wiki.

6) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language: Allow multiple lang-ID results per query, including the language of the wiki—however, ignore results in the language of the wiki for the purposes of calculating recall and precision, which may result in “no result” or other languages being pushed down a postion in the ranked results. We will note but not consider as errors misidentification on queries in the language of the wiki.

7) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language, with per-Language Thresholds: Allow multiple lang-ID results per query, including the language of the wiki—however, ignore results in the language of the wiki for the purposes of calculating recall and precision—and limit whether languages are considered based on their position within the results. We will note but not consider as errors misidentification on queries in the language of the wiki.

Notes

Apples & Oranges

In the analysis summaries, the results are divided into those that include the “home” language of the wiki, and those that don’t. These can’t necessarily be compared directly to each other. The home language is always the largest category of queries, and removing it takes away the best source of correct language identification, and can drastically change the interactions among the remaining languages, especially if the second most common language is not as dominant over there remainder as the home language is over all the others. Another factor, for these samples, is that the non-home language sample size can be smaller than I’d like (esp. for Spanish, which is ridiculously small!). The samples were taken to reach a target of 500+ annotated queries, but not with any minimum non-home language sample size.

Coverage vs Recall

Another way of looking at the problem is as one of “coverage” rather than recall. In this sense, coverage indicates the number of queries that return some language that can be used for cross-wiki searching, even if it isn’t the correct one. This could be called this a desperate attempt return anything at all. In the options above, (1), (4), and (5) have at least 99% coverage—some result that is not the home wiki language is returned for almost all queries. If coverage rather than recall is important, we can choose the option from among (1), (4), and (5) with the best F2 score.

“Extra” Language Models

For the present analysis I’m not including the “extra” language models that I included in the precision-favoring analysis. These were high-accuracy models for queries that were found in the larger sample for each wiki, but not in the hand-coded sample used for optimization. They could and should be included (after a quick check that they don’t cause unexpected problems) if any recommendations are taken from this analysis.

Results

Compare to precision-favoring results.

French

0) Precision-Favoring Results—F2: 89.1%

1) Ignore Home Language—F2: 84.9%

The best language set is English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Polish, Thai, Armenian. (en, ar, pt, de, es, ru, zh, pl, th, hy)

              f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL    84.7%   84.8%   84.9%   85.0%   84.6%  213     181     33
    English    91.0%   88.8%   86.6%   85.2%   92.6%  88      75      6
     Arabic   100.0%  100.0%  100.0%  100.0%  100.0%  66      66      0
 Portuguese    70.3%   72.0%   73.8%   75.0%   69.2%  12      9       4
     German    52.6%   62.5%   76.9%   90.9%   47.6%  11      10      11
    Spanish    54.1%   61.5%   71.4%   80.0%   50.0%  10      8       8
    Russian   100.0%  100.0%  100.0%  100.0%  100.0%  5       5       0
    Chinese   100.0%  100.0%  100.0%  100.0%  100.0%  4       4       0
      Dutch     0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
   Corsican     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
    Italian     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
     Polish    38.5%   50.0%   71.4%  100.0%   33.3%  2       2       4
   Armenian   100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
     Breton     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
  Hungarian     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
  Icelandic     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
      Latin     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Swahili     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Swedish     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
       Thai   100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
              f0.5    f1      f2      recall  prec    total   hits    misses

The 468 queries that are actually French are tagged as (i.e., potentially “silly” results): English (221), Spanish (100), German (74), Portuguese (68), Polish (5)

2) Allow Multiple Lang-ID Results—F2: 89.1%

The best language set, with a threshold of 1 language, is French, English, Arabic, Russian, Chinese, Thai, Armenian. (fr en, ar, ru, zh, th, hy)

This is the same result as (0), because the optimal threshold is 1.

              f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL     89.0%   89.1%   89.1%   89.1%   89.0%  681     607     75
     French     94.8%   95.1%   95.5%   95.7%   94.5%  468     448     26
    English     67.0%   74.9%   84.9%   93.2%   62.6%  88      82      49
     Arabic    100.0%  100.0%  100.0%  100.0%  100.0%  66      66      0
 Portuguese      0.0%    0.0%    0.0%    0.0%    0.0%  12      0       0
     German      0.0%    0.0%    0.0%    0.0%    0.0%  11      0       0
    Spanish      0.0%    0.0%    0.0%    0.0%    0.0%  10      0       0
    Russian    100.0%  100.0%  100.0%  100.0%  100.0%  5       5       0
    Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  4       4       0
      Dutch      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
   Corsican      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
    Italian      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
     Polish      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
   Armenian    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
     Breton      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
  Hungarian      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
  Icelandic      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
      Latin      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Swahili      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Swedish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
       Thai    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
              f0.5    f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually French are tagged as (i.e., potentially “silly” results): English (20)

3) Allow Multiple Lang-ID Results, with per-Language Thresholds—F2: 89.7%

The best language set is French, English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Dutch, Polish, Thai, Armenian. (fr, en, ar, pt, de, es, ru, zh, nl, pl, th, hy). Thresholds are shown in the table below.

               thresh  f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL    4        79.3%   84.2%   89.7%   93.8%   76.3%  681     639     198
     French    4        94.1%   94.7%   95.3%   95.7%   93.7%  468     448     30
    English    3        61.4%   70.1%   81.8%   92.0%   56.6%  88      81      62
     Arabic    1       100.0%  100.0%  100.0%  100.0%  100.0%  66      66      0
 Portuguese    2        37.2%   47.8%   67.1%   91.7%   32.4%  12      11      23
     German    1        43.7%   52.9%   67.2%   81.8%   39.1%  11      9       14
    Spanish    2        20.0%   28.6%   50.0%  100.0%   16.7%  10      10      50
    Russian    1       100.0%  100.0%  100.0%  100.0%  100.0%  5       5       0
    Chinese    1        93.8%   85.7%   78.9%   75.0%  100.0%  4       3       0
      Dutch    1        18.2%   25.0%   40.0%   66.7%   15.4%  3       2       11
   Corsican    -         0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
    Italian    -         0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
     Polish    1        23.8%   33.3%   55.6%  100.0%   20.0%  2       2       8
   Armenian    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
     Breton    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
  Hungarian    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
  Icelandic    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
      Latin    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Swahili    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Swedish    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
       Thai    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
               thresh  f0.5    f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually French are tagged as (i.e., potentially “silly” results): English (50), Spanish (38), Portuguese (16), Dutch (6), German (6), Polish (4)

4) Allow Multiple Lang-ID Results, Ignoring Home Language—F2: 84.9%

The best language set, with a threshold of 1 language, is English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Polish, Thai, Armenian. (en, ar, pt, de, es, ru, zh, pl, th, hy)

Note that this is the same as (1) above since the threshold was just one language. Allowing 3 languages offered the same F2 score (to one decimal place), with moderately higher recall and a good lower precision (the unbalanced trade-off being the nature of the weighted F2 measure).

  thresh  f0.5    f1      f2      recall  prec    total   hits    misses
  TOTAL (213)
  1        84.7%   84.8%   84.9%   85.0%   84.6%  213     181     33
  2        72.1%   77.7%   84.2%   89.2%   68.8%  213     190     86
  3        69.6%   76.5%   84.9%   91.5%   65.7%  213     195     102

5) Allow Multiple Lang-ID Results, Ignoring Home Language, with per-Language Thresholds—F2: 88.1%

The best language set is English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Polish, Thai, Armenian. (en, ar, pt, de, es, ru, zh, pl, th, hy). Thresholds are shown in the table below.

               thresh  f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL    3        80.2%   84.0%   88.1%   91.1%   77.9%  213     194     55
    English    3        84.7%   88.4%   92.5%   95.5%   82.4%  88      84      18
     Arabic    1       100.0%  100.0%  100.0%  100.0%  100.0%  66      66      0
 Portuguese    2        59.8%   68.7%   80.9%   91.7%   55.0%  12      11      9
     German    1        52.6%   62.5%   76.9%   90.9%   47.6%  11      10      11
    Spanish    2        49.0%   60.6%   79.4%  100.0%   43.5%  10      10      13
    Russian    1       100.0%  100.0%  100.0%  100.0%  100.0%  5       5       0
    Chinese    1       100.0%  100.0%  100.0%  100.0%  100.0%  4       4       0
      Dutch    1         0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
   Corsican    -         0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
    Italian    -         0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
     Polish    1        38.5%   50.0%   71.4%  100.0%   33.3%  2       2       4
   Armenian    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
     Breton    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
  Hungarian    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
  Icelandic    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
      Latin    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Swahili    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Swedish    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
       Thai    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
               thresh  f0.5    f1      f2      recall  prec    total   hits    misses

The 468 queries that are actually French are tagged as (i.e., potentially “silly” results): English (357), Spanish (184), Portuguese (161), German (74), Polish (5)

6) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language—F2: 83.7%

The best language set, with a threshold of 2 languages, is French, English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Thai, Armenian. (fr, en, ar, pt, de, es, ru, zh, th, hy)

              f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL     70.6%   76.6%   83.7%   89.2%   67.1%  213     190     93
    English     86.5%   88.5%   90.6%   92.0%   85.3%  88      81      14
     Arabic     98.8%   99.2%   99.7%  100.0%   98.5%  66      66      1
 Portuguese     62.5%   71.0%   82.1%   91.7%   57.9%  12      11      8
     German     26.1%   36.1%   58.5%  100.0%   22.0%  11      11      39
    Spanish     51.0%   62.5%   80.6%  100.0%   45.5%  10      10      12
    Russian    100.0%  100.0%  100.0%  100.0%  100.0%  5       5       0
    Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  4       4       0
      Dutch      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
   Corsican      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
    Italian      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
     Polish      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
   Armenian    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
     Breton      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
  Hungarian      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
  Icelandic      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
      Latin      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Swahili      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Swedish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
       Thai    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
     French      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       19
              f0.5    f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually French are tagged as (i.e., potentially “silly” results): English (45), Spanish (38), German (21), Portuguese (18)

7) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language, with per-Language Thresholds—F2: 87.8%

The best language set is French, English, Arabic, Portuguese, German, Spanish, Russian, Chinese, Dutch, Polish, Thai, Armenian. (fr, en, ar, pt, de, es, ru, zh, nl, pl, th, hy). Thresholds are shown in the table below.

               thresh  f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL    4        81.3%   84.4%   87.8%   90.1%   79.3%  213     192     50
    English    4        86.9%   89.1%   91.5%   93.2%   85.4%  88      82      14
     Arabic    1       100.0%  100.0%  100.0%  100.0%  100.0%  66      66      0
 Portuguese    2        65.5%   73.3%   83.3%   91.7%   61.1%  12      11      7
     German    1        57.0%   64.3%   73.8%   81.8%   52.9%  11      9       8
    Spanish    2        51.0%   62.5%   80.6%  100.0%   45.5%  10      10      12
    Russian    1       100.0%  100.0%  100.0%  100.0%  100.0%  5       5       0
    Chinese    1        93.8%   85.7%   78.9%   75.0%  100.0%  4       3       0
      Dutch    1        32.3%   40.0%   52.6%   66.7%   28.6%  3       2       5
   Corsican    -         0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
    Italian    -         0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
     Polish    1        38.5%   50.0%   71.4%  100.0%   33.3%  2       2       4
   Armenian    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
     Breton    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
  Hungarian    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
  Icelandic    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
      Latin    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Swahili    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Swedish    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
       Thai    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
               thresh  f0.5    f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually French are tagged as (i.e., potentially “silly” results): English (56), Spanish (38), Portuguese (16), Dutch (6), German (6), Polish (4)

Summary

Configurations that include reporting French (n = 681 samples), by F2:

89.7%	3) Allow Multiple Lang-ID Results, with per-Language Thresholds
89.1%	0) Precision-Favoring Results
89.1%	2) Allow Multiple Lang-ID Results

Configurations that ignore French (n = 213), by F2:

88.1%	5) Allow Multiple Lang-ID Results, Ignoring Home Language, with per-Language Thresholds
87.8%	7) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language, with per-Language Thresholds
84.9%	1) Ignore Home Language
84.9%	4) Allow Multiple Lang-ID Results, Ignoring Home Language
83.7%	6) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language

Spanish

0) Precision-Favoring Results—F2: 95.6%

1) Ignore Home Language—F2: 79.5%

The best language set is English, Russian, Chinese, Portuguese. (en, ru, zh, pt)

              f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL     79.5%   79.5%   79.5%   79.5%   79.5%  44      35      9
    English     86.1%   89.9%   93.9%   96.9%   83.8%  32      31      6
      Latin      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
    Russian    100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0
    Catalan      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
     French      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     German      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Guarani      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Italian      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
 Portuguese     29.4%   40.0%   62.5%  100.0%   25.0%  1       1       3
              f0.5    f1      f2      recall  prec    total   hits    misses

The 476 queries that are actually Spanish are tagged as (i.e., potentially “silly” results): Portuguese (440), English (36)

2) Allow Multiple Lang-ID Results—F2: 96.1%

The best language set, with a threshold of 2 languages, is Spanish, English, Russian, Chinese. (es, en, ru, zh)

              f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL     91.3%   93.7%   96.1%   97.9%   89.8%  520     509     58
    Spanish     97.8%   98.4%   99.1%   99.6%   97.3%  476     474     13
    English     47.1%   58.7%   78.0%  100.0%   41.6%  32      32      45
      Latin      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
    Russian    100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0
    Catalan      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
     French      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     German      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Guarani      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Italian      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
 Portuguese      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
              f0.5    f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually Spanish are tagged as (i.e., potentially “silly” results): English (38)

3) Allow Multiple Lang-ID Results, with per-Language Thresholds—F2: 96.9%

The best language set is Spanish, English, Russian, Chinese. (es, en, ru, zh). Thresholds are shown in the table below.

               thresh  f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL    2        94.8%   95.8%   96.9%   97.7%   94.1%  520     508     32
    Spanish    2        97.8%   98.4%   99.1%   99.6%   97.3%  476     474     13
    English    1        66.8%   75.6%   87.1%   96.9%   62.0%  32      31      19
      Latin    -         0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
    Russian    1       100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0
    Catalan    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Chinese    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
     French    1         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     German    1         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Guarani    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Italian    1         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
 Portuguese    1         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
               thresh  f0.5    f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually Spanish are tagged as (i.e., potentially “silly” results): English (13).

4) Allow Multiple Lang-ID Results, Ignoring Home Language—F2: 79.5%

The best language set, with a threshold of 1 language, is English, Russian, Chinese, Portuguese. (en, ru, zh, pt)

This is the same as (1).

5) Allow Multiple Lang-ID Results, Ignoring Home Language, with per-Language Thresholds—F2: 79.5%

The best language set, with a threshold of 1 for every result, is English, Russian, Chinese, Portuguese. (en, ru, zh, pt)

This is also the same as (1).

               thresh  f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL    1        79.5%   79.5%   79.5%   79.5%   79.5%  44      35      9
    English    1        86.1%   89.9%   93.9%   96.9%   83.8%  32      31      6
      Latin    -         0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
    Russian    1       100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0
    Catalan    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Chinese    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
     French    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     German    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Guarani    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Italian    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
 Portuguese    1        29.4%   40.0%   62.5%  100.0%   25.0%  1       1       3
               thresh  f0.5    f1      f2      recall  prec    total   hits    misses

6) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language—F2: 77.3%

The best language set, with a threshold of 1 language, is Spanish, English, Russian, Chinese, Portuguese. (es, en, ru, zh, pt)

              f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL     77.3%   77.3%   77.3%   77.3%   77.3%  44      34      10
    English     85.2%   88.2%   91.5%   93.8%   83.3%  32      30      6
      Latin      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
    Russian    100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0
    Catalan      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
     French      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     German      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Guarani      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Italian      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
 Portuguese     38.5%   50.0%   71.4%  100.0%   33.3%  1       1       2
    Spanish      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       2
              f0.5    f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually Spanish are tagged as (i.e., potentially “silly” results): Portuguese (46), English (9)

7) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language, with per-Language Thresholds—F2: 80.3%

The best language set is Spanish, English, Russian, Chinese. (es, en, ru, zh). Thresholds are shown in the table below.

               thresh  f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL    2        82.5%   81.4%   80.3%   79.5%   83.3%  44      35      7
    English    2        85.1%   90.1%   95.8%  100.0%   82.1%  32      32      7
      Latin    -         0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
    Russian    1       100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0
    Catalan    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Chinese    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
     French    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     German    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Guarani    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Italian    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
 Portuguese    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
               thresh  f0.5    f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually Spanish are tagged as (i.e., potentially “silly” results): English (38)

Summary

Configurations that include reporting Spanish (n=520), by F2:

96.9%	3) Allow Multiple Lang-ID Results, with per-Language Thresholds
96.1%	2) Allow Multiple Lang-ID Results
95.6%	0) Precision-Favoring Results

Configurations that ignore Spanish (n=44), by F2:

[Note that this sample is very small, probably too small to draw any strong conclusions from).]

80.3%	7) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language, with per-Language Thresholds
79.5%	1) Ignore Home Language
79.5%	4) Allow Multiple Lang-ID Results, Ignoring Home Language
79.5%	5) Allow Multiple Lang-ID Results, Ignoring Home Language, with per-Language Thresholds
77.3%	6) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language

Italian

0) Precision-Favoring Results—F2: 92.2%

1) Ignore Home Language—F2: 79.5%

The best language set is English, Spanish, Russian, Romanian, Portuguese, Arabic, Chinese. (en, es, ru, ro, pt, ar, zh)

              f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL     79.5%   79.5%   79.5%   79.5%   79.5%  146     116     30
    English     89.1%   90.1%   91.1%   91.7%   88.5%  109     100     13
    Spanish     51.5%   60.9%   74.5%   87.5%   46.7%  8       7       8
     German      0.0%    0.0%    0.0%    0.0%    0.0%  6       0       0
     French      0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0
      Latin      0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0
     Arabic    100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0
 Portuguese     21.3%   28.6%   43.5%   66.7%   18.2%  3       2       9
   Romanian      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
    Russian    100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0
    Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
      Czech      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     Polish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
              f0.5    f1      f2      recall  prec    total   hits    misses

The 404 queries that are actually Italian are tagged as (i.e., potentially “silly” results): Spanish (196), English (114), Portuguese (94)

2) Allow Multiple Lang-ID Results—F2: 92.2%

The best language set, with a threshold of 1 language, is Italian, English, Russian, Arabic, Chinese. (it, en, ru, ar, zh)

This is the same result as (0), because the optimal threshold is 1.

              f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL     92.2%   92.2%   92.2%   92.2%   92.2%  550     507     43
    Italian     95.4%   96.7%   98.1%   99.0%   94.6%  404     400     23
    English     84.9%   87.3%   89.9%   91.7%   83.3%  109     100     20
    Spanish      0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0
     German      0.0%    0.0%    0.0%    0.0%    0.0%  6       0       0
     French      0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0
      Latin      0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0
     Arabic    100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0
 Portuguese      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
   Romanian      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
    Russian    100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0
    Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
      Czech      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     Polish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
              f0.5    f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually Italian are tagged as (i.e., potentially “silly” results): English (27)

3) Allow Multiple Lang-ID Results, with per-Language Thresholds—F2: 92.2%

The best language set is Italian, English, Russian, Arabic, Chinese. (it, en, ru, ar, zh). Thresholds are shown in the table below.

The same F2 score can be had with the thresh for English set to 1, which is then the same as (2) and (0).

               thresh  f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL    2        88.6%   90.3%   92.2%   93.5%   87.4%  550     514     74
    Italian    1        95.4%   96.7%   98.1%   99.0%   94.6%  404     400     23
    English    2        72.2%   80.1%   90.1%   98.2%   67.7%  109     107     51
    Spanish    -         0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0
     German    -         0.0%    0.0%    0.0%    0.0%    0.0%  6       0       0
     French    -         0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0
      Latin    -         0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0
     Arabic    1       100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0
 Portuguese    -         0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
   Romanian    -         0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
    Russian    1       100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0
    Chinese    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
      Czech    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     Polish    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
               thresh  f0.5    f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually Italian are tagged as (i.e., potentially “silly” results): English (4).

4) Allow Multiple Lang-ID Results, Ignoring Home Language—F2: 80.5%

The best language set, with a threshold of 2 languages, is English, Spanish, Russian, Romanian, Arabic, Chinese. (en, es, ru, ro, ar, zh)

              f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL     74.0%   77.1%   80.5%   82.9%   72.0%  146     121     47
    English     87.0%   90.6%   94.5%   97.2%   84.8%  109     106     19
    Spanish     26.3%   36.4%   58.8%  100.0%   22.2%  8       8       28
     German      0.0%    0.0%    0.0%    0.0%    0.0%  6       0       0
     French      0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0
      Latin      0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0
     Arabic    100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0
 Portuguese      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
   Romanian      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
    Russian    100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0
    Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
      Czech      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     Polish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
              f0.5    f1      f2      recall  prec    total   hits    misses

The 404 queries that are actually Italian are tagged as (i.e., potentially “silly” results): Spanish (364), English (245)

5) Allow Multiple Lang-ID Results, Ignoring Home Language, with per-Language Thresholds—F2: 82.3%

The best language set is English, Spanish, Russian, Romanian, Portuguese, Arabic, Chinese. (en, es, ru, ro, pt, ar, zh). Thresholds are shown in the table below.

               thresh  f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL    3        78.8%   80.5%   82.3%   83.6%   77.7%  146     122     35
    English    3        87.6%   91.0%   94.6%   97.2%   85.5%  109     106     18
    Spanish    1        51.5%   60.9%   74.5%   87.5%   46.7%  8       7       8
     German    1         0.0%    0.0%    0.0%    0.0%    0.0%  6       0       0
     French    -         0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0
      Latin    -         0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0
     Arabic    1       100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0
 Portuguese    1        21.3%   28.6%   43.5%   66.7%   18.2%  3       2       9
   Romanian    -         0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
    Russian    1       100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0
    Chinese    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
      Czech    1         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     Polish    1         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
               thresh  f0.5    f1      f2      recall  prec    total   hits    misses

The 404 queries that are actually Italian are tagged as (i.e., potentially “silly” results): English (222), Spanish (196), Portuguese (94)

6) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language—F2: 77.1%

The best language set, with a threshold of 4 languages, is Italian, English, Russian, Arabic, Chinese, Spanish, Portuguese. (it, en, ru, ar, zh, es, pt)

              f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL     60.4%   67.8%   77.1%   84.9%   56.4%  146     124     96
    English     88.8%   91.8%   95.0%   97.2%   86.9%  109     106     16
    Spanish     29.4%   40.0%   62.5%  100.0%   25.0%  8       8       24
     German      0.0%    0.0%    0.0%    0.0%    0.0%  6       0       0
     French      0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0
      Latin      0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0
     Arabic    100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0
 Portuguese     14.0%   20.7%   39.5%  100.0%   11.5%  3       3       23
   Romanian      0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
    Russian    100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0
    Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
      Czech      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     Polish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Italian      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       33
              f0.5    f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually Italian are tagged as (i.e., potentially “silly” results): Spanish (69), Portuguese (45), English (21)

7) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language, with per-Language Thresholds—F2: 82.5%

The best language set is Italian, English, Spanish, Russian, Portuguese, Arabic, Chinese. (it, en, es, ru, pt, ar, zh). Thresholds are shown in the table below.

               thresh  f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL    4        79.6%   81.1%   82.5%   83.6%   78.7%  146     122     33
    English    4        88.8%   91.8%   95.0%   97.2%   86.9%  109     106     16
    Spanish    1        62.5%   66.7%   71.4%   75.0%   60.0%  8       6       4
     German    -         0.0%    0.0%    0.0%    0.0%    0.0%  6       0       0
     French    -         0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0
      Latin    -         0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0
     Arabic    1       100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0
 Portuguese    2        22.4%   31.6%   53.6%  100.0%   18.8%  3       3       13
   Romanian    -         0.0%    0.0%    0.0%    0.0%    0.0%  3       0       0
    Russian    1       100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0
    Chinese    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
      Czech    1         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     Polish    1         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
               thresh  f0.5    f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually Italian are tagged as (i.e., potentially “silly” results): Portuguese (32), English (21), Spanish (20)

Summary

Configurations that include reporting Italian (n= 550), by F2:

92.2%	0) Precision-Favoring Results
92.2%	2) Allow Multiple Lang-ID Results
92.2%	3) Allow Multiple Lang-ID Results, with per-Language Thresholds

Configurations that ignore Italian (n=146), by F2:

82.5%	7) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language, with per-Language Thresholds
82.3%	5) Allow Multiple Lang-ID Results, Ignoring Home Language, with per-Language Thresholds
80.5%	4) Allow Multiple Lang-ID Results, Ignoring Home Language
79.5%	1) Ignore Home Language
77.1%	6) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language

German

0) Precision-Favoring Results—F2: 88.1%

1) Ignore Home Language—F2: 78.1%

The best language set is English, Italian, Spanish, Chinese. (en, it, es, zh)

              f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL    78.1%   78.1%   78.1%   78.1%   78.1%  160     125     35
    English    92.1%   91.4%   90.8%   90.3%   92.6%  124     112     9
    Italian    39.5%   48.0%   61.2%   75.0%   35.3%  8       6       11
      Latin     0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0
    Spanish    36.5%   46.7%   64.8%   87.5%   31.8%  8       7       15
     French     0.0%    0.0%    0.0%    0.0%    0.0%  5       0       0
    Chinese     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
      Dutch     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     Polish     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Swedish     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Turkish     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
 Vietnamese     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
              f0.5    f1      f2      recall  prec    total   hits    misses

The 360 queries that are actually German are tagged as (i.e., potentially “silly” results): English (278), Italian (72), Spanish (10)

2) Allow Multiple Lang-ID Results—F2: 88.3%

The best language set, with a threshold of 1 language, is German, English, Chinese. (de, en, zh)

              f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL     88.3%   88.3%   88.3%   88.3%   88.3%  520     459     61
     German     94.0%   95.0%   96.0%   96.7%   93.3%  360     348     25
    English     77.9%   81.9%   86.3%   89.5%   75.5%  124     111     36
    Italian      0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0
      Latin      0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0
    Spanish      0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0
     French      0.0%    0.0%    0.0%    0.0%    0.0%  5       0       0
    Chinese      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
      Dutch      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     Polish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Swedish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Turkish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
 Vietnamese      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
              f0.5    f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually German are tagged as (i.e., potentially “silly” results): English (12)

3) Allow Multiple Lang-ID Results, with per-Language Thresholds—F2: 89.1%

The best language set is German, English, Italian, Spanish, Chinese, Vietnamese. (de, en, it, es, zh, vi). Thresholds are shown in the table below.

               thresh  f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL    4        78.1%   83.2%   89.1%   93.5%   75.0%  520     486     162
     German    4        87.5%   91.4%   95.6%   98.6%   85.1%  360     355     62
    English    3        69.6%   77.4%   87.1%   95.2%   65.2%  124     118     63
    Italian    1        26.0%   33.3%   46.3%   62.5%   22.7%  8       5       17
      Latin    -         0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0
    Spanish    1        32.4%   42.4%   61.4%   87.5%   28.0%  8       7       18
     French    -         0.0%    0.0%    0.0%    0.0%    0.0%  5       0       0
    Chinese    1         0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
      Dutch    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     Polish    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Swedish    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Turkish    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
 Vietnamese    1        38.5%   50.0%   71.4%  100.0%   33.3%  1       1       2
               thresh  f0.5    f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually German are tagged as (i.e., potentially “silly” results): English (49), Italian (6), Spanish (5)

4) Allow Multiple Lang-ID Results, Ignoring Home Language—F2: 79.1%

The best language set, with a threshold of 3 languages, is English, Spanish, Chinese. (en, es, zh)

              f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL     71.8%   75.3%   79.1%   81.9%   69.7%  160     131     57
    English     89.2%   92.4%   95.9%   98.4%   87.1%  124     122     18
    Italian      0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0
      Latin      0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0
    Spanish     18.2%   25.9%   44.9%   87.5%   15.2%  8       7       39
     French      0.0%    0.0%    0.0%    0.0%    0.0%  5       0       0
    Chinese    100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0
      Dutch      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     Polish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Swedish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Turkish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
 Vietnamese      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
              f0.5    f1      f2      recall  prec    total   hits    misses

The 360 queries that are actually German are tagged as (i.e., potentially “silly” results): English (356), Spanish (114)

5) Allow Multiple Lang-ID Results, Ignoring Home Language, with per-Language Thresholds—F2: 83.5%

The best language set is English, Italian, Spanish, Chinese. (en, it, es, zh). Thresholds are shown in the table below.

               thresh  f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL    4        77.8%   80.6%   83.5%   85.6%   76.1%  160     137     43
    English    3        89.7%   92.8%   96.1%   98.4%   87.8%  124     122     17
    Italian    1        39.5%   48.0%   61.2%   75.0%   35.3%  8       6       11
      Latin    -         0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0
    Spanish    1        36.5%   46.7%   64.8%   87.5%   31.8%  8       7       15
     French    -         0.0%    0.0%    0.0%    0.0%    0.0%  5       0       0
    Chinese    4       100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0
      Dutch    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     Polish    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Swedish    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Turkish    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
 Vietnamese    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
               thresh  f0.5    f1      f2      recall  prec    total   hits    misses

The 360 queries that are actually German are tagged as (i.e., potentially “silly” results): English (345), Italian (72), Spanish (10)

6) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language—F2: 74.7%

The best language set, with a threshold of 2 languages, is German, English, Italian, Spanish, Chinese, Vietnamese. (de, en, it, es, zh, vi)

              f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL     58.1%   65.3%   74.7%   82.5%   54.1%  160     132     112
    English     91.8%   92.4%   93.1%   93.5%   91.3%  124     116     11
    Italian     27.0%   37.2%   59.7%  100.0%   22.9%  8       8       27
      Latin      0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0
    Spanish     29.2%   38.9%   58.3%   87.5%   25.0%  8       7       21
     French      0.0%    0.0%    0.0%    0.0%    0.0%  5       0       0
    Chinese      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
      Dutch      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     Polish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Swedish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Turkish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
 Vietnamese     29.4%   40.0%   62.5%  100.0%   25.0%  1       1       3
     German      0.0%    0.0%    0.0%    0.0%    0.0%  0       0       50
              f0.5    f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually German are tagged as (i.e., potentially “silly” results): English (45), Italian (14), Spanish (6), Vietnamese (1)

7) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language, with per-Language Thresholds—F2: 80.8%

The best language set is German, English, Italian, Spanish, Chinese, Vietnamese. (de, en, it, es, zh, vi). Thresholds are shown in the table below.

               thresh  f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL    3        77.6%   79.2%   80.8%   81.9%   76.6%  160     131     40
    English    3        90.5%   92.2%   93.9%   95.2%   89.4%  124     118     14
    Italian    1        34.7%   41.7%   52.1%   62.5%   31.2%  8       5       11
      Latin    -         0.0%    0.0%    0.0%    0.0%    0.0%  8       0       0
    Spanish    1        39.8%   50.0%   67.3%   87.5%   35.0%  8       7       13
     French    -         0.0%    0.0%    0.0%    0.0%    0.0%  5       0       0
    Chinese    1         0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
     Dutch    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     Polish    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Swedish    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Turkish    -         0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
 Vietnamese    1        38.5%   50.0%   71.4%  100.0%   33.3%  1       1       2
               thresh  f0.5    f1      f2      recall  prec    total   hits    misses

The remaining queries that are actually German are tagged as (i.e., potentially “silly” results): English (49), Italian (6), Spanish (5)

Summary

Configurations that include reporting German (n= 520), by F2:

89.1%	3) Allow Multiple Lang-ID Results, with per-Language Thresholds
88.3%	2) Allow Multiple Lang-ID Results
88.1%	0) Precision-Favoring Results

Configurations that ignore German (n=160), by F2:

83.5%	5) Allow Multiple Lang-ID Results, Ignoring Home Language, with per-Language Thresholds
80.8%	7) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language, with per-Language Thresholds
79.1%	4) Allow Multiple Lang-ID Results, Ignoring Home Language
78.1%	1) Ignore Home Language
74.7%	6) Allow Multiple Lang-ID Results, Detecting But Ignoring Home Language

Discussion

Among configurations that include reporting the home language (0, 2, 3), we see the expected increase in F2 score over the baseline (0), from allowing multiple languages to be reported (2), or optimizing how far down the list to consider each language independently (3). Sometimes there is no difference between the options, but when there is, the order is always the same, and the improvement is minor (< 1.5%).

Option (2) usually results in either maintaining the threshold of one language, or increasing it a bit to 2 or 3.

Option (3) usually results in increasing the threshold for more strongly represented languages (the home language and the second most common), while keeping the others at 1. Thus, for the less frequent languages, it’s best to only accept that answer if it’s the best guess, while for the more common languages, a less confident guess is still a good one.

Among configurations that ignore the home language (1, 4, 5, 6, 7), the two that allow for per-language thresholds (5 and 7) are consistently the best, which makes sense as they can be more finely tuned (or perhaps overfitted!). Among the others, (6) is consistently the worst. There’s a consistent partial ordering, in that (5) and (7) are the same or better than (4), which is the same or better than (1), which is better than (6)

The overall span in F2 scores for this group isn’t huge, but is sometimes moderate (3-9%). F2 scores for these configs are consistently worse than those including the home language (0, 2, 3, above), but they are apples and oranges (see Notes above).

Conclusions

For raw F2 score, including the home language gives the highest score, but can’t be directly compared to scores ignoring the home language. Allowing for multiple results is the best way to increase overall F2 score. Per-language threshold tweaking is often slightly better, but may not be worth the complexity.

In terms of coverage (see Notes above), we get nearly full coverage (some non-home language alternative is offered for every query) with options (1), (4), and (5). Option (4), allowing multiple language results and ignoring the home language, is the best middle ground. It gives more accurate results than (1), while being less complex than (5).

I prefer per-wiki tuning, but it seems like a reasonable generic recommendation for improving recall and/or coverage would be to allow a second language result from TextCat, and if you prefer coverage over accuracy, ignore the home language of the wiki.

Footnotes

↑ It’s hard to know what to call these kinds of errors. To a human, some of the incorrect results can seem ridiculous. Mistaking a one-word query in Spanish for Portuguese is not so much of a “silly” mistake. Mistaking a string of Chinese characters followed by what looks like a serial number with Latin characters and numbers for German is. I can explain why it happened, based on language model sizes, rarity of the Chinese characters used, the relative frequencies of individual characters in different writing systems, and the presence of three characters, “aus” which are very characteristically German out of context—but it still looks ridiculous to a normal human user. We’ll go with “silly” for now.
↑ Overall, most of the poor performing queries are not in a language. There’s a lot of junk (e.g., “fhdjskhfdsjkhfjdks”), and a lot of names (of people, places, products, books, movies, songs, bands, etc., etc.) that aren’t really in any language. (Names that are made up of common words of a language—“The Rolling Stones”, “A Hundred Years of Solitude”, “Fly by Night”, or “So I Married an Axe Murderer”—are counted as being in a language.) There are also a few transliterations, particularly of films (e.g., Naadodigal, and transliteration of Tamil நாடோடிகள், isn’t in Tamil, but isn’t in any particular language that uses the Latin alphabet, either), serial numbers and product codes, heavily ambiguous words (e.g., “a”), etc.
↑ It is necessary to restrict the possible languages to be reported by the language detection to the “plausible” ones. For example, an early language detector we tested reported that 38% of ~100K queries on English Wikipedia were in Romanian. I’m sure at least one of them was, but in a test set of over 1,000 queries, I didn’t see any Romanian. Clearly, leaving Romanian out of the mix improves overall results. Other cases are less clear-cut, but a language detector that gets many many more false positives than true positives is hurting more than it is helping, so it is best left out. The right blend of languages depends on the wiki in question.
↑ Recall and precision don’t take into account true negatives. In many information retrieval contexts, there are lots of results that are correctly not returned. So many, in fact, that it’s often hard to get below 95% accuracy on measures that include them, making those measures basically useless. In the context of language detection, for example, you can start with a list of a hundred languages, and be sure in every case that at least 99 of them are not the language of the query. If you return three lang-ID results, even if they are all wrong, there are 97 others that you correctly did not return (i.e., true negatives), giving a true negative rate (or “specificity”) of 97%. (See more the definitions of various accuracy measures for more.) For search engine accuracy metrics, the imbalance is even more lopsided. For any query there are thousands, millions, or even billions of irrelevant documents that are correctly not returned in the top 10 results, every time.

[1] It’s hard to know what to call these kinds of errors. To a human, some of the incorrect results can seem ridiculous. Mistaking a one-word query in Spanish for Portuguese is not so much of a “silly” mistake. Mistaking a string of Chinese characters followed by what looks like a serial number with Latin characters and numbers for German is. I can explain why it happened, based on language model sizes, rarity of the Chinese characters used, the relative frequencies of individual characters in different writing systems, and the presence of three characters, “aus” which are very characteristically German out of context—but it still looks ridiculous to a normal human user. We’ll go with “silly” for now.

[2] Overall, most of the poor performing queries are not in a language. There’s a lot of junk (e.g., “fhdjskhfdsjkhfjdks”), and a lot of names (of people, places, products, books, movies, songs, bands, etc., etc.) that aren’t really in any language. (Names that are made up of common words of a language—“The Rolling Stones”, “A Hundred Years of Solitude”, “Fly by Night”, or “So I Married an Axe Murderer”—are counted as being in a language.) There are also a few transliterations, particularly of films (e.g., Naadodigal, and transliteration of Tamil நாடோடிகள், isn’t in Tamil, but isn’t in any particular language that uses the Latin alphabet, either), serial numbers and product codes, heavily ambiguous words (e.g., “a”), etc.

[3] It is necessary to restrict the possible languages to be reported by the language detection to the “plausible” ones. For example, an early language detector we tested reported that 38% of ~100K queries on English Wikipedia were in Romanian. I’m sure at least one of them was, but in a test set of over 1,000 queries, I didn’t see any Romanian. Clearly, leaving Romanian out of the mix improves overall results. Other cases are less clear-cut, but a language detector that gets many many more false positives than true positives is hurting more than it is helping, so it is best left out. The right blend of languages depends on the wiki in question.

[4] Recall and precision don’t take into account true negatives. In many information retrieval contexts, there are lots of results that are correctly not returned. So many, in fact, that it’s often hard to get below 95% accuracy on measures that include them, making those measures basically useless. In the context of language detection, for example, you can start with a list of a hundred languages, and be sure in every case that at least 99 of them are not the language of the query. If you return three lang-ID results, even if they are all wrong, there are 97 others that you correctly did not return (i.e., true negatives), giving a true negative rate (or “specificity”) of 97%. (See more the definitions of various accuracy measures for more.) For search engine accuracy metrics, the imbalance is even more lopsided. For any query there are thousands, millions, or even billions of irrelevant documents that are correctly not returned in the top 10 results, every time.

[1]

[2]

[3]

[4]