User:TJones (WMF)/Notes/TextCat Optimization for plwiki arwiki zhwiki and nlwiki

September 2016 — See TJones_(WMF)/Notes for other projects. (Phabricator ticket: T142140)

TextCat Optimization for plwiki, arwiki, zhwiki, and nlwiki

Summary of Results

Using the default 3K models, the best options for each wiki are presented below:

nlwiki

languages: Dutch, English, Chinese, Arabic, Korean, Greek, Hebrew, Japanese, Russian
lang codes: nl, en, zh, ar, ko, el, he, ja, ru
relevant poor-performing queries: 36%
f_0.5: 82.3%

Background

See the earlier report on frwiki, eswiki, itwiki, and dewiki for information on how the corpora were created.

Dutch Results

About 16.8% of the original 10K corpus was removed in the initial filtering. A 1200-query random sample was taken, and 57.1% of those queries were discarded, leaving a 515-query corpus. Thus only about 35.7% of low-performing queries are in an identifiable language.

Other languages searched on nlwiki

Based on the sample of 515 poor-performing queries on nlwiki that are in some language, about 63% are in Dutch, 25% in English, 2-3% in French and German, less than 2% each are in a handful of other languages.

Below are the results for nlwiki, with raw counts, percentage, and 95% margin of error.

count	lg	%	+/-
326	nl	63.30%	4.16%
128	en	24.85%	3.73%
16	fr	3.11%	1.50%
11	de	2.14%	1.25%
6	es	1.17%	0.93%
5	it	0.97%	0.85%
4	la	0.78%	0.76%
3	zh	0.58%	0.66%
2	tr	0.39%	0.54%
2	pl	0.39%	0.54%
2	fi	0.39%	0.54%
2	ar	0.39%	0.54%
1	vi	0.19%	0.38%
1	pt	0.19%	0.38%
1	my	0.19%	0.38%
1	ko	0.19%	0.38%
1	hr	0.19%	0.38%
1	da	0.19%	0.38%
1	cs	0.19%	0.38%
1	af	0.19%	0.38%

In order, those are Dutch, English, French, German, Spanish, Italian, Latin, Chinese, Turkish, Polish, Finnish, Arabic, Vietnamese, Portuguese, Burmese, Korean, Croatian, Danish, Czech, Afrikaans.

We don’t have query-trained language models for all of the languages represented here, such as Afrikaans, Danish, Finnish, Croatian, Latin, and Burmese (af, da, fi, hr, la, my). Since these each represent very small slices of our corpus (< 5 queries each), we aren’t going to worry about them, and accept that they will not be detected correctly.

Looking at the larger corpus of 8,323 remaining queries after the initial filtering, focusing on queries in other writing systems, there are also a small number of Russian, Hebrew, Greek, and Japanese queries, and Amharic (for which we do not have models).

Analysis and Optimization

Using all of the language models available, I ran tests on various model sizes, in increments of 500 up to 5,000 and increments of 1,000 up to 10,000. Results for the 3K models, and some of the models that did better are here:

               3000    3500    4000    4500    5000    6000    9000    
      TOTAL    74.5%   74.5%   75.5%   76.1%   77.0%   78.3%   78.7%   
      Dutch    88.2%   88.6%   89.2%   88.8%   89.3%   89.9%   89.9%   
    English    71.3%   69.8%   71.0%   73.6%   75.1%   75.9%   78.2%   
     French    57.1%   58.3%   59.6%   66.7%   64.0%   71.1%   71.1%   
     German    32.1%   32.7%   34.0%   32.7%   34.6%   39.2%   37.7%   
    Spanish    46.2%   44.4%   44.4%   46.2%   46.2%   46.2%   50.0%   
    Italian    20.7%   21.4%   14.8%   14.8%   15.4%   13.8%   13.3%   
      Latin     0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    
    Chinese    80.0%   80.0%   80.0%   80.0%   80.0%  100.0%  100.0%  
     Arabic   100.0%  100.0%  100.0%  100.0%  100.0%  100.0%  100.0%  
    Finnish     0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    
     Polish    50.0%   50.0%   66.7%   66.7%   66.7%   66.7%   66.7%   
    Turkish    57.1%   50.0%   50.0%   40.0%   44.4%   44.4%   44.4%   
  Afrikaans     0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    
    Burmese     0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    
   Croatian     0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    
      Czech    22.2%   25.0%   28.6%   40.0%   50.0%   50.0%   50.0%   
     Danish     0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    0.0%    
     Korean   100.0%  100.0%  100.0%  100.0%  100.0%  100.0%  100.0%  
 Portuguese    25.0%   20.0%   22.2%   25.0%   22.2%   28.6%   28.6%   
 Vietnamese    40.0%   40.0%   40.0%   40.0%   50.0%   66.7%   66.7%

Performance details for the 3K model are here (details for larger models are similar in terms of which language models perform the most poorly):

               f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL     74.6%   74.5%   74.4%   74.4%   74.7%  515     383     130
      Dutch     93.7%   88.2%   83.3%   80.4%   97.8%  326     262     6
    English     80.2%   71.3%   64.2%   60.2%   87.5%  128     77      11
     French     47.3%   57.1%   72.2%   87.5%   42.4%  16      14      19
     German     23.6%   32.1%   50.6%   81.8%   20.0%  11      9       36
    Spanish     34.9%   46.2%   68.2%  100.0%   30.0%  6       6       14
    Italian     14.9%   20.7%   34.1%   60.0%   12.5%  5       3       21
      Latin      0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0
    Chinese     90.9%   80.0%   71.4%   66.7%  100.0%  3       2       0
     Arabic    100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0
    Finnish      0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
     Polish     38.5%   50.0%   71.4%  100.0%   33.3%  2       2       4
    Turkish     45.5%   57.1%   76.9%  100.0%   40.0%  2       2       3
  Afrikaans      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Burmese      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
   Croatian      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
      Czech     15.2%   22.2%   41.7%  100.0%   12.5%  1       1       7
     Danish      0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     Korean    100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
 Portuguese     17.2%   25.0%   45.5%  100.0%   14.3%  1       1       6
 Vietnamese     29.4%   40.0%   62.5%  100.0%   25.0%  1       1       3
               f0.5    f1      f2      recall  prec    total   hits    misses

French, German, Spanish, and Italian all do very poorly, with too many false positives. (When Spanish and Italian are disabled, French does even worse). Polish, Turkish, Czech, Portuguese and Vietnamese aren’t terrible in terms of raw false positives, but aren’t great, either.

As noted above, Greek, Hebrew, Japanese, and Russian are present in the larger sample, and as our models for these languages are very high accuracy, I’ve included them.

The final language set is Dutch, English, Chinese, Arabic, Korean, Greek, Hebrew, Japanese, and Russian. As above, 3K is not the optimal model size, but it is within 1.5%. The 3K results are shown below along with the best performing model sizes:

               3000    3500    6000    7000    9000   
      TOTAL    82.3%   82.5%   82.9%   83.3%   83.7%   
      Dutch    92.1%   92.4%   92.8%   92.9%   92.7%   
    English    76.0%   76.2%   76.5%   77.6%   79.1%   
     French     0.0%    0.0%    0.0%    0.0%    0.0%    
     German     0.0%    0.0%    0.0%    0.0%    0.0%    
    Spanish     0.0%    0.0%    0.0%    0.0%    0.0%    
    Italian     0.0%    0.0%    0.0%    0.0%    0.0%    
      Latin     0.0%    0.0%    0.0%    0.0%    0.0%    
    Chinese   100.0%  100.0%  100.0%  100.0%  100.0%  
     Arabic    80.0%   80.0%   80.0%   80.0%   80.0%   
    Finnish     0.0%    0.0%    0.0%    0.0%    0.0%    
     Polish     0.0%    0.0%    0.0%    0.0%    0.0%    
    Turkish     0.0%    0.0%    0.0%    0.0%    0.0%    
  Afrikaans     0.0%    0.0%    0.0%    0.0%    0.0%    
    Burmese     0.0%    0.0%    0.0%    0.0%    0.0%    
   Croatian     0.0%    0.0%    0.0%    0.0%    0.0%    
      Czech     0.0%    0.0%    0.0%    0.0%    0.0%    
     Danish     0.0%    0.0%    0.0%    0.0%    0.0%    
     Korean   100.0%  100.0%  100.0%  100.0%  100.0%  
 Portuguese     0.0%    0.0%    0.0%    0.0%    0.0%    
 Vietnamese     0.0%    0.0%    0.0%    0.0%    0.0%

The accuracy is very high, and the differences are <2%, so it makes sense to stick with the default 3K models for now, but keep an eye out for significant performance improvements with other model sizes.

The detailed report for the 3K model is here:

              f0.5    f1      f2      recall  prec    total   hits    misses
      TOTAL    82.3%   82.3%   82.3%   82.3%   82.3%  515     424     91
      Dutch    92.4%   92.1%   91.9%   91.7%   92.6%  326     299     24
    English    68.5%   76.0%   85.4%   93.0%   64.3%  128     119     66
     French     0.0%    0.0%    0.0%    0.0%    0.0%  16      0       0
     German     0.0%    0.0%    0.0%    0.0%    0.0%  11      0       0
    Spanish     0.0%    0.0%    0.0%    0.0%    0.0%  6       0       0
    Italian     0.0%    0.0%    0.0%    0.0%    0.0%  5       0       0
      Latin     0.0%    0.0%    0.0%    0.0%    0.0%  4       0       0
    Chinese   100.0%  100.0%  100.0%  100.0%  100.0%  3       3       0
     Arabic    71.4%   80.0%   90.9%  100.0%   66.7%  2       2       1
    Finnish     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
     Polish     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
    Turkish     0.0%    0.0%    0.0%    0.0%    0.0%  2       0       0
  Afrikaans     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
    Burmese     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
   Croatian     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
      Czech     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     Danish     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
     Korean   100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
 Portuguese     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
 Vietnamese     0.0%    0.0%    0.0%    0.0%    0.0%  1       0       0
              f0.5    f1      f2      recall  prec    total   hits    misses

Recall went up and precision went down for Dutch and English, but overall performance improved. Queries in unrepresented languages were almost all identified as either Dutch or English (decreasing precision for both), but those now unused models are no longer generating lots of false positives and bringing down precision overall. (The one query in Burmese was identified as Arabic, probably because it scored the same in all languages—with the max “unknown” score—and Arabic is alphabetically first among the contenders.)

nlwiki: Best Options

The barely sub-optimal settings (though consistent with others using 3K models) for nlwiki, based on these experiments, would be to use models for Dutch, English, Chinese, Arabic, Korean, Greek, Hebrew, Japanese, and Russian (nl, en, zh, ar, ko, el, he, ja, ru), using the default 3000-ngram models.