User:TJones (WMF)/Notes/Language Detection with TextCat

December 2015 — See TJones_(WMF)/Notes for other projects. (Phabricator ticket: T118287)

Language Detection with TextCat

edit

Background

edit

As previously noted, the default configuration ES Plugin doesn't do a great job of language detection of queries. At Oliver's suggestion, I was looking into a paper by Řehůřek and Kolkus ("Language Identification on the Web: Extending the Dictionary Method", 2009). Řehůřek and Kolkus compare their technique against an n-gram method (TextCat). On small phrases (30 characters or less), I felt the n-gram method they used generally out-performed their method (typically similar or higher precision, though often with lower recall—see Table 2 in the paper).

I was familiar with TextCat, so I looked it up online, and it's available under GNU GPL, so I thought I'd give that a try.

TextCat

edit

TextCat is based on a paper by Cavnar and Trenkle ("N-Gram-Based Text Categorization", 1994). The basic idea is to sort n-grams by frequency, then compare the rank order of the n-grams against the profile for a given language. One point is added for every position the rank orders disagree for each n-gram, and low score wins.

Unfortunately, the most current version of the original TextCat by Gertjan van Noord (in Perl) is pretty out of date (there are other implementations in other languages available, but I wanted to stick with the original if possible).

The provided language models for TextCat are non-Unicode, and there are even models for the same language in different encodings (e.g., Arabic iso8859_6 or windows1256). Also, as I discovered later, the language models are all limited to 400 n-grams.

Upgrades to TextCat and the Language Models

edit

Since language detection on short strings is generally difficult and the existing models were non-Unicode, I decided to retrain the models on actual wiki query data. In addition to using Unicode input, models built on query data may have significantly different distributions of n-grams. For example, there may be a significantly different proportion of punctuation, diacritics, question words (leading to different proportions of "wh"s in English or "q"s in Spanish, for example), verbs (affecting counts for conjugation endings in some languages), or different number of inflected forms in general. (I didn't try to empirically verify these ideas independently, but they are the motivation for the use of query data for re-training.)

I modified TextCat in several ways:

  • updated it to handle Unicode characters
  • modified the output to include scores (in case we want to limit based on the score)
  • pre-loaded all language models so that when processing line by line it is many times faster (a known deficiency mentioned in the comments of the original)
  • put in an alphabetic sub-sort after frequency sorting of n-grams (as noted in the comments of the original, not having this is faster, but without it, results are not unique, and can vary from run to run on the same input!!)
  • removed the benchmark timers (after re-shuffling some parts of the code, they weren't in a convenient location anymore, so I just took them out)

The modified version is available on GitHub.

I also changed the way TextCat deals with the number of n-grams in a model and the number of n-grams in the sample. This requires a bit more explanation. The language models that come with TextCat have 400 n-grams (the 400 most frequent for each language), and by default TextCat considers the 400 most frequenct n-grams from the sample to be identified. There is an option to use fewer n-grams from the sample (for speed, presumably), but the entire language model would still be used. There is a penalty for an unknown n-gram, which is the same as the number of n-grams used in the sample. Confusing, no?

As an example, if you have language models with 400 n-grams, but you choose to only look at the 20 most frequent n-grams in your sample (a silly thing to do), then any unknown n-gram would be given a penalty of 20 (penalties are based on difference in rank order). In this case, that's crazy, because a known n-gram in 50th place in the language model (i.e., with a penalty of at least 30) counts against a language more than an unknown n-gram (penalty of 20). In practice, I assume 300-500 sample n-grams would be used, and the penalty for an unknown n-gram would be more similar to that of a low frequency n-gram.

This makes sense when dealing with reasonably large texts, where the top most frequent n-grams really do the work of identifying a language, because they are repeated often. In really short samples (like most queries), the final decision may be made more on the basis of which language a string is least dissimilar to, rather than which is it most similar to, simply because it's too short to exhibit characteristic patterns. For example, in English, e is the most common letter, and is roughly 1.4 times as common as t, 1.6 times as common as a, and 1.7 times as common as o. You won't reliably get those proportions in a ten to twenty character string made up of English words.

As a result, it makes sense that very large language models, with thousands of n-grams, could be better at discriminating between languages, especially on very short strings. So we can see that while "ish_" (i.e., "ish" at the end of a word), is not super common in English (n-gram #1,014), it is even less common in Swedish (n-gram #4,100). In a long text, this wouldn't matter, because the preponderance of words ending in e, s, t, or y, or starting with s or t, or containing an, in, on, or er, or the relative proportions of single letters, or some other emergent feature would carry the day. But that's not going to happen when the string you are assessing is just "zebrafish".

I also modified TextCat to limit the size of the language model being used rather than using the whole model available (i.e., the model may have 5,000 n-grams in it, but we only want to look at the first 3,000). This means we can use the same n-gram file to test language models of various sizes without having to regenerate the models.

I made the penalty the size of the model we're using (i.e., if we look at 3,000 English n-grams, then any unknown n-gram gets treated as if it were #3000, regardless of how many n-grams we look at in the sample). The number of n-grams looked at in the sample is still configurable, but I set it to 1,000, which is effectively "all of them" for most query strings.

Using the entire available model (or even larger models) isn't necessarily a good idea. At some point random noise will begin to creep in, and with very low frequency counts, alphabetizing the n-grams may have as much of an effect as the actual frequency (i.e., an n-gram may be tied for 15,683rd place, but may show up in 16,592nd place because there are a thousand n-grams with a count of 1). Also, larger models (with more n-grams) are more coarse when built on smaller training data sets, further exaggerating the differences between models built on larger vs. smaller corpora.

Query Data Collection

edit

I started with 46,559,669 queries extracted from a week's worth of query logs (11/10/2015 through 11/16/2015). I collated the queries by wiki (with the various wikis acting as a initial stand-in for the corresponding language). There were 59 query sets with at least 10,000 raw queries (up to 18M+, for English): Albanian, Arabic, Armenian, Azerbaijani, Basque, Bengali, Bosnian, Bulgarian, Cantonese, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Farsi, Finnish, French, Georgian, German, Greek, Hebrew, Hindi, Hungarian, Igbo, Indonesian, Italian, Japanese, Kazakh, Korean, Latin, Latvian, Lithuanian, Macedonian, Malay, Malayalam, Mongolian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Serbo-Croatian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, and Vietnamese.

There's plenty of messiness in the queries, so I filtered queries according to a number of criteria:

  • deduplication: I deduped the queries. Even though the same query could come from multiple sources, the most commonly repeated queries in general come from bots. Others are driven by current events, and don't reflect more general language or query patterns. Deduping reduces their ability to skew the language model stats.
  • repetitive junk: A decent filter for junk (with very high precision) is to remove queries with the same character or two-letter sequence repeated at least four times in a row, or the same 3-6 character sequence at least three times in a row. I skimmed the queries being removed, and for some character sets my non-Unicode tool (grep) did some things not quite right, and so I adjusted accordingly. But as a general heuristic, this is a good way of reducing noise.
  • inappropriate character set: For each language, I also filtered out queries that were entirely in an inappropriate character set. For example, a query with no Latin characters is not going to be in English. This is obviously much more precise for some languages (Thai, Greek), and since all query sets seem to have a fair number of English queries, it was also fairly effective for languages that don't use the Latin alphabet even if their writing system isn't unique to the language (Cyrillic, Arabic). I also filtered queries with < and > characters, since there were bits of HTML and XML in some queries.
  • bad key words: I took a look at highest frequency tokens across all queries and found a number of terms that were high-precision markers for "bad" queries, including insource, category, Cookbook, prefix, www, etc. These were all filtered out, too.

After filtering, a lot of queries were removed. English was down to ~14M. Telugu had the fewest left, losing almost 70% down to ~3,300 queries. The largest loss (percentage and total) was Italian, which lost about 89% (5.8M) queries. The largest factor in Italian is probably deduplication, since searches on itwiki are repeated on a number of other Italian wikis.

The data was still messy, but filtering should have improved the signal strength of the main language of the wiki, while preserving the idiosyncrasies of actual queries (vs., say, wiki text in that language).

Variants Tested

edit

My primary variables were (a) language model size (b) whether to use the sample n-gram count or language model size (i.e., n-gram count) as the unknown n-gram penalty, and (c) sample n-gram count. As noted above, using the language model size as the penalty (b) performed much better, and the sample n-gram count (c) seemed best when it was "all of them" (in practice, for queries, that's 1,000).

I tested model sizes with 100 to 2,000 n-grams (in increments of 100) and 2,000 to 5,000 n-grams (in increments of 500). In my experiments, 3,000 to 3,500 n-grams generally performed the best.

When I reviewed the results, there were clearly some detectors that performed very poorly. I was less concerned with recall (every right answer is a happy answer) and more concerned with precision. Some low-precision models are the result of poor training data (the Igbo wiki, for example, gets a lot of queries in English), others are apparently just hard, esp. on small strings (like French). I removed language models with poor precision, in the hopes that, for example, English queries identified as French or Igbo would be correctly identified once French and Igbo were removed as options. Removing options that had very low precision resulted in improved performance.

A number of languages were dropped because there were no examples in the evaluation set, meaning they could only be wrong (and many were). Others, like French Tagalog, and German, were dropped even though they could theoretically help, because they got so many misses (false positives). The final list of languages used included: English, Spanish, Chinese, Portuguese, Arabic, Russian, Persian, Korean, Bengali, Bulgarian, Hindi, Greek, Japanese, Tamil, and Thai. The language models for Hebrew, Armenian, Georgian, and Telugu were also used, but didn't detect anything (i.e., they weren't problematic, so they weren't removed).

Some of these are high accuracy because their writing systems are very distinctive: Armenian, Bengali, Chinese (esp. when not trying to distinguish Cantonese), Georgian, Greek, Hebrew, Hindi (in this set of languages), Korean, Tamil, Telugu, and Thai. Bulgarian and Portuguese (potentially confused with Russian and Spanish, respectively) actually didn't do particularly well, but their negatives were on a fairly small scale.

Best Options

edit

The best performing set up for enwiki then is: language models with 3,000 n-grams, built on the filtered query set, setting the unknown n-gram penalty to the language model size, and limiting the languages to those that are very high precision or very useful to enwiki: English, Spanish, Chinese, Portuguese, Arabic, Russian, Persian, Korean, Bengali, Bulgarian, Hindi, Greek, Japanese, Tamil, and Thai.

The Numbers

edit

The baseline performance I am trying to beat is the ES Plugin (with spaces). A summary of F0.5 performance of the ES Plugin overall and for the most common languages in enwiki queries is provided below.

ES Plugin Baseline
            f0.5    recall  prec    total  hits  misses
TOTAL        54.4%   39.0%   60.4%  775    302   198
English      71.8%   34.2%   99.0%  599    205   2
Spanish      62.8%   58.1%   64.1%  43     25    14
Chinese      90.3%   65.0%  100.0%  20     13    0
Portuguese   44.0%   42.1%   44.4%  19     8     10
Arabic       95.2%   80.0%  100.0%  10     8     0
French       13.6%   30.0%   12.0%  10     3     22
Tagalog      31.0%   77.8%   26.9%  9      7     19
German       36.8%   62.5%   33.3%  8      5     10
Russian      88.2%   60.0%  100.0%  5      3     0
Persian      75.0%   75.0%   75.0%  4      3     1

The results for TextCat, showing sub-scores languages with > 0% (plus French, Tagalog, and German, for nostalgia's sake):

TextCat, limited to certain languages
             f0.5   recall  prec    total  hits  misses
TOTAL        83.1%   83.2%   83.1%  775    645   131
English      90.5%   93.3%   89.9%  599    559   63
Spanish      51.4%   74.4%   47.8%  43     32    35
Chinese      85.5%   65.0%   92.9%  20     13    1
Portuguese   37.4%   73.7%   33.3%  19     14    28
Arabic       87.0%   80.0%   88.9%  10     8     1
French        0.0%    0.0%    0.0%  10     0     0
Tagalog       0.0%    0.0%    0.0%  9      0     0
German        0.0%    0.0%    0.0%  8      0     0
Russian      95.2%   80.0%  100.0%  5      4     0
Persian      83.3%  100.0%   80.0%  4      4     1
Korean       90.9%   66.7%  100.0%  3      2     0
Bengali     100.0%  100.0%  100.0%  2      2     0
Bulgarian    55.6%  100.0%   50.0%  2      2     2
Hindi       100.0%  100.0%  100.0%  2      2     0
Greek       100.0%  100.0%  100.0%  1      1     0
Tamil       100.0%  100.0%  100.0%  1      1     0
Thai        100.0%  100.0%  100.0%  1      1     0

These results are comparable to (actually slightly better than for F0.5) the using thresholds by language with the ES Plugin (which was optimized on the evaluation set and thus very much overfitted and brittle), with much, much better overall recall (83.2% vs 36.1%) and marginally worse precision (83.1% vs 90.3%).

Other Training Options Explored

edit
  • I did initially and very optimistically build language models on the raw query strings for each language. The results were not better than the ES Plugin, hence the filtering.
  • I tried to reduce the noise in the training data. I chose English and Spanish because they are the most important languages for queries on enwiki. I manually reviewed 5,699 enwiki queries and reduced them to 1,554 English queries (so much junk!!), and similarly reduced 4,101 eswiki queries to 2,497 Spanish queries. I built models on these queries and used them with models for other languages built on the larger query sets above. They performed noticeably worse, probably because of the very small corpus size. It might be possible to improve the performance of lower-performing language models using this method, but it's a lot of work to build up sizable corpora.
  • I extracted text from thousands of Wiki articles for Arabic, German, English, French, Portuguese, Tagalog, and Chinese—the languages with the most examples in my test corpus for enwiki. I extracted 2.6MB of training data for each language; though it was obviously messy and included bits of text in other languages. I built language models on these samples, and used them in conjunction with the high-performing models for other wikis built on query data. The results were not as good as with the original models built on query data, regardless of how I mixed and matched them. So, query data does seem to have patterns that differ from regular text, at least Wikipedia article text. (Interestingly, these models were best at the max language model size of 5,000 n-grams, so I tested model sizes in increments of 500 up to 10,000 n-grams. Performance did in fact max out around 5,000.)
  • I looked at using the internal dissimilarity score (i.e., smaller is better) from TextCat as a threshold, but it didn't help.

Next Steps

edit

There are a number of options we could explore from here, and these have been converted into Phabricator tickets for the Discovery Team.

  1. Stas has already started working on converting TextCat to PHP for use in Cirrus Search (available on GitHub), and he and Erik have been brainstorming on ways of making it more efficient, too. That needs some testing (e.g., Unicode compatibility) and comparison to the Perl version (i.e., same results on test queries). Phabricator: T121538
  2. Do a better assessment of the new language models to decide which ones are really not good (e.g., probably Igbo) and which ones are just not appropriate for enwiki (e.g., hopefully French and German). The obvious approach is to create a "fair" evaluation test set with equal numbers of examples for each language, and evaluating performance on that set. Phabricator: T121539
  3. Use the training data created here for training models for the ES Plugin / Cybozu. Perhaps its difficulties with queries are partly due to inaccurate general language models. This could also include looking at the internals and seeing if there is any benefit to changing the model size or other internal configuration, including optionally disabling "unhelpful" models (I'm looking at you, Romanian). Phabricator: T121540
  4. Create properly weighted evaluation sets for other wikis (in order by query volume) and determine the best mix of languages to use for each of them. Each evaluation set would be a set of ~1,000 zero-results queries from the given wiki, manually tagged by language. It takes half a day to do if you are familiar with the main language of the wiki, and evaluation on a given set of language models takes a couple of hours at most. (depends on 2 to make sure we aren't wasting time on a main language that does not perform well) Phabricator: T121541
  5. Do an A/B test on enwiki (or A/B/C test vs the ES Plugin) using the best config determined here. (A/B test depends on 1; A/B/C test could benefit from 3) Phabricator: T121542
  6. Do A/B tests on other wikis (depends on 4) Phabricator: T121543
  7. Create larger manually "curated" training sets for languages with really crappy training data (e.g., Igbo) that's contaminated with English and other junk. (could depend on and be gated by the results of 8; could be tested via re-test of data in 2) Phabricator: T121544
  8. See if Wikipedia-based language models for languages with crappy training data do better. (could obviate the need for 7 in some cases; could be tested via re-test of data in 2) Phabricator: T121545
  9. Experiment with equalizing training set sizes, since very small training sets may make for less accurate language models. That is, extract a lot more data for particular wikis with smaller training sets so their language models are more fine-grained. These languages ended up with less than 20K queries to build their language models on: Armenian, Bosnian, Cantonese, Hindi, Latin, Latvian, Macedonian, Malayalam, Mongolian, Serbo-Croatian, Swahili, Tamil, Telugu, Urdu. Some need it more than others—languages with distinctive character sets do well already. (could link up with 7, 8, and/or 2) Phabricator: T121546
  10. Improve training data via application of language models to the training data. For example, use all available high-precision language models except French on the French training data. Group the results by language and sort by score (for TextCat, smaller is better). Manually review the results and mark for deletion those that are not French. This should be much faster (though less exhaustive) than option 7, because most of the best-scoring queries will in fact not be French. Review can stop, say, when the incidence of non-French queries is less than half. This should remove the most distinctively English (or German, or whatever) queries from the French training set. Repeat on other languages, retrain all the language models, and if there is useful improvement, repeat the whole process. (depends on 2 for a reasonable evaluation set) Phabricator: T121547

Discussion Notes from IRC

edit

TextCat and Language Detection

Back before the holidays (12/23/2015), Stas and Trey had a conversation on IRC about TextCat and Lang ID. There was lots of good stuff in the conversation, so the main points are summarized here, to record for posterity, and to open them up to further conversation if anyone has any additional ideas.

For reference, the main Phab ticket for language ID stuff is T118278: EPIC: Improve Language Identification for use in Cirrus Search[1]

Building Language Models: It seems like we should try to create language models to cover at least the same set of languages as the original TextCat. The original models were in various encodings, but we’d create (and have created) models in Unicode. In general, we saw better performance doing language detection on queries using models built on queries.[2] If we want to support general language identification, we could also build models based on text from Wikipedia (which we need to do for some languages anyway because the query data is so poor).[3] It’s a relatively straightforward task, compared to getting sufficiently high quality query data.[4]

Using Language Models: We get the biggest improvement in language detection accuracy (~20% increase in F0.5) from restricting the list of candidate languages based on their individual performance and the distribution of languages we encounter in real life, rather than using all available languages.[2][7] We need our new TextCat to support the ability to specify which models to use.[5] It makes sense to create models based on both query data (if we have it) and general text (from Wikipedia) and make them available, probably through Stas’s PHP version of TextCat on GitHub.[6] Trey will also be putting the Perl version and language models up on GitHub after a bit more cleanup.

Choosing Language Models: In order to choose which models to use on a particular wiki, we need to sample queries and manually identify the languages represented, and then experimentally determine the best set of language models to use.[8] We will do this for the wikis with the highest query volume, and see how far down the list we have time to work on. For any wikis we don’t get to, we can try using a generic set of languages, or just not do language detection for now, or make general capabilities available as an opt-in feature—though we need to think more carefully about how to handle smaller wikis, especially after we have more experience using TextCat on larger wikis.

In addition to evaluation sets for particular wikis, we’re have a task[9] to create a “balanced” set of queries in known languages for top wikis (by query volume) for general evaluation of language models, which can help us determine a generic set of more-or-less reliable languages. (These are smaller sets that let us gauge general performance, but not enough for training language models.)

Updating Language Model Choices: Trey’s estimate/intuition (which could use some validation) is that the per-wiki language lists would need updating at most once a quarter, though it’s possible that with appropriate metrics we could determine that we needed to do an update by a sudden or sustained gradual decrease in performance. We may need to think this through a bit more carefully, since different update pattern imply different places/ways to store the list of relevant language models. Stas says that quarterly updates are close enough to static to put language lists into some file in the Cirrus source, pretty much like we do with indexing profiles, etc. Alternatively, if updates are more frequent and per-wiki, we could store the list of languages to use in mediawiki-config.

[1] https://phabricator.wikimedia.org/T118278

[2] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_with_TextCat#Best_Options

[3] https://phabricator.wikimedia.org/T121545

[4] https://phabricator.wikimedia.org/T121547, etc. See [1] for more.

[5] https://phabricator.wikimedia.org/T121538

[6] https://github.com/smalyshev/textcat

[7] https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language_Detection_Evaluation#ElasticSearch_Plugin.E2.80.94Limiting_Languages_.26_Retraining

[8] https://phabricator.wikimedia.org/T121541

[9] https://phabricator.wikimedia.org/T121539

Extra Bits

edit

Query Counts

edit

The table below shows the (usually) two-letter language code, the name of the language, the scripts used by the language, the original number of queries from the week's sample, the number of queries left after filtering, and the diff between queries and filtered queries. It is sorted in descending order by filtered queries.

Code Language Script(s) Queries Filtered Queries Query diff
en English Latin 18651038 14139859 -24.19%
es Spanish Latin 3328264 2887172 -13.25%
de German Latin 3461455 2872488 -17.02%
fr French Latin 2098877 1774733 -15.44%
ru Russian Cyrillic 2043875 1416948 -30.67%
pt Portuguese Latin 1093626 969825 -11.32%
id Indonesian Latin 1072967 809114 -24.59%
ja Japanese Japanese 1134321 751109 -33.78%
it Italian Latin 6580737 727043 -88.95%
ar Arabic Arabic 912170 652108 -28.51%
nl Dutch Latin 655287 580785 -11.37%
pl Polish Latin 581678 511323 -12.10%
zh Chinese Chinese 868158 510544 -41.19%
cs Czech Latin 418310 365714 -12.57%
tr Turkish Latin 379098 331727 -12.50%
fa Farsii Arabic 343720 227553 -33.80%
sv Swedish Latin 246081 224244 -8.87%
ko Korean Korean 302097 194734 -35.54%
vi Vietnamese Latin 215934 177089 -17.99%
uk Ukranian Cyrillic 160214 136523 -14.79%
fi Finnish Latin 145355 132250 -9.02%
tl Tagalog Latin 148668 129196 -13.10%
he Hebrew Hebrew 156793 117217 -25.24%
no Norwegian Latin 121937 112184 -8.00%
hu Hungarian Latin 122404 110642 -9.61%
ro Romanian Latin 111709 103912 -6.98%
ca Catalan Latin 110597 103442 -6.47%
el Greek Greek 118763 80685 -32.06%
da Danish Latin 74256 66455 -10.51%
th Thai Thai 94312 64934 -31.15%
sk Slovak Latin 48723 45972 -5.65%
ig Igbo Latin 45642 41606 -8.84%
hr Croatian Latin 39450 36964 -6.30%
sr Serbian Latin + Cyrillic 38063 35851 -5.81%
lt Lithuanian Latin 31658 29970 -5.33%
az Azerbaijani Latin 31055 28983 -6.67%
et Estonian Latin 29715 28017 -5.71%
bn Bengali Bengali 75130 27005 -64.06%
kk Kazakh Cyrillic 30648 26490 -13.57%
ka Georgian Georgian 42658 26141 -38.72%
ms Malay Latin 25195 23541 -6.56%
sl Slovenian Latin 24349 23151 -4.92%
sq Albanian Latin 24718 23108 -6.51%
bg Bulgarian Cyrillic 36478 22825 -37.43%
eu Basque Latin 21117 19759 -6.43%
sw Swahili Latin 16829 15395 -8.52%
hi Hindi Devanagari 58913 14511 -75.37%
lv Latvian Latin 14948 14288 -4.42%
hy Armenian Armenian 21918 13163 -39.94%
sh Serbo-Croatian Latin + Cyrillic 13344 12956 -2.91%
la Latin Latin 12741 10402 -18.36%
bs Bosnian Latin 10457 10004 -4.33%
zh_yue Cantonese Chinese 13748 9014 -34.43%
mk Macedonian Cyrillic 13590 8065 -40.65%
ur Urdu Arabic 22523 7511 -66.65%
mn Mongolian Cyrillic 10009 6815 -31.91%
ta Tamil Tamil 19479 5639 -71.05%
ml Malayalam Malayalam 23108 4079 -82.35%
te Telugu Telugu 10762 3304 -69.30%

Additional Non-Word Characters

edit

Stas suggested that removing some punctuation symbols from texts might improve performance. He suggested adding period and parens to the list of non-word characters in TextCat (which was originally only numbers and whitespace characters). I thought periods might be useful: if certain languages tend to end sentences with certain kinds of words and those words have certain kinds of inflections, then certain letters and a period could be more characteristic of one language than another. Rather than debate it, we decided testing it would be more conclusive, so I tested it.

I started using the same training data and language set I'd previously optimized for enwiki (which is also what my evaluation set is a based on). For reference, F0.5 performance for the previous system is 83.3%.

I retrained those language models on the same data, but added \.() to the list of non-word characters. The short version is that net F0.5 performance improved by 0.1% to 83.4% (a net of one more true positive and one less false positive).

I decided to take out the period and filter just the parens. Performance improved another 0.1% to 83.5%. (Or, adding parens to the non-word list improved performance 0.2%, and adding the period decreased performance by 0.1%.)

I'd already queued up a somewhat longer list of punctuation and special characters, adding question mark, exclamation mark, comma, colon, and semicolon, i.e.: 0-9\s\.\?!,:;\\() — F0.5 performance went back to 83.4%.

Since it only takes a little while to run a test, I decided to try an even larger collection of potential non-word characters (not including period, which seems useful): 0-9\s\?!,:;\\()"@#\$%^&\*_\+=\{\}\/<> — the results were terrible! F0.5 maxed out at 76.8%. (50/50 that's because I messed up the specs for the all those non-word characters, though.)

Interestingly (and reassuringly), in all cases, 3000 n-grams was, as before, the optimal n-gram set size.

Conclusions

edit

Looks like periods help a little, and parens confuse things a little more, and other random characters are pretty random, but the effect is very small (a net improvement of 2 false positives and 2 true positives out of 775 examples).

I suggest we exclude parens from being "word" characters, but keep periods. Cool! I'll commit the change to the PHP version.

If anyone has a semi-well-motivated list of non-word characters to try, let me know.

Lots of Details

edit

More details on performance for each potential set of non-word characters are provided below.

0-9\s

edit

The previous configuration.

                thresh  f0.5    f1      f2      recall  prec    total   hits    misses
TOTAL (775)     1        83.3%   83.3%   83.3%   83.4%   83.2%  775     646     130
English (599)   1        90.5%   91.6%   92.6%   93.3%   89.9%  599     559     63
Spanish (43)    1        51.4%   58.2%   66.9%   74.4%   47.8%  43      32      35
Chinese (20)    1        90.3%   78.8%   69.9%   65.0%  100.0%  20      13      0
Portuguese (19) 1        37.4%   45.9%   59.3%   73.7%   33.3%  19      14      28
Arabic (10)     1        87.0%   84.2%   81.6%   80.0%   88.9%  10      8       1
Russian (5)     1        95.2%   88.9%   83.3%   80.0%  100.0%  5       4       0
Persian (4)     1        83.3%   88.9%   95.2%  100.0%   80.0%  4       4       1
Korean (3)      1        90.9%   80.0%   71.4%   66.7%  100.0%  3       2       0
Bengali (2)     1       100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0
Bulgarian (2)   1        55.6%   66.7%   83.3%  100.0%   50.0%  2       2       2
Hindi (2)       1       100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0
Greek (1)       1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
Japanese (1)    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
Tamil (1)       1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
Thai (1)        1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0

0-9\s()

edit

The best config so far.

                thresh  f0.5    f1      f2      recall  prec    total   hits    misses
TOTAL (775)     1        83.5%   83.6%   83.6%   83.6%   83.5%  775     648     128
English (599)   1        90.6%   91.7%   92.7%   93.5%   89.9%  599     560     63
Spanish (43)    1        51.4%   58.2%   66.9%   74.4%   47.8%  43      32      35
Chinese (20)    1        90.3%   78.8%   69.9%   65.0%  100.0%  20      13      0
Portuguese (19) 1        41.0%   50.0%   64.1%   78.9%   36.6%  19      15      26
Arabic (10)     1        95.2%   88.9%   83.3%   80.0%  100.0%  10      8       0
Russian (5)     1        95.2%   88.9%   83.3%   80.0%  100.0%  5       4       0
Persian (4)     1        83.3%   88.9%   95.2%  100.0%   80.0%  4       4       1
Korean (3)      1        90.9%   80.0%   71.4%   66.7%  100.0%  3       2       0
Bengali (2)     1       100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0
Bulgarian (2)   1        55.6%   66.7%   83.3%  100.0%   50.0%  2       2       2
Hindi (2)       1       100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0
Greek (1)       1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
Japanese (1)    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
Tamil (1)       1        55.6%   66.7%   83.3%  100.0%   50.0%  1       1       1
Thai (1)        1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0

0-9\s\.()

edit

The suggestion that started this

                thresh  f0.5    f1      f2      recall  prec    total   hits    misses
TOTAL (775)     1        83.4%   83.4%   83.5%   83.5%   83.4%  775     647     129
English (599)   1        90.5%   91.6%   92.6%   93.3%   89.9%  599     559     63
Spanish (43)    1        50.8%   57.7%   66.7%   74.4%   47.1%  43      32      36
Chinese (20)    1        90.3%   78.8%   69.9%   65.0%  100.0%  20      13      0
Portuguese (19) 1        41.0%   50.0%   64.1%   78.9%   36.6%  19      15      26
Arabic (10)     1        95.2%   88.9%   83.3%   80.0%  100.0%  10      8       0
Russian (5)     1        95.2%   88.9%   83.3%   80.0%  100.0%  5       4       0
Persian (4)     1        71.4%   80.0%   90.9%  100.0%   66.7%  4       4       2
Korean (3)      1        90.9%   80.0%   71.4%   66.7%  100.0%  3       2       0
Bengali (2)     1       100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0
Bulgarian (2)   1        55.6%   66.7%   83.3%  100.0%   50.0%  2       2       2
Hindi (2)       1       100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0
Greek (1)       1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
Japanese (1)    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
Tamil (1)       1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
Thai (1)        1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0

0-9\s\.\?!,:;\\()

edit

More! more! .... meh

                thresh  f0.5    f1      f2      recall  prec    total   hits    misses
TOTAL (775)     1        83.4%   83.4%   83.5%   83.5%   83.4%  775     647     129
English (599)   1        90.5%   91.6%   92.6%   93.3%   89.9%  599     559     63
Spanish (43)    1        51.4%   58.2%   66.9%   74.4%   47.8%  43      32      35
Chinese (20)    1        90.3%   78.8%   69.9%   65.0%  100.0%  20      13      0
Portuguese (19) 1        40.1%   49.2%   63.6%   78.9%   35.7%  19      15      27
Arabic (10)     1        95.2%   88.9%   83.3%   80.0%  100.0%  10      8       0
Russian (5)     1        95.2%   88.9%   83.3%   80.0%  100.0%  5       4       0
Persian (4)     1        83.3%   88.9%   95.2%  100.0%   80.0%  4       4       1
Korean (3)      1        90.9%   80.0%   71.4%   66.7%  100.0%  3       2       0
Bengali (2)     1        71.4%   80.0%   90.9%  100.0%   66.7%  2       2       1
Bulgarian (2)   1        55.6%   66.7%   83.3%  100.0%   50.0%  2       2       2
Hindi (2)       1       100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0
Greek (1)       1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
Japanese (1)    1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
Tamil (1)       1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
Thai (1)        1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0

0-9\s\?!,:;\\()"@#\$%^&\*_\+=\{\}\/<>

edit

Train wreck!

                thresh  f0.5    f1      f2      recall  prec    total   hits    misses
TOTAL (775)     1        76.8%   76.9%   76.9%   76.9%   76.8%  775     596     180
English (599)   1        88.1%   87.8%   87.6%   87.5%   88.2%  599     524     70
Spanish (43)    1        31.3%   35.7%   41.5%   46.5%   29.0%  43      20      49
Chinese (20)    1        85.9%   71.0%   60.4%   55.0%  100.0%  20      11      0
Portuguese (19) 1        22.3%   29.9%   45.1%   68.4%   19.1%  19      13      55
Arabic (10)     1        95.2%   88.9%   83.3%   80.0%  100.0%  10      8       0
Russian (5)     1        95.2%   88.9%   83.3%   80.0%  100.0%  5       4       0
Persian (4)     1        83.3%   88.9%   95.2%  100.0%   80.0%  4       4       1
Korean (3)      1        90.9%   80.0%   71.4%   66.7%  100.0%  3       2       0
Bengali (2)     1       100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0
Bulgarian (2)   1        55.6%   66.7%   83.3%  100.0%   50.0%  2       2       2
Hindi (2)       1       100.0%  100.0%  100.0%  100.0%  100.0%  2       2       0
Greek (1)       1        55.6%   66.7%   83.3%  100.0%   50.0%  1       1       1
Japanese (1)    1        38.5%   50.0%   71.4%  100.0%   33.3%  1       1       2
Tamil (1)       1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0
Thai (1)        1       100.0%  100.0%  100.0%  100.0%  100.0%  1       1       0