Which languages does this software support? Can it be trained with all Wikipedia's languages?
Talk:TextCat/Flow
We currently have two classes of language models: those based on Wikipedia query strings (30 languages) and those based on Wikipedia Text (70 models). We get better performance from the query-string models on new query strings, because queries differ from encyclopedic text in several ways (less formality, more question words, fewer diacritics in languages that use them, more nouns and fewer verbs, etc.). However, I'm working on a new config that will allow us to use both query-based and Wikitext-based models together (since, for example, the Oriya Wikitext model is probably good enough).
You can see the currently available list on github. The LM-query directory has the query-based models, and the LM directory has the Wiki-text–based models. They are named for their Wiki codes, which are usually but not always the same as or similar to current or former ISO 639 codes.
The query-based models require a lot of manual work, since a lot of queries are not in the language of the Wiki. (Igbo Wikipedia, for example, had about half its queries in English in a sample I took in 2015; English Wikipedia has lots of other languages show up, which is what started this project.) The Wiki-text Models are less work, but still require validation. For smaller wikis there isn't enough text, and for the ones still in development (like Igbo Wikipedia), there's a lot of text that's not in the language of the wiki (often English).
For the larger, more well-developed Wikipedias, we could build models for all of them. But it does take some work, and so I haven't done it, though I'd like to.
I'd also like to cover a topic that I think is implicit in your question, but which you may not have intended. Having all those language models available wouldn't make it easy to detect all those languages. Running all those models would require more computational power, but also lead to worse results in language detection. As I mentioned, right now we don't detect French on English Wikipedia because there are too many false positives for French and too few actual queries in French. I will be able to turn on French on English Wikipedia soon, but having all the languages available would lead to too many errors (e.g., Scots vs English, or all the Romance languages) in the general case. Instead, we enable at most the languages we see in the query logs for a given Wikipedia, minus the ones that cause give more errors than correct answers in that context.
Thanks for the questions! And please let me know if I can help explain anything else.
Thanks a lot for the detailed answer.
I'd love to see a page that lists suggestions for people who write in small language wikis about how to get their wikis' content usable for this. E.g., as obvious as it should sound, a suggestion to not write in English in the Igbo Wikipedia needs to be explicit.
Good point. And it's not so much that people are writing in English on the Igbo Wikipedia—though there can be lots of titles in English (say for an American actor or singer)—but also templates have English fallbacks when no Igbo translation is available. I'm not sure if that ends up in the extracted Wiki-text or not.
Where do you think would be a good place to put such a list of suggestions, if one were formulated?
Something like TextCat/Best practices for editors would be a good start.
How does this software treat languages where spaces between words are used differently from European languages? For example, Thai, Chinese, or Burmese?
Thai and Burmese could be detected by the script (there are other languages using the same respective script, but there are no Wikipedia editions in them). Mandatin Chinese, however, could be confused with Japanese and with other Chinese varieties.
TextCat uses a very lightweight model based on n-grams (rather than, say, dictionaries), so it doesn't care much about spaces since it doesn't care about words as words. It just looks at sequences of 1, 2, 3, 4, and 5 characters, and spaces are just another character. Spaces are useful at distinguishing languages because characters appearing at the beginning and ending of a word are more or less likely in different languages. For example, ng is much more likely at the end of a word in English than at the beginning, so "ng " is more characteristic of English than " ng", which is most common in a discussion of the sequence ng, or in names, like Nguyen. As expected, "ng " was the 124th most common n-gram in the training data for English, while " ng" comes in at #6361. In Vietnamese, "ng " is more common than " ng" at #20, but " ng" is only #79, which is still very common.
You can try out TextCat with the demo. With the default settings, whatlanguageisthis? is identified as English, despite the lack of spaces.
There are some languages that could be identified with high accuracy by their scripts—Thai, Burmese, Korean, Japanese hiragana and katakana, Hebrew, Greek, and others. Some of those do get used for other languages (Japanese for Okinawan or Ainu, Hebrew for Yiddish), but those are rare in queries on English Wikipedia, for example.
TextCat doesn't take the uniqueness of the writing system into account, and it really can't. A hybrid system could, but that's more complexity than we need most of the time. TextCat works well on unique alphabets and syllabaries, because all the characters are represented among the top thousand n-grams (and we use models of at least 3,000 n-grams). For Chinese, which is a logographic, TextCat can get in trouble with short strings (like many queries) or strings of Chinese with a tiny sprinkling of Latin characters. Because there are so many Chinese characters, not all of them are in the Chinese language model. For relatively large samples of text—say, a paragraph—it can tell apart Cantonese and Mandarin (which gets labelled "Chinese") most of the time—based on the different patterns of co-occurrence of the more common characters. (You can test that with snippets from the front page of the respective Wikipedias and the TextCat demo.) Japanese is relatively easy if it's a longer sample, because there will be likely be hiragana and katakana. For very short strings of hanzi and kanji, it's much harder.
Because of this confusion on short strings, we had to disable Chinese detection on Japanese Wikipedia (and similarly, French detection on English Wikipedia) with the original implementation. I've been working on improvements that should allow us to re-enable both of those relatively soon!
Oh! I didn't realize that it goes by character. I thought that it goes by word. This makes a lot of sense.
It would be great to resolve the Chinese-Japanese problem. There is research that shows that there is a lot of overlap in editors of these two wikis: https://arxiv.org/abs/1312.0976 , and I'd guess that there's some overlap in the readers, too.
Interesting paper. Thanks for the reference.
The Chinese-Japanese problem is definitely a hard one. I just more carefully checked my current best numbers in places where they are in competition, and generally even when both are enabled, only one is returning results (because the other is tuned to be more aggressive). This was surprising to me on the Japanese Wikipedia corpus I have. Bummer.
Just FYI and amusement. http://theweek.com/articles/617776/how-identify-language-glance "How to identify any language at a glance"
Cool! This is actually a much lower resolution version of the ideas behind TextCat. Certain letters, combinations of letters, and the relative frequencies of the letters or combinations is how TextCat identifies languages. It's made more complicated by the length of the text we're dealing with; queries are often very short, and not all of the distinguishing features of a language will not be present in a short query string.