Topic on Talk:TextCat/Flow

Languages with different spacing

4
Amire80 (talkcontribs)

How does this software treat languages where spaces between words are used differently from European languages? For example, Thai, Chinese, or Burmese?

Thai and Burmese could be detected by the script (there are other languages using the same respective script, but there are no Wikipedia editions in them). Mandatin Chinese, however, could be confused with Japanese and with other Chinese varieties.

TJones (WMF) (talkcontribs)

TextCat uses a very lightweight model based on n-grams (rather than, say, dictionaries), so it doesn't care much about spaces since it doesn't care about words as words. It just looks at sequences of 1, 2, 3, 4, and 5 characters, and spaces are just another character. Spaces are useful at distinguishing languages because characters appearing at the beginning and ending of a word are more or less likely in different languages. For example, ng is much more likely at the end of a word in English than at the beginning, so "ng " is more characteristic of English than " ng", which is most common in a discussion of the sequence ng, or in names, like Nguyen. As expected, "ng " was the 124th most common n-gram in the training data for English, while " ng" comes in at #6361. In Vietnamese, "ng " is more common than " ng" at #20, but " ng" is only #79, which is still very common.

You can try out TextCat with the demo. With the default settings, whatlanguageisthis? is identified as English, despite the lack of spaces.

There are some languages that could be identified with high accuracy by their scripts—Thai, Burmese, Korean, Japanese hiragana and katakana, Hebrew, Greek, and others. Some of those do get used for other languages (Japanese for Okinawan or Ainu, Hebrew for Yiddish), but those are rare in queries on English Wikipedia, for example.

TextCat doesn't take the uniqueness of the writing system into account, and it really can't. A hybrid system could, but that's more complexity than we need most of the time. TextCat works well on unique alphabets and syllabaries, because all the characters are represented among the top thousand n-grams (and we use models of at least 3,000 n-grams). For Chinese, which is a logographic, TextCat can get in trouble with short strings (like many queries) or strings of Chinese with a tiny sprinkling of Latin characters. Because there are so many Chinese characters, not all of them are in the Chinese language model. For relatively large samples of text—say, a paragraph—it can tell apart Cantonese and Mandarin (which gets labelled "Chinese") most of the time—based on the different patterns of co-occurrence of the more common characters. (You can test that with snippets from the front page of the respective Wikipedias and the TextCat demo.) Japanese is relatively easy if it's a longer sample, because there will be likely be hiragana and katakana. For very short strings of hanzi and kanji, it's much harder.

Because of this confusion on short strings, we had to disable Chinese detection on Japanese Wikipedia (and similarly, French detection on English Wikipedia) with the original implementation. I've been working on improvements that should allow us to re-enable both of those relatively soon!

Amire80 (talkcontribs)

Oh! I didn't realize that it goes by character. I thought that it goes by word. This makes a lot of sense.

It would be great to resolve the Chinese-Japanese problem. There is research that shows that there is a lot of overlap in editors of these two wikis: https://arxiv.org/abs/1312.0976 , and I'd guess that there's some overlap in the readers, too.

TJones (WMF) (talkcontribs)

Interesting paper. Thanks for the reference.

The Chinese-Japanese problem is definitely a hard one. I just more carefully checked my current best numbers in places where they are in competition, and generally even when both are enabled, only one is returning results (because the other is tuned to be more aggressive). This was surprising to me on the Japanese Wikipedia corpus I have. Bummer.

Reply to "Languages with different spacing"