Topic on Extension talk:CirrusSearch/Flow

Search in Japanese incorrectly parsed

2 comments • 20:44, 2 December 2024 10 days ago

2

Currently, we are having issues on Ylvapedia with Japanese searches due to the way phrases and words are parsed. Users are reporting that their search phrases are being broken up into individual words, rather than searching the full query (such as in the case of 2+ kanji/kana words). What configuration changes should we make to better support Japanese searches? For example:

https://ylvapedia.wiki/index.php?search=きのこ

Compare to:

https://ylvapedia.wiki/index.php?search=%22きのこ%22

The expected result for either search is for a page with the string "きのこ" to be the first result, rather than pages merely containing き,の, and こ.

Our software version page can be found here:

https://ylvapedia.wiki/wiki/Special:Version

Reply 04:43, 27 November 2024 15 days ago

TJones (WMF) (talkcontribs)

TL;DR: Looks like you have some older configuration, and possibly some older code. I'd suggest updating both if possible. Your language processing could be improved for your multi-lingual data, but I'm not sure that's all you need, so I've pinged a couple of other people who might be able to help further.

From the menu on your wiki, you are supporting English, Spanish, Portuguese, Chinese, Japanese, and Korean on one wiki. That is a lot going on!

According to your Cirrus Settings you have the CJK analyzer enabled, with no customization,[*] for your "text" field, and for the "plain" field (which is used with quotes) you have a fairly bare-bones configuration. Slightly re-ordered and reformatted:

"plain": {
    "type": "custom",
    "char_filter": [ "word_break_helper" ],
    "tokenizer": "standard",
    "filter": [ "lowercase" ]
},
...
"text": {
    "type": "cjk",
    "char_filter": [ "word_break_helper" ]
},

[*] Note that "word_break_helper" doesn't actually do anything in the "text" field because you are using the monolithic CJK analyzer. (That is an old bug in the configuration that was fixed quite a while ago.) Generally, if your "type" is anything but "custom", extra specifications for "char_filter", "filter", "tokenizer", etc. are ignored. It ought to throw an error, or at least a warning, but it does not. Anyway, "word_break_helper" converts a few characters (_.():) to spaces, which it is doing in the "plain" field.

For reference, the CJK analyzer in Elasticsearch 7.17 (from your versions page) also uses the standard tokenizer (which splits CJK strings into single characters). But it then uses "cjk_bigram", which puts CJK characters back together in overlapping bigrams.

So, on Japanese Wikipedia (which also uses the CJK analyzer, though further customized), when searching for きのこ (without quotes) I'd expect it to actually be looking for きの and のこ (the two overlapping bigrams from the query). The "plain" field will be looking for き, の, and こ as separate tokens (according to the "standard" tokenizer), but should look for them all in a row, because of the quotes.

And that's what I see in the detailed results dump for Japanese Wikipedia—there are lines for "text:きの" and "text:のこ" (the overlapping bigrams). (You also see "text.plain:きのこ", but that's because it's using a different tokenizer for the plain field—more on that in a minute.) On Ylvapedia, I see "all.plain:の", "all.plain:き", and "all.plain:こ" for most results, and "all:のこ" and "all:きの" for a few.

The first unexpected thing is that the Ylvapedia results are using results from the "all" field. I also noticed, comparing Ylvapedia Cirrus Config to Japanese Wikipedia that your profiles are different—I particularly noticed "CirrusSearchFullTextQueryBuilderProfile", though I'm not sure that's the most important thing. This is outside my area of expertise, because I'm usually optimizing for the config we have on Wikipedia and its sister projects. Overall, though, while you seem to be up-to-date on your software (though I'm surprised at the plain monolithic CJK config—that's pretty old), it seems like you have some old configuration files... and maybe old code, too. I also see that you have Semantic MediaWiki enabled, which is another unknown for me. I'm not sure what to suggest, but maybe @DCausse (WMF) or @EBernhardson (WMF) will have some good ideas, or at least better questions.

I also don't know what Elastic plugins you have installed (on the command line, you can do something similar to bin/elasticsearch-plugin list, or curl elastichost:9200/_cat/plugins). Given your multilingual set up, if you can handle the complexity, I'd suggest separate indexes for each language, like we do for Commons, but that is definitely a lot. Otherwise, I'd suggest at least installing the ICU plugin and using the "icu_tokenizer"—it has a decent dictionary for CJK and other East Asian spaceless languages, and gives more targeted results. (It isn't perfect, but any complex approach for CJK parsing is going to have some errors here and there.)

Maybe something like this:

"plain": {
    "type": "custom",
    "char_filter": [ "word_break_helper" ],
    "tokenizer": "icu_tokenizer",
    "filter": [ "lowercase" ],
    },
...
"text": {
    "type": "custom",
    "char_filter": [ "word_break_helper" ],
    "tokenizer": "icu_tokenizer",
    "filter": [ "cjk_width", "lowercase" ]
}

If you want to use a stop word list, you could use the one used by CJK (which is all English words), use the default English list (as used by the English analyzer), or build a custom stop word filter, combining lists for English, Spanish, and Portuguese (possibly dropping any that look like they might cause cross-language interference.. nothing really jumped out at me on a quick glance, though).

I'd personally also prefer to use "icu_normalizer" instead of "lowercase" in both (and then you also shouldn't need "cjk_width", if I recall correctly).

Also, the "icu_normalizer" has a couple of problems with the standalone forms of handakuten and dakuten (they get regularized to the combining forms with an extra space), so you might want to copy the "cjk_charfilter" from either the Commons or Japanese configs if your editors commonly use those forms.. it seems like the combining forms are more commonly used, though.

Reply 20:44, 2 December 2024 10 days ago

Reply to "Search in Japanese incorrectly parsed"