Topic on Extension talk:CirrusSearch

Greek search no longer truly diacritics insensitive

9 comments • 08:07, 19 October 2024 1 month ago

9

Spiros71 (talkcontribs)

For example, go to https://en.wiktionary.org/wiki/Wiktionary:Main_Page and try inputting ανθρωπος. Two existing entries will not appear: άνθρωπος and ἄνθρωπος. The same can be seen in my recent upgrade to ElasticSearch 7.10.2, core ICU plugin and extra:7.10.2-wmf12. Go to https://lsj.gr/ and try inputting σιφων. Missing entries will appear when using σίφων (σίφων and σίφωνας). Any advice on how to remedy this would be warmly appreciated!

Reply 12:03, 13 October 2024 1 month ago

TJones (WMF) (talkcontribs)

After writing up everything below, I realized I'm diagnosing the current behavior because @EBernhardson (WMF) thought this might be related to some recent work I did on diacritic folding, but now I don't think that's it. The info below might still be helpful, though. It's possible that there have been some changes to the weighting of exact prefix matches in suggestions, so I'll also invite @DCausse (WMF) to weigh in. He's more likely to remember any autocomplete changes that weren't so recent.

I believe you are talking about the drop-down list of suggestions (which we call the "autocomplete" suggestions), since ἄνθρωπος and άνθρωπος are the top two results in the full search results list for ανθρωπος.

The autocomplete search isn't truly insensitive to anything—including case, spaces, punctuation, and diacritics—in that exact matches can always be ranked a little better than inexact matches.

For case, consider the autocomplete suggestions for hun, Hun, ȟun, and hün on English Wiktionary:

hun: hun, hunger, hunt, hundred, hund, Hund, Hun, hung...
Hun: Hun, hunger, hun, hunt, hundred, hund, Hund, hung...
ȟun: hunger, hun, hunt, hundred, hund, Hund, Hun, hung...
hün: hün, Hündin, Hüne, hünkâr, Hündchen, hündür, hünnap...

I think the ȟun results are the "truest" results for h+u+n because there are no exact matches. In the other cases, exact matches (hun, Hun) become the first result, and exact prefix matches (everything starting with hün..) can also rank higher.

Note that if you add spaces to hün and search for h ü n, you get the same list as for ȟun above, because spaces can be ignored, and there are no exact matches or exact prefix matches with those spaces.

The problem with ανθρωπος is that it has many exact prefix matches (for those following along who don't read Greek, it's "anthropos", which is the beginning of way more than ten other Greek words), so they rank higher than άνθρωπος and ἄνθρωπος and push them out of the top ten suggestions. If you instead search for ἇνθρωπος (analogous to ȟun in the examples above), you get the results that I think you expect, with ἄνθρωπος and άνθρωπος as the first two suggestions because there is no exact match or exact prefix match.

Unless I'm misreading the diacritics (which is 100% possible with Greek diacritics!) it looks like σιφων does the right thing on both English Wiktionary and LSJ, presumably because there aren't as many exact prefix matches competing for space in the suggestions list.

As for remedies, it depends on what you are looking for. If you want exact prefix matches to count for less, or for diacritics to be completely ignored, I'm not sure there's anything to be done. It might help in this case, but it would cause problems in general.

If you want a solution that you, as a savvy searcher, can use in cases like this where you know or suspect that there might be relevant results that differ by diacritics but which are being swamped by exact prefix matches, you can use the space hack we used for h ü n: if you search for α ν θ ρ ω π ο ς (or less ridiculously, just α νθρωπος or ανθρωπο ς), then you get suggestions without any exact prefix re-ranking. Of course there is always the chance that you get some exact prefix matches after adding one space. If there are too many, add another space—not a great solution, but it works.

For less savvy searchers, hitting return will give you the full-text search results, which do not look for arbitrary prefix matches (though stemming matches can still be prefixes), and at least in this case, the desired results are the top two.

(Note: I've been testing in the search bar on the search results page rather than the search box at the top of the page. These are usually the same, but weird differences can occur. The only thing I've noticed today is that the two boxes seem to use different events to trigger autocomplete searches. Editing hun to hün gives different results because typing ü on my American keyboard uses dead keys, which trigger Javascript events in the big search results search box, but not in the search box at the top of the page. Historical UI cruft, that is. Sigh.)

Reply 21:27, 15 October 2024 1 month ago

Spiros71 (talkcontribs)

Tray, that is a very thorough and exhaustive reply as usual!

The points I am making are:

1) I can see a clear change on this from the times of the ElasticSearch 5.6 implementation, and

2) usability (for Greek and Ancient Greek)—being able to get what one is looking for with the minimum effort. When it comes to Ancient Greek (polytonic) many "weird" accents/spirits are used which are not readily available in most keyboards, cases, etc. and users prefer to omit them (this is also typical of how Greek users search on Google even for Modern Greek which only has one accent/spirit). So, in the specific example, using ανθρωπος I would expect to get two search results in autocomplete which are "perfect" matches (minus the diacritics of course). But I do not get these results! A savvy user or a scholar "might" use the full diacritics version (speaking of Ancient Greek here), but the average user will be dumbfounded as they get no results at all with the no-diacritics approach. Also, yes, one could hit search and still get them, but the point of autocomplete is faster access to information.

I am not advocating a sweeping approach here for all languages, as I am not an expert, but I can see clearly the benefit for Greek and Ancient Greek.

Reply 08:19, 16 October 2024 1 month ago

DCausse (WMF) (talkcontribs)

Regarding ανθρωπος and άνθρωπος and ἄνθρωπος on english wiktionary:

These two results are found at position 11 and 12: https://en.wiktionary.org/w/api.php?action=opensearch&format=json&formatversion=2&search=%CE%B1%CE%BD%CE%B8%CF%81%CF%89%CF%80%CE%BF%CF%82&namespace=0&limit=12

Unfortunately we display only 10.

If you enter Special:Search these two should move back to the top: https://en.wiktionary.org/w/index.php?go=Go&search=%CE%B1%CE%BD%CE%B8%CF%81%CF%89%CF%80%CE%BF%CF%82&title=Special%3ASearch&ns0=1

Unfortunately the completion search does only rank higher the one suggestion that is a perfect exact match. It does not rank higher suggestions that appear to be fully written titles over the ones that appear to be partially written. It is something we know is not quite perfect but for which we don't yet have a solution for.

Another cause is also that completion prefers suggestions that match a prefix with its accents:

ανθρωποσφαγή is preferred over άνθρωπος when searching ανθρωπος

note that ς is just considered identical to σ here.

If this issue is quite recent I'm not sure what could have caused it, I don't think anything changed in the software that could have directly caused this behavior. Could it be that more pages being added over time caused these suggestions to slip out of the 10 displayed results?

See phab:T132637 for when we first implemented diacritics folding for greek, the example query αθανατος used at the time to report the bug is still working as expected.

Reply Edited 08:36, 16 October 2024 1 month ago

Spiros71 (talkcontribs)

Yes, David, you pointed very aptly to some of the culprits here

Another cause is also that completion prefers suggestions that match a prefix with its accents:

ανθρωποσφαγή is preferred over άνθρωπος when searching ανθρωπος
note that ς is just considered identical to σ here

My point is that ς considered identical with σ is something that could resolve such cases. The former is of course only used at the end of a word. And quoting from that phab issue, I concur with Tray:

French speakers usually have no trouble typing French diacritics, but they may have no idea how to type Ancient Greek polytonic diacritics—which speakers of Modern Greek may also have trouble with, just as speakers of Modern English usually don't know how to type ð, þ, æ, or ē, despite them all being used in the first few lines of Beowulf! Hwæt! (You call me a language nerd, now I gotta act like one.)

Reply Edited 08:33, 16 October 2024 1 month ago

TJones (WMF) (talkcontribs)

I can see a clear change on this from the times of the ElasticSearch 5.6 implementation

Wow.. that was 5 years ago for us, so I can't recall every change that might have been relevant in that time. Not sure when it would have changed.

I understand your usability argument, but it is often the case in search engineering that optimizing for one use case breaks others. We are already ignoring the Greek diacritics for the recall phase, but the exact matches come into play for the ranking phase. It's an issue for ανθρωπος (ignoring final sigma, see below) because there are so many words without diacritics that match better.

There's been a similar complaint about overly exact case matching (T364888), but I don't think we can only ignore Greek diacritics or only ignore case for ranking, which—on English Wikitionary for example—would mean that typing an would give "exact" matches with an, àn, ån, án, än, ân, An, Ân, ãn, ān, ăn, ản, ǎn, Ấn, ấn, ẩn, and ắn. (Those are the top full-text results, though.. and I missed aN!) You could argue these results are less usable in autocomplete, since most people most of the time will not be looking for them (on English Wiktionary).

We also were tossing around ideas for improving full-title matches, which could have similar side-effects for short queries. (This also applies, albeit less voluminously, to queries longer than 2 letters, but I stopped looking for details because it's a lot of manual searching for examples since autocomplete doesn't work that way at the moment.)

There's always a trade-off, and having to fall over to full-text search is not the worst trade-off.

I've opened a ticket for the final/non-final sigma issue (T377495), though I'm not sure it will help you. It definitely makes sense on Greek-language wikis, but not as much for non-Greek wikis, like English Wiktionary. (LSJ looks to be using English as its analysis language, too.)

You should be able to set CirrusSearchICUNormalizationUnicodeSetFilter and CirrusSearchICUFoldingUnicodeSetFilter to "[^ς]" in mediawiki/extensions/CirrusSearch/extension.json in your LSJ installation to exempt ς from folding, but that would disable the ς to σ mapping everywhere (autocomplete, full-text, template lookups, etc.), and it still won't work if your language is set to Greek because the Greek-specific lowercase filter also maps ς to σ.. everyone really wants that mapping to happen! But there's an immediate config option that I might help.

Reply Edited 19:32, 17 October 2024 1 month ago

Spiros71 (talkcontribs)

Interestingly, Tray, ανθρωπος appears not to be an issue in my case https://ibb.co/StV1J6M There is one other funny thing happening though, not sure if this is up your alley (David included), but I do get search results for a non-existent page https://ibb.co/HgrzHpc σῑ́φων.

Reply 20:43, 17 October 2024 1 month ago

TJones (WMF) (talkcontribs)

With respect to the autocomplete of ανθρωπος on LSJ, we know the recall portion of autocomplete gets everything we'd want, but the ranking is where things go awry. Would it make sense on LSJ for those ἄνθρωπος and άνθρωπος to be much more popular? I don't recall all the factors that go into ranking the autocomplete results, but different stats on your site could lead to different rankings that could overpower the exact prefix match advantage.

As for the non-existent page for σῑ́φων, I can't reproduce it, which I think is because I'm not logged in so I don't get offers to create pages. My guess is that there is some either some normalization that isn't happening (ι + ̄ + ́ vs ῑ + ́) or there's an invisisble character (soft hyphens will cause this and are common enough in non-Greek contexts), etc. An example of a lack of normalization you can easily see is that searching for GrEeK LaNgUaGe on enwiki will offer to let you create that exact page.

(BTW, it's "Trey" with an "e".. cognate with τρεις, no less.)

Reply 17:12, 18 October 2024 1 month ago

Spiros71 (talkcontribs)

Τριάκις, thanks, Trey! Yes, I think the stats would make the difference as άνθρωπος is a very common word.

Reply 08:07, 19 October 2024 1 month ago

Reply to "Greek search no longer truly diacritics insensitive"