About this board

Diacritic overfolding in Vietnamese

Mxn (talkcontribs)

Hi Trey, thanks for your recent blog post – it's a good overview of many of the challenges I encounter in multilingual text processing as a software engineer, not only in search.

Since you mentioned Vietnamese, I'd like to call your attention to phab:T78485: if the user enters search terms that contain diacritics, especially tone marks, MediaWiki should not direct the user to any other article title that matches only the base letters but not the diacritics. This is important because Vietnamese words pack a lot of meaning into diacritical marks. There are a great many minimal pairs of 6–12 three-character-long words that differ only by diacritics, especially in proper names.

You have a point that we can't rely on users to always enter all the diacritical marks. But if they enter any diacritical marks, they expect those particular diacritical marks to be respected for the most part. The impact of diacritic-folding already marked text is similar to redirecting a query for "résumé" to "resume" in English. Sometimes users enter the wrong diacritics, especially when using the VNI input method or the "VIQR" keyboard on iOS, which both place all the diacritic keys next to each other. But such mistakes can be counted less than a base-character difference when calculating edit distance; these typos don't necessarily require diacritic folding.

What's more, Vietnamese organizes the marks into two tiers: one tier (such as circumflexes) is considered part of the base letter, while another tier of tone marks is considered separate from the base letter. Traditionally, the tone marks apply to the word as a whole. Analytics bear out the fact that, in an autocompleting textbox, users commonly enter some diacritics while omitting the tone marks until after they spell out the whole word. So if anything, Vietnamese queries should be evaluated three times: first literally, then after folding the tone marks in the target text, then after folding all diacritics in the target text. But diacritics in the query should never be folded automatically.

TJones (WMF) (talkcontribs)

Hi Minh, sorry that phab task has been languishing for so long. Unfortunately it is from before the early days of what is now the Search Platform team, so it is on the wrong phab board.. or, rather, it is not also on the right one. I've added it to our main board and put it in "needs triage" status so the team will discuss it next week. It still has to make its way through our triage and prioritization process, and it'll probably end up on the backlog for now, but I'd prioritize it as high.

I'm not 100% sure what's going on, but I think I've got it down to the fact that the "Go" feature (which is what you get when you hit the "search" button at the top of the page) has a secondary index that uses ASCII folding. The primary index is boosted more strongly, and it uses ICU normalization, which is much less aggressive.

A practical short-term fix would be to upgrade Vietnames to use ICU folding, because that allows for folding exceptions. ASCII folding does not (which presents a potential problem for third party MediaWiki users without the ICU plugin, but we can think about that more later).

I may need to check with our Elastic expert but I'm not sure that we have separate analyzers for query text and title text in the Go feature (we do for full text, for example). He's not available this week, so I may not get a quick answer.

Assuming we don't have separate analyzers for the Go feature, would it be better for the secondary index to maintain the full diacritics (e.g., "trường hộp" & "trường hợp"), or to drop the tone diacritics but preserve the base letters ("trương hôp" & "trương hơp", if I did that correctly)? It sounds like the second option is better, but I defer to your opinion. Looking more carefully, the first option makes the secondary index identical to the primary index, other than a 5000 character limit, which shouldn't come up too much. Also, does the same level of folding—either none or just removing tone diacrtics—make sense for full text results? The exceptions are linked in our current config.

If just adding the ICU upgrade and a mapping to remove tone diacritics is enough, I might be able to get to this much sooner as a 10% project outside our team's normal plans. (I've got a soft spot for language fixes that are approaching 10 years old! Ask me about Crimean Tatar transliteration!)

Final question for now! While I can see that this is terribly annoying, is it particularly common? It seems that it requires the exact match to fail and then for there to be exactly one alternative after folding. Given the number of similar syllables, I wouldn't think that happens a lot. I typed cờm and hit "search", and I got rolled over to full text results because there are at least 7 similarly-folded alternatives (cơm, cớm, còm, cộm, cốm, cỡm, cỏm). I guess it's more likely with longer titles like trường hộp / trường hợp because most similarly-folded combinations of syllables don't occur.

Reply to "Diacritic overfolding in Vietnamese"
קיפודנחש (talkcontribs)

hi. saw you message on arwiki.

did you create one for hebrew perchance? or if not, could you?

this is a small pebble in the shoe on hewiki, and maybe one more small excuse to stay with "old vector" for veterans who miss it on the new skin.

thanks for all your work,

peace -

קיפודנחש (talkcontribs)

i might have misunderstood. i thought your patch was for dwim on new vector/new search (it does not work). i see now that it was specific to issues with dwim in Arabic. apologies.


TJones (WMF) (talkcontribs)

I'm a language guy, not a UI guy, so I was only working on the Arabic part of the Arabic DWIM. I talked some with the team working on the new skin and asked them to try to keep DWIM working, but it was not a simple fix and not high enough priority. I'm hoping there will eventually be a straightforward way to keep it working for Hebrew and Russian, too.

קיפודנחש (talkcontribs)

there is an open task on phabricator to re-enable DWIM (which really means implement completely new design) for the new search.

i mistakenly associated you message on arwiki to this story, hence my first (erroneous) post here.

(as the original author of the "old dwim", i have some "emotional attachment" to the issue...)

peace -

Reply to "dwim"

Regex searches timing out

Speravir (talkcontribs)

It’s probably off topic in Shift sensitivity: I run recently in a time out with intitle in Commons. I searched for SVG files with file extension in upper case letters and even narrowing down to SVG files with filemime wasn’t enough (over 1.7 mio files). Luckily I did not need all of them, just some examples for pointing to a script bug related to these upper case files. I probably thought of the timeout danger only because of that incident.

TJones (WMF) (talkcontribs)

That makes sense. There's always context and exceptions and exceptions to the exceptions. Commons has many more files than even the biggest Wikipedias, and the ratio of body text to title is much lower. And of course what you are searching for matters, too—I changed my example from dog (which doesn't time out) to mark (which does) because I can see wanting to distinguish mark/Mark. It's hard to give specific advice to an IP address user with so little context!

Reply to "Regex searches timing out"

Followup on the link suggestion tool

Ostrzyciel (talkcontribs)

Hi! On the search team meeting you asked me why I've decided to declense titles that we look for before searching instead of relying on a stemmer. I tried a few things and I seem to remember why now :D

The biggest problem is we're looking for exact matches of an inflected title, so we can't use the standard "no operator" search mode that uses stemming. A partial result or a result with two matching words separated by other words is useless in our case, so we have to use double quotes. I don't think it's possible to search the stemmed text with exact matching… or is it? To be honest I'm not much of an expert in Elastic, so I may be wrong here :)

Another problem is that we sometimes don't want certain words to be declensed. For example "Sejm Rzeczpospolitej Polskiej" (our parliament thing) has the "Rzeczpospolitej Polskiej" fixed in that case. We would declense it Sejm, Sejmu, Sejmowi, etc. without changing the rest of the name. This is a rare case though, as invalid forms that may stem from that are rather unlikely to occur in articles.

So that's why :)

TJones (WMF) (talkcontribs)

Thanks for the information! There are always interesting new use cases to be discovered.

Not surprisingly, we focus on search on Wikipedia and its sister projects, but we certainly want to help other users of Mediwiki when we can. The information below is based on my experience and testing with Wikipedia, etc., but may be helpful to you.

The tilde operator (~) is terribly and confusingly overloaded in our search. I think it means "do something different"—but what is different changes for each use case.

One use case is after a phrase in quotes, like "Sejmowi Rzeczpospolita polski"~ In this case, it maintains the order of the words and doesn't allow any words in between, but still allows stemming. So, on Polish Wikipedia, "Sejmowi Rzeczpospolita polski"~ brings up good-looking results even though there are no exact matches. The ranking for the phrase-with-tilde is not great because there aren't any exact phrase matches.

In general, you can add &cirrusDumpQuery to a query to see the full query we build up to send to Elasticsearch:

  • "Sejmowi Rzeczpospolita polski"~
    • this stems the words, but keeps them in order
    • the query string is "Sejmowi Rzeczpospolita polski"
  • "Sejmowi Rzeczpospolita polski"
    • this uses our "plain" fields, which don't do stemming
    • the query string is (title.plain:"Sejmowi Rzeczpospolita polski"~0^20 OR redirect.title.plain:"Sejmowi Rzeczpospolita polski"~0^15 OR category.plain:"Sejmowi Rzeczpospolita polski"~0^8 OR heading.plain:"Sejmowi Rzeczpospolita polski"~0^5 OR opening_text.plain:"Sejmowi Rzeczpospolita polski"~0^3 OR text.plain:"Sejmowi Rzeczpospolita polski"~0^1 OR auxiliary_text.plain:"Sejmowi Rzeczpospolita polski"~0^0.5)

Maybe some of that will help if you decide to implement stemming on your project. Feel free to come back to our office hours and talk to us again if you want to chat more!

Ostrzyciel (talkcontribs)

Thanks! I had no idea CirrusSearch could do that.

Reply to "Followup on the link suggestion tool"

CirrusSearch suggestion (^ and $ anchors)

Zabavuju flašku chlastu maskovanou jako zubní pastu (talkcontribs)

Hi, thank you for solving the "blocking" thread. And since I can see you are probably CirrusSearch developer (?), I'd like to show you my suggestion to the Community Wishlist Survey 2020. What do you think? I personally ran across to some cases where having ^ and $ anchors would have helped a lot.

TJones (WMF) (talkcontribs)

Thanks for contacting me! Yeah, I'm on the Search Platform team at WMF, so I do work with CirrusSearch and the underlying technology stack.

I think it makes a lot of sense for intitle searching, especially Wiktionary. I'm not sure about whole documents, but with the multiline option it could still be useful there. I've also run into some cases where it would have been helpful.

There are a couple of potential hold-ups. The Community Tech team, which sponsors the Community Wishlist may not have the skills needed to work on this project—though we (the Search Platform team) do look at the Community Wishlist, too, and see if there are promising projects there that we should take on. We don't follow the Community Wishlist timeline, though. Also, it might turn out to be too expensive in multiline mode on large WIkipedia docs on big wikis; I'm not sure, and I don't think it should be, but it's possible.

That said, I do think it's definitely worth proposing and discussing!

Reply to "CirrusSearch suggestion (^ and $ anchors)"
Alsee (talkcontribs)

Hi. I just came across your message on wikimedia-l, specifically the part about "removing quotes within queries by default".

I think the focus on avoiding zero-results is leading to a misstep here. The true goal is to have the best search engine with the most useful results. If I'm using quotes for an exact phrase search, and that phrase doesn't exist, then "zero hits" is the exact answer I wanted! That's far more valuable than digging through junk results, trying to figure out whether my quoted phrase exists.

If you still want to avoid a zero-result, just do what Google does. Give the zero-hit answer and re-run the search:

No results found for "foo bar baz".
Results for foo bar baz (without quotes):
TJones (WMF) (talkcontribs)

Hi @Alsee. Sorry for the confusion. The short version is that our long term plan is to do what you suggest.

The longer version:

While the zero results rate is an easy indicator to compute, we do recognize that it is low resolution and of limited value. A big swing up or down is a cause for concern—so it's a good metric to track on the dashboards—but getting it to zero is no longer a goal. One of my earliest write-ups covers lots of cases that do, in fact, deserve no results (the write up itself is a bit of a mess—sorry).

Mikhail's Zero to Hero analysis, which Deb linked to in the email, highlights the text characteristics that are most often associated with zero results. While zero results may be appropriate, a very high failure rate points to places where we could possibly make improvement.

An area that I'd identified earlier in my research was queries in the wrong language, so now we run language detection on poorly performing queries for some wikipedias and search other more appropriate wikipedias. As an example, a search in Russian on English Wikipedia can show results from Russian Wikipedia.

Two areas that Mikhail's report found as potentially high-impact (both relatively common and relatively unsuccessful query types) were queries with question marks and queries with quotation marks. I did a quick analysis of both and found that they did look promising. This led to a more thorough analysis of dropping question marks, and eventually a change in the question mark syntax that makes naive use of question marks behave as a naive searcher would expect.

Quotes are harder, because as you point out, the query you intend (with quotes) and the modified query (without quotes) are not the same query. We would of course want to show the "before" and "after" like we do with "Did You Mean" queries, and cross-wiki searches based on language detection (as above).

The actual implementation of quote removal is complicated by the fact that it could interact with Did You Mean, language detection, sister-wiki results (as discussed in Deb's email), the modified question mark syntax, and other forms of "second chance" searches we might implement in the future. We have an outline of the problem and the beginning of a discussion of how to deal with it, but it's not high on the priority list right now.

So, to sum up, what Deb was referring to in her email was the idea of automatically/"by default" taking poorly-performing queries that have quotes and re-processing them without the quotes, rather than relying on the user do it, if they choose to and they realize that it could help. The UI for such a process would include the before and after versions, as you suggest. Again, sorry for the confusion.

Alsee (talkcontribs)

Thanks, sounds good. The links you gave were interesting too. It's a very curious detail that zero-results from Ireland are 1/3 the zero rate from Australia. By the way, I found and fixed the 'paperr' typo mentioned in your research. Chuckle. It remained in that article for two years.

TJones (WMF) (talkcontribs)

I wonder if the Australian numbers are still coming from the National Library of Australia. They had a glitch that seemed to be converting quotation marks to " and then presumably sanitizing that to plain quot—which makes for a poor search term.

Thanks for fixing paperr. I have to fight against the urge to go on an error-fixing spree when I stumble across them, especially semi-systematic ones. One of my favorites is when people accidentally type one letter in another character set. There are at least dozens of cases of Cyrillic о used in place of Latin o on English Wikipedia. Depending on your font, they are indistinguishable–but it messes up searching for those words.

Alsee (talkcontribs)

Error-fixing sprees are always welcome, chuckle. Tho I guess it's not supposed to interfere with your paid job. If you discover come group of errors that need clean up, you could post it to EN:WP:VPM (Village Pump Miscellaneous). There's a good chance someone will pick up the task.

I just tried looking for cases of Cyrillic о you mentioned. I get 10,595 hits for the letter. I tried to narrow the search somewhat, I got to 4,583 hits for о -Cyrillic -insource:"«о". Scanning through a bunch of the hits, all I could find was clear cases of Russian text containing the letter.

Do you have any suggestion on how to find cases that need cleanup?

TJones (WMF) (talkcontribs)

> Do you have any suggestion on how to find cases that need cleanup?

Sure! The idea is that it's very unlikely that you'd have a Cyrillic Letter next to a Latin letter in a word—possible, but unlikely. So, you want an insource regex search for any Latin character next to a Cyrillic character, or vice versa. It's an expensive query and it times out—fortunately we now get partial results!—so you can break it into two pieces that are still too expensive, but less so that one combined regex:

  • insource:/[А-Яа-яЅІЈѕіј][A-Za-z]/ — a Cyrillic character followed by a Latin character
  • insource:/[A-Za-z][А-Яа-яЅІЈѕіј]/ — a Latin character followed by a Cyrillic character

You do get false positives like "KoЯn" and "NGiИX". You also get unexpected typos like "LГhomme Amérindien dans son environnement" which is almost certainly supposed to start with "L'homme".

I think the two most common sources of actual errors are probably

  • Users of phonetic keyboards—these keyboards have, for example, the Russian letters on the same keys as their English counterparts, so you can't tell if you mis-typed a Cyrillic or Latin o because they are the same key.
  • People working on Serbian or Macedonian topics that don't have ready access to keyboards for those languages and so substitute non-Russian Cyrillic ЅІЈѕіј with Latin SIJsij.

You could extend this with more accented Cyrillic and Latin letters, but this is a good start.

Reply to "Quotes in search"
There are no older topics