User:TJones (WMF)/Notes/On Merging Apostrophes and Other Unicode Characters

August 2016 — See TJones_(WMF)/Notes for other projects. ( T41501) For help with the technical jargon used in discussing Analysis Chains, check out the Language Analysis section of the Search Glossary.

Intro

I wrote this up for T41501, but wanted to put it with my other notes for easy finding later.

T41501 is a pretty old ticket dating back to a discussion in 2009 about straight and curly apostrophes in ldsearch, which has since been retired. Below is an abbreviated version of the ticket description.

Task Description

When doing a search with the apostrophe character U+0027 "apostrophe/single quote" available on most keyboard, results should match other Unicode apostrophe-like characters like the preferred apostrophe U+2019 and others.

Basically indexing should convert all apostrophes to U+0027, and searching should convert all apostrophes to U+0027. So articles containing U+2019 for exemple would be matches when search with U+0027, U+2019 or other apostrophes.

From the 2009 discussion, the list of apostrophes was:

U+0027 APOSTROPHE
U+2018 LEFT SINGLE QUOTATION MARK
U+2019 RIGHT SINGLE QUOTATION MARK
U+201B SINGLE HIGH-REVERSED-9 QUOTATION MARK
U+2032 PRIME
U+00B4 ACUTE ACCENT
U+0060 GRAVE ACCENT
U+FF40 FULLWIDTH GRAVE ACCENT
U+FF07 FULLWIDTH APOSTROPHE

I would add other characters for which U+0027 is often used as an accessible substitute like some modifier letters and saltillo:

U+02B9 MODIFIER LETTER PRIME
U+02BB MODIFIER LETTER TURNED COMMA
U+02BC MODIFIER LETTER APOSTROPHE
U+02BD MODIFIER LETTER REVERSED COMMA
U+02BE MODIFIER LETTER RIGHT HALF RING
U+02BF MODIFIER LETTER LEFT HALF RING
U+0384 GREEK TONOS
U+1FBF GREEK PSILI
U+A78B LATIN CAPITAL LETTER SALTILLO
U+A78C LATIN SMALL LETTER SALTILLO

My Response

There are a lot of connected issues here. I’ll try to untangle some of them.

In Elasticsearch, there are two particularly relevant steps to processing text. Tokenizing, which breaks up text into tokens to be indexed (usually what we think of as words), and ascii-folding, which converts non-plain ascii into plain ascii if possible—though, for example, you can’t convert Chinese characters into plain ascii because there’s no reasonable mapping.

The rules Elasticsearch uses for tokenizing and other processing can differ by language, so I’ve only tested these on the English analysis chain for now.

A normal apostrophe is treated as a word break, so looking at prickett’s (from the prickett’s charge in the article from the Daily WTF), we get prickett and s as our terms to be indexed. Searching for prickett’s charge actually searches for three tokens: prickett s charge. The obvious title comes up because that phrase in that exact order is the title of the article, which is usually a very good result.

Many of the apostrophe-like characters listed above also serve as word breaks in English. The ones listed here that are not word breaks include all the listed modifier letters, and the small saltillo—oddly, the capital saltillo is a word break. Of course, in other languages, the analysis could be different, though I checked Greek and the separate tonos is still a word breaker. (I think it’s because it’s not a modifier mark, since all the vowels with tonos have precomposed Unicode characters—but I’m guessing.)

For characters that are not word breaks, ascii-folding often does what you’d want, but not always. Ascii-folding is currently enabled on English Wikipedia, so searching for pïćkętt‘s čhãrgè works like you’d want. In my (not quite done) research into French (T142620), Turkish dotted-I (İ) is properly folded to I by the default French analysis chain, but not by the explicit ascii-folding step. The French stemmer does some ascii-folding, but generally not as much as the explicit ascii-folding step (dotted-I notwithstanding).

In general, the Elasticsearch ascii-folding is pretty good—though linguists cringe at folding ɰ to m. Undoubtedly there are other minor errors in the ascii-folding.

The tokenizer is causing some of these problems, particularly with the multiplication mark, ×, which is a non-word character, and so acts as a word break. When using the multiplication symbol, 3×4 is tokenized as two tokens: 3 4; while when using an x, 3x4 is tokenized as three tokens: 3 x 4.

We are currently doing explicit ascii-folding for English and Italian, and we’re adding it for French (which will come with BM25). Some probably happens in other language-specific analysis chains, but we don’t know exactly what or where without testing.

It is possible to add any of these others—x for ×, I for İ—as Elastic character filters, which just uniformly map one character to another, but that could have unintended consequences. They would definitely no longer distinguish between the mapped characters—so we couldn’t apply them universally, since in Turkish, the distinction between I and İ matters.

There can always be problems with particular “non-native” characters and a particular symbols that the default tokenizing and ascii-folding doesn’t handle as well as we’d like. More issues will come up, but I’d consider closing this specific task since this is based on the behavior of lsearchd which is no longer around, all of the original apostrophe-like characters now behave like apostrophes, and we are looking into ICU folding (T137830), which is more appropriate for other languages that aren’t using the Latin alphabet (it’s already enabled for Greek).