Discovery/Status updates/2019-01-14

This is the weekly update for the week starting 2019-01-14

Discussions

Search

Trey updated TextCat with models for detecting Russian typed on an English keyboard and vice-versa, and UTF-8 Russian text improperly encoded as Windows-1251, [1] as a precursor to providing wrong-keyboard/encoding detection and suggestion. [2]
Erik and the team did a lot of work on an epic ticket (with several sub tasks) to explore and figure out next steps in using user click data to tune Wikidata search parameters [3] and [4]. The team will ship the newly tuned wbsearchentities profile for en soon with de, fr, es afterward.
The team also had lots of discussions and exploration on how to transform Wikidata autocomplete click logs into a useful dataset. They are now transformed: Relevance Forge now has a utility for taking in the Wikidata completion search logs and tuning the parameters of search based on those logs. [5]
David fixed a minor regression where search request failures when offset+limit is out of bounds (cirrussearch-backend-error) [6]
Mathew discovered that the required metrics have been exposed by the prometheus exporter but they are displaying and fixed the issue with help from David and Gehel [7]
David reconfigured the ElasticSearch crosscluster on production search servers to have persistent configs [8]

WDQS

Stas & Guillaume finished moving categories namespace into a separate Blazegraph instance [9]

Did you know?

English text, like many others, is written left-to-right (LTR), but some languages—most notably Arabic, Hebrew, Persian, and Urdu, but also many others [10]—are written right-to-left (RTL). To handle different writing directions—especially in mixed LTR and RTL texts—Unicode classifies characters as having "strong", "weak", or "neutral" directionality. Strong characters definitely go in a particular direction, like ABC or אבג. Weak characters have a "vague" directionality, but can be changed in context, mostly numbers. Neutral characters pick up their directionality from context, like punctuation and whitespace characters used across scripts.

Mirrored characters change the way they display based on context. For example "A>B>C" and "א>ב>ג" both only have the greater than character (>) in them, but, if you are reading this somewhere that follows the Unicode bidirectional algorithm, the ones between Latin letters point to the right and those between Hebrew letters point to the left.

The algorithms are complicated [11], and when they don't work, there are explicit characters that indicate things like "text should flow left to right from here". The explicit formatting characters have the most potential to cause trouble for search because they are usually invisible, and you can pick one up without realizing it. For example, when copying an Arabic word from a page in English, or a French word from a page in Hebrew, the word that is "the other way around" from the main text might have one of these marks at the beginning or end of it. Fortunately, we can usually identify them and strip them out.

Finally, there are some scripts that have been written in other interesting directions. Vertical text includes Chinese, Japanese, and Korean, [12] and Mongolian. [13]. Hanunó'o [14] and Ogham [15] were written bottom-to-top! My favorite "direction" is "boustrophedon," [16] which means "like an ox ploughs" and alternates left-to-right and right-to-left, and was used particularly in old manuscripts and inscriptions in many writing systems. Why jump from one side of the page to the other when you can just curve around where you are or flip to mirrored letters and keep going?!

--

View all open tickets related to Discovery.
Looking to get involved? See tasks marked as Easy or volunteer needed