Discovery/Status updates/2018-06-11

This is the weekly update for the week starting 2018-06-11



  • Trey completed a technical review of the available Estonian morphological library with help from Guillaume and David, and unfortunately it's not usable, and the stemming algorithm is not easily ported. See T178928. [1]
  • Trey did an analysis [2] of the effect of using the Elasticsearch Indonesian analysis chain on Malay-language data. (See Wikipedia [3] for details on Malay and Indonesian.) Next step is getting speaker review of the stemming quality, then hopefully on to reindexing wikis in both Malay and Indonesian.
  • Trey did a write up about the weirdness that comes from searching for single punctuation characters without good redirect support [4] to explain why searching for a hyphen on Farsi Wikipedia redirects you to the article on the apostrophe. See also T196826. [5]
  • Erik and David looked at adding 'type' field to store same information as was in es5 types in metastore [6]
  • David did work on investigating (and implementing) how the prefix keyword should augment and not override the list of requested namespaces [7]
  • Trey got the feedback he needed to go head and create and merge Croatian, Serbo-Croatian, and Bosnian Analysis Chains Using Serbian Morphological Libraries [8]
  • Gehel found that when we freeze writes to elasticsearch, jobs pile up in the job queue and we needed an alert to tell us that the writes aren't getting thawed in a timely manner [9]
  • Trey worked on moving Serbian language wikis from extra-analysis to extra-analysis-serbian plugin (it went into production a week ago with the re-indexing) [10]
  • Erik and Gehel resolved current deprecation warnings in elasticsearch 5 [11]
  • David worked on adding support for boosting keywords [12] and adding support for Filtering keyword (FilterQueryFeature) [13]
  • Erik did quite a bit of research on how to ensure that the regex highlighting doesn't always timeout as expected because @ apparently matches "any string" in the lucene regex syntax; Trey helped with the analysis and it got pushed into production in early June [14]
  • Stas added lemma & form representation texts to fulltext search index, which allows (very primitive) fulltext search for Lexemes [15]. Better search coming soon!

Other Noteworthy StuffEdit

  • Wikidata Quality Constraints violation now can be exported into RDF. Loading to Wikidata Query Service coming soon. [16]