User:TJones (WMF)/Notes/Khmer Reordering Before and After Reindexing Report

March 2021 — See TJones_(WMF)/Notes for other projects. See also T274205.

Data edit

I pulled a 10K sample of Khmer Wikipedia queries from 2021-02-01 to 2021-03-01. I used the usual sampling filters:

  • the sample is across four weeks, to account for any cyclical effects (weekend queries may be different from weekday queries)
  • sampled queries are limited to one query per IP per day (to reduce bots that slip through other filters, and so power users aren't over represented)
  • we exclude any IPs with more than 100 queries in a day (to reduce the number of bots and other atypical users)
  • we require the session to include a near-match query, which runs before a full-text query and originates in the search box in the upper right (or left) corner of the page (again, to filter in favor of human searchers)

The queries were lightly normalized—lowercased, strings of whitespace converted to a single space, and leading and trailing whitespace trimmed—and deduplicated.

I also created some filters to remove generally low quality queries. In total, 3634 queries (36.34%) were filtered:

  • xxx: Variations on xxx, xnxx (a popular porn site), and the word porn accounted for 1925 queries.
  • www: Websites and URLs (www, .com, .net, .org, bit.ly) accounted for 1504 queries.
  • numbers: Queries consisting entirely of numbers (with an optional + at the beginning) accounted for 69 queries.
  • junk: Queries with the same character 4 times (a fairly reliable sign of junk) in a row accounted for 39 queries.
  • consonants: Queries with 5 consonants in a row (a fairly reliable signal of junk, except in German), or queries that are just 4 consonants accounted for 90 queries.
  • punctuation: Queries with at least 2 characters that were all punctuation and spaces accounted for 7 queries.

I lightly reviewed the filtered queries, and they were generally obviously junk.

5424 unique queries remained, from these I randomly sampled 1000 queries and ran them against current production API before and after reindexing—45 minutes apart.

As a control, I pulled a similar sample of English Wikipedia queries from 2021-02-01 to 2021-02-08 (one week), and applied the same filters. Significantly fewer queries were filtered—360 in total.

  • xxx: 114
  • www: 54
  • numbers: 17
  • junk: 18
  • consonants: 157
  • punctuation: 0

9584 unique queries remained, from these I randomly sampled 1000 queries and ran them against current production API before and after reindexing Khmer (i.e., with no change to the English indexing). I don't have the exact timing anymore, but it was slightly longer than Khmer sample, so > 45 minutes, but definitely less than 2 hours.

Of course, any comparisons aren't exact, since English Wikipedia has more content and is more active than Khmer Wikipedia, but it gives a likely upper bound on random "natural" changes.

Khmer Stats edit

  • 319 (31.9%) queries originally got zero results
    • 29 (2.9%) went from 0 results to some results
      • from 1 to 209 new results
  • 280 (28%) got a different number of hits
    • from 1 to 678 more hits
    • 256 (25.6%) increased from non-zero to more results
      • from 0.10% (1994 to 1996) - 11400.00% (1 to 115)
  • 95 (9.5%) changed their top result (including ZRR changes)

Observations edit

  • The largest change in results was for a numeric query (8) which when from 677 hits to 1355 hits because we are mapping Khmer numerals to Arabic numerals.
  • I checked the 8 zero-results queries that had the biggest numbers of new results, and 7 of them had the kinds of problems the Khmer syllable reordering was intended to correct:
    • split vowels
    • repeat diacritics
    • deprecated characters

English Stats edit

  • 173 (17.3%) queries originally got zero results
    • 0 (0%) went from 0 results to some results
  • 133 (13.3%) got a different number of hits
    • from 16 fewer to 15 more hits
    • 33 (3.3%) decreased from non-zero to fewer results
      • from -1.89% (53 to 52) to -0% (122,708 to 122,707)
    • 100 (10%) increased from non-zero to more results
      • from +0% (1,640,797 to 1,640,802) to 1.44% (277 to 281)
  • 33 (3.3%) changed their top result

Observations edit

  • I checked a handful of the queries that changed their top result, and their top result changed randomly as I reloaded the search results page. They don't have an overwhelming top result and the final ranking depends on which shard handles the request.

Khmer vs English Analysis edit

  • Improvements to Khmer zero results queries are likely a direct result of the Khmer plugin!
  • The range of number of hits for Khmer (1 to 678) is very different from the "random" changes in English (-16 to 15), and so is also likely a direct result of the Khmer plugin!

Looks like the Khmer plugin is going to have a pretty big impact on Khmer searches!