User:TJones (WMF)/Permuting Khmer

Permuting Khmer: Restructuring Khmer Syllables for Search edit

Over the course of about a year and a half (from late 2019 to early 2021), I worked on a project off and on to improve on-wiki search for Khmer-language wikis. In particular—because of the way many Khmer fonts work to support the Khmer script—the same word can be often written in multiple ways and still look the same to a reader, although each version is different to the software that processes the text to make it searchable.

I developed and implemented an algorithm to reorder Khmer syllables that are not in the canonical order, as defined by the Unicode Standard. Below are various resources that describe and document the process of developing the algorithm, and provide several implementations:

  • A 5½-minute video presentation I gave at a WMF Tech Department meeting, which provides a quick sketch of the background and problem, and a brief overview of the impact of the changes made. (May 2021):
Presentation on automatically restructuring Khmer syllables to improve search
  • A 1600-word blog post that has additional detailed examples. (June 2020)
  • Four implementations are available:
    • An Elasticsearch character filter (in Java) in the Search Platform team’s search/extra repo. (Here are links to the Khmer-specific code and docs.)
    • Stand-alone implementations in Java, Python, and Perl at my GitHub repo. These operate on files, but are intended to be easily modified to be part of a larger application. The docs include a quick summary of the algorithm and relevant character classes.
  • Detailed development and evaluation notes on MediaWiki: