User:TJones (WMF)/Permuting Khmer
Permuting Khmer: Restructuring Khmer Syllables for Search
editOver the course of about a year and a half (from late 2019 to early 2021), I worked on a project off and on to improve on-wiki search for Khmer-language wikis. In particular—because of the way many Khmer fonts work to support the Khmer script—the same word can be often written in multiple ways and still look the same to a reader, although each version is different to the software that processes the text to make it searchable.
I developed and implemented an algorithm to reorder Khmer syllables that are not in the canonical order, as defined by the Unicode Standard. Below are various resources that describe and document the process of developing the algorithm, and provide several implementations:
- A 5½-minute video presentation I gave at a WMF Tech Department meeting, which provides a quick sketch of the background and problem, and a brief overview of the impact of the changes made. (May 2021):
- A 1600-word blog post that has additional detailed examples. (June 2020)
- Four implementations are available:
- An Elasticsearch character filter (in Java) in the Search Platform team’s search/extra repo. (Here are links to the Khmer-specific code and docs.)
- Stand-alone implementations in Java, Python, and Perl at my GitHub repo. These operate on files, but are intended to be easily modified to be part of a larger application. The docs include a quick summary of the algorithm and relevant character classes.
- Detailed development and evaluation notes on MediaWiki:
- Syllable Re-Writing to Improve Khmer Search Performance—Development and analysis of an algorithm to re-order ambiguous Khmer syllables. Includes lots of references. (September 2019)
- Khmer Reordering Analysis Analysis—Analysis of the impact of adding the Khmer reordering plugin to the Khmer language analysis chain. (January 2021)
- Khmer Reordering Before and After Reindexing Report—A first attempt at quantifying the impact of analysis changes & reindexing by running a sample right before and after reindexing. (March 2021)