User:TJones (WMF)/Notes/Stempel Analyzer Analysis/Recompiling the Stemmer

Recompiling the Stemmer

The original tables of stems which were used to train Stempel are not presently available. However, the process outlined in the Apache documentation for Stempel is probably reproducible, using Stempel itself.

Based on email from Leo Galamboš, one of the authors of Egothor, on which Stempel is built (see Apache docs above), the process for creating a table is as follows:

Download http://www.getopt.org/stempel/stempel-1.0.jar
Prepare a table—First term is the lemma/stem, and the rest of the line contains all respective variants (example from English Egothor table):

 A-bomb   A-bombs
 abacus   abacuses
 abandon   abandons abandoning abandoned
 abase   abases abasing abased
 abate   abates abating abated
 abbess   abbesses

Run java -cp stempel-1.0.jar org.egothor.stemmer.Compile -0E2 en_table

It will compile "en_table" using "-0E2" method (Elasticsearch uses "-0ME2" which may not be better) and the product is saved into en_table.out file

We can’t quite just replace the old file with en_table.out, because it would be necessary to change en_table.out to Stempel's format—it needs the UTF String header with opt-method signature.

However, there is also a Compile class in the current Elasticsearch distribution, which likely performs the same process, but with the correct headers.

All words in the table will be transformed as the table specifies. Unknown words may be transformed incorrectly. So, if there is a word-stem pair that is processed incorrectly, we can add the correct transformation pair into the table, recompile it, and it would be fixed.

The problem is getting appropriate data for the table of stems/lemmas. As the Apache documentation discusses, this data was originally derived from tagged corpora and the output of a different stemmer (SAM). We could generate a similar corpus using data from English and Polish Wiktionary, and using the current version of Stempel the way Stempel used SAM.

Polish Wikipedia provides a ready source of frequency data for Polish words. Using lemmas for the most common words in Polish Wikipedia should provide ample training data for the stemmer, and having an uncompiled table available would allow us to update it to fix problem cases.