Topic on User talk:TJones (WMF)/Notes/Esperanto Stemmer Analysis

First look / unua lego

5
Brooke Vibber (WMF) (talkcontribs)

Looks mostly good! The only stemming example that looks wrong is shortening to "demokr-" where it should stem as "demokrat-". This'll just have to be put in the dictionary I guess, since "-at-" is also the <s>past</s>present passive participle suffix. A few others look like foreign names (French, German, Latin) that stem slightly odd but acceptably.

Will take a quick look over the Java code shortly.

Brooke Vibber (WMF) (talkcontribs)

Stemming exceptions include some things like alternate spellings, like both "chio" and "ĉio". I'm not sure I understand what happens to these; are they removed just from stemming processing and left as-is? Would "ĉion" get stemmed but not "ĉio"?

Brooke Vibber (WMF) (talkcontribs)

I'd probably recommend three small changes:

  • either ignore the alternate and incorrect spellings (like 'ghi' and 'gi' for 'ĝi') or normalize them before stemming
  • split the stemmingExceptions list into a list of short particles that should not get stemmed at all and a list of stems that should not be broken down further (eg 'demokratojn' should break down as 'demokrat-o-j-n' not 'demokr-at-o-j-n')
  • some of the stemExceptions have a missing-diacratic spelling (like 'kvazau') but not the version with correct diacritics ('kvazaŭ'), these need to be fixed.

I'll be happy to provide pull reqs for diacritic corrections and see if I can find or pull a list of other word stems that break down weird. :)

TJones (WMF) (talkcontribs)

Thanks, Brion!

Even a nice carefully constructed language has irregularities—especially to a computer! Language is always messy.

I think the list of exceptions came from Wikipedia or Wikibooks, and I don't think the fact that they could be inflected was taken into account (typical English-speaker thinking on my part, at least—we have completely different pronoun forms, for no particularly good reason). The current stemming exceptions are just left unaltered. The goal was to keep ĉio from losing its -o; but ĉion should definitely be treated similarly. If it's easy to say which are can be inflected and which can't in the list, that would be great, otherwise I can try to work it out.

I've been thinking about the stemming options for demokrat-. Keeping in mind that the goal is not necessarily to get a correct stem, but rather a unique stem, maybe it doesn't matter. (Though in this case it picked up Demokrito, too—but stemming names is always a gamble.) It seems like it could get very complex to deal with in the general case. Productive prefixes give related forms like maldemokratia and pseŭdodemokratia—listing them all (either all the related forms or all the acceptable prefixes) would be annoying and prone to problems. On the other hand, while blocking ĉio from having the -o stripped off makes sense, it looks like other words end in -ĉio that are not related, so allowing any arbitrary prefix is ugly, too. Any thoughts on dealing with that? Maybe one way will seem obviously best if you can come up with any other potential problem cases.

Do you have any insight into how often the h-system and x-system forms are used in written text and in searching? If lots of people can't type ĝ and so search for gh or gx, it's probably not something we should ignore. A potential problem is the treatment of foreign words—though it doesn't matter if ghost, though, and laugh are internally represented as ĝost, thouĝ and lauĝ as long as they aren't ambiguous and thus collide with other words. I can try that out and see what impact it has on the words in my sample.

Help with the missing diacritical forms would be great, whether a pull request or a list here or elsewhere.

The to do list:

  • make sure all the exceptions have proper diacritics
  • find the exceptions that can be inflected, like ĉio and handle them properly (add to a general list of unbreakable stems, or explicitly map forms to stems)
  • remove h-system and x-system words from the exception list
  • test the impact of automatic h-system and x-system conversion on stemming collisions; if it's small enough, just do it
  • decide how to handle ambiguous stems like demokrat- (accept defective stems with some errors, do something clever to handle prefixes, or something else TBD)

Thanks for all the help!

TJones (WMF) (talkcontribs)

It took a while to get back to this, both for related and unrelated reasons... whew! Updates:

  • The exception list has been updated to have proper diacritics and no x-system or h-system words, some unneeded exceptions were removed, and a few new ones were added. On GitHub.
  • The stemmer also works on all the exceptions with regular and irregular inflections.
  • I looked into automatic h-system and x-system conversions, for both queries and on-wiki text. Details are here, but the summary is that too many non-Esperanto word get caught up by h-system conversion, and x-system conversion has very little impact. If someone thinks x-system conversion is worth it anyway, it's straightforward to implement.
  • Nothing extra has been done yet with the ambiguous stems, but it also isn't clear how big a problem it is.
Reply to "First look / unua lego"