I'm a Malay native speaker. So, there are some wrong base words. For example, ohon supposed to be pohon.
Topic on User talk:TJones (WMF)/Notes/Analysis of Applying Indonesian Analysis Chain to Malay
Thanks, @Tofeiku! The base forms don't have to be correct, since they are only an internal representation of the stem and the users will not see them. I show them because sometimes seeing them makes it easier to understand how the inflected forms ended up grouped together. And of course the closer they are to being correct the less likely other errors are. In that case, as long as memohon, memohonkan, pemohon, and pemohonan are all forms of the same word, the form as a stem doesn't matter very much.
Any other thoughts on the groupings?
Oh ok. And "use" and "electric" are not Malay words.
I saw electric and dielectric in the list and suspected they weren't Malay. That's expected. A rules-based stemmer isn't very good at detecting foreign words—though sometimes you can rule them out based on a foreign letter or impossible stem form, but that's subject to errors, too. I consider "stupid but understandable" errors to be tolerable; even though I don't speak Malay, I understand why electric and dielectric end up together, because di- is a common prefix, and stemmers are dumb. (I was less sure about "use" since almost any three letter CVC or VCV sequence could be a word in many languages.)
Anyway, it's good to see how the stemmer treats foreign words and names and what kinds of errors it makes, because foreign words and names are unavoidable on our projects.
I don't think there's a VCV with an "e" at the end in Malay. Sorry for the late replies. Busy celebrating Eid.
No worries about slow replies! I guess I meant that VCV is plausible in any language you don't know a lot about. It's (linguistically) interesting that it doesn't happen in Malay.