@Tofeiku, @Bennylin, and @Yosri: thanks for your questions and comments!
My goal is to decide whether to enable the Indonesian language analyzer for Malay wikis. It includes the stemmer and stop word list.
The stemmer isn’t perfect, but it has been enabled on the Indonesian wikis for a long time, and I think it is a lot better than nothing. You can test it some on the Indonesian Wikipedia. For example, searching for mengajar, belajar, or pembelajaran shows the differences in ranking based on the exact form of the word used.
Do we have enough information to decide what to do?
- (1) If using the Indonesian language analyzer is clearly better than nothing, we can deploy it. (Also note that it can be removed as easily as it is deployed. It takes one to two weeks, but it is easy to do.)
- If we find something better in the future, we can replace it, too.
- (2) If the Indonesian language analyzer is terrible for Malay, then we can abandon the project.
- If the decision is not clear, we can (3) look for more people to bring into the discussion, or we can (4) set up a search demo that uses the Indonesian language analyzer on Malay Wikipedia data, or both.
What are your thoughts? Thanks!