User:TJones (WMF)/Notes/Language Analysis Morphological Libraries

October 2017 — See TJones_(WMF)/Notes for other projects. See also T171652. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Background

After recently testing and implementing several third-party open-source Elasticsearch language analyzers and seeing that some are just simple wrappers around other third-party open-source language analysis software, I decided to go looking for other language analysis software with the potential to be similarly wrapped into Elasticsearch language analyzers that could benefit our wiki communities.

Themes

A few recurring themes emerged:

Some code is proprietary or has no licensing information, so even though it might work well, it’s not legally/philosophically available to us.
Open-source code gets abandoned or effectively abandoned (i.e., no longer being developed in a form that is useful for us), because:
- the developer just moved on to other projects.
- the developer commercialized the project and stopped open-source development.
- the developer took the project in direction not compatible with our needs (e.g., focusing on massively parallel cluster-based installations, or pulling in huge external libraries).
Lots of code is not well-documented in English; this isn’t a huge surprise, but there may be other awesome software that we could use but we just don’t have a good way to learn that it exists. If anyone knows of such awesome software, tell me about it!
Lots of code is not in Java.
- It’s not strictly required that the code be in Java, but it is the easiest to write a wrapper around. While some code has Java integration, through JNI for example, and some programming languages besides Java use the JVM or have JVM implementations, most non-Java options greatly increase complexity.
- On the other hand, some algorithms are sufficiently straightforward that re-implementing them in Java (or any programming language) wouldn’t be that hard; so that’s always something to keep in mind.
For some languages, I found a fair number of research papers, but not with accompanying software or sufficient algorithmic description to actually implement anything.

Selection Criteria

With all that in mind, my criteria for consideration for follow-up came down to the following (updated May 2018 after working with Serbian and Slovak):

Code that has a workable license.
Code that’s in Java, or uses a straight-forward enough algorithm that it could be ported to Java.
Code that isn’t in a huge library and doesn’t have massive dependencies.
Code that looks to be reasonably mature (e.g., doesn’t have a huge TO DO list of basic features or other indications that implementation was not complete).

Other important criteria for actual development and deployment (which would be assessed in a follow-up task) include:

Accuracy of analysis—so the linguistic results need to be reasonable.
Ability to be integrated—it’s possible that the API of the software makes it ridiculously hard to do necessary integration with Elasticsearch.
Run-time performance—the code shouldn’t need a giant Spark cluster to run, or be twenty times slower than our current analyzers.

The first three languages I looked at—Japanese, Vietnamese, and Korean—had a lot of options and lots of complexity. The other four—Serbian, Malay, Estonian, and Slovak—had no more than a couple of plausible options, if any.

May 2018 update: I'm now less concerned about "abandoned" code. Forking or porting part or all of a repo and folding it into the search/extra or search/extra-analysis project is not a huge problem, and is worth doing for code that provides useful morphological analysis.

June 2018 update: I'm now thinking "just Java" or portable to Java. C or C++ with Java integration is probably a non-starter; JNI integration in particular has a lot of problems. Obviously, a straightforward C or C++ algorithm could be ported to Java.

Next Steps

Based on my review of these seven languages, I suggest testing some of the software packages. Fortunately, we don’t need to commit to full Elasticsearch integration to perform our standard testing. As long as we can run the analysis and map analyzed tokens back to their original text, we can do a most of the language analysis analysis to determine whether the analyzer is worth pursuing for integration.

For Japanese, I want to look at MeCab, tinysegmenter, and possibly CaboCha in more detail.
For Vietnamese, I want to look at vnTokenzizer. It is the same library that the previous Elasticsearch analyzer I looked at was based on—the problems it had were with integration with Elasticsearch, not with tokenization.
For Korean, I want to look at the newer module named mecab-ko-lucene-analyzer—there are two!
For Serbian, I want to test both available stemmers: SerbianStemmer and SCStemmers. SCStemmers, which implements four stemming algorithms, seems to include the algorithm used in SerbianStemmer, but it wouldn’t hurt to compare them. If SerbianStemmer were somehow superior, it would likely be possible to port the improvements to SCStemmers. [DONE]
For Malay, I was only able to find research papers—~~nothing implemented or implementable that I could find.~~ Update: I've decided to give the existing Elastic Indonesian stemmer a go! (See Malay Update (June 2018) below.) [DONE]
For Estonian, I want to look at Vabamorf. [DONE]
For Slovak, I want to try both of the available stemmers: stemm-sk and Stemmer-sk. The former is in Python but looks to be easily ported to Java if it is awesome, and the latter is already a Lucene analyzer. [DONE]
I ended up also working on Esperanto because it was convenient to do so. [IN PROGRESS—still need to re-index]

I expect some failures. Two of the language analyzers maintained or suggested by Elasticsearch (Japanese and Vietnamese) did not perform as well as we needed them to. However, several others did: those for Polish, Hebrew, Ukrainian, and Chinese (which involved two plugins being melded together). Right now, six of the seven languages I investigated yielded something worth following up on. We’ll see how many of those turn into something usable—if it’s two or three, this is a definitely a process worth repeating. If it is zero, then maybe we need to let the language analyzers mature on their own and come to us when they are ready.

Malay Update (June 2018)

After working on Serbian (T178926/T192395) and Slovak (T178929) and looking at the papers they were based on or translated from, I decided to reconsider what counts as "implementable" for Malay, and review the papers on Malay stemming and compare it to the existing Indonesian analysis.

My understanding of Indonesian and Malay was pretty simple, and that they are "more distinct than American and British English, but less distinct than Spanish and Portuguese". Also, Malay and Indonesian didn't interact in my investigation into fallback languages, where each is used as a fallback language for other languages.

However, looking at the wiki page on the matter, and reviewing some other sources, it seems that a lot of the difference is in Dutch-influenced vs English-influenced spelling of certain sounds, Dutch vs English loanwords, other vocabulary differences, and some pronunciation differences—all of which can decrease mutual intelligibility—but the grammar of the two standard forms seems to be essentially the same.

I also compared the Malay stemmer papers with the Lucene Indonesian stemmer implementation, and verified that they are working on similar affixes. There are some discrepancies, but the core affixes are the same, and the differences seem to come down to what affixes to try to account for (some derivational vs inflectional).

While it's possible that spelling differences or vocabulary differences could increase the error rate for Malay vs Indonesian, it seems to be worth testing; if it is successful, all we need to do it configure it—everything is not only already built, it's already installed, too!

Raw Notes

Below is a table with my notes.

Language	Name	license	prog lang	age	notes
Japanese	Kytea	Apache 2	C++	7 years	segments words or morphemes, user dct for training, trainable, POS tagging
Japanese	Kagome	Apache 2	Go	0-2 years	normal/search/extended segmenting (a la Kuromoji); user dict
Japanese	MeCab	GNU GPL/LGPL	C++ / Java integration	5-6 years	docs in Japanese
Japanese	yc-nlplab	?	FOMA	3 years	based on mecab?
Japanese	Sen	?	?	?	defunct; see also http://lith.me/code/2015/02/05/Japanese-tokenization-with-Java-and-Lucene/
Japanese	tinysegmenter	BSD	Python / Javascript	<1 year	seems too small!
Japanese	tinysegmenter	BSD	Python	3 years	corpora https://github.com/SamuraiT/tinysegmenter/tree/master/tests
Japanese	tinysegmenter	BSD	Javascript	9 years
Japanese	JUMAN	?	C	20 years?
Japanese	JUMAN++	Apache 2	C++	2 years?	HUGE! / online testable
Japanese	CaboCha	LGPL/BSD	C++ / Java integration	<1 year	relies in MeCab? Docs in Japanese
Japanese	ChaSen legacy	?	C	5 years	abandoned?
Japanese	micter	?	C++	7 years	Not mature
Japanese	Notes
Japanese	Notes

Vietnamese	elasticsearch-analysis-vietnamese	Apache 2	Java	<1 year	the one we tested previously
Vietnamese	vnTokenizer	GNU GPL	Java	8 years	internal software for elasticsearch-analysis-vietnamese
Vietnamese	vn.vitk	GNU GPL	Java	1-2 years	v5 of vnTokenize; requires Apache Spark, intended for use in a cluster
Vietnamese	Vietnamese morphological analyzer with using SVMs	MIT	Python	1-2 years	depends on YamCha
Vietnamese	PvnSeg (no URL)	?	Perl	?	mentioned in other sources; best performance in LREC 2008
Vietnamese	JVnSegmenter	copyrighted	Java	10 years
Vietnamese	JVnTextPro	GNU GPL	Java	7 years	follow up to JvnSegmenter
Vietnamese	Roy_VnTokenizer	none?	Python	4 years	excellent overview and literature review, test corpora
Vietnamese	UETsegmenter	none?	Java	1-2 years	geared to academic use
Vietnamese	survey article
Vietnamese	notes
Language	Name	license	prog lang	age	notes
Korean	korean-morphological-analyzer	?	C++	9 years
Korean	komoran-2.0	Apache 2	Java	3 years	seems to have gone commercial
Korean	HanNanum	GNU GPL	Java	6 years
Korean	mecab-ko-lucene-analyzer	Apache 2	Java	4 years
Korean	mecab-ko-lucene-analyzer	Apache 2	Java	10 months	Elastic 5.1.1
Korean	KoNLPy	GNU GPL	Python	?
Korean	notes				lots of “Korean morpheme analyzer tools”

Serbian	SerbianStemmer	none	Python	5 years
Serbian	SCStemmers	GNU GPL	Java	1-2 years	collection of 4 stemming algorithms

Malay	(nothing!—just papers; nothing implemented or implementable that I could find.)

Estonian	Vabamorf	GNU LGPL	JNI/C++	2-3 years	Update: Java availability is through JNI
Estonian	PyVabamorf	GNU LGPL	Python	3 years	Python wrapper to Vabamorf
Estonian	Estmorf	proprietary		3 years
Estonian	Filosoft	proprietary	?	?

Slovak	stemm-sk	MIT	Python	2 years	fairly simple, could be easily translated to Java
Slovak	Stemmer-sk	GNU AGPL	Java	2 years	Lucene Analyzer
Language	Name	license	prog lang	age	notes