User:TJones (WMF)/Notes/Nori Analyzer Analysis/Nori Config
Below is the final command line config I used for testing the Nori Korean analyzer: Nori is unpacked, icu_normalization is added, the tokenizer is configured for "mixed" compound processing (keeping the original and the parts), "VCP", "VCN", and "VX" are added to the part-of-speech filter, a minimum length filter is added to eliminate empty tokens, a character map is added for characters that cause problems in tokenization and to fix the regression on dotted I (İ) from ICU normalization, and another character filter to strip the most common problem combining diacritics.
curl -X PUT "localhost:9200/nori_mixed_icu_custom_pos?pretty" -H 'Content-Type: application/json' -d' { "settings": { "index": { "analysis": { "tokenizer": { "nori_tok": { "type": "nori_tokenizer", "decompound_mode": "mixed" } }, "filter": { "nori_posfilter": { "type": "nori_part_of_speech", "stoptags": [ "E", "IC", "J", "MAG", "MAJ", "MM", "SP", "SSC", "SSO", "SC", "SE", "XPN", "XSA", "XSN", "XSV", "UNA", "NA", "VSV", "VCP", "VCN", "VX" ] }, "nori_length": { "type": "length", "min" : 1 } }, "char_filter": { "nori_charfilter": { "type": "mapping", "mappings": [ "\\u0130=>I", "\\u00B7=>\\u0020", "\\u318D=>\\u0020", "\\u00AD=>", "\\u200C=>" ] }, "nori_combo_filter": { "type": "pattern_replace", "pattern" : "[\\u0300-\\u0331]", "replacement" : "" } }, "analyzer": { "text": { "type": "custom", "char_filter" : [ "nori_charfilter", "nori_combo_filter" ], "tokenizer": "nori_tok", "filter" : [ "nori_posfilter", "nori_readingform", "icu_normalizer", "nori_length" ] } } } } } } '
I still need to convert this appropriate config in AnalysisConfigBuilder and test there.