Topic on User talk:TJones (WMF)/Notes/Nori Analyzer Analysis

Revi C. (talkcontribs)

I didn't look at the whole group. I looked at your notes.

(snip) "智", are listed in Wiktionary as just 지]/ji, while others, like "知", have multiple Hangeul versions (in this case, 알/ai or 지/ji), and it looks like Nori picked this one. In several other cases, especially where the token ends with -진, the part of speech tagger is marking 지 as an auxiliary verb, which is maybe another category of parts of speech we should filter.

知 just has 지/ji, 알/ai is the 'meaning' part. enwiktionary is wrong there then.

-진 is -지+ㄴ.

-이- is often used as suffix, so there's lots of example used as a suffix there.

TJones (WMF) (talkcontribs)

> 知 just has 지/ji, 알/ai is the 'meaning' part. enwiktionary is wrong there then.

No, it was just me. I didn't understand the concept of "eumhun" so I interpreted "(eumhun 알 지 (al ji))" incorrectly, and it's hard to find documentation on Wiktionary (there are no links), and it is confusing if you aren't familiar with it.

I'll update my notes.

TJones (WMF) (talkcontribs)

For -이-, I think we need to filter the "positive designator" parse. It shows up a lot, doesn't seem to carry a lot of valuable meaning, and links a lot of otherwise unlinked tokens.

Reply to "Large groups"