Topic on User talk:TJones (WMF)/Notes/Nori Analyzer Analysis

Revi C. (talkcontribs)
가르다
  • original: 가르다: [가르다] [가르다호]
  • Have no idea where 가르다호 is coming from.
갈라서
  • Fine
  • Original: 귄: [귄] [르귄]
  • Have no idea where 르귄 is coming from.
끌어당기
  • LGTM
다스리
  • LGTM
달리
  • LGTM
덤벼들
  • LGTM
독하
  • LGTM
뒤흔들
  • LGTM
들뜨
  • LGTM
  • Original: 링: [링] [바이링]
  • Have no idea where 바이링 is coming from.
매달리
  • LGTM
매사추세츠
  • LGTM
멋지
  • LGTM
몸부림치
  • LGTM
무덥
  • LGTM
무르만스크
  • LGTM
바덴뷔르템베르크
  • LGTM
부러뜨리
  • LGTM
불러일으키
  • LGTM
  • Original: 빙 [리빙] [빙]
  • 리빙 is direct translation of "Living". Unrelated.
빠뜨리
  • LGTM
빠져나오
  • LGTM
사라
  • Original: 사라: [사라] [사라코너]
  • 사라코너 is a name: 사라 코너. Unrelated.
사우스다코타
  • LGTM
사우스캐롤라이나
  • LGTM
슐레스비히홀슈타인
  • LGTM
리아디
  • Original: 리아디: [리아디] [아디]
  • Looks unrelated.
아키타
  • LGTM
애쓰
  • LGTM
야단치
  • LGTM
열리
  • LGTM
오래되
  • LGTM
우르
  • Original: 우르: [우러] [우르]
  • if if meant to say 울어, it should've been 울으/울어.
웨스턴오스트레일리아
  • LGTM
위안장
  • LGTM
유프라테스
  • LGTM
잘츠부르크
  • LGTM
잠기
  • LGTM
지내
  • LGTM
쫓기
  • LGTM
추하
  • LGTM
테네시
  • LGTM
  • Original: 펜: [비제이펜] [펜]
  • Does not look related.
후려치
  • LGTM
후쿠시마
  • LGTM
  • Original: 휴: [손휴] [휴]
  • 손휴 looks like a name, unrelated.
TJones (WMF) (talkcontribs)

Sorry, the stemming list includes some compounds, which are divided into parts and will be searchable by any of the parts, though exact matches are best. So, the compound cases are fine, assuming the tokenization (breaking into words) is reasonable. Because there's a parser involved, context can change the way characters are treated, which adds to the complexity.

  • [르귄 / 르 / 귄, a compound, with 르 and 귄 tagged as proper nouns.
  • 빙 / 리빙—in isolation, 리빙 comes out as a single token. There are three instances of 리빙 in my Wikipedia corpus, and two of them are treated correctly. However, in "태양의 아이들 (2011, 웅진리빙하우스) ISBN 9788901136059", it gets indexed as a compound. Probably still a parsing error.
  • 사라 / 사라코너—yep, I see it. But for some reason the name 사라코너 is also being treated as a compound [사라코너 • 사라 • 코너].
  • 리아디 / 아디—again, 리아디 is treated as a compound, and the part 아디 is indexed under the whole
  • 우러 is interpreted as 우르/VV(Verb)+어/E(Verbal endings), so it gets grouped with other instances of 우르.
  • 비제이펜—interpreted as a compound, all proper nouns: "비/NNP(Proper Noun)+제이/NNP(Proper Noun)+펜/NNP(Proper Noun)", and so grouped under each of the parts.
  • 손휴—again, proper nouns.

This brings up the possibility that we should not index compounds by their parts. The default setting throws away the original compound and only keeps the parts. I thought keeping the original would increase precision when you know exactly what you are looking for. Not keeping the parts would get rid of some of these errors, but also make it harder to match when you have part of a compound. For example, right now, many-part compounds can match a shorter compound that is part of it. So a four-part compound, ABCD, can match the three-part compound, ABC, because A, B, and C are all indexed separately.

Based on the general review of Tokenization and Compounds, though, I think we are okay, with more correct tokenizations than errors.

Thanks again, revi, for all the help! Any more comments on anything would be welcome!

Reply to "Stemming"