Topic on User talk:TJones (WMF)/Notes/Nori Analyzer Analysis

Speaker Review -> Tokenization and Compounds -> 7 sentences with many-part compounds

3
Bmansurov (WMF) (talkcontribs)
input 양재역 - 양재시민의숲역 - 양재 나들목 (제부여객으로 이관)
tokens [양재역 • 양재 • 역] — [양재시민의숲역 • 양재 • 시민 • 숲 • 역] — [양재] — [나들목 • 나들 • 목] — [제부여객 • 제부 • 여객] — [이관]
my tokens same as above
input 제41권 《비틀스를 위기에서 건진 노란 잠수함》
tokens [41] — [권] — [비틀스] — [위기] — [건진 • 것 • 이 • 지] — [노란 • 노랗] — [잠수함 • 잠수 • 함]
my tokens [41] — [권] — [비틀스] — [위기] — [건지] (건지다 - pull up, 건진 - pulled up) — [노란 • 노랗] — [잠수함 • 잠수 • 함]
input 17번트랙 <좋은날> 브라운아이드걸스 버전을 편곡
tokens [17] — [번] — [트랙] — [좋] — [날] — [브라운아이드걸스 • 브라운 • 아이드 • 걸스] — [버전] — [편곡]
my tokens same as above
input 사탕수수는 원래 열대 남아시아와 동남아시아에서 전해져왔다.
tokens [사탕수수 • 사탕 • 수수] — [열] — [대] — [남아시아 • 남 • 아시아] — [동남아시아 • 동남 • 아시아] — [전해져왔 • 전하 • 지 • 오]
my tokens [사탕수수 • 사탕 • 수수] — [열] — [대] — [남아시아 • 남 • 아시아] — [동남아시아 • 동남 • 아시아] (not sure if dividing 동남 further is a good idea. 동 - east, 남 - south, but 동남 - southeast) — [전해져왔 • 전하] (removed 지 • 오, as they are not informative: 전하다 + 아지다 + 오다 + 았다 => 전해져왔다)
input 미국군이 처음으로 라인강을 도하한다
tokens [미국] — [군] — [처음] — [라인강 • 라인 • 강] — [도하]
my tokens same as above
input 전 구간 야마구치현에 소재.
tokens [구간] — [야마구치현 • 야마구치 • 현] — [소재]
my tokens same as above
input 당신이 뭔데 여기서 큰소리를 치는거야.
tokens [당신] — [뭔데 • 뭐 • 이] — [여기] — [큰소리 • 큰 • 소리] — [치] — [거 • 것] — [야 • 이]
my tokens [당신] — [뭔데 • 뭐] (removed 이: 뭐 + 인 + 데 => 뭔데) — [여기] — [큰소리 • 큰 • 소리] — [치] — [거 • 것] (removed verb ending)
TJones (WMF) (talkcontribs)

Thanks again, Baha!

  • 제41권 《비틀스를 위기에서 건진 노란 잠수함》
    • 건진 / 건지 — this was interpreted as "것/NNB(Dependent noun)+이/VCP(Positive designator)+ᆫ/E(Verbal endings)+지/NNB(Dependent noun)+ᆫ/J(Ending Particle)" — the verbal ending and particle were filtered. This looks like a likely error, since it doesn't seem like a verbal ending should go on noun. Looks like the parser made a mistake.
  • 사탕수수는 원래 열대 남아시아와 동남아시아에서 전해져왔다.
    • 동남 — I agree that splitting them doesn't seem necessary for directions, but at least it makes sense why it happened.
    • 전해져왔 / 지 / 오 — The parser seems to agree with you, because these are both marked as "Auxiliary Verb or Adjective", which sounds imminently ignorable.
  • 당신이 뭔데 여기서 큰소리를 치는거야.
    • 뭔데/이 — another "Positive designator", looking like it should be filtered.
    • 야 — oddly, to me, this gets split into "이/VCP(Positive designator)+야/E(Verbal endings)", where the whole thing, 야, is also an ending. So it originally would have been [[야 • 이] • 야], where the first "야" is a compound, and the second "야" is a verb ending (which did get filtered). Weird. Another vote for "positive designator" to get filtered.

Again, nothing that seems horrible, given the overall complexity of the task. More votes for filtering "positive designator" and "auxiliary verb or adjective", and possibly looking into filtering "negative designator", too.

Generally, though, I'm hopeful this will work out with only minor tweaks.

-revi (talkcontribs)

Baha's one LGTM.

Reply to "Speaker Review -> Tokenization and Compounds -> 7 sentences with many-part compounds"