Topic on User talk:TJones (WMF)/Notes/Nori Analyzer Analysis

Speaker Review -> Tokenization and Compounds -> 7 sentences with many-part compounds

3 comments • 07:14, 9 October 2018 6 years ago

3

input	양재역 - 양재시민의숲역 - 양재 나들목 (제부여객으로 이관)
tokens	[양재역 • 양재 • 역] — [양재시민의숲역 • 양재 • 시민 • 숲 • 역] — [양재] — [나들목 • 나들 • 목] — [제부여객 • 제부 • 여객] — [이관]
my tokens	same as above

input	제41권 《비틀스를 위기에서 건진 노란 잠수함》
tokens	[41] — [권] — [비틀스] — [위기] — [건진 • 것 • 이 • 지] — [노란 • 노랗] — [잠수함 • 잠수 • 함]
my tokens	[41] — [권] — [비틀스] — [위기] — [건지] (건지다 - pull up, 건진 - pulled up) — [노란 • 노랗] — [잠수함 • 잠수 • 함]

input	17번트랙 <좋은날> 브라운아이드걸스 버전을 편곡
tokens	[17] — [번] — [트랙] — [좋] — [날] — [브라운아이드걸스 • 브라운 • 아이드 • 걸스] — [버전] — [편곡]
my tokens	same as above

input	사탕수수는 원래 열대 남아시아와 동남아시아에서 전해져왔다.
tokens	[사탕수수 • 사탕 • 수수] — [열] — [대] — [남아시아 • 남 • 아시아] — [동남아시아 • 동남 • 아시아] — [전해져왔 • 전하 • 지 • 오]
my tokens	[사탕수수 • 사탕 • 수수] — [열] — [대] — [남아시아 • 남 • 아시아] — [동남아시아 • 동남 • 아시아] (not sure if dividing 동남 further is a good idea. 동 - east, 남 - south, but 동남 - southeast) — [전해져왔 • 전하] (removed 지 • 오, as they are not informative: 전하다 + 아지다 + 오다 + 았다 => 전해져왔다)

input	미국군이 처음으로 라인강을 도하한다
tokens	[미국] — [군] — [처음] — [라인강 • 라인 • 강] — [도하]
my tokens	same as above

input	전 구간 야마구치현에 소재.
tokens	[구간] — [야마구치현 • 야마구치 • 현] — [소재]
my tokens	same as above

input	당신이 뭔데 여기서 큰소리를 치는거야.
tokens	[당신] — [뭔데 • 뭐 • 이] — [여기] — [큰소리 • 큰 • 소리] — [치] — [거 • 것] — [야 • 이]
my tokens	[당신] — [뭔데 • 뭐] (removed 이: 뭐 + 인 + 데 => 뭔데) — [여기] — [큰소리 • 큰 • 소리] — [치] — [거 • 것] (removed verb ending)

Reply 15:00, 5 October 2018 6 years ago

TJones (WMF) (talkcontribs)

Thanks again, Baha!

제41권 《비틀스를 위기에서 건진 노란 잠수함》
- 건진 / 건지 — this was interpreted as "것/NNB(Dependent noun)+이/VCP(Positive designator)+ᆫ/E(Verbal endings)+지/NNB(Dependent noun)+ᆫ/J(Ending Particle)" — the verbal ending and particle were filtered. This looks like a likely error, since it doesn't seem like a verbal ending should go on noun. Looks like the parser made a mistake.

사탕수수는 원래 열대 남아시아와 동남아시아에서 전해져왔다.
- 동남 — I agree that splitting them doesn't seem necessary for directions, but at least it makes sense why it happened.
- 전해져왔 / 지 / 오 — The parser seems to agree with you, because these are both marked as "Auxiliary Verb or Adjective", which sounds imminently ignorable.

당신이 뭔데 여기서 큰소리를 치는거야.
- 뭔데/이 — another "Positive designator", looking like it should be filtered.
- 야 — oddly, to me, this gets split into "이/VCP(Positive designator)+야/E(Verbal endings)", where the whole thing, 야, is also an ending. So it originally would have been [[야 • 이] • 야], where the first "야" is a compound, and the second "야" is a verb ending (which did get filtered). Weird. Another vote for "positive designator" to get filtered.

Again, nothing that seems horrible, given the overall complexity of the task. More votes for filtering "positive designator" and "auxiliary verb or adjective", and possibly looking into filtering "negative designator", too.

Generally, though, I'm hopeful this will work out with only minor tweaks.

Reply 18:34, 5 October 2018 6 years ago

Revi C. (talkcontribs)

Baha's one LGTM.

Reply 07:14, 9 October 2018 6 years ago

Reply to "Speaker Review -> Tokenization and Compounds -> 7 sentences with many-part compounds"