User talk:TJones (WMF)/Notes/Nori Analyzer Analysis

About this board

Speaker Review -> Tokenization and Compounds -> 10 random sentences

9 comments • 17:26, 16 October 2018 6 years ago

9

I've checked the tokenization of 10 random sentences. The results look good.

input	김대중 대통령은 2003년까지 학급당 학생수를 35명 이하로 감축한다는내용의 '7.20 교육여건 개선계획' 을 발표했다.
tokens	[김대중] — [대통령] — [2003] — [년] — [학급] — [학생] — [수] — [35] — [명] — [이하] — [감축] — [내용] — [7] — [20] — [교육] — [여건] — [개선] — [계획] — [발표]
my tokens	[김대중 • 김 • 대중] (person's name which consists of the last name and the first name) — [대통령] — [2003] — [년] — [학급] — [학생] — [수] — [35] — [명] — [이하] — [감축] — [내용] — [7] — [20] — [교육] — [여건] — [개선] — [계획] — [발표]

input	모든 모델은 MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, 향상된 인텔 스피드스텝 기술(EIST), EM64T(Extended Memory 64 Technology), XD 비트, 가상화 기술, 스마트 캐시, 인텔 터보 부스트 지원
tokens	[모델] — [mmx] — [sse] — [sse] — [2] — [sse] — [3] — [ssse] — [3] — [sse] — [4] — [1] — [sse] — [4] — [2] — [향상] — [인텔] — [스피드스텝 • 스피드 • 스텝] — [기술] — [eist] — [em] — [64] — [t] — [extended] — [memory] — [64] — [technology] — [xd] — [비트] — [가상] — [기술] — [스마트] — [캐시] — [인텔] — [터보] — [부스트] — [지원]
my tokens	[모든] (missing word) — [모델] — [mmx] — [sse] — [sse] — [2] — [sse] — [3] — [ssse] — [3] — [sse] — [4] — [1] — [sse] — [4] — [2] — [향상] — [인텔] — [스피드스텝 • 스피드 • 스텝] — [기술] — [eist] — [em] — [64] — [t] — [extended] — [memory] — [64] — [technology] — [xd] — [비트] — [가상화] (missing ending; with the ending the word means "virtualization", without it something different) — [기술] — [스마트] — [캐시] — [인텔] — [터보] — [부스트] — [지원]

input	다 자라면 몸길이는 61 cm, 몸무게는 1.4~2.7 kg 정도가 된다.
tokens	[자라] — [몸길이 • 몸 • 길이] — [61] — [cm] — [몸무게 • 몸 • 무게] — [1] — [4] — [2] — [7] — [kg] — [정도] — [된다 • 되]
my tokens	[다] (missing word) — [자라] — [몸길이 • 몸 • 길이] — [61] — [cm] — [몸무게 • 몸 • 무게] — [1] — [4] — [2] — [7] — [kg] — [정도] — [된다 • 되]

input	7월 14일에는 태항산에 있던 조선청년연합회 소속 병사들이 하북성에 도착하자, 당일 하북성 섭현에서 김두봉, 박효삼 등과 함께 조선의용군을 발족시키고 총사령관에 취임했다.
tokens	[7] — [월] — [14] — [일] — [태항] — [산] — [있] — [조선] — [청년] — [연합회 • 연합 • 회] — [소속] — [병사] — [하북성 • 하북 • 성] — [도착] — [당일] — [하북성 • 하북 • 성] — [섭] — [현] — [김두봉] — [박] — [효] — [삼] — [등] — [조선] — [용군] — [발족] — [총사령관 • 총 • 사령 • 관] — [취임]
my tokens	[7] — [월] — [14] — [일] — [태항] — [산] — [있] — [조선] — [청년] — [연합회 • 연합 • 회] — [소속] — [병사] — [하북성 • 하북 • 성] — [도착] — [당일] — [하북성 • 하북 • 성] — [섭현] (should be one word) — [김두봉 • 김 • 두봉] (person's last and first name) — [박효삼 • 박 • 효삼] (person's name) — [등] — [함께] (missing word) — [조선] — [용군] — [발족] — [시키] (missing word) — [총사령관 • 총 • 사령 • 관] — [취임]

input	연합감리교회의 조직은 미국 이외에도 캐나다와 유럽, 아프리카와 필리핀의 교회들을 포함한다.
tokens	[연합] — [감리] — [교회] — [조직] — [미국] — [이외] — [캐나다] — [유럽] — [아프리카] — [필리핀] — [교회] — [포함]
my tokens	same as above

input	2006년 중화인민공화국에서는 단백질의 함량을 속여서, 미국으로 수출할 가축 사료의 원료인 밀글루텐 등 조단백 함량이 높은 사료 원료의 단백질양을 과장하여 부풀리는 데 이용하였다.
tokens	[2006] — [년] — [중화] — [인민공화국 • 인민 • 공화국] — [단백질 • 단백 • 질] — [함량] — [속여서 • 속이] — [미국] — [수출] — [가축] — [사료] — [원료] — [인 • 이] — [밀] — [글루텐] — [등] — [조단] — [백] — [함량] — [높] — [사료] — [원료] — [단백질 • 단백 • 질] — [양] — [과장] — [부풀리] — [데] — [이용]
my tokens	[2006] — [년] — [중화] — [인민공화국 • 인민 • 공화국] — [단백질 • 단백 • 질] — [함량] — [속이] (first form is just 속이+어서) — [미국] — [수출] — [가축] — [사료] — [원료] — (removed [인 • 이] as it's a noun maker and doesn't have a meaning by itself) — [밀] — [글루텐] — [등] — [조단] — [백] — [함량] — [높] — [사료] — [원료] — [단백질 • 단백 • 질] — [양] — [과장] — [부풀] (removed ending) — [데] — [이용]

input	일본 요리는 쇼군 치하 동안에 엘리트주의를 없애려 했던 중세 시대가 출현하며 변화하였다.
tokens	[일본] — [요리] — [쇼군] — [치하] — [동안] — [엘리트주의 • 엘리트 • 주의] — [없애] — [했 • 하] — [중세] — [시대] — [출현] — [변화]
my tokens	same as above

input	『산릉도감의궤』 등 문헌에 의하면 세종 영릉(英陵), 명종 강릉(康陵), 인조 장릉(長陵), 효종 영릉(寧陵)의 정자각이 팔작지붕이었으나, 후대에 모두 맞배지붕으로 교체되어 현재는 숭릉의 정자각만 팔작지붕으로 남아 있다.
tokens	[산릉도감 • 산릉 • 도감] — [궤] — [등] — [문헌] — [의하] — [세종] — [영릉] — [영릉] — [명종] — [강릉] — [강릉] — [인조] — [장릉] — [장릉] — [효종] — [영릉] — [寧] — [릉] — [정자각 • 정자 • 각] — [팔작지붕 • 팔작 • 지붕] — [이] — [후대] — [맞배지붕 • 맞배 • 지붕] — [교체] — [현재] — [숭릉] — [정자각 • 정자 • 각] — [팔작지붕 • 팔작 • 지붕] — [남] — [있]
my tokens	[산릉도감 • 산릉 • 도감] — [궤] — [등] — [문헌] — [의하] — [세종] — [영릉] — [영릉] — [명종] — [강릉] — [강릉] — [인조] — [장릉] — [장릉] — [효종] — [영릉] — [영릉](hanja should be correctly detected) — [정자각 • 정자 • 각] — [팔작지붕 • 팔작 • 지붕] — (removed [이]) — [후대] — [모두] (was missing) — [맞배지붕 • 맞배 • 지붕] — [교체] — [현재] — [숭릉] — [정자각 • 정자 • 각] — [팔작지붕 • 팔작 • 지붕] — [남] — [있]

input	1934년 파울 폰 힌덴부르크 대통령이 사망한 후 히틀러는 수상과 대통령직을 겸무해서 국방국 최고 지휘권을 손에 넣게 되었다.
tokens	[1934] — [년] — [파울] — [폰] — [힌덴부르크] — [대통령] — [사망] — [후] — [히틀러] — [수상] — [대통령] — [직] — [겸무] — [국방] — [국] — [최고] — [지휘] — [손] — [넣] — [되]
my tokens	[1934] — [년] — [파울] — [폰] — [힌덴부르크] — [대통령] — [사망] — [후] — [히틀러] — [수상] — [대통령] — [직] — [겸무] — [국방] — [국] — [최고] — [지휘권 • 지휘 • 권] (compound word) — [손] — [넣] — [되]

input	부산지방법원와 서울형사지방법원 등에서 부장판사를 하다가 부산지방법원, 제주지방법원, 춘천지방법원, 광주고등법원에서 법원장을 역임하였으며 이후 공직에서 물러나 변호사 활동을 했다.
tokens	[부산] — [지방] — [법원] — [서울] — [형사] — [지방] — [법원] — [등] — [부장] — [판사] — [하] — [부산] — [지방] — [법원] — [제주] — [지방] — [법원] — [춘천] — [지방] — [법원] — [광주] — [고등] — [법원] — [법원장 • 법원 • 장] — [역임] — [이후] — [공직] — [물러나 • 물러나] — [변호사 • 변호 • 사] — [활동] — [했 • 하]
my tokens	[부산] — [지방] — [법원] — [서울] — [형사] — [지방] — [법원] — [등] — [부장] — [판사] — [하] — [부산] — [지방] — [법원] — [제주] — [지방] — [법원] — [춘천] — [지방] — [법원] — [광주] — [고등] — [법원] — [법원장 • 법원 • 장] — [역임] — [이후] — [공직] — [물러나 ] (removed duplicate) — [변호사 • 변호 • 사] — [활동] — [했 • 하]

Reply Edited by TJones (WMF) 16:08, 4 October 2018 6 years ago

Revi C. (talkcontribs)

Input: 모든 모델은 MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, 향상된 인텔 스피드스텝 기술(EIST), EM64T(Extended Memory 64 Technology), XD 비트, 가상화 기술, 스마트 캐시, 인텔 터보 부스트 지원

Mine: 모든/모델/MMX/SSE/SSE2/SSE3/SSSE3/SSE4.1/SSE4.2/향상/인텔/스피드/스텝/기술/EIST/EM64T/Extended/Memory/64/Technology/XD/비트/가상화/기술/스마트/캐시/인텔/터보/부스트/지원

Diff: I have 모든.

Input: 2006년 중화인민공화국에서는 단백질의 함량을 속여서, 미국으로 수출할 가축 사료의 원료인 밀글루텐 등 조단백 함량이 높은 사료 원료의 단백질양을 과장하여 부풀리는 데 이용하였다.

Mine: 2006/년/중화/인민/공화국/단백질/함량/함량/속여서(속이)/미국/수출/가축/사료/원류/밀/글루텐/조단백/함량/높은/사료/원료/단백질/양/과장/부풀리는(부풀리)/이용

Diff: I did not split 조단백.

Input: 『산릉도감의궤』 등 문헌에 의하면 세종 영릉(英陵), 명종 강릉(康陵), 인조 장릉(長陵), 효종 영릉(寧陵)의 정자각이 팔작지붕이었으나, 후대에 모두 맞배지붕으로 교체되어 현재는 숭릉의 정자각만 팔작지붕으로 남아 있다.

Mine: 산릉/도감/의궤/문헌/의하/세종/영릉/英陵(translates to 영릉)/명종/강릉/康陵(translates to 강릉)/인조/장릉/長陵(translates to 장릉)/효종/영릉/寧陵(translates to 영릉)/정자각/팔작지붕(can be split to 팔작/지붕)/이/후대/모두/맞배지붕(can be split to 맞배/지붕)/교체/현재/숭릉/정자각/팔작지붕/남아(남).

Diff: 의궤 (ko:의궤) is its own word. Should not omit 의 here.

Input: 1934년 파울 폰 힌덴부르크 대통령이 사망한 후 히틀러는 수상과 대통령직을 겸무해서 국방국 최고 지휘권을 손에 넣게 되었다.

Mine: 1934/년/파울/폰/힌덴부르크/대통령/사망/후/히틀러/수상/대통령/직/겸무/국방/국/최고/지휘/권/손/넣/되

Diff: 권 means right. Should not be omitted.

Otherwise LGTM.

Reply 07:06, 9 October 2018 6 years ago

Revi C. (talkcontribs)

Seems most of my stuff is also covered below but 조단백 (I don't know how it was created (as I am not good at Biology or Chemical stuff), but it's IMO obviously not 조단/백. Maybe 조/단백)? and 의궤 still stands.

Reply Edited 07:13, 9 October 2018 6 years ago

TJones (WMF) (talkcontribs)

Re: 조단백—it looks like 백 was interpreted as a number (Wiktionary says 100) and 조단 was just kind of left over as a "general noun". Is it a rare or very technical term? It gets only 7 hits on Korean Wikipedia at the moment. It's not surprising if some rare scientific terms are processed oddly. Fortunately, splitting it up incorrectly won't keep it from being found (it may just increase irrelevant results—but scoring should bring the good ones, including exact matches, to the top).

Re: 의궤—yeah, that's an error. It's reading 의 as an "ending particle" which then gets filtered, and 궤 as a "general noun". (I'm starting to think "general noun" means "some leftover characters.) There's something about the phrase "산릉도감의궤" that is causing it, because 의궤 by itself comes out as one word.

Reply 20:54, 9 October 2018 6 years ago

Revi C. (talkcontribs)

I'm not a biology expert, but it does sound like a technical term. 단백 is the protein, so I guess 조 is something to be omitted or it just makes separate word.

Reply Edited 03:09, 10 October 2018 6 years ago

Garam (talkcontribs)

The word "조단백" originated from 조(crude) + 단백(protein). For this, see https://opendict.korean.go.kr/dictionary/view?sense_no=1316777&viewType=confirm. Thanks.

Reply Edited 17:26, 16 October 2018 6 years ago

TJones (WMF) (talkcontribs)

Thanks a lot Baha!

I forgot to mention that some words or endings may be intentionally missing from the tokenization. Nori also removes words/characters/jamo that it determines are in the categories verbal endings, interjections, ending particles, general adverbs, conjunctive adverbs, determiners, prefixes, adjective suffixes, noun suffixes, verb suffixes, and various kinds of punctuation.

I can re-do the tokenization without the part-of-speech filtering, if you think that would help.

For now, I'll just look into the specific ones that you mentioned are missing.

김대중 대통령은 2003년까지 학급당 학생수를 35명 이하로 감축한다는내용의 '7.20 교육여건 개선계획' 을 발표했다.
- I'm not terribly surprised it didn't split the name 김대중 correctly, though if it was going to know about any Korean surname, it seems like it would know 김. It did recognize it as a proper noun, though. Are there any other names that are split up like you propose?

모든 모델은 MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, 향상된 인텔 스피드스텝 기술(EIST), EM64T(Extended Memory 64 Technology), XD 비트, 가상화 기술, 스마트 캐시, 인텔 터보 부스트 지원.
- 모든 is filtered as a determiner; based on the English Wiktionary entry, that seems reasonable.
- 가상화: it is pulling off 화 as a noun suffix.

다 자라면 몸길이는 61 cm, 몸무게는 1.4~2.7 kg 정도가 된다.
- 다 is filtered as a general adverb.

7월 14일에는 태항산에 있던 조선청년연합회 소속 병사들이 하북성에 도착하자, 당일 하북성 섭현에서 김두봉, 박효삼 등과 함께 조선의용군을 발족시키고 총사령관에 취임했다.
- 섭현 is split as two "general nouns", so that's an error.
- 박효삼 is split with 박 as a proper noun, 효 as a general noun, and 삼 as a numeral, which Wiktionary agrees with. Recognizing ambiguous names is hard, but this is an error. However, it shouldn't prevent search matches, but it will allow potential false matches.
- 함께 is filtered as a "general adverb".
- 시키 is filtered as a verb suffix

2006년 중화인민공화국에서는 단백질의 함량을 속여서, 미국으로 수출할 가축 사료의 원료인 밀글루텐 등 조단백 함량이 높은 사료 원료의 단백질양을 과장하여 부풀리는 데 이용하였다.
- 속여서 seems to be treated as a compound and is actually tokenized as [속여서 • 속이 • 어서], but 어서 is dropped. As long as 속이 is the correct stemmed form and is present, it's okay. Though I've noticed this happening elsewhere, and I think it may be a bug. If it was just [속이 • 어서], then 어서 would be dropped as a verbal ending and we'd get the desired result.
- 인 and 이 are tagged as "positve designators"; we could filter those if this comes up a lot.
- 부풀리, looks like a stemming error, as it is just tagged as a verb.

『산릉도감의궤』 등 문헌에 의하면 세종 영릉(英陵), 명종 강릉(康陵), 인조 장릉(長陵), 효종 영릉(寧陵)의 정자각이 팔작지붕이었으나, 후대에 모두 맞배지붕으로 교체되어 현재는 숭릉의 정자각만 팔작지붕으로 남아 있다.
- 寧陵/영릉 — looks like it detected 陵 as Hanja, but not both characters together. Weird.
- 팔작지붕/이 — another "positive designator".
- 모두 — another victim of the adverb filter.

1934년 파울 폰 힌덴부르크 대통령이 사망한 후 히틀러는 수상과 대통령직을 겸무해서 국방국 최고 지휘권을 손에 넣게 되었다.
- 지휘권 — looks like 권 was parsed as a noun suffix, and then filtered.

부산지방법원와 서울형사지방법원 등에서 부장판사를 하다가 부산지방법원, 제주지방법원, 춘천지방법원, 광주고등법원에서 법원장을 역임하였으며 이후 공직에서 물러나 변호사 활동을 했다.
- 물러나 somehow gets parsed as 물러나:"물러나/Verb+아/Verbal endings" • 물러나:Verb • 아:Verbal ending (the verbal ending gets dropped). It's weird, but okay in terms of search that it gets duplicated.

Thanks again for all the detail!

Sounds like I might need to ask upstream about verbs getting treated as compounds if that is a more widespread problem, and we might want to consider filtering the "positive designator" part of speech, but I'd have to look at other instances to make sure they are mostly as useless as these.

Does filtering out the adverbs make sense, by the way?

Reply 20:58, 4 October 2018 6 years ago

Bmansurov (WMF) (talkcontribs)

> I can re-do the tokenization without the part-of-speech filtering, if you think that would help.

Given your explanation above, I don't think we should re-do the tokenization.

> I'm not terribly surprised it didn't split the name 김대중 correctly, though if it was going to know about any Korean surname, it seems like it would know 김. It did recognize it as a proper noun, though. Are there any other names that are split up like you propose?

I think, in general, Korean names are written like 김대중, but sometimes person's title may follow the last name. For exmaple, 김 대통령 (President Kim). Sometimes the first name appears by itself (in colloqual speech, usually). That's why any name maybe split like above in my view.

> 시키 is filtered as a verb suffix

My bad, you're correct.

> 부풀리, looks like a stemming error, as it is just tagged as a verb.

I may have made a mistake here. I thought we should take the stem from 부풀다 (become swollen) and not from 부풀리다 (make swollen).

> Does filtering out the adverbs make sense, by the way?

Yes, it does.

Reply Edited 21:45, 4 October 2018 6 years ago

TJones (WMF) (talkcontribs)

Okay, so everything is looking pretty good! A tolerable number of minor mistakes, and no absurd mistakes, so far.

I wish I had a better answer on the names and titles. I'll keep an eye out for problems related to that.

Reply 22:04, 4 October 2018 6 years ago

Reply to "Speaker Review -> Tokenization and Compounds -> 10 random sentences"

Stemming

2 comments • 21:36, 9 October 2018 6 years ago

2

Revi C. (talkcontribs)

가르다

original: 가르다: [가르다] [가르다호]
Have no idea where 가르다호 is coming from.

갈라서

Fine

귄

Original: 귄: [귄] [르귄]
Have no idea where 르귄 is coming from.

끌어당기

LGTM

다스리

LGTM

달리

LGTM

덤벼들

LGTM

독하

LGTM

뒤흔들

LGTM

들뜨

LGTM

링

Original: 링: [링] [바이링]
Have no idea where 바이링 is coming from.

매달리

LGTM

매사추세츠

LGTM

멋지

LGTM

몸부림치

LGTM

무덥

LGTM

무르만스크

LGTM

바덴뷔르템베르크

LGTM

부러뜨리

LGTM

불러일으키

LGTM

빙

Original: 빙 [리빙] [빙]
리빙 is direct translation of "Living". Unrelated.

빠뜨리

LGTM

빠져나오

LGTM

사라

Original: 사라: [사라] [사라코너]
사라코너 is a name: 사라 코너. Unrelated.

사우스다코타

LGTM

사우스캐롤라이나

LGTM

슐레스비히홀슈타인

LGTM

리아디

Original: 리아디: [리아디] [아디]
Looks unrelated.

아키타

LGTM

애쓰

LGTM

야단치

LGTM

열리

LGTM

오래되

LGTM

우르

Original: 우르: [우러] [우르]
if if meant to say 울어, it should've been 울으/울어.

웨스턴오스트레일리아

LGTM

위안장

LGTM

유프라테스

LGTM

잘츠부르크

LGTM

잠기

LGTM

지내

LGTM

쫓기

LGTM

추하

LGTM

테네시

LGTM

펜

Original: 펜: [비제이펜] [펜]
Does not look related.

후려치

LGTM

후쿠시마

LGTM

휴

Original: 휴: [손휴] [휴]
손휴 looks like a name, unrelated.

Reply 07:38, 9 October 2018 6 years ago

TJones (WMF) (talkcontribs)

Sorry, the stemming list includes some compounds, which are divided into parts and will be searchable by any of the parts, though exact matches are best. So, the compound cases are fine, assuming the tokenization (breaking into words) is reasonable. Because there's a parser involved, context can change the way characters are treated, which adds to the complexity.

[르귄 / 르 / 귄, a compound, with 르 and 귄 tagged as proper nouns.
빙 / 리빙—in isolation, 리빙 comes out as a single token. There are three instances of 리빙 in my Wikipedia corpus, and two of them are treated correctly. However, in "태양의 아이들 (2011, 웅진리빙하우스) ISBN 9788901136059", it gets indexed as a compound. Probably still a parsing error.
사라 / 사라코너—yep, I see it. But for some reason the name 사라코너 is also being treated as a compound [사라코너 • 사라 • 코너].
리아디 / 아디—again, 리아디 is treated as a compound, and the part 아디 is indexed under the whole
우러 is interpreted as 우르/VV(Verb)+어/E(Verbal endings), so it gets grouped with other instances of 우르.
비제이펜—interpreted as a compound, all proper nouns: "비/NNP(Proper Noun)+제이/NNP(Proper Noun)+펜/NNP(Proper Noun)", and so grouped under each of the parts.
손휴—again, proper nouns.

This brings up the possibility that we should not index compounds by their parts. The default setting throws away the original compound and only keeps the parts. I thought keeping the original would increase precision when you know exactly what you are looking for. Not keeping the parts would get rid of some of these errors, but also make it harder to match when you have part of a compound. For example, right now, many-part compounds can match a shorter compound that is part of it. So a four-part compound, ABCD, can match the three-part compound, ABC, because A, B, and C are all indexed separately.

Based on the general review of Tokenization and Compounds, though, I think we are okay, with more correct tokenizations than errors.

Thanks again, revi, for all the help! Any more comments on anything would be welcome!

Reply 21:36, 9 October 2018 6 years ago

Reply to "Stemming"

Large groups

3 comments • 16:41, 9 October 2018 6 years ago

3

Revi C. (talkcontribs)

I didn't look at the whole group. I looked at your notes.

(snip) "智", are listed in Wiktionary as just 지]/ji, while others, like "知", have multiple Hangeul versions (in this case, 알/ai or 지/ji), and it looks like Nori picked this one. In several other cases, especially where the token ends with -진, the part of speech tagger is marking 지 as an auxiliary verb, which is maybe another category of parts of speech we should filter.

知 just has 지/ji, 알/ai is the 'meaning' part. enwiktionary is wrong there then.

-진 is -지+ㄴ.

-이- is often used as suffix, so there's lots of example used as a suffix there.

Reply 07:52, 9 October 2018 6 years ago

TJones (WMF) (talkcontribs)

> 知 just has 지/ji, 알/ai is the 'meaning' part. enwiktionary is wrong there then.

No, it was just me. I didn't understand the concept of "eumhun" so I interpreted "(eumhun 알 지 (al ji))" incorrectly, and it's hard to find documentation on Wiktionary (there are no links), and it is confusing if you aren't familiar with it.

I'll update my notes.

Reply 16:24, 9 October 2018 6 years ago

TJones (WMF) (talkcontribs)

For -이-, I think we need to filter the "positive designator" parse. It shows up a lot, doesn't seem to carry a lot of valuable meaning, and links a lot of otherwise unlinked tokens.

Reply 16:41, 9 October 2018 6 years ago

Reply to "Large groups"

Hanja to Hangul

2 comments • 16:16, 9 October 2018 6 years ago

2

Revi C. (talkcontribs)

IMO hanja sucks except for differentiation, and Koreans these day uses less and less hanja itself as part of daily language use, but maybe worth doing it. It's all LGTM for me.

Reply 07:27, 9 October 2018 6 years ago

TJones (WMF) (talkcontribs)

I'm going to reply hear first since this is the easiest one! The Hanja-to-Hangeul conversion is turned on by default, so I left it in. Also, because of the way our search is configured, exact matches still get a boost, so if you search for Hanja, you are more likely to get Hanja, and if you search for Hangeul you are more likely to get Hangeul, all other things being equal. But for rarer terms, the Hanja-to-Hangeul match could be the only thing that matches, which is good.

Reply 16:16, 9 October 2018 6 years ago

Reply to "Hanja to Hangul"

Speaker Review -> Tokenization and Compounds -> 7 sentences with many-part compounds

3 comments • 07:14, 9 October 2018 6 years ago

3

Bmansurov (WMF) (talkcontribs)

input	양재역 - 양재시민의숲역 - 양재 나들목 (제부여객으로 이관)
tokens	[양재역 • 양재 • 역] — [양재시민의숲역 • 양재 • 시민 • 숲 • 역] — [양재] — [나들목 • 나들 • 목] — [제부여객 • 제부 • 여객] — [이관]
my tokens	same as above

input	제41권 《비틀스를 위기에서 건진 노란 잠수함》
tokens	[41] — [권] — [비틀스] — [위기] — [건진 • 것 • 이 • 지] — [노란 • 노랗] — [잠수함 • 잠수 • 함]
my tokens	[41] — [권] — [비틀스] — [위기] — [건지] (건지다 - pull up, 건진 - pulled up) — [노란 • 노랗] — [잠수함 • 잠수 • 함]

input	17번트랙 <좋은날> 브라운아이드걸스 버전을 편곡
tokens	[17] — [번] — [트랙] — [좋] — [날] — [브라운아이드걸스 • 브라운 • 아이드 • 걸스] — [버전] — [편곡]
my tokens	same as above

input	사탕수수는 원래 열대 남아시아와 동남아시아에서 전해져왔다.
tokens	[사탕수수 • 사탕 • 수수] — [열] — [대] — [남아시아 • 남 • 아시아] — [동남아시아 • 동남 • 아시아] — [전해져왔 • 전하 • 지 • 오]
my tokens	[사탕수수 • 사탕 • 수수] — [열] — [대] — [남아시아 • 남 • 아시아] — [동남아시아 • 동남 • 아시아] (not sure if dividing 동남 further is a good idea. 동 - east, 남 - south, but 동남 - southeast) — [전해져왔 • 전하] (removed 지 • 오, as they are not informative: 전하다 + 아지다 + 오다 + 았다 => 전해져왔다)

input	미국군이 처음으로 라인강을 도하한다
tokens	[미국] — [군] — [처음] — [라인강 • 라인 • 강] — [도하]
my tokens	same as above

input	전 구간 야마구치현에 소재.
tokens	[구간] — [야마구치현 • 야마구치 • 현] — [소재]
my tokens	same as above

input	당신이 뭔데 여기서 큰소리를 치는거야.
tokens	[당신] — [뭔데 • 뭐 • 이] — [여기] — [큰소리 • 큰 • 소리] — [치] — [거 • 것] — [야 • 이]
my tokens	[당신] — [뭔데 • 뭐] (removed 이: 뭐 + 인 + 데 => 뭔데) — [여기] — [큰소리 • 큰 • 소리] — [치] — [거 • 것] (removed verb ending)

Reply 15:00, 5 October 2018 6 years ago

TJones (WMF) (talkcontribs)

Thanks again, Baha!

제41권 《비틀스를 위기에서 건진 노란 잠수함》
- 건진 / 건지 — this was interpreted as "것/NNB(Dependent noun)+이/VCP(Positive designator)+ᆫ/E(Verbal endings)+지/NNB(Dependent noun)+ᆫ/J(Ending Particle)" — the verbal ending and particle were filtered. This looks like a likely error, since it doesn't seem like a verbal ending should go on noun. Looks like the parser made a mistake.

사탕수수는 원래 열대 남아시아와 동남아시아에서 전해져왔다.
- 동남 — I agree that splitting them doesn't seem necessary for directions, but at least it makes sense why it happened.
- 전해져왔 / 지 / 오 — The parser seems to agree with you, because these are both marked as "Auxiliary Verb or Adjective", which sounds imminently ignorable.

당신이 뭔데 여기서 큰소리를 치는거야.
- 뭔데/이 — another "Positive designator", looking like it should be filtered.
- 야 — oddly, to me, this gets split into "이/VCP(Positive designator)+야/E(Verbal endings)", where the whole thing, 야, is also an ending. So it originally would have been [[야 • 이] • 야], where the first "야" is a compound, and the second "야" is a verb ending (which did get filtered). Weird. Another vote for "positive designator" to get filtered.

Again, nothing that seems horrible, given the overall complexity of the task. More votes for filtering "positive designator" and "auxiliary verb or adjective", and possibly looking into filtering "negative designator", too.

Generally, though, I'm hopeful this will work out with only minor tweaks.

Reply 18:34, 5 October 2018 6 years ago

Revi C. (talkcontribs)

Baha's one LGTM.

Reply 07:14, 9 October 2018 6 years ago

Reply to "Speaker Review -> Tokenization and Compounds -> 7 sentences with many-part compounds"

There are no older topics