User:TJones (WMF)/Notes/Overview of Phonetic Algorithm Performance

January 2018 — See TJones_(WMF)/Notes for other projects. See also T182708. For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary.

Intro

edit

Phonetic algorithms allow you to index words based (sometimes only very loosely) on their pronunciation. This is a difficult task in English, in part because our spelling system is horrendous. Phonetic algorithms range from super simple to relatively complex, and try to balance recall and precision in different ways. For genealogical research, for example, a coarse-grained algorithm that casts a wide net would be useful because it would still exclude more than 99% of names as possible matches, and might allow for variances in the way English speakers say or spell names they aren't familiar with. (This is why Comedian Louis C.K. goes by "C.K."—English speakers have an easier time with "C.K." than "Székely".) More precise, finer-grained algorithms have been used for spellcheckers.

It makes sense to use phonetic searching to help searchers find things they don't know how to spell. An obvious use case is searching for something you've heard aloud but never seen written and which has a non-obvious spelling.

Because even the most targeted phonetic search is going to be pretty fuzzy, it makes sense to limit the scope of what can be matched. An obvious first restriction would be to limit matches to titles. Otherwise, there would likely be too many possible matches. Another likely sensible restriction would be to require a specific keyword to enable phonetic searching, such as "soundslike:" or something similar.

Data

edit

For my data, I pulled 50,000 article/entry titles each from English Wikipedia and English Wiktionary.

Many of the Wiktionary titles (and some of the Wikipedia titles) are not in English, or even in the Latin alphabet. How the algorithms handle non-Latin characters and non-English accents is good to know, too.

Elasticsearch

edit

Elasticsearch offers a number of phonetic matching algorithms, including Soundex, Metaphone and Double Metaphone, and a few others that have more-specific targets, such as Caverphone for New Zealand names, Beider-Morse for German and Yiddish names, and Kölner Phonetik for general German words.

Looking at the code (Elasticsearch / Apache—yay open source!) I discovered additional algorithms—Refined Soundex and NYSIIS—and additional options, including preserving the original token and setting the phonetic code length for Double Metaphone.

I'm going to look at algorithms designed for English (words and names), including Soundex, Refined Soundex, NYSIIS, Metaphone, and Double Metaphone, including different code lengths for Double Metaphone.

For my initial pass, I'm not enabling the option to keep the original token and I'm not enabling any additional normalization (like lowercasing). The phonetic algorithms aren't case sensitive. They also generate uppercased tokens, so there's no chance of collision with lowercased real words. Not generating an extra "original" token also makes token counts more comparable to the baseline analysis chain.

Also, not doing extra normalization to start eliminate other groups (e.g., baseball, Baseball, BASEBALL) from showing up in first round of reports. (Edit from the future: it looks like diacritics and other non-ASCII Latin characters are treated inappropriately, so folding is definitely necessary for real world use on Wikipedia and Wiktionary data.)

On Code Length

edit

The longest word in my Wikipedia corpus is Galactofuranosylgalactofuranosylrhamnosyl, which is part of the name of an enzyme. Because of the limit of only 4 characters in the Soundex code, it is a Soundex match for galactic, Galactica, and Gloster as well as the more appropriate galactofuranosyltransferase.

It's hard to know where to draw the line. Galactica isn't a terrible match, since I doubt I could remember more than galactofurowhatsis without studying it intently. The Double Metaphone algorithm also defaults to a length of 4, but it is configurable. Some algorithms have longer (or unlimited) code lengths.

We'll have to try to figure out what makes the most sense. Humans are generally good at remembering how something begins—and sometimes how it ends—better than the middle bits. The code length is another element of the recall/precision balance.

Baseline / Status Quo

edit

For reference, I ran my title collections through the current English analysis chain.

Wikipedia:

  • Total new tokens: 137,421
  • pre-analysis types: 52,945
  • post-analysis types: 47,695

Wiktionary:

  • Total new tokens: 55,721
  • pre-analysis types: 52,666
  • post-analysis types: 51,904

Note: Tokens are instances of a word, types are unique words. Pre-analysis is before stemming and other analysis chain modifications, post-analysis is after. So, the string dog dog dogs has three tokens, because there are three words. Pre-analysis, there are two types, dog and dogs. Post-analysis, there is only one type, dog, since dogs is stemmed to dog.

Wiktionary titles have a much higher type-to-token ratio because so many Wiktionary titles are unique individual words. Wikipedia titles are often more than one word (2.7 words on average, it looks like), but a lot of those words get re-used between titles (no surprise when the shows up in many different Wikipedia titles, for example).

Soundex

edit

Soundex is one of the earliest phonetic algorithms (patented in 1918), and it is very coarse-grained—it aggressively collapses related sounds, and generates a code that is only 4 characters long, no matter how long the original word is. It is largely intended for names, and originally not for use with computers!

Wikipedia:

  • Total new tokens: 143,985
  • pre-analysis types: 54,001
  • post-analysis types: 8,357 (5.8-to-1 vs baseline)

Wiktionary:

  • Total new tokens: 54,877
  • pre-analysis types: 52,848
  • post-analysis types: 19,560 (2.7-to-1 vs baseline)

A larger number of Wikipedia tokens reflects different tokenization breaks. The massively smaller number of post-analysis tokens shows the level of conflation (i.e., new collisions) the phonetic algorithm causes. Wiktionary is not impacted as much because of the higher number of unique/interesting words and non-Latin words.

Post-analysis type ratios are compared to pre-analysis type counts.

Random Observations:

  • With a code length of four, words match primarily on the beginning few letters. This might make for too much recall.
  • The first letter in the word is preserved, which can make for too little recall. Khloe and Chloe could not match.
  • Numbers are ignored, so 60A and A1010 are matches.
  • Codes are numeric. This isn't as much of a problem because the first character of the code is a letter.
  • Non-initial vowels, h, w, and y are ignored, which tends to hide how many syllables there are. Both Punkt and Panagiota have the same code (they are both reduced to P-N-G/K-T).
  • Soundex punts on strings with unknown characters, so Sławica is analyzed as Sławica. (That is, it doesn't generate a code—it just returns the whole original word, unchanged.) Folding could help.
  • The longest token in Wiktionary is 180 characters. It's actually a Gothic word that is only 15 letters long (𐌰𐌽𐌳𐍃𐍄𐌰𐌽𐌳𐌰𐌽𐌳𐌰𐌽𐌳𐍃) but each letter is two Unicode characters, each of which is rendered in Java \u format which is six ASCII characters per Uniode character. For example, the initial 𐌰 is rendered as "\uD800\uDF30" in the token.

The largest group in the Wikipedia data has 166 words in it:

  • R200: [1 R1200GS][1 RC][1 RCA][1 RG][1 RJ103][1 RK][1 RKA][1 ROC][1 ROKS][1 ROSE][1 RQ25][2 RRS][1 RSS][1 Racah][18 Race][1 Raceway][1 Raesch][1 Rag][2 Rage][1 Raghu][1 Rags][1 Rahasia][1 Rahasya][1 Raise][1 Raizo][6 Raj][6 Raja][2 Rajah][1 Raji][1 Rajshahi][3 Raju][1 Rake][1 Raku][1 Rashkeh][1 Rauch][1 Rayess][1 Rays][1 Rayski][1 Raysse][1 Raz][2 Raza][1 Razu][2 Reach][1 Recchia][1 Recz][3 Rees][1 Reese][1 Reggae][1 Reggie][1 Rehs][1 Reig][1 Reijo][1 Rejayi][1 Reka][1 Rekhy][1 Res][1 Resco][9 Rescue][2 Reuss][8 Rex][1 Rexx][5 Reyes][2 Reza][1 Rezai][1 Rezi][1 Rhea's][2 Rhys][1 Riaz][2 Ric][3 Rica][1 Ricca][1 Ricci][14 Rice][5 Rich][1 Riche][1 Richeiu][1 Richey][3 Richie][10 Rick][1 Ricks][6 Ricky][18 Rico][1 Riez][2 Riga][1 Rigyu][1 Riichi][1 Rik][1 Rika][1 Riki][1 Rikkyo][1 Rioja][1 Rios][1 Ris][1 Risa][1 Risco][5 Rise][3 Risk][1 Riss][1 Riza][1 Roach][1 Roache][2 Roc][1 Rocca][1 Rocha][4 Roche][49 Rock][2 Rockaway][9 Rocks][1 Rogge][1 Rogowo][1 Roig][1 Rois][1 Roissy][2 Rojo][1 Rok][1 Roka][2 Rokeya][2 Rook][1 Rookie][1 Rooks][1 Roos][1 Roosa][1 Roque][8 Rosa][1 Rosco][2 Roscoe][25 Rose][1 Rosie][24 Ross][4 Rossi][1 Rossiya][2 Rosso][5 Rouge][1 Rough][2 Rouse][2 Roush][1 Roux][1 Rowhouse][1 Rowska][2 Roxy][3 Royce][1 Roz][1 Ruche][1 Ruecho][1 Rug][4 Ruiz][1 Rusch][1 Ruse][2 Rush][1 Rusi][1 Rusko][2 Russ][25 Russia][2 Russo][1 Ryogo][1 Ryoichi][1 Ryoji][1 Ryukyu][8 race][1 rack][1 raga][1 raise][1 res][1 rescue][1 rhyssa][2 rice][1 rich][1 rico][1 rise][5 rock][1 rocks][1 rook][2 rosea][1 roshcha][1 rosse][1 rosy][1 rouxae][1 rugs][1 russe]

Most of those actually aren't all that bad because they only have two sounds in them: r plus something in the c, g, j, k, q, s, x, z family. (Because c can sound like either s or k, everything that sounds remotely like s or k ends up in one big group. Also, since "sounds like k" includes g, and g sometimes sounds like j, j gets tossed in, too.)

The bigger problems come from ignoring h, w, and y in contexts where they are clearly consonants, effectively ignoring additional syllables in words like Rockaway.

The next three largest group have 150 words each. Two are similar to R200: B200 and L200. The third group, T000, is a bit different in that it has words starting with t, followed by vowels, h, w, and y, and also many ordinals like 189th, where the numbers and the h are ignored, leaving just t, which is coded as T000.

The next few biggest groups follow a similar pattern X200 for some value of X, and S300 (which matches all the #st ordinals)

The largest group (I536) in Wiktionary, with 252 words, are mostly things that start with inter- and intra-, with a few intro- and intru(d)- items, and a few others. Obviously this is where we see the excess recall from limiting the length of the code.

The next largest group (T652) with 184 words, is mostly things starting with trans-. The next largest group (C536)) contains words starting with cent(e)r-, contra-, and counter-. The next few groups are similar.

My impression: Soundex has too short of a code length for a dictionary, for sure; lack of first letter variation is probably too limiting in one dimension, while ignoring h, w, and y is not limiting enough in a different dimension. Folding is necessary to get any results on words with diacritics.

Metaphone

edit

Metaphone is a 1990 improvement to Soundex. It's aware of a lot of the common silent consonants in English (knight, gnome, bomb, badge, etc.); it tries to split c into either the k or s groups and g into j or k groups based on the surrounding letters; it tries to account for common but non-standard spelling-to-sound mappings, like -tion; it holds on to h, w, and y before vowels. It is also intended to be used on general words, and not just names.

Wikipedia:

  • Total new tokens: 143,985
  • pre-analysis types: 54,001
  • post-analysis types: 14,822 (3.3-to-1 vs baseline)

Wiktionary:

  • Total new tokens: 54,877
  • pre-analysis types: 52,848
  • post-analysis types: 21,421 (2.4-to-1 vs baseline)

We have the same number of tokens and pre-analysis types as Soundex because we are using the same tokenizer. Post-analysis tokens are less agressive about conflating, especially because Metaphone breaks up Soundex's c, g, j, k, q, s, x, z family.

Random Observations:

  • th is encoded as 0, so the actual number 0 is grouped with words starting with th—along with #th ordinals!
  • Numbers seem to be ignored in general.
  • Really long words still group somewhat oddly, since only the first 4 letters of the encoding are retained. Our friend Galactofuranosylgalactofuranosylrhamnosyl is grouped with galactic, Colgate, and Kolkata.
  • Not retaining the first letter of the word allows Calcutta and Kolkata to be in the same group.
  • The groups of letters that are coded the same are much smaller than in Soundex, decreasing recall, probably in a good way.
  • The 180-character token for a 15-letter Gothic word is back. That's probably common to all of these phonetic algorithms.

The largest group from the Wikipedia sample for Metaphone has 166 words in it, too. (The fact that the max is 166 again is coincidental, since they are unrelated groups):

  • K: [1 1K][1 23c][2 2C][1 38CK44][1 3C][1 68HC08][1 C.A.I][1 C1][1 C10H12O2][1 C10H18][1 C12.10][1 C15][1 C18H26O3][1 C19H18O6][1 C2][1 C22H22O11][1 C23H34O5][1 C27H30O14][1 C3H7][1 C5][2 C65][1 C8051][1 CA][1 CA39][1 CO][2 CW][2 Ca][1 Cae][1 Caeau][1 Cai][1 Caio][1 Cao][1 Cau][1 Cay][1 Caño][1 Cha][1 Chaa][1 Chaah][8 Chah][1 Chao][1 Chay][1 Che][1 Cheoah][1 Chew][3 Chi][1 Chih][2 Chiu][1 Cho][1 Choe][2 Choo][2 Chow][2 Chu][2 Chuy][17 Co][3 CoA][1 Coe][2 Cow][2 Cue][2 Cui][1 Cà][1 Céu][2 G.I][1 G0423][1 G1202][1 G50][1 G6][1 G98][1 G99][1 GA200][6 GAA][2 GO][2 Ga][1 Gah][1 Gahō][1 Gai][1 Gaia][2 Gao][5 Gay][30 Go][4 Goa][2 Gough][2 Gu][1 Guieu][1 Guié][2 Guo][9 Guy][1 Guľa][1 Gäu][1 Gée][1 H.Q][2 HC][1 HK][1 HKK][1 Hyōgo][1 Hücü][43 K][1 K.O][1 K5000][1 K6][1 KE800][2 KIA][1 KIU][1 KK][1 KYHW][1 KYYW][9 Ka][2 Kai][5 Kay][2 Kačić][1 Kaʻau][4 Ke][2 Kee][1 Kei][8 Key][6 Ki][1 Kii][1 Kićo][4 Ko][1 Koh][4 Koi][1 Kou][1 Koča][1 Kočí][2 Koło][1 Ku][2 Kuh][1 Kuyō][1 Kyō][1 Kyū][1 Kõue][1 Kłyżów][1 Kō][1 Q1][1 Q105][1 Qiao][1 Qiu][1 Qu][1 Quai][1 Quay][2 Que][1 Queah][3 Quo][1 Quy][1 Quşçu][1 W.A.K.O][1 WHYC][1 WQHY][1 YC][1 Yūki][1 cae][2 chi][1 chih][3 ga][1 gay][1 gu][1 kai][2 key][1 ki][1 ko][1 kyū][2 que][1 qui][1 Çayköy][1 Íñigo][1 Öga][1 Ýagşyýew][2 Đokić][1 Ļo̧k][1 Łąka][1 Łęck][1 Łęka][1 Łężki][1 Ōki][1 ŽKK][1 Žiga][1 Țicău]

Acronyms with h, w, and y are not treated well (KYHW, KYYW) because those letters are ignored when adjacent to other consonants, and ignoring numbers leads to the weird treatment of the chemical formulas (esp. since they tend to have H and O, which can get ignored). The worst here, I think, are words with diacritics; it looks like those letters are just ignored. Folding before Metaphone would improve that dramatically.

The next few largest groups are similar: K + some other letter. It looks like K (which encodes, k, q, g, and sometimes c) is the biggest letter group, so this is expected. The next few are KT (133), KR (123), KN (122), KL (121), and then T (112).

Since KT was the largest 2-letter group, it's no surprise T is the second largest 1-letter group. It looks similar to K—short words starting with related letters (this time t and d), mixes of numbers and letters, acronyms with h, w, and y, and ignored letters with diacritics (e.g., Çatay and Łódź). Next: TN (110), TR (106) and then 0 (103).

0 is the encoding for TH, and it picks up all the ordinals ending in -th, like 189th. Next is ST (100) which has ordinals ending in -st.

Next: S (98), which is generally similar to K and T, but also picks up decades, like 1970s.

My impression: I'm still not liking the 4-letter code length for dictionaries. Splitting up the C-related group from Soundex is an improvement in limiting too much recall, and allowing the first character to be normalized is an improvement in too little recall. Of course, it was intended as an improvement over Soundex, so it's not surprising that it is! Folding is necessary to avoid bad results on words with diacritics.

Double Metaphone

edit

Double Metaphone is a 2000 update to Metaphone. It's "double", I think, because it can return both a primary and secondary encoding for a given word. (From looking at samples, it looks like the primary assumes English spelling, while the secondary makes a guess at non-English spelling, if appropriate.) This helps handle some of the ambiguity of English spelling, without having to either arbitrarily split or join groups of letters together when there is a small overlap between the groups. It is supposed to be smarter about words that come from various sources (i.e., other languages) and considers a lot more context. The Wikipedia article on Double Metaphone says around 100 contexts are considered for the letter c. That sounds about right. (And let me take the opportunity to again apologize for English spelling to anyone who had to learn the language as an adult. It's ridiculous!)

Code Length 4

edit

The default code length for Double Metaphone is 4, so I ran a test with that. I also ran a test with code length of 10 (see below) as a possibly extreme counterpoint to the shorter default.

Wikipedia:

  • Total new tokens: 166,896
  • pre-analysis types: 61,961
  • post-analysis types: 10,597 (5.3-to-1 vs baseline)

Wiktionary:

  • Total new tokens: 60,407
  • pre-analysis types: 58,226
  • post-analysis types: 18,574 (3.1-to-1 vs baseline)

We have significantly more tokens because many words get two phonetic encodings. Post-analysis type ratios show that the Double Metaphone grouping is fairly aggressive (more so than Metaphone).

Random Observations:

  • It's possible to get an empty stem! Háj and WJ are both stemmed as J and as a space (so not an empty string, but an empty stem). (Metaphone only gives the J option for these.)

Other observations are similar to Metaphone:

  • th is still encoded as 0, so the actual number 0 is grouped with words starting with th—along with #th ordinals.
  • Numbers seem to be ignored in general.
  • Really long words still group somewhat oddly, since only the first 4 letters of the encoding are retained. Our friend Galactofuranosylgalactofuranosylrhamnosyl is grouped with galactic, Colgate, and Kolkata.
  • Not retaining the first letter of the word allows Calcutta and Kolkata to be in the same group.
  • The groups of letters that are coded the same are much smaller than in Soundex, decreasing recall, probably in a good way.
  • The 180-character token for a 15-letter Gothic word is still there.

The largest group on the Wikipedia data is T, with 223 words:

  • T: [5 10th][1 112th][2 113th][1 116th][1 117th][3 11th][1 120th][1 126th][1 127th][6 12th][1 134th][5 13th][1 145th][4 14th][1 155th][4 15th][1 168th][1 16DA][4 16th][3 17th][1 189th][3 18th][1 198th][4 19th][1 207th][8 20th][1 210th][1 214th][1 215th][5 24th][1 255th][1 257th][2 25th][3 26th][2 27th][3 28th][1 2d][2 30th][1 311th][1 312th][1 319th][1 349th][3 34th][1 35th][5 36th][1 374th][3 37th][1 387th][1 388th][2 38th][2 39th][7 3D][1 3DI][1 3d][1 400th][2 40th][2 44th][1 460th][1 47th][1 490th][2 49th][1 4D][18 4th][1 527th][4 54th][2 57th][1 58th][1 59th][16 5th][1 60th][1 639th][2 65th][1 67th][1 68th][2 69th][10 6th][1 745th][1 75th][1 76th][1 77th][1 789th][2 79th][8 7th][1 80th][1 834th][1 86th][1 878th][6 8th][1 922d][1 94th][1 98th][1 9HD][6 9th][63 D][2 D.I][1 D01][1 D5100][1 D68][1 D94][8 DD][1 DD35A][2 DE][1 DEI][1 DIII][1 DO][2 DT][9 Da][2 Dae][2 Dai][1 Daia][1 Daio][1 Dao][1 Dawe][1 Dawei][51 Day][43 De][3 Dee][12 Deh][1 Dei][2 Dey][1 Deyeh][5 Di][1 Dia][10 Die][1 Dieu][1 Dio][1 Diệu][29 Do][2 Doe][2 Doi][1 Doo][1 Douai][3 Dow][5 Du][2 Due][2 Duo][2 Duy][1 Duži][1 Dwa][1 DyE][2 Dye][1 Dâu][1 Día][10 HD][1 HTT][47 T][1 T1][1 T12][1 T20][1 T42][1 T54][1 T8][1 TD][1 TEU][1 THEY][1 TI][3 TT][1 TT23][1 TW][1 TWA][1 Ta][2 Tae][1 Tah][4 Tai][1 Taihō][1 Taiwu][1 Taiyo][3 Tao][1 Tau][1 Tawa][1 Tawia][2 Tay][2 Te][5 Tea][1 Teh][1 Tei][1 Tey][1 Th][5 Tha][4 Thai][1 Thawi][1 Thaya][1042 The][2 Thee][1 Theo][7 They][1 Thi][1 Thia][1 Thou][1 Thu][1 Thy][1 Thế][4 Thị][1 Thời][1 Thủy][2 Ti][2 Tie][2 Tier][1 Tigh][1 Tião][14 To][1 Toee][1 Toh][5 Too][1 Toy][1 Toya][1 Toše][1 Tu][1 Tui][1 Tuya][1 Tuyo][31 Two][4 Ty][1 Tàu][1 Tú][1 Tŷ][1 Tứ][1 Tự][1 W39DE][2 WTA][1 WTTO][30 da][3 day][446 de][1 dea][5 dei][33 di][1 die][28 do][44 du][1 dō][1 t][1 ta][1 ta'ai][1 tae][1 tau][2 te][4 tea][1 tei][1 th][1382 the][1 ti][2 tie][182 to][1 toe][1 too][1 tou][2 toy][1 tu][2 tui][1 two][1 út][10 Łódź][2 Ōita][2 Ōta]

Lots of ordinals because numbers are ignored, acronym problems (HTT, TWA), and problems with ignoring diacritics (Łódź, Ōita). Definitely need some ASCII folding.

The second largest Wikipedia group, ANTR, has 203 words. These are the inter- and intra- words, but also and(e)r-, ent(e)r-, unter- and a few words starting with w and j, which seem to be ignored in non-English mode.

The next largest, A, has 185 words. They are all strings of vowels, numbers, j, w, y and some h's. A few diacritical characters are improperly ignored (ḍ, ł, ż, ž, ř, ș) but generally this one is a messy grab-bag—which is sort of to be expected, since most phonetic algorithms mostly ignore vowels. It's a bigger group than in Metaphone because j, w, y are being ignored in the secondary encoding.

The next largest is K (185), which is similar to Metaphone's K group, with a few additional presumably secondary encodings (e.g., cc, CH) thrown in.

The next groups are AT (183), AK (172), AN (164), and KT (164). The A<CONS> groups have more members now because of the secondary encodings with j, w, y. Otherwise, the big groups are similar to Metaphone.

The biggest Wiktionary group is ANTR with 438 words. It's similar to the Wikipedia ANTR group, just with even more non-English words that start with cognate prefixes. Next is ANKR (155) which are all match the vowel-N-G/K-R pattern, then KNTR (154) with co(u)nter-, contra- and cognates, plus a few others. AKST (150) is the extr- group, though it picks up some words with k/g and s separated by vowels, such as augusta and icosidodecadodecahedrons. TRNS (150) is the trans- group.

My impression: Similar to Metaphone, but with additional encodings, which cast a wider net, but lead to some over-aggressive matches.

Code Length 10

edit

Wikipedia:

  • Total new tokens: 167,985
  • pre-analysis types: 62,681
  • post-analysis types: 23,674 (2.4-to-1 vs baseline)

Wiktionary:

  • Total new tokens: 62,293
  • pre-analysis types: 60,110
  • post-analysis types: 44,702 (1.3-to-1 vs baseline)

More pre-analysis types and many more post-analysis types since more of the latter part of longer words can be used to distinguish them.

Random Observations:

  • A lot is similar to the code length 4 case above, because short words are still short and that's where the interesting collisions happen.
  • Háj and WJ have empty stems.
  • th is still encoded as 0, so the actual number 0 is grouped with words starting with th—along with #th ordinals.
  • Numbers seem to be ignored in general.
  • Calcutta and Kolkata are in the same group, which is much smaller.
  • The 180-character token for a 15-letter Gothic word is still there.

Some things are different:

  • Galactofuranosylgalactofuranosylrhamnosyl is doesn't group with anything, since it's encoding, KLKTFRNSLK, is longer and now unique.
  • Groups are generally smaller, since more of the word needs to match.

The largest Wikipedia group is T (223), generally the same as before—all the short words that are -th ordinals or t/d plus vowels, h, j, w, y. The next groups, A (185), K (185) are similar to the code length 4 groups from before, but mostly limited to the shorter words. Next are AT (183), AK (172), AN (164), KT (164), KN (150) are similar to the ones from before, but limited to shorter words.

The largest Wiktionary groups is T (62), K (61), A (55), S (54), AN (53), P (49), PL (48) are all made up of short words.

The larger groups, with inter-, counter-, extra-, etc., have been broken up into many smaller groups.

My impressions: This is much better than Double Metaphone code length 4 because you don't have groups based entirely on common prefixes (and for Wiktionary, cognates of those prefixes), though it is possible that code length 10 is too long for the "I can only remember the beginning of the word" case. It still has the same problems for short words. Finding a good code length (less than 10? more than 10?!) will be required if we use Double Metaphone.

Metaphone 3

edit

There is yet another update to Metaphone—Metaphone 3 from 2009 and later—but it is only available commercially.

Languages other than English

edit

There are Metaphone-like implementations for other languages—including Portuguese, Spanish, Russian, German and others—linked to on the Wikipedia page.

It might be plausible to convert any of those to Java as needed.

Refined Soundex

edit

Refined Soundex doesn't have a Wikipedia page (just a redirect to Soundex) and seems to have possibly been created by Apache. It's hard to tell, and searching Google Scholar for early references isn't helping much. Various variants of Soundex have been proposed over the years, and this one breaks up b, p, f, v into b, p and f, v, and the big c, g, j, k, q, s, x, z group into c, k, s and g, j and q, x, z.

Wikipedia:

  • Total new tokens: 143,985
  • pre-analysis types: 54,001
  • post-analysis types: 30,738 (1.6-to-1 vs baseline)

Wiktionary:

  • Total new tokens: 54,877
  • pre-analysis types: 52,848
  • post-analysis types: 47,825 (1.1-to-1 vs baseline)

We have the same number of tokens and pre-analysis types as Soundex because we are using the same tokenizer. The post-analysis type counts indicate that it is much less aggressive than regular Soundex.

Random Observations:

  • There's no limit on the length of the encoding, and word-internal vowels are preserved (coded as "0"). Our friend Galactofuranosylgalactofuranosylrhamnosyl is coded as G407036020908030740703602090803079080307, and is unlikely to match anything else.
  • The first letter in the word is preserved, which can make for too little recall. Khloe and Chloe could not match.
  • Numbers are ignored, so 60A and A1010 are matches.
  • Codes are numeric. This isn't as much of a problem because the first character of the code is a letter.
  • Non-initial vowels, h, w, and y are ignored, which leads to that problem of long stretches of vowels and h, w, y being ignored. Final vowels are included in the encoding, which leads to lots of final silent -e being included. As a result, all of the following are effectively encoded as B-R-<vowels>-C/K/S-<vowels>: Bracci, Bruce, Brewhouse, Breakaway.
  • Refined Soundex also punts on strings with unknown characters, so Sławica is analyzed as Sławica. (That is, it doesn't generate a code—it just returns the whole original word, unchanged.) Folding could help.
  • The 180-character token for a 15-letter Gothic word is still there.
  • In general, groups are small, and other than short words and words with too many h, w, and y, grouped words are pretty similar.

The largest Wikipedia group is T60 with 139 words in it. It's our old friend T+vowels and ordinals.

  • T: [5 10th][1 112th][2 113th][1 116th][1 117th][3 11th][1 120th][1 126th][1 127th][6 12th][1 134th][5 13th][1 145th][4 14th][1 155th][4 15th][1 168th][4 16th][3 17th][1 189th][3 18th][1 198th][4 19th][1 207th][8 20th][1 210th][1 214th][1 215th][5 24th][1 255th][1 257th][2 25th][3 26th][2 27th][3 28th][2 30th][1 311th][1 312th][1 319th][1 349th][3 34th][1 35th][5 36th][1 374th][3 37th][1 387th][1 388th][2 38th][2 39th][1 400th][2 40th][2 44th][1 460th][1 47th][1 490th][2 49th][18 4th][1 527th][4 54th][2 57th][1 58th][1 59th][16 5th][1 60th][1 639th][2 65th][1 67th][1 68th][2 69th][10 6th][1 745th][1 75th][1 76th][1 77th][1 789th][2 79th][8 7th][1 80th][1 834th][1 86th][1 878th][6 8th][1 94th][1 98th][6 9th][1 TEU][1 THEY][1 TI][1 TW][1 TWA][1 Ta][2 Tae][1 Tah][1 Taha][1 Tahoe][4 Tai][1 Taiwu][1 Taiyo][3 Tao][1 Tau][1 Tawa][1 Tawia][2 Tay][2 Te][5 Tea][1 Teh][1 Tei][1 Tey][1 Th][5 Tha][4 Thai][1 Thawi][1 Thaya][1042 The][2 Thee][1 Theo][7 They][1 Thi][1 Thia][1 Thiha][1 Thou][1 Thu][1 Thy][2 Ti][1 Tia1][2 Tie][14 To][1 Toee][1 Toh][5 Too][1 Toy][1 Toya][1 Tu][1 Tui][1 Tuya][1 Tuyo][31 Two][4 Ty][1 ta][1 ta'ai][1 tae][1 tau][2 te][4 tea][1 tei][1 th][1382 the][1 ti][2 tie][182 to][1 toe][1 too][1 tou][2 toy][1 tu][2 tui][1 two]

The next largest group, R9030 only has 77 items, which are mostly short words starting with r with a c, k, s and optional h as the second consonant. Other than Rockaway they are all reasonable. Next largest is S3080 (75—S..M/N..), M8060 (68—M..T/D..), S30 (68—S..), and M8030 (67—M..C/K/S..).

The largest Wiktionary group C3030 only has 24 items and follows the pattern C/K/S..C/K/S.. Next is D603 (19), S30 (19), S3080 (19), B108 (17), R9030 (17), and S3060 (17), which are all really small groups.

My impressions: The groups are pretty focused except for the very short words (which is unavoidable) and those with too many h, w, y acting as consonants. I think this is a big swing toward precision away from recall, since g/k and s/z are split up. Grouping q, x, z is also a bit weird, and may be more of a "weird letter" group than a phonetic group.

NYSIIS

edit

NYSIIS is from 1970 and is touted as being "2.7%" more accurate than Soundex. It includes some contextual substitutions before the character-by-character encoding, such as "KN → NN". Historical note: Non-final substitutions use the same number of characters (e.g., KN → NN, rather than KN → N, and SCH → SSS rather than SCH → S) because back in 1970 fixed-width fields were a common way of processing data more efficiently, so you didn't want to remove characters from the middle of a string because it was computationally too expensive. (See Moore's law—computers today are on the order of 10 million times faster than in 1970. The Apache implementation still does the same-number-of-character substitutions, even though NN and SSS end up being encoded the same as N and S, respectively.

The Wiki page says that the original code length is 6, though different implementations may use a different length. The Elastic/Lucene/Apache implementation does use 6.

Wikipedia:

  • Total new tokens: 143,985
  • pre-analysis types: 54,001
  • post-analysis types: 23,895 (2.0-to-1 vs baseline)

Wiktionary:

  • Total new tokens: 54,877
  • pre-analysis types: 52,848
  • post-analysis types: 31,039 (1.7-to-1 vs baseline)

NYSIIS is much less aggressive than Soundex in terms of number of terms grouped together.

Random Observations:

  • Galactofuranosylgalactofuranosylrhamnosyl is coded as GALACT, and so is grouped with galactic, Galactica, galactofuranosyltransferase and others.
  • The code length is 6, but vowels are preserved (medial vowels are all encoded as A). This distinguishes clusters (str) from separate syllables (s..t..r), but also means less of the word is encoded.
  • The first letter in the word is preserved, which can make for too little recall. Khloe and Chloe could not match.
  • Numbers are ignored, so 60A and A1010 are matches.
  • w is treated like a vowel, (but h and y are not) so there aren't as many long stretches of letters that get ignored, though Hawaii is encoded simply as H.
  • Diacritical and non-Latin characters are preserved while everything else is encoded, and then everything is reasonably uppercased. So Sławica is encoded as SŁAC, háj as HÁJ, and αβγδε as ΑΒΓΔΕ. Yep, we need folding.
  • Most final vowels are dropped, so cam and camo group together. Final y, ee, ie and similar are encoded as Y.
  • The 180-character token for a 15-letter Gothic word is still there.

The largest Wikipedia group is is the familiar T group with 142 words.

  • T: [5 10th][1 112th][2 113th][1 116th][1 117th][3 11th][1 120th][1 126th][1 127th][6 12th][1 134th][5 13th][1 145th][4 14th][1 155th][4 15th][1 168th][4 16th][3 17th][1 189th][3 18th][1 198th][4 19th][1 207th][8 20th][1 210th][1 214th][1 215th][5 24th][1 255th][1 257th][2 25th][3 26th][2 27th][3 28th][2 30th][1 311th][1 312th][1 319th][1 349th][3 34th][1 35th][5 36th][1 374th][3 37th][1 387th][1 388th][2 38th][2 39th][1 400th][2 40th][2 44th][1 460th][1 47th][1 490th][2 49th][18 4th][1 527th][4 54th][2 57th][1 58th][1 59th][16 5th][1 60th][1 639th][2 65th][1 67th][1 68th][2 69th][10 6th][1 745th][1 75th][1 76th][1 77th][1 789th][2 79th][8 7th][1 80th][1 834th][1 86th][1 878th][6 8th][1 94th][1 98th][6 9th][47 T][1 T1][1 T12][1 T20][1 T42][1 T54][1 T8][1 TEU][1 TI][1 TSS][3 TT][1 TT23][1 TTT][1 Ta][2 Tae][1 Tah][4 Tai][1 Taiwu][3 Tao][2 Taos][1 Tau][1 Taus][1 Tawa][1 Tawia][2 Te][5 Tea][2 Tees][1 Teh][1 Tei][1 Th][5 Tha][4 Thai][1 Thais][1 Thawi][1042 The][1 Theeuwes][1 Theiss][1 Theo][1 Thi][1 Thia][15 This][1 Thou][1 Thu][1 Thus][2 Ti][1 Tia1][1 Ties][1 Tisch][1 Tish][14 To][1 Toh][5 Too][1 Toos][1 Tu][1 Tui][1 t][1 ta][1 ta'ai][1 tae][1 tau][2 te][4 tea][1 tei][1 th][1382 the][1 ti][182 to][1 toe][1 too][1 tou][1 tu][2 tui]

The next largest is CAN (127) with mostly one- and two-syllable words matching C/K..M/N.., followed by C (122) which includes chemical formulas. Next: SAN (100), S (93), DAN (88), and CAL (83).

The largest Wiktionary group is CANTRA (121), which includes lots of cent(e)r-, contr- and cognate prefixes. Next is CARCAN (104) with lots of circum- prefixes, CANPLA (61) with co(m/n)pl- prefixes, CANANA (57), CANTAN (57), SAPARA (47), and DASCAN (43).

It's interesting that the largest Wiktionary groups are all 6 letters.

Head-to-Head

edit

In general, it's difficult to directly compare across phonetic algorithms because the groups they create are multiply overlapping, and the encodings they use are not comparable. (With stemmers in English, at least, you can expect that most "boring" nouns and verbs will be stemmed to their proper base forms, so you can compare across stemmers.)

Soundex and Refined Soundex, for example, can't easily be compared because too many things have changed at once. There's the change in code length (4 vs unlimited), the fact that Refined Soundex encodes the first letter of a word twice (as both the letter and its code), and the fact that Refined Soundex represents vowels and Soundex doesn't. There's no overlap in the codes.

The Metaphone family, though, does allow some comparison. Double Metaphone encodes some things differently than Metaphone, but many are the same—enough to get a sense of how the changes affect groupings. And for Double Metaphone we can look at different code lengths (4 vs 10) to see what affect that change has.

Metaphone vs Double Metaphone

edit

Comparing Metaphone with Double Metaphone (with code length 4—the same as Metaphone) a few things stick out:

  • Much better treatment of initial hy- and wy-—the h and w are treated as consonants and the y as a vowel, so you don't have Hytham with Them or Wythnos with Thins.
  • Some special coding for specific words or parts of words. For example, archit is coded with ch only as k, and pseudo- is not coded with a p, just s.
  • Better handling of English orthographic stupidity, notably ough.
  • A lot of the splits seem to be improvements, especially for letter combinations that are characteristic and distinctive for certain languages, like German sch, Russian -ov, and Polish sz, cz.
  • A lot of mergers/collisions come from allowing the first letter of a word to be encoded, so words starting with vowels can group together, or b- and p- words, etc.
  • A lot more mergers/collisions come from treating initial j, w, y in a non-English way, and letting them group with vowel-initial words.

Of course, not all the splits and mergers are great, but more look good than not.

Double Metaphone: Code Length 4 vs 10

edit

All of the changes are splits, since encoded tokens are either the same (for short words) or longer (for long words). There are some plurals and other inflections that show up as the fifth consonant in the encoding, so theologian is split from theologians and athletic splits from athlete. On the other hand, longer words can start the same but not be terribly related, such as all the inter- words and counter- words. Others are harder to tell: ecology/ecologista, Thailand/thailandensis, tuberculosis/tuberculata, thanks/thanksgiving, thunder/thunderball.

Of course, what is "good" in these cases depends on your use cases and user model. I don't think thunder/thunderball is a huge loss, but I can see thailandewhachamajig or tuberculothingie being all someone could remember for the unfamiliar scientific terms.

Overall Impressions

edit
  • Different algorithms are trying to accomplish different things, and have different purposes. Soundex and NYSIIS, and possibly Refined Soundex are designed specifically to match names, not general words.
  • When used for genealogical purposes, paying extra attention to the first letter and ignoring the second half of a long name makes sense, based on what is likely to be stable over generations. For words that are significantly shorter than Galactofuranosylgalactofuranosylrhamnosyl, though, expecting to remember the general form of the whole word is more reasonable. Shorter code lengths model the former, longer code lengths model the latter. The best code length is an open question, and an exact number also depends on other elements of the encoding (esp. whether vowels are represented in the code).
  • The earliest algorithms, especially Soundex, needed to be simple as well, so that people could perform them manually with reasonable accuracy. This leads to either over-aggressive, contextless grouping of letters (Soundex) or arbitrary splitting of groups (Refined Soundex).
  • Ignoring vowels entirely seems better for genealogy, where syllables can get reduced over generations. Vowel quality isn't constant, but there's a clear difference between str- and sitar.
  • None of the algorithms, strictly interpreted, knows what to do with diacritical letters. Folding before encoding makes sense, since people often ignore them and don't know when they have very divergent pronunciation. (For example, I expect English speakers to pronounce Polish ł as l, not w.)
  • The treatment of numbers is likely a problem.
  • Short words are always going to be a mess.
  • All of the algorithms are in theory "fixable," in that we could at the very least fork the code and add new features (adjustable code length, non-numeric codes, etc.). Less drastic measures include filtering tokens before or after encoding, or making pull requests for simple features—though the cycle for such requests to make it back to production Elasticsearch can be quite loooooong.

Compression Aggression

edit

Below is a table summarizing the comparative aggressiveness of the grouping for each algorithm (vs the baseline level of grouping with English stemming).

Algorithm Wiki Wikt
Soundex 5.8 2.7
Metaphone 3.3 2.4
Double Metaphone (4) 5.3 3.1
Double Metaphone (10) 2.4 1.3
Refined Soundex 1.6 1.1
NYSIIS 2 1.7

The raw numbers aren't the only thing that matters, as some algorithms group too aggressively along one dimension while not being agressive enough along another dimension, but the comparison is useful to get a general idea of the overall level of compression in the encodings.

Improvements and Implementations

edit

We can mitigate some of the precision problems with some additional steps in the analysis chain.

  • Obviously character folding and other normalization will help with the diacritical characters—and prevent either skipping the word, ignoring the letter, or leaving it unencoded in the output, depending on the algorithm.
  • We could also drop stop words before encoding, and skip other problematic cases, such as words with numbers.
  • We could also have a minimum word length or minimum encoded length. I lean toward minimum word length. Hawaii gets encoded as H by Double Metaphone, as do h, ha, he, hi, ho, Hu, haa, hah, hay, hee, hey, how, high and many other short words. But with a five-letter minimum word length, the density of quite reasonable candidates (Hawaa, Hawaai, Hawai, Heiau, Huawei, Huiyi) increases significantly.

We can also mitigate some problems with how we search:

  • Limiting matches to titles (and redirects) will eliminate a lot of spurious matches.
  • We could apply some additional constraints, such as requiring one non-phonetic match or a minimum score.
  • Ranking by some similarity metric—such as Levenshtein edit distance, a modified Hamming distance, or Jaccard similarity over bigrams—is appealing, but seems to be too complex to implement within Elasticsearch.

While I was thinking about implementing this as a new keyword, David has some additional ideas about how to integrate it with our normal search to increase recall—though I worry about the crazier matches being ranked too highly.

Another potentially crazy idea for English projects: transliterate other alphabets, such as Cyrillic, Greek, various Indic scripts, and maybe even Hiragana and Katakana to get fuzzy but possibly useful cross-language matches. Simple transliteration could be based on a character filter in the analysis chain config; more complex transliteration would require a proper plugin. Candidate transliteration schemes could be adapted from existing standards (e.g., Greek, Cyrillic, Devanagari, etc.) combined with knowledge of what distinctions the phonetic algorithm is likely to ignore anyway.

Conclusion and Next Steps

edit

This finishes up the Q2 goal of investigating phonetic algorithms (T182708). The Elasticsearch page on phonetic algorithms notes that "phonetic algorithms are fairly crude, and very specific to the languages they were designed for," but I have some hope after this investigation that we could get something useful out of them, if only a keyword for the desperate.

Next steps (See T184771):

  • Pick an algorithm to test. DONE: Double Metaphone seems to give the best mix of reasonable encoding and code length flexibility.
  • Implement and test an analysis chain with the obvious precision mitigation steps: character folding, stop word filtering, filtering words with numbers, and word length filtering.
    • If time permits, add one or more straightforward transliteration options.
  • Set up a RelForge test bed for English Wikipedia and Wiktionary with a keyword or other implementation for phonetic title/redirect search or other implementation method, and invite feedback via the Village Pump, mailing lists, etc.
    • Depending on feedback, possibly adjust encoding, code length, etc., and test again.
  • If all goes well, make a deployment plan—using a keyword, or making it a component of standard search (which then has to interact with Learn-to-Rank), etc.