Topic on User talk:TJones (WMF)/Notes/Potential Applications of Natural Language Processing to On-Wiki Search

Justin Ormont (talkcontribs)

Word embedding models, like en:fastText and en:GloVe, can be used to generate synonyms and similar words. When trained on text w/ many misspellings, they can also work to suggest spelling fixes.

Facebook's FAIR lab produced pre-trained models on 294 languages of Wikipedia (link). These being trained on Wikipedia text, won't be very good at spelling mistakes as the source material is very clean. It would be very interesting to train a fastText model on your query logs and see what is produced by the nearest neighbor search in the word embedding space.

You can explore the nearest neighbor search by grabbing one of the models and running fastText:

./fasttext nn en.wiki.bin 50

Searching for "imbedding", the closest 50 words in the 300-dimensional word embedding space are:

Query word? imbedding

  1. imbeddings 0.941056
  2. embedding 0.880808
  3. embeddings 0.875705
  4. compactification 0.732301
  5. diffeomorphism 0.729409
  6. compactifying 0.729186
  7. antihomomorphism 0.726086
  8. compactifications 0.724407
  9. geometrization 0.721966
  10. biholomorphism 0.721854
  11. isomorphism 0.721106
  12. homeomorphism 0.719762
  13. homotopic 0.717359
  14. parametrization 0.717293
  15. parametrizations 0.716476
  16. injective 0.715966
  17. diffeomorphisms 0.715271
  18. automorphism 0.714177
  19. biholomorphisms 0.71407
  20. submanifold 0.711693
  21. antiholomorphic 0.711509
  22. topological 0.711504
  23. geometrizable 0.710431
  24. automorphisms 0.708235
  25. homeomorphisms 0.708069
  26. codimension 0.706777
  27. projective 0.7067
  28. generalizes 0.706284
  29. endomorphism 0.705661
  30. simplicial 0.705504
  31. reparametrizations 0.7055
  32. hypersurface 0.705288
  33. parametrizing 0.704711
  34. codimensional 0.704644
  35. reparametrization 0.703381
  36. quasitopological 0.703158
  37. nullhomotopic 0.703086
  38. quasiconformal 0.703035
  39. hypersurfaces 0.700519
  40. biholomorphic 0.69997
  41. antiautomorphism 0.699786
  42. geometrizes 0.699575
  43. submanifolds 0.699203
  44. compactified 0.69918
  45. conformal 0.699034
  46. embeddability 0.69899
  47. pseudoholomorphic 0.698393
  48. complexification 0.698191
  49. holomorphicity 0.698155
  50. nonsingularity 0.697529
Smalyshev (WMF) (talkcontribs)

Sorry if I am asking something very obvious, but I understand that the models are based on the word co-occurrence. For a text, this makes a lot of sense, but queries are usually very short and frequently omit words. Would we have enough data in the query corpus to have good word relationships?

Also, fastText seems to split words into n-grams, which should work ok with misspellings (at least ones that do not make the word completely unrecognizable).

Reply to "WordEmbeddings"