User:TJones (WMF)/Notes/Project Wishlist

See TJones_(WMF)/Notes for other projects.

If I had infinite time, these are some of the other projects I'd like to work on. I do work on some of them as my 10%-time projects, and hope to get to these and others in the future. If you'd like to help in any way—comments, questions, and suggestions are welcome!—please contact me or leave a note on the talk page.

Future Hackathon Potential Project List edit

This is a list of mostly language-focused, not-necessarily-great ideas, in order of my current desire to work on them at the 2020 Hackathon.

  • Work with Albanian speakers to implement a basic Albanian language analyzer, with appropriate folding for non-Albanian diacritics, stop word list, test an Albanian stemmer or two ( 1a & 1b, 2) and begin porting it to Java if warranted.
  • plugin to do transliteration for languages where it is relatively easy (Serbian was on the list, but it’s already done!—and for very simple mappings this is just a character map). LanguageConverter docs have a list of what's implemented, but there are others.
    • “Bollywood detector”—identify and map Bollywood movie names into multiple scripts (these show up in zero-results searches)
  • work out the use cases and infrastructure for supporting a community-built thesaurus
    • "synonym tester": a user script to test the effects of making two words synonyms
  • expand the plugin to do automatic homoglyph corrections (T222669) to include Greek/Latin and Cyrillic/Greek (and handle those rare tri-script tokens)
  • look into ways of automatically generating a stemmer from Wiktionary conjugation/declension data (maybe start with Estonian?)
  • find a way to automatically determine low-information title/redirect prefixes like “List of …” and investigate indexing the Completion Suggester without them
  • extract “related results” from an article’s infoboxes, opening text, or elsewhere and display them on the search results page with the article
  • project WordNet or other thesaurus/ontology onto short strings (e.g., Commons descriptions, Wikipedia titles, etc.) to determine useful thesaurus terms and prune the rest
  • implement a phonetic search keyword for matching query to titles
  • develop a different statistical approach to detect wrong keyboard typing and build a search-only filter to generate alternative tokens—for Russian/English (T138958), Hebrew/English T155104, OR one hand on wrong home row key

Potential Non-Hackathon 10% Projects edit

If anyone at a hackathon wanted to work on these, I'd be more than happy to, but these are more search internals tech debt type projects.

  • recheck differences in unpacked vs monolithic analyzers (eliminating our automatic upgrades, which 98% likely to have caused the diffs)
  • compare the analyzers for the top 5-10 wiki languages by volume, and look for ways to increase consistency among them

Completed Projects!! edit

  • Mirandese (mwl) analysis plugin built from Portuguese and French parts, plus a stop list provided by an mwl editor (T194941) Done!
  • plugin to merge high surrogates and low surrogates that get split up by the Chinese analyzer (T168427) Done!
  • plugin to do automatic homoglyph corrections (T222669) Done!