Rainman

I work on lucene-search stuff and administrate the search cluster... My user page is sr:User:Rainman

Brief history of the internal search

Until 2005 WMF wikis have been using mysql search, however, as the traffic increased it was soon realized that this does not scale well. In 2005 River installed Lucene and we started using it with out-of-the-box settings. As number of articles increased the results of out-of-the-box lucene were not very relevant and often seemed like an almost random collection of articles. Furthermore, en.wiki index was getting too big to fit onto one machine, and we needed a way to split it and distribute the search process.

In late 2006 I started thinking about customizing our lucene configuration for better ranking and support of robust distributed searching. At that time I had to pick my masters thesis and I decided to do this. In mid-2007 the first version of lucene-search extension was finished, which was essentially our wrapper around lucene to support these tasks and communicate them to MediaWiki. For ranking I decided to use PageRank-like algorithm with reference-to-article counting following the conventional wisdom of the time (of course we don't use the original PageRank due to patent restrictions). But still most of the code dealt with configuration and ways to maintain consistent copies of split indexes, do smooth updates from indexer to searchers etc. This version was deployed on the WMF cluster in mid-2007, in late 2007 I got my masters.

However, this was only the beginning. It was quite obvious that the search results are still pretty bad and that the design of the search page sucks. Finally, one crucial feature was missing: "Did you mean...". During 2007 and 2008 I set out to solve these problems. I found that knowing what information is important or representative of article is often more informative than it's PageRank. So, I took care to give more weight to beginning of articles, redirects, words used to refer to article, section captions... Further, what disambiguates the article from related terms is its context, so I extracted frequently co-occuring article titles in all of wikipedia to extract article association.

To my surprise I didn't find a single open-source "Did you mean..." engine. There are programs like aspell, but all of them spell-check only single words. So I had to invent the algorithm. I started by making a 2-gram representation of language, by recording all possible two-word phrases in all of wikipedia (above some threshold, and with non-stop anchor words). Obviously using a higher-order approximation of language would have been better, but due to limited resources we had to settle with this. I found this model can fix some simple errors, but is not powerful enough. So, to make it more flexible, I had to build a scoring scheme by trying out different heuristics. First I've added special score boost for those 2-grams that are in titles, then I added whole titles, and finally, I added "fuzzy" 2-grams of words that might provide context for a title words, by taking all words from redirects and links in first paragraph of article. Finally, I combined the search results to see if the rare spelling user entered is indeed something significant of not.

The final version of the lucene-search engine was enabled in late 2008 on en.wiki, and in 2009 on other wikis. In summer of 2009 with help of usability people we redesigned the Search page to give it its current looks. This redesign was probably the first thing from the usability team that was enabled WMF-wide by default. All of this work has been done / is been done as a volunteer and these days I'm mainly concerned with server maintenance. No other major features are planned. However, there is still many things to do.

Rainman

Brief history of the internal search

Links