If a change is made to the search algorithm (e.g., deciding to strip question marks from inputs or turn underscores into spaces), how is that change's impact on search result quality measured? Is there a test suite that says "Obama" on the English Wikipedia should return X in the top five results? Or a search for "kidney" should return greater than X results on the Spanish Wikibooks or whatever? Where is this test suite and can users contribute tests to it?
Topic on Talk:Wikimedia Discovery/So Many Search Options
It depends on what we're testing. We don't quite have the manually curated test suite you are describing, but we have some things that are close. There are a number of problems with such a test suite, which we've been working to overcome for more than a year now. The biggest one is that it tests what the curators think should be tested—which is not always sufficiently representative, and can lead to unexpected side effects elsewhere. Some of the tools we use include (find links and other info in on bolded terms in the Glossary):
There's the Relevance Forge (formerly "Relevance Lab"—but too many things have "lab" in their name). There are RelForge servers set up that allow us to host multiple copies of wiki indexes with different indexing schemes, or, more simply, we can query the same index with different configurations (weighting schemes, query types, etc—anything that doesn't require re-indexing). RelForge can measure the zero results rate (ZRR), the "poorly performing" rate (fewer than 3 results), query re-ordering, and a few other automatically computable metrics. It is possible to manually review the changed results, too. We can run RelForge against targeted corpora (i.e., queries with question marks) or general regression test corpora. It's a bit of a blunt instrument, but it makes it very easy to asses the maximum impact of a change quickly by seeing what percentage of a regression test set has changes. If it's unexpectedly high, we know something "interesting" is going on that warrants further investigation; when it's low, we can look at all the changed examples and try to figure out what's going on.
RelForge also has an optimizer that will take a number of numerical configuration parameters and a range of values for each and do a grid search over the parameter space to find the best config, based on PaulScore, which is derived from actual user click data. (See the Glossary for more on PaulScore, which is a name we gave an otherwise unnamed metric proposed by Paul Nelson.)
For English Wikipedia, we are trying to emulate Google, Bing, and others with Discernatron, which allows users to manually review and rank results. Unfortunately it's mind-numbingly tedious to do the ranking (I speak from experience) and we don't have a lot of data because we aren't paying people to do it. However, we have a moderately sized test set that we can use to test proposed changes and to calibrate other less manual-work–intensive methods, which we hope to then generalize to other projects. We use the Discernatron scores to calculate discounted cumulative gain (DCG) values before and after changes. The DCG scores are based on the ordering of results compared to the manually scored results.
We've also been experimenting with other clickstream metrics, in particular using dynamic Bayesian networks (DBN). The DBN model uses a lot more aggregate data, which makes it more precise, but requires a certain number of people to have looked at equivalent queries (ignoring case, extra spaces, etc.), which means it must ignore the long tail, but we've been comparing it to the Discernatron results and it compares favorably, and requires no additional manual rating work. We've also just started experiments with Learning to rank, but those are still early days.
We also use A/B tests when there's no other good way to test things, or we aren't fully confident in other measures. The data is collected and analyzed by our data analysts, and the analysis write-ups are on GitHub and Commons (for example, this one). They use a lot of very cool and high-powered statistical techniques to tease out what's going on, including ZRR, PaulScore, and lots, lots more.
For individual search improvements, we'd use whichever of these methods seems most helpful at the time, depending on the scope and complexity of the improvements being evaluated. For the overall problem of showing results from so many second-try search options, I expect we'll use A/B tests to see what display method and ordering has the most user engagement and refine that over time, since there's no a priori way to determine the best ordering of additional results.
I've adapted this to its own wiki page, so we can point people to a more general answer in the future, and a more up-to-date answer in the more distant future.
Wow, thank you for the very thorough answer! There's quite a bit to process here.