Wikimedia Search Platform/Search/Testing Search

How Do We Test Search Changes?

Testing proposed changes to search can be a complicated business, and how we do it depends on what kinds of changes we need to test. Some of the tools we use include the following (also see the Search Glossary):

There's the Relevance Forge. There are RelForge servers set up that allow us to host multiple copies of wiki indexes with different indexing schemes, or, more simply, we can query the same index with different configurations (weighting schemes, query types, etc.—anything that doesn't require re-indexing). RelForge can measure the zero-results rate (ZRR), the "poorly performing" rate (fewer than 3 results), query re-ordering, and a few other automatically computable metrics. It is possible to manually review the changed results, too. We can run RelForge against targeted corpora (i.e., queries with question marks) or general regression test corpora. It's a bit of a blunt instrument, but it makes it very easy to asses the maximum impact of a change quickly by seeing what percentage of a regression test set has changes. If it's unexpectedly high, we know something "interesting" is going on that warrants further investigation; when it's unexpectedly low, we can look at examples and try to figure out what's going on.
- RelForge also has an optimizer that will take a number of numerical configuration parameters and a range of values for each and do a grid search over the parameter space to find the best config, based on PaulScore, which is derived from actual user click data. (See the Glossary for more on PaulScore, which is a name we gave an otherwise unnamed metric proposed by Paul Nelson.)

We've also been experimenting with other clickstream clickstream metrics, in particular using Dynamic Bayesian networks dynamic Bayesian networks (DBN). The DBN model uses a lot more aggregate data, which makes it more precise, but requires a certain number of people to have looked at equivalent queries (ignoring case, extra spaces, etc.), which means it must ignore the long tail. The results compare favorably to manually ranked results using Discernatron (see the Glossary), and requires no additional manual rating work.

We also use A/B tests when there's no other good way to test things, or we aren't fully confident in other measures. Older analysis write-ups are on GitHub and Commons (for example, this one). They use a lot of very cool and high-powered statistical techniques to tease out what's going on, including ZRR, PaulScore, and lots, lots more. A/B tests can look at either general search performance, or performance on specific kinds of queries, similar to how we use RelForge.
- The noise inherent in A/B tests can make it difficult to detect the impact of relatively small changes. We've run simulations of "null" A/B tests, in which there is no difference between the A and B groups in the test. We still commonly see differences of 1% (and sometimes as high as 4%) in clickthrough rates and other measures. This happens because our users are not a particularly homogeneous group. We can greatly lessen this effect by using interleaved A/B tests, in which results for both the A and B test groups are shown together (interleaved, A-B-A-B or B-A-B-A). Since every user in the test is in both test groups, the randomness of the noise is greatly decreased. (For more, see Erik's presentation at Haystack.)

For individual search improvements, we use whichever of these methods seems most helpful at the time, depending on the scope and complexity of the improvements being evaluated. For the overall problem of showing results from so many second-try search options, we'll probably need to use A/B tests to see what display method and ordering has the most user engagement and refine that over time, since there doesn't seem to be any good way to determine the best ordering of additional results other than testing it live.