Wikimedia Search Platform/Search/Glossary
This is a place for us to provide definitions, context, and links for terms we use that other people may not be familiar with. If you come across a term on the mailing list, IRC, or elsewhere that should be defined here, add a note on the Talk page!
- Elasticsearch is a distributed, open source search and analytics engine which powers search on Wikipedia and other WMF wikis. Elasticsearch has a glossary defining terms used in the Elasticsearch world. Below are a few of the most used ones.
- Lucene is the indexing engine used by Elasticsearch.
- In the context of Elasticsearch, a node is an instance of Elasticsearch. In practice, this maps to a server, even if it is actually possible to run multiple nodes on the same server. However, this is not something that we do at the moment.
- An Elasticsearch index can be compared to a database.
- An index is split in multiple shards. Each shard is a Lucene instance. Each shard has at least one replica. (Multiple replicas are supported.)
Language analysis is part of the process of preparing input to Elasticsearch, like Wikipedia articles, to be indexed. It includes breaking text into tokens, normalizing, folding, and stemming the tokens, removing stopwords, etc.
- Another name for Elasticsearch language analyzers. This includes everything that's configured for one analyzer. It usually includes a tokenizer and filters, but may also include character filters.
- ASCII folding generally strips diacritics from ASCII characters. It's most useful for English because it only works on Latin characters and strips all diacritics. We can do ASCII folding with a "preserve" option, which allows us to index both the original and folded version of a token.
Bidi / BidirectionalEdit
- Bidirectional (or "bidi") characters are invisible (or "non-printing") Unicode characters that are used to annotate text that contains both left-to-right text (like English) and right-to-left text (like Arabic or Hebrew). These annotations help programs that understand them display the bi-directional text correctly. They are relevant to language analysis because they can be (invisibly) included in tokens that come out of an analysis chain, making those tokens effectively impossible to search for. While they are most commonly seen on Arabic or Hebrew tokens in wikis using left-to-right languages, they can end up on tokens in any script. They generally should be stripped during analysis.
- Character filters make changes to the text before tokenization. The most (in)famous character filter is "word_break_helper" which maps underscores, parens, and other characters to spaces, so that the tokenizer will break words on them. Other character filters include the mapping from hiragana to katakana used on English-language wikis, or mapping ё to е on Russian-language wikis.
- In the context of analyzer analysis, collisions occur when analysis chain changes cause two tokens (or groups of tokens) that were not previously analyzed the same to now be analyzed the same. For example, enabling diacritic stripping would cause resume and résumé to collide. Collisions and their opposite, Splits, aren't inherently good or bad, but they should be looked at to make sure that in a given case they are expected, or at least explainable.
- The default analyzer, used on most smaller wikis, consists of the Elasticsearch's "standard tokenizer," and the ICU Normalizer.
- Folding is a kind of normalization that converts a character to more "typical" or "standard" version of itself. Lowercasing is case folding. See Normalization, ASCII folding, ICU folding, and ICU Normalization. Folding sometimes includes a "preserve" option, which keeps both the original and folded version of a token.
- ICU folding is the Unicode-aware version of ASCII folding. It generally is very aggressive about stripping diacritics from characters, even when the diacritical version of a character is almost never considered the same as the plain version. We can specify language-specific exceptions to ICU folding. One useful heuristic is that is usually okay to fold everything except characters that are in the alphabet of a language. We can do ICU folding with a "preserve" option, which allows us to index both the original and folded version of a token. It's hard to get an exact list of everything covered by ICU folding.
- ICU normalization is a less aggressive version of ICU folding, that mostly converts fairly rare versions of characters to their more typical versions. But it also converts ß to ss and ς to σ, which is usually the right thing to do. It's hard to get an exact list of everything covered by ICU normalization—so we ran all the Unicode characters through it to see what happened.
- See Stemming.
- Most analyzers provided by Elasticsearch are available to just turn on—you can specify any of more than 30 languages (or "CJK")—and it will do tokenization, normalization, stemming, etc. Some of the analyzers support some configuration, others do not. None work with additional character or token filters (though that doesn't keep us from configuring "word_break_helper" for them, even though it does nothing), hence "monolithic". If we want to really customize the analyzer for a given language, we have to "unpack" it into its parts and then customize it. See Unpacked Analyzer.
- Normalization is a general name for converting characters into more appropriate characters, which often but not always means more "typical" or "standard" versions of themselves. This includes lowercasing, ASCII folding, ICU folding, ICU normalization, stripping or converting unwanted characters, etc. See Folding, ASCII folding, ICU folding, and ICU Normalization.
- The plain field indexes a version of the text with only minimal analysis. It generally only features tokenization and lowercasing, though there are a few cases, like Russian, where some additional analysis is done. When you search with quotes around a word or phrase, you are searching the plain field, which requires almost exact matches (except for case, usually). The plain field also allows us to match phrases that are made up entirely of stopwords, like "to be or not to be"!
- Analyzers and other optional Elasticsearch components are made available as plugins. We sometimes find stemmers or other analysis software and wrap them in a plugin, which then allows us to create an analysis chain for a language that didn't have one before—using built-in Elasticsearch tokenization and lowercasing with a custom stemmer, for example.
- Post-analysis types are the unique types after the analysis chain is complete. In English, these are more or less the root forms of the words from the text. There are usually more pre-analysis types than post-analysis types because some analysis steps merge tokens; though some analysis steps can generate more tokens.
- Pre-analysis types are the unique types present in the text after tokenization, but before the rest of the analysis chain has been run. In English, these are more or less the words as found in the text. There are usually more pre-analysis types than post-analysis types because some analysis steps merge tokens.
- See Tokenization.
- See Collision.
- Stemming tries to replace a word (token) with a "root" or "stem", so that words with the same stem can be indexed together (like English hope, hoped, hoping). The stems found by stemmers may not be the actual root words you would find in a dictionary. If the goal is to find the actual dictionary entry form, that's Lemmatization. Stemming is language-dependent and not available for most languages. Since we don't show the stemmed version to users, we don't usually care whether our stemmer is a lemmatizer, as long as the words grouped together are mostly right.
- Also "stop word". Stopwords are usually high-frequency function words that are important for the structure of a language, but don't carry a lot of meaning. Stop words are ignored in the text and not indexed. Determiners (a, an, the), prepositions (to, of, in), and others are usually stop words. Sometimes "regular" words that don't carry a lot of meaning, like make, do, want, be are also stop words. Ignoring stop words lets search focus more on the "content" of your search, but causes problems with searches like to be or not to be. See also Plain Field.
- The text field indexes a version of the text with the maximal language-specific analysis. In addition to tokenizing and lowercasing, it may include language-specific folding and other normalization, and stemming and other language-specific filters, if available.
- Token filters generally make changes to individual tokens after tokenization (though they can make multi-token changes). Because they work on tokens, they can generally ignore punctuation and other things handled by the tokenizer.
- Tokenization is more or less the same as breaking text into words for English, though there are special cases, like dealing with slashes/hyphens-underscores_andCamelCase. For languages that don't put spaces between words, like Chinese, Japanese, and Thai, it can be more complicated, and is often called Segmentation.
- Tokens are the individual instances of words (etc., see "tokenization") that occur in the text. Compare Types. "one one two" has three tokens, but only two types. See also type–token distinction on Wikipedia.
- Types are the unique words (etc., see "tokenization") that occur in the text. Compare Tokens. "one one two" has three tokens, but only two types. See also type–token distinction on Wikipedia.
- By default, Elasticsearch analyzers are available as a single "monolithic" option that does tokenization, normalization, stemming, etc. (See Monolithic Analyzer.) If we want to do serious customization, particularly adding new character filters or token filters, we have to "unpack" them. Conveniently, Elasticsearch provides a reference for the equivalent unpacked analysis chain for each of their tokenizers. Once unpacked, we can re-arrange parts (like moving ICU folding to come before stemming), upgrade parts (like replacing lowercasing with ICU Normalization), or add parts (like "word_break_helper"). Unpacked Elasticsearch analyzers should be identical to their monolithic counterparts, but we have a lot of automatic processing in our configuration set up that applies to unpacked analyzers—upgrading lowercasing to ICU Normalization and adding word_break_helper, for example—so results may vary. Third party analyzers are not always unpackable, as the authors may not have provided access to the individual components in the analyzer.
Scoring and Accuracy MeasuresEdit
Discounted Cumulative GainEdit
- Also known as DCG, nDCG, IDCG, DCG@5, etc.
- Discounted cumulative gain (DCG) is a measure of ranking quality. It requires human judgement to be made on rankings—hence the (frustrated) desire for the Discernatron to gather that data.
- Also known as F0.5, F1, and F2
- The F-Score is the harmonic mean (used to average rates) of recall and precision, and is used to provide a single number for comparison purposes. It makes ranking recall and precision numbers easier. Fβ is a weighted harmonic mean that is used to favor recall (β > 1) or precision (β < 1). The weighting isn't linear, so F0.5 and F2 are complementary.
- "PaulScore" is the name we've given to a metric proposed by Paul Nelson in a talk he gave at Elasticon (and has given in many other places). "NelsonScore" was also tossed around as an alternative and may occur in very early discussions.
- We use PaulScore to evaluate the quality of results provided by CirrusSearch or proposed modifications to CirrusSearch, based on historical click data.
- A big advantage of the PaulScore is that it relies on user click history to award points, so it is easy to compute. A disadvantage is that radical improvements may provide results in the lab (vs. in production) that are better, but which earn no points because they were never seen by an actual user. As such, the PaulScore is great for (and recommended by Nelson for) tracking live performance, though it is still useful for evaluating changes in the lab.
- A PaulScore can be calculated for the result set returned for a query. PaulScores can then be aggregated to a user session, and then across user sessions. Since PaulScore is not a common term of art (Paul's working on it) and there's little external documentation beyond videos of his presentations, extra detail is provided here.
- Users queries are grouped into sessions (in our case, queries from the same user that occur within ten minutes of the previous query are in the same session).
- All results (across queries within a session) are considered potentially relevant to all queries within that session; this is not entirely realistic, but it is a good heuristic.
- Imagine that a user searches for query Q and gets result R in 8th place, but doesn't click on it (possibly because they didn't look that far down the list). They then search for Q' and the same result R is now in 1st place and they click on it. Query Q' gets credit for result R at position 1, but query Q also gets credit for result R in position 8, even though that instance wasn't clicked on. The justification for the heuristic is that queries with overlapping results are probably on the same or related topics; queries on completely unrelated topics generally won't have overlapping results.
- For each result clicked on by a given user for a given query, accumulate a score such that query_score += Fk, where F is a scoring factor (see below) and k is the 0-based position of the result.
- The factor F is a value between 0 and 1 that gives more or less weight to resutls farther down the result list. Nelson suggests values between 0.8 ("low") and 0.99 ("high"), though we have investigated values between 0.1 and 0.9, and have used 0.7 0.8, and 0.9. The credit for results decreases exponentially as you go down the results list. Higher factors give more weight to lower-ranked results. We know from previous research that most users on English Wikipedia, for example, are much more likely to click on the top 3 results.
- As an aid to debugging, it is also good to note that the numerically maximum score for a query is 1/(1-F), though such a score is extremely unlikely. The minimum score would be zero (no clicks).
- For each user session, the scores are averaged across all queries into a user session score. Thus, hyperactive users (i.e., bots) that run 20,000 queries in a session don't skew the results, and a person who only searched once is as important as someone who searched many times.
- For the whole data set (e.g., a day's worth of data), all user session session scores are averaged to give an engine score.
- We had a dashboard that automatically approximated the daily PaulScore for full-text queries in CirrusSearch, using a range of factors (0.7-0.9), but it has been decommissioned. Nelson suggests that a score of 0.25 is "very good", though of course the specific F factor and other vagaries of search will affect the score.
- For auto-completion suggestions, we expect a much lower score since most queries get no clicks (i.e., many results are shown and ignored while typing and most users will only click on one result), whereas full-text searchers can more easily go back to the results page or open multiple results in other windows.
- When measuring accuracy (or classification, or search results, etc.), precision is the fraction of retrieved instances that are relevant. When you focus on precision, you care more about every answer you return being correct, and worry less about missing some correct answers. Improving precision decreases false positives.
- When measuring accuracy (or classification, or search results, etc.), recall is the fraction of relevant instances that are retrieved. When you focus on recall, you care more about returning every correct answer you can, and worry less about having some additional wrong answers. Improving recall decreases false negatives.
- "Recall" is also sometimes used more loosely to refer to the number of results made available. So someone might speak of "increasing recall" when they mean specifically that they want more results to be considered and scored, usually on the assumption that some of them will be good results. An example would be configuring your search engine to only require 70% of query terms to be in a returned document—that would certainly increase "recall" in the sense of there being more results, but wouldn't guarantee that there would be a larger number of desirable results.
- Discernatron is a tool that allows participants to judge the relevance of search results to help the Search team be able to test changes before making them available on-wiki. In particular, we tried using the Discernatron to gather data so that we could calculate DCG scores for CirrusSearch results. However, the task is extremely tedious and there was no large-scale contribution from volunteers, so while the Discernatron still exists, it has largely been abanadoned.
- Glent is a project/platform for improving "Did you mean" suggestions in search. It has three main approaches, called Method 0 (also M0, and Session Reformulation), Method 1 (M1, Similar Queries), and Method 2 (M1, Dictionary Suggester).
- Method 0, Session Reformulation: Mine search logs for query + correction pairs from within search sessions and make suggestions for incoming queries based on similarity to the original query, number of results, and query frequency. Only applicable for languages/projects with sufficient search traffic.
- Method 1, Similar Queries: Mine search logs for common queries across all search sessions to make suggestions for incoming queries based on similarity to the original query, number of results, and query frequency. Only applicable for languages with relatively small writing systems (alphabets, abjads, syllabaries, etc.).
- Method 2, Dictionary Suggester: Use resources external other than search logs (e.g., dictionaries with word frequencies) as a source for spelling corrections, using existing open source proximity/spell checking algorithms. Only applicable to languages with relevant linguistic resources.
- MjoLniR (note that MLR == Machine Learned Ranking) or Mjolnir is a library for handling the backend data processing for Machine Learned Ranking at Wikimedia. It is specialized to how click logs are stored at Wikimedia and provides functionality to transform the source click logs into ML models for ranking in elasticsearch.