Extension:CirrusSearch/Query Construction

This page describes how the user query is manipulated to be reconstructed as a structured Elasticsearch query.

this document describes CirrusSearch internals and may rapidly become out of date as it describes details of the current code base.

Overview

CirrusSearch interacts with MediaWiki core by extending SearchEngine. This class exposes 3 main ways to query the index and find pages (called SearchEngine entry points in CirrusSearch):

full text: the classic full text search provided by Special:Search or the search module of API:Query
near match: this call is responsible for the "go feature", when typing a text that nearly perfectly matches a page it goes directly to that page instead of Special:Search, or using srwhat=near_match in the search module of API:Query.
completion: used by all autocomplete (search as you type).

When the query string and its associated metadata^[1] enter Cirrus it undergoes various transformation steps:

Parsing
Profile selection
Elasticsearch query building
Elasticsearch responses transformation
Fallback methods evaluation
CrossProject searches

Parsing

Parsing is responsible for extracting features^[2] from the user query string. Note that while parsing is particularly important for full text search queries it is also present for other search entry points, for instance the namespace prefix extraction is present in all searches and can be considered a parsing step.

Parsing produces a SearchQuery instance that contains all the information known about the query and its context.

the search engine entry point
all its metadata (size, offset, ...)
contextual filters (e.g. the prefix option provided by Extension:InputBox)
the parsed query (AST)

The SearchQuery is immutable.

Profile selection

Profile is the process responsible for deciding what are the best profiles to use for a given SearchQuery. This component is currently under discussion.

Elasticsearch query building

This is the process of building the Elasticsearch search request body.

As of this writing the intent is to switch the logic of building the elasticsearch query into a set of transformations whose input is the immutable SearchQuery and whose output is a part of the elasticsearch search request body. The current technique uses a mutable context that all building components can modify.

Retrieval query

Meant to extract all the documents that match the user query. This Elasticsearch query is split into two parts.

Scoring part

Elements of the query that affect scoring. Changing something here should not change the set of hits found by the retrieval query. This section of the query must only affect the initial ranking of the results. The scoring part of a query is controlled by a FullTextQueryBuilder^[3] currently only supported by the full text SearchEngine entry point.

Filtering part

Elements of the query that do not affect ranking. Changing something here does not affect ranking but changes the set of hits found by the retrieval query. Filtering is also controlled by FullTextQueryBuilder but will change similarly to have a SearchQuery as input.

full text search keywords can interect with the filters by implementing FilterQueryFeature.

Rescore query

Fine-tuning of the ranking. Depending on the need, multiple rescore queries can be combined, their scores can also be combined. Some searches may prefer to combine the score from the scoring part of the retrieval query with some rescore components.

full text search keywords can interact with the rescore functions by implementing BoostFunctionFeature.

Fetch phase configuration

This is the part of the search request that instructs Elasticsearch what data to extract for every hit we display to the user. This phase of the query building process is not yet fully designed and the current way of doing things is not optimal. A ResultsType is chosen early in the process and is responsible for selecting the fields to extract and the fields to highlight.

It is tricky as it is directly connected to the way we display the search hits in Special:Search. Some extension may want to extract and display specific data that it stored using a custom mapping and a custom ContentHandler. Some keywords may want to tell the user that they matched a particular part of the document. Some extension may want to completely transform the data and aspect displayed using hooks like Manual:Hooks/ShowSearchHit.

In general, ShowSearchHit is currently used in a dual capacity: as a hook for some extension to incrementally tweak search results (i.e. add some widget or formatting), and as means to completely override the result display, like Wikibase is doing (both with and without CirrusSearch enabled). The challenge here is that some scenarios - like Wikibase without CirrusSearch - may call for complete display override without actually involving custom result type, thus the only way to implement such customization now is the hook.

Currently:

Fields are extracted using ResultsType::getSourceFiltering() and ResultsType::getStoredFields()
Highlighting is setup by ResultsType::getHighlightingConfiguration()
Special:Search's look can be changed using Manual:Hooks/ShowSearchHit or Manual:Hooks/ShowSearchHitTitle
Extensions can register extra data into SearchResult using Manual:Hooks/SearchResultsAugment

Drawbacks are:

None of these techniques can be strongly coupled but they are highly interdependent^[4]
ResultsType is not driven by profile and it's unclear when it should be constructed. Cirrus decides the ResultsType before anything else, but some FullTextQueryBuilder may override it
Manual:Hooks/ShowSearchHit being a hook gives no guarantee that it'll be executed in the right order (not have its values overridden) nor that it has all the required context to know what to do.
Keywords are unable to cleanly add new highlighting hints^[5]

Elasticsearch responses transformation

The process of reading the Elasticsearch response and returning a:

SearchResultSet for the full text search engine entry point
SearchSuggestionSet for the completion search engine entry point
Title for the near match search engine entry point

The process responsible for doing this transformation is through CirrusSearch ResultsType.

wikidata's wbsearchentities uses a special ResultsType implementation to create TermSearchResult arrays instead of the SearchEngine types.

Fallback methods evaluation

Fallback methods are only used (for now) in full text searches. It's a process that spans the entirety of the query construction up to the results evaluation. It is meant to repair a query that may not produce desirable results (e.g., at least 3 results to display).

see Wikimedia_Discovery/So_Many_Search_Options for more ideas.

Phrase suggester

Attach an Elasticsearch suggest request to the main search query and display the suggestion if a title is not highlighted. May rewrite the entire result set using the suggestion as the query if the initial result set did not produce any results. It is supposed to detect typos and fix them.

TextCat language detection

This process runs language detection on the user query and runs a second search on the corresponding wiki in the detected language. The results are appended to the first ones.

Cross-project searches

This process runs SearchQuery for every sister project of the same language. The search request is attached to the main one using the msearch feature.

Notes and references

↑ namespaces, size, offset...
↑ See Help:CirrusSearch for example features in the full text search query
↑ will be deprecated soon to take benefit of the SearchQuery class
↑ task T190130
↑ task T195881

[1] spaces, size, offset...

[2] See Help:CirrusSearch for example features in the full text search query

[3] will be deprecated soon to take benefit of the SearchQuery class

[4] task T190130

[5] task T195881

[1]

[2]

[3]

[4]

[5]