Extension:CirrusSearch/Schema

CirrusSearch uses Elasticsearch as the underlying search engine. The schema used by CirrusSearch is defined through Elasticsearch index settings and mappings. Both the settings and mappings can be requested from any wiki running CirrusSearch to retrieve the current configuration. Attempts are made to keep the documentation here up to date, but the api responses contain the source of truth.


Analysis Chains Used

edit

CirrusSearch defines a variety of analysis chains that are used throughout the schema to allow search text fields in different way. These are exposed as sub properties when querying elasticsearch. For example the near_match analysis of the title field is typically exposed as title.near_match. There are no strict guarantees about the sub-property naming, but convention is for the property to share the name of the analyzer.

The results of using an analysis chain can be checked with the elasticsearch analyze API. This can be queried on the cloudelastic servers or by importing the settings provided by the cirrus-settings-dump api call into a local elasticsearch instance.

keyword

edit

Strict matching of the property text with the queried text. The text is not split into words, the whole text must match from beginning to end. The property text is truncated to 5000 characters, nothing after the first 5000 characters is taken into consideration when matching.

lowercase_keyword

edit

Identical to keyword, but with icu normalization and folding applied.

near_match

edit

Identical to keyword, but with additional flattening of various space-like tokens to spaces. This is used to power the "Go" functionality of CirrusSearch.

near_match_asciifolding

edit

Identical to lowercase_keyword, but with additional flattening of various space-like tokens to spaces.

plain

edit

Applied to textual content to represent the words in a method very close to the original words. Minimal transformations are applied. This only represents words, various special characters (quotes, commas, etc.) are removed in the tokenization step.

prefix

edit

Generates all possible prefixes of a keyword. ICU normalization is applied along with flattening of various space-like tokens to spaces. Any matching against a prefix must start from the very first character of the field.

prefix_asciifolding

edit

Similar to prefix, but with icu folding applied as well.

trigram

edit

Generates trigrams, or three character sequences, of the textual content. This is primarily used to accelerate regex search. For example the string "example text" will yield the tokens: "exa", "xam", "amp", "mpl", "ple", "le ", "e t", " te", "tex", "ext"

text

edit

Standard analyzer for text content. This is similar to the plain analyzer but with more aggressive normalization applied to the content. These normalizations may include stop word filtering, stemming, and other language specific handling.

short_text

edit

Similar to the text analyzer, but specialized for short text strings such as headings and titles.

source_text_plain

edit

Analyzer primarily used against wikitext to provide word level queries. Uses only icu normalization along with some special rules to help separate words seen in wikitext.

suggest

edit

Shingled analzer used to power search suggestions (aka did you mean). Shingles are similar to trigrams, but operate on the word level instead of the character level. This analyzer is configured to emit 1, 2 and 3-grams. For example the string "cats with hats" will emit the tokens: "cats", "cats with", "cats with hats", "with", "with hats", "hats"

token_count

edit

Reports the number of tokens in a field, rather than the textual content.

Native Document Properties

edit

These properties are calculated in the CirrusSearch extension and provided to elasticsearch when sending updates.

version

edit

The revision id that was indexed

wiki

edit

The dbname of the wiki this document belongs to

namespace

edit

The integer namespace the document is in

namespace_text

edit

The textual representation of the namespace the document is in. This is in the wiki's content language

title

edit

The title of the page this document represents. The title uses the text format, where spaces in the title are preserved.

timestamp

edit

The timestamp the most recently indexed revision of this page was created at. Timestamps are in the format YYYY-MM-DDTHH:MM:SSZ

create_timestamp

edit

The timestamp the first revision of this page was created at. Timestamps are in the format YYYY-MM-DDTHH:MM:SSZ.

category

edit

A list of categories the page belongs to. The categories use the text format, where spaces in the title are preserved.

edit

A list of external url's this page links to.

edit

A list of wiki pages that are linked from this page. The wiki pages are in dbkey format, where spaces are replaced with underscores.

template

edit

A list of templates that are used in this page, as reported by the MediaWiki wikitext parser. The template names are in text format, where spaces in the title are preserved.

text

edit

The textual content of the page. This is roughly constructed by running the wikitext through the parser to generate html, removing non-text content, and stripping all html. Content removed from this field such as tables, captions, and hatnotes, are moved to the auxiliary_text field.

source_text

edit

The source wikitext of the page.

text_bytes

edit

The size of the content as reported by the associated mediawiki Content implementation. For wikitext this is the number of bytes in the wikitext.

content_model

edit

String representing the name of the content model for this page.

wikibase_item

edit

String containing the wikidata Q-item this page is associated with.

coordinates

edit

List of coordinates associated with this page. Each coordinate has the following structure:

Properties of each coordinate:

  • coord - elasticsearch geo_point. Represented as object with two properties: lat/lon. Both contain a floating point number in the domain (-180, 180)
  • country - country code
  • dim - dimension. Integer radius, in meters, of the item being referenced
  • globe - The globe the coordinates are on. Typically "earth".
  • name - Name of the item referenced. Often null
  • primary - Boolean representing if this is the primary coordinate for the article. Only one coordinate can be primary.
  • region - Sub-region of country this coordinate is within. For example if country code is US region will be a two letter US State code.
  • type - ???. Same value as gt_type field of GeoData table in mysql

language

edit

The language code this page is in

heading

edit

List of headings on this page

opening_text

edit

Text content of the page prior to the first heading. The content is also available in the text property.

auxiliary_text

edit

List of strings removed from the text property. The content that is moved from the text property to this one is controlled by the WikiTextStructure::$auxiliaryElementSelectors property in mediawiki core.

display_title

edit

Contains the display title of the page if it differs from the regular page title in ways other than casing. If the display title is prefixed with the translated namespace of the page in the pages language the namespace name is stripped.

file_bits

edit

Contains the integer bit depth of the media represented by this page

file_height

edit

Contains the integer height of the media represented by this page

file_media_type

edit

Contains the media type of the media represented by this page.

file_mime

edit

Contains the mime type of the media represented by this page.

file_resolution

edit

Contains an integer representation of the resolution of the media represented by this page. This is calculated as floor(square_root(file_width * file_height)).

file_size

edit

Contains the size of the media represented by this page in bytes.

file_text

edit

Contains the text content of the media represented by this page for mime types that mediawiki knows how to extract the content of, such as PDF and DJVU. Length of text indexed is limited by $wgCirrusSearchMaxFileTextLength which is unlimited by default and 50kB on WMF wikis.

file_width

edit

Contains the width of the media represented by this page in pixels

edit

Contains an integer representing the number of pages on the same wiki that link to this page.

redirect

edit

List of redirects on the same wiki that redirect to this page. Each redirect is represented by an object with two properties: The integer namespace in the namespace property, and the title of the redirect in

Properties only populated on commonswiki

edit

local_sites_with_dupe

edit

Only found on commonswiki in the File namespace. Contains list of wiki dbnames that have an uploaded file with the exact same name as this file.

Properties only populated on wikis running wikibase repo

edit

descriptions.*

edit

label_count

edit

labels.*

edit

lemma

edit

lexeme_forms

edit

representation

edit

lexeme_language

edit

lexical_category

edit
edit

statement_count

edit

statement_keywords

edit

External Document Properties

edit

These properties are calculated external to CirrusSearch and populated within the production search clusters

popularity_score

edit

A floating point number representing the percentage of page views to this wiki that requests this page. This is only available for content pages.

weighted_tags

edit

Contains classification predictions about the page from various sources, including ORES models and link recommendations. While the name says articletopic, this will be renamed to something semantically appropriate, perhaps predicted_classes or even classifications, in the future.

Predictions are provided in the source documents in an array with per-model prefixes and a suffixed integer in [0,1000] representing the confidence. The analysis chain interprets this value as the term frequency. For legacy reasons unprefixed predictions (without a /) belong to the ORES articletopic model. For example:

   [
       "STEM.Computing|780",
       "drafttopic/STEM.STEM*|988",
       "link_recommend/exists|1",
   ]

copy_to Document Properties

edit

These properties are not provided directly by CirrusSearch, rather the elasticsearch mapping is instructed to create these fields by copying content from other fields.

Contains all text content copied to a single field. This consolidation into a single field is an optimization, semantically it shouldn't be important. The general idea is to use as a first-pass filter that removes most irrelevant results, leaving the individual field queries to only effect scoring.

all_near_match

edit

Contains both titles and redirects in a single field for filtering with the near_match analyzer.

suggest

edit

The suggest field is populated by the copy_to section of the title and redirect fields. The suggest field uses shingles (word ngrams) which provides phrase matching in a way that doesn't have to be restricted to the rescore window for performance reasons.

labels_all

edit

Only generated on wikis containing wikibase repo. Contains a copy of all labels in all languages.