Extension:CirrusSearch/Schema
CirrusSearch uses Elasticsearch as the underlying search engine. The schema used by CirrusSearch is defined through Elasticsearch index settings and mappings. Both the settings and mappings can be requested from any wiki running CirrusSearch to retrieve the current configuration. Attempts are made to keep the documentation here up to date, but the api responses contain the source of truth.
Analysis Chains Used
editCirrusSearch defines a variety of analysis chains that are used throughout the schema to allow search text fields in different way. These are exposed as sub properties when querying elasticsearch. For example the near_match analysis of the title field is typically exposed as title.near_match. There are no strict guarantees about the sub-property naming, but convention is for the property to share the name of the analyzer.
The results of using an analysis chain can be checked with the elasticsearch analyze API. This can be queried on the cloudelastic servers or by importing the settings provided by the cirrus-settings-dump api call into a local elasticsearch instance.
keyword
editStrict matching of the property text with the queried text. The text is not split into words, the whole text must match from beginning to end. The property text is truncated to 5000 characters, nothing after the first 5000 characters is taken into consideration when matching.
lowercase_keyword
editIdentical to keyword, but with icu normalization and folding applied.
near_match
editIdentical to keyword, but with additional flattening of various space-like tokens to spaces. This is used to power the "Go" functionality of CirrusSearch.
near_match_asciifolding
editIdentical to lowercase_keyword, but with additional flattening of various space-like tokens to spaces.
plain
editApplied to textual content to represent the words in a method very close to the original words. Minimal transformations are applied. This only represents words, various special characters (quotes, commas, etc.) are removed in the tokenization step.
prefix
editGenerates all possible prefixes of a keyword. ICU normalization is applied along with flattening of various space-like tokens to spaces. Any matching against a prefix must start from the very first character of the field.
prefix_asciifolding
editSimilar to prefix, but with icu folding applied as well.
trigram
editGenerates trigrams, or three character sequences, of the textual content. This is primarily used to accelerate regex search. For example the string "example text" will yield the tokens: "exa", "xam", "amp", "mpl", "ple", "le ", "e t", " te", "tex", "ext"
text
editStandard analyzer for text content. This is similar to the plain analyzer but with more aggressive normalization applied to the content. These normalizations may include stop word filtering, stemming, and other language specific handling.
short_text
editSimilar to the text analyzer, but specialized for short text strings such as headings and titles.
source_text_plain
editAnalyzer primarily used against wikitext to provide word level queries. Uses only icu normalization along with some special rules to help separate words seen in wikitext.
suggest
editShingled analzer used to power search suggestions (aka did you mean). Shingles are similar to trigrams, but operate on the word level instead of the character level. This analyzer is configured to emit 1, 2 and 3-grams. For example the string "cats with hats" will emit the tokens: "cats", "cats with", "cats with hats", "with", "with hats", "hats"
token_count
editReports the number of tokens in a field, rather than the textual content.
Native Document Properties
editThese properties are calculated in the CirrusSearch extension and provided to elasticsearch when sending updates.
version
editThe revision id that was indexed
wiki
editThe dbname of the wiki this document belongs to
namespace
editThe integer namespace the document is in
namespace_text
editThe textual representation of the namespace the document is in. This is in the wiki's content language
title
editThe title of the page this document represents. The title uses the text format, where spaces in the title are preserved.
timestamp
editThe timestamp the most recently indexed revision of this page was created at. Timestamps are in the format YYYY-MM-DDTHH:MM:SSZ
create_timestamp
editThe timestamp the first revision of this page was created at. Timestamps are in the format YYYY-MM-DDTHH:MM:SSZ.
category
editA list of categories the page belongs to. The categories use the text format, where spaces in the title are preserved.
external_link
editA list of external url's this page links to.
outgoing_link
editA list of wiki pages that are linked from this page. The wiki pages are in dbkey format, where spaces are replaced with underscores.
template
editA list of templates that are used in this page, as reported by the MediaWiki wikitext parser. The template names are in text format, where spaces in the title are preserved.
text
editThe textual content of the page. This is roughly constructed by running the wikitext through the parser to generate html, removing non-text content, and stripping all html. Content removed from this field such as tables, captions, and hatnotes, are moved to the auxiliary_text field.
source_text
editThe source wikitext of the page.
text_bytes
editThe size of the content as reported by the associated mediawiki Content implementation. For wikitext this is the number of bytes in the wikitext.
content_model
editString representing the name of the content model for this page.
wikibase_item
editString containing the wikidata Q-item this page is associated with.
coordinates
editList of coordinates associated with this page. Each coordinate has the following structure:
Properties of each coordinate:
- coord - elasticsearch geo_point. Represented as object with two properties: lat/lon. Both contain a floating point number in the domain (-180, 180)
- country - country code
- dim - dimension. Integer radius, in meters, of the item being referenced
- globe - The globe the coordinates are on. Typically "earth".
- name - Name of the item referenced. Often null
- primary - Boolean representing if this is the primary coordinate for the article. Only one coordinate can be primary.
- region - Sub-region of country this coordinate is within. For example if country code is US region will be a two letter US State code.
- type - ???. Same value as gt_type field of GeoData table in mysql
language
editThe language code this page is in
heading
editList of headings on this page
opening_text
editText content of the page prior to the first heading. The content is also available in the text property.
auxiliary_text
editList of strings removed from the text property. The content that is moved from the text property to this one is controlled by the WikiTextStructure::$auxiliaryElementSelectors property in mediawiki core.
display_title
editContains the display title of the page if it differs from the regular page title in ways other than casing. If the display title is prefixed with the translated namespace of the page in the pages language the namespace name is stripped.
file_bits
editContains the integer bit depth of the media represented by this page
file_height
editContains the integer height of the media represented by this page
file_media_type
editContains the media type of the media represented by this page.
file_mime
editContains the mime type of the media represented by this page.
file_resolution
editContains an integer representation of the resolution of the media represented by this page. This is calculated as floor(square_root(file_width * file_height)).
file_size
editContains the size of the media represented by this page in bytes.
file_text
editContains the text content of the media represented by this page for mime types that mediawiki knows how to extract the content of, such as PDF and DJVU. Length of text indexed is limited by $wgCirrusSearchMaxFileTextLength which is unlimited by default and 50kB on WMF wikis.
file_width
editContains the width of the media represented by this page in pixels
incoming_links
editContains an integer representing the number of pages on the same wiki that link to this page.
redirect
editList of redirects on the same wiki that redirect to this page. Each redirect is represented by an object with two properties: The integer namespace in the namespace property, and the title of the redirect in
Properties only populated on commonswiki
editlocal_sites_with_dupe
editOnly found on commonswiki in the File namespace. Contains list of wiki dbnames that have an uploaded file with the exact same name as this file.
Properties only populated on wikis running wikibase repo
editdescriptions.*
editlabel_count
editlabels.*
editlemma
editlexeme_forms
editid
editrepresentation
editlexeme_language
editlexical_category
editsitelink_count
editstatement_count
editstatement_keywords
editExternal Document Properties
editThese properties are calculated external to CirrusSearch and populated within the production search clusters
popularity_score
editA floating point number representing the percentage of page views to this wiki that requests this page. This is only available for content pages.
weighted_tags
editContains classification predictions about the page from various sources, including ORES models and link recommendations. While the name says articletopic, this will be renamed to something semantically appropriate, perhaps predicted_classes
or even classifications
, in the future.
Predictions are provided in the source documents in an array with per-model prefixes and a suffixed integer in [0,1000] representing the confidence. The analysis chain interprets this value as the term frequency. For legacy reasons unprefixed predictions (without a /
) belong to the ORES articletopic model. For example:
[ "STEM.Computing|780", "drafttopic/STEM.STEM*|988", "link_recommend/exists|1", ]
copy_to Document Properties
editThese properties are not provided directly by CirrusSearch, rather the elasticsearch mapping is instructed to create these fields by copying content from other fields.
all
editContains all text content copied to a single field. This consolidation into a single field is an optimization, semantically it shouldn't be important. The general idea is to use as a first-pass filter that removes most irrelevant results, leaving the individual field queries to only effect scoring.
all_near_match
editContains both titles and redirects in a single field for filtering with the near_match analyzer.
suggest
editThe suggest field is populated by the copy_to section of the title and redirect fields. The suggest field uses shingles (word ngrams) which provides phrase matching in a way that doesn't have to be restricted to the rescore window for performance reasons.
labels_all
editOnly generated on wikis containing wikibase repo. Contains a copy of all labels in all languages.