Help talk:CirrusSearch

About this board

Summary by Speravir

Question for wildcards answered, at least questioner was satisfied (thanked in background).

Mystyc1 (talkcontribs)

Is the * treated like a wildcard or like a regex operator? If it's a wildcard, then does it represent one character, a group of characters, a word, or something else? Maybe this isn't the right article to answer such questions, but if it is, then it is remarkably bad.  

I suspect the latter because I have not been able to find a link in the article that would direct me to a more appropriate page.

Speravir (talkcontribs)

See help page, section Words, phrases, and modifiers:

The last words give a hint that there are regex searches possible in which the * has a different meaning. “Covered later” refers to the according section.

This sentence from section for insource searches should be noted, as well:

(You can find all of this by searching for wildcard on the help page.)

217.117.125.83 (talkcontribs)

How to search for an exact string including greyspace characters?

Speravir (talkcontribs)
Reply to "greyspace characters"
2001:1711:FA4B:D10:B1BE:F13C:8327:704F (talkcontribs)

Hi,

Any profile example on how we can use a synonym file with CirrusSearch and Elastic ?

Thanks

EBernhardson (WMF) (talkcontribs)

Unfortunately synonyms aren't something CirrusSearch has any support for. It's been in the background as something to work on, but we need to come up with a solution that works in hundreds of languages and likely defers the actualy synonym definition to wiki editors rather than system administrators.

While not exactly synonyms, on the WMF wikis we rely on redirects to pages to provide alternate names for them. In most cases where wiki search externally appears to have used synonyms what actually happened was there was a redirect to the page giving alternate titles (that are used as a fairly strong ranking signal).

Aparolini (talkcontribs)

Thanks for the feedback.

Because Elasticsearch doses support synonyms as a filter and that Cirrus is really just a Bridge to Elastic, I was hopping we could work this out with profiles, such as

'default' => [
'builder_class' => Query\FullTextQueryStringQueryBuilder::class,
'settings' => [
    'filter' => [
	'type' => 'synonym',
	'settings' => [
		'synonyms_path' => 'my_synonyms.txt',
                'updateable' => 'true'
	]
      ]
],

Synonyms are important to us (medical wiki), as for instance if you look for, say "audition", you should find not only page with "audition" in it, but also page with "hear" or "malleus" (small bone inside the hear).

Editing the page to add synonyms is not an option for us, as this will add a lot of work for page producers.

Reply to "Synonyms"
2001:1711:FA4B:D10:1163:390A:525:F58B (talkcontribs)

Hi. Mediawiki 1.38.2 and CirrusSearch generate Elastic queries using "query_string". How do I make Cirrus use "match_phrase_prefix" instead" ?

This will allow me to find page using partial keywords: Example "Cirr" will return pages with "Cirrus" inside.

Any ideas ? Thanks.

EBernhardson (WMF) (talkcontribs)

Within cirrus we don't have anything that directly supports match_phrase_prefix. We generally avoid this style of query as it provides queries that give unexpected outputs that can change depending on which replicas of the index it lands on. In particular there is no guarantee with match_phrase_prefix that "cirr" will return pages with "Cirrus" inside of them. Instead it will look at term dictionaries and select a number of words somewhat arbitrarily that start with cirr and then search for those words. Depending on the exact term statistics in the replica it lands on this can choose a different set of words to search for when repeating the same query.

While I would generally suggest avoiding it, the existing query_string queries do support this style of query. You can achieve the same functionality by appending a *, such as cirr*

2001:1711:FA4B:D10:1029:E025:F952:194F (talkcontribs)

Thanks for your anser.

It's a Swiss-French medical Wiki used by doctors, with a lot of long words. Our basic users don't know Elatic tricks, like "*" or "~".

I.E: "Prostatectomie", should be found by juste by entering "Prostat"

So if we cannot use match_phrase_prefix, can we put the final "*" by default in all search with Cirrus?

DCausse (WMF) (talkcontribs)

To customize the main full text search query you can implement your own \CirrusSearch\Query\FullTextQueryBuilder implementation and register it in the wgCirrusSearchFullTextQueryBuilderProfiles config var, see some examples for other builder profiles here. Then you can activate this new profile as the default by setting wgCirrusSearchFullTextQueryBuilderProfile to its name.

You have some examples of how to implement a FullTextQueryBuilder here.

Note that doing this is not very trivial but this is I think the only way to achieve what you want without teaching your users to use the search syntax.

Reply to "Using match_phrase_prefix"
46.193.3.148 (talkcontribs)

Hello quick question once I have downloaded all the dependencies for the cirrus Search extension how do I link it with elastic search ?

Ciencia Al Poder (talkcontribs)
Reply to "Elastic and Cirrus Search"
Jonteemil (talkcontribs)

Hello!

Some of the filenames of these files aren't complete for some reason. Also you would expect that this search would say "File:PDP-CH - Philadelphia Orchestra - -Wikipedia-Leopold Stokowski-Leopold Stokowski - Brandenburg Concerto No. 2 in F major, BWV 1047 - 1st Movement- Allegro - Johann Sebastian Bach - Hmv-d1708-42-606.flac (redirect from File:PDP-CH - Philadelphia Orchestra - -Wikipedia-Leopold Stokowski-Leopold Stokowski - Brandenburg Concerto No. 2 in F major, BWV 1047 - 1st Movement- Allegro - Johann Sebastian Bach - Hmv-d1708-42-606.flac.flac)" however it isn't. Might anyone here know why?

DCausse (WMF) (talkcontribs)

Hi,

I'm not sure to understand what is not complete in the query results for intitle:/Philadelphia Orchestra/ filemime:audio/x-flac do you have a specific page missing.

The second query you pasted contains error and after fixing it (I assumed you wanted to search for the redirect with the doubled .flac extension) finds the page you mention. Here is how I fixed the query:

intitle:/Philadelphia Orchestra/ filemime:audio/x-flac intitle:/\.flac\.flac/.

When searching for a redirect the search engine will always display the redirected page, sometimes you may see a hint that you matched a redirect when the mention (redirect from: page_name) appears after the page title, see for instance the results for: intitle:/\.flac\.flac/ filemime:audio/x-flac incategory:"Swiss Foundation Public Domain".

Speravir (talkcontribs)

DCausse, the filenames are only partially displayed, from the first search Jonteemil provided it seems there is a maximum length for display, some limit for characters, and then the second search condition intitle:/\.flac\.flac/ only narrows down the result(s) without adjusting the displayed lines.

But with an altered search I get the full display, it links to the redirected file with only one file extension, though, as already pointed out by DCausse: file: filemime:audio/x-flac intitle:/Philadelphia Orchestra.+\.flac\.flac/. Note that I merged both regex searches, as the one with two of them is really bad in terms of server loading (I also added the namespace as search domain; this should be always added if possible).

Reply to "2 questions"

cirrussearch vs database backup dumps

2
69.191.241.48 (talkcontribs)

Hi - there are two types of dumps available for enwiki pages - monthly database dump structured in XML which you can subscribe to and weekly cirrussearch dumps, which are structured in JSON for bulk upload to elasticsearch. We're trying to diff the two dumps to see if they're comparable, but notice some articles are in the monthly XML dump not in the weekly cirrussearch dump. I'm having trouble finding an explanation in the main wikimedia homepage that clearly states the difference beteween these two enwiki dumps. Any additional information would be much appreciated.

I would post links, but am getting an error when trying to post, so please navigate to dumps.wikimedia.org and look for the extensions

cirrussearch dump: /other/cirrussearch/

xml dump: /enwiki/latest/

Ciencia Al Poder (talkcontribs)

Check if the "missing" articles in cirrus search dumps exist on the live wiki. If not, that means those articles got deleted after the monthly XML dumps but before the weekly cirrus search dumps

Reply to "cirrussearch vs database backup dumps"
Nicolas senechal (talkcontribs)

I try to -incategory:"Actif" and I have a result of an include. So I try with ! and he doesn't work (-! and !- (is equal with -)). I try to include Actif with incategory:"Actif" and he doesn't work.

It's not normal... how can fix this?

Any help is apprised.

DCausse (WMF) (talkcontribs)

The syntax you are using is correct and should work:

-incategory:"Cartographie"

is searching for pages that are not directly under the Cartographie category.


It might be that you are expecting incategory to find articles belonging to the tree of subcategories?

This is not the case, incategory is limited to direct relationships.

For finding articles in a category and the subcategories of this category you must use the deepcat keyword: deepcat:"Cartographie".


This keyword is available on WMF wikis.

Nicolas senechal (talkcontribs)

Thank you @DCausse (WMF), but it's still doesn't work.

Actif is the directely liked.

And I try again with deepcat the same, but doesn't work with just deepcat.

DCausse (WMF) (talkcontribs)

If you are using a publicly accessible wiki could you provide the link to it so that we can try to have a closer look?

If you are using your own private wiki could you double check that CirrusSearch is properly installed, to do this I generally append &cirrusDumpQuery to the URL bar on the search results page.

This is how it looks like on the french wikipedia:

https://fr.wikipedia.org/w/index.php?search=-incategory%3ACartographie&title=Sp%C3%A9cial%3ARecherche&ns0=1&cirrusDumpQuery


This should display the JSON document sent to elasticsearch and we might be able to detect what is wrong in your setup, esp. if the match query on the category.lowercase_keyword field is wrapped inside a must_not block or not.

The deepcat keyword requires a graph engine to be installed and setup, this most probably explains why it does not work out of the box.

Nicolas senechal (talkcontribs)
Nicolas senechal (talkcontribs)

OK - after checking my install

I have 1 question :

I only have to have the elastica extension for Wikipedia, and the library cURL in order to running circusSearch?

If no, so how can I download for xampp on windows elastica librairie for php?

Nicolas senechal (talkcontribs)

After some time I retry to download elasticaSearch, now it work, incategory don't realy work (so for me: !incategory: and -incategory: to include, I don't have exclude, the oprator like OR, AND, NOT don't work).

So in order to run ElasticSearch I have to do a command in the cmd on windows, but I need to have a cmd run constantly. I shear on the internet and I find the commend START \B .So now I have elasticSearsh who run in a hidden cmd. To stop it I just have to close my common cmd.


so here is the result with &cirrusDumpQuery for !incategory:"Actif" -incategory:"Sql"

{

    "__main__": {

        "description": "full_text search for '!incategory:\"Actif\" -incategory:\"Sql\"'",

        "path": "test\/page\/_search",

        "params": {

            "timeout": "20s",

            "search_type": "dfs_query_then_fetch"

        },

        "query": {

            "_source": [

                "namespace",

                "title",

                "namespace_text",

                "wiki",

                "redirect.*",

                "timestamp",

                "text_bytes"

            ],

            "stored_fields": [

                "text.word_count"

            ],

            "query": {

                "bool": {

                    "minimum_should_match": 1,

                    "should": [

                        {

                            "query_string": {

                                "query": "!incategory\\: (all.plain:\"Actif\"~0^1)",

                                "fields": [

                                    "all.plain^1",

                                    "all^0.5"

                                ],

                                "phrase_slop": 0,

                                "default_operator": "AND",

                                "allow_leading_wildcard": true,

                                "fuzzy_prefix_length": 2,

                                "rewrite": "top_terms_boost_1024"

                            }

                        },

                        {

                            "multi_match": {

                                "fields": [

                                    "all_near_match^2",

                                    "all_near_match.asciifolding^1.5"

                                ],

                                "query": "!incategory:"

                            }

                        }

                    ],

                    "filter": [

                        {

                            "bool": {

                                "must": [

                                    {

                                        "terms": {

                                            "namespace": [

                                                0,

                                                4,

                                                6,

                                                14

                                            ]

                                        }

                                    }

                                ],

                                "must_not": [

                                    {

                                        "bool": {

                                            "should": [

                                                {

                                                    "match": {

                                                        "category.lowercase_keyword": {

                                                            "query": "Sql"

                                                        }

                                                    }

                                                }

                                            ]

                                        }

                                    }

                                ]

                            }

                        }

                    ]

                }

            },

            "highlight": {

                "pre_tags": [

                    "\ue000"

                ],

                "post_tags": [

                    "\ue001"

                ],

                "fields": {

                    "title": {

                        "type": "fvh",

                        "number_of_fragments": 0,

                        "order": "score",

                        "matched_fields": [

                            "title",

                            "title.plain"

                        ]

                    },

                    "redirect.title": {

                        "type": "fvh",

                        "number_of_fragments": 1,

                        "order": "score",

                        "fragment_size": 10000,

                        "matched_fields": [

                            "redirect.title",

                            "redirect.title.plain"

                        ]

                    },

                    "category": {

                        "type": "fvh",

                        "number_of_fragments": 1,

                        "order": "score",

                        "fragment_size": 10000,

                        "matched_fields": [

                            "category",

                            "category.plain"

                        ]

                    },

                    "heading": {

                        "type": "fvh",

                        "number_of_fragments": 1,

                        "order": "score",

                        "fragment_size": 10000,

                        "matched_fields": [

                            "heading",

                            "heading.plain"

                        ]

                    },

                    "text": {

                        "type": "fvh",

                        "number_of_fragments": 1,

                        "order": "score",

                        "fragment_size": 150,

                        "no_match_size": 150,

                        "matched_fields": [

                            "text",

                            "text.plain"

                        ]

                    },

                    "auxiliary_text": {

                        "type": "fvh",

                        "number_of_fragments": 1,

                        "order": "score",

                        "fragment_size": 150,

                        "matched_fields": [

                            "auxiliary_text",

                            "auxiliary_text.plain"

                        ]

                    },

                    "file_text": {

                        "type": "fvh",

                        "number_of_fragments": 1,

                        "order": "score",

                        "fragment_size": 150,

                        "matched_fields": [

                            "file_text",

                            "file_text.plain"

                        ]

                    }

                },

                "highlight_query": {

                    "query_string": {

                        "query": "!incategory\\: (title.plain:\"Actif\"~0^20 OR redirect.title.plain:\"Actif\"~0^15 OR category.plain:\"Actif\"~0^8 OR heading.plain:\"Actif\"~0^5 OR opening_text.plain:\"Actif\"~0^3 OR text.plain:\"Actif\"~0^1 OR auxiliary_text.plain:\"Actif\"~0^0.5 OR file_text.plain:\"Actif\"~0^0.5)",

                        "fields": [

                            "title.plain^20",

                            "redirect.title.plain^15",

                            "category.plain^8",

                            "heading.plain^5",

                            "opening_text.plain^3",

                            "text.plain^1",

                            "auxiliary_text.plain^0.5",

                            "file_text.plain^0.5",

                            "title^10",

                            "redirect.title^7.5",

                            "category^4",

                            "heading^2.5",

                            "opening_text^1.5",

                            "text^0.5",

                            "auxiliary_text^0.25",

                            "file_text^0.25"

                        ],

                        "phrase_slop": 1,

                        "default_operator": "AND",

                        "allow_leading_wildcard": true,

                        "fuzzy_prefix_length": 2,

                        "rewrite": "top_terms_boost_1024"

                    }

                }

            },

            "stats": [

                "full_text",

                "full_text_querystring",

                "complex_query",

                "incategory",

                "query_string"

            ],

            "rescore": [

                {

                    "window_size": 8192,

                    "query": {

                        "query_weight": 1,

                        "rescore_query_weight": 1,

                        "score_mode": "multiply",

                        "rescore_query": {

                            "function_score": {

                                "functions": [

                                    {

                                        "field_value_factor": {

                                            "field": "incoming_links",

                                            "modifier": "log2p",

                                            "missing": 0

                                        }

                                    },

                                    {

                                        "weight": 0.1,

                                        "filter": {

                                            "terms": {

                                                "namespace": [

                                                    4

                                                ]

                                            }

                                        }

                                    },

                                    {

                                        "weight": 0.2,

                                        "filter": {

                                            "terms": {

                                                "namespace": [

                                                    6,

                                                    14

                                                ]

                                            }

                                        }

                                    }

                                ]

                            }

                        }

                    }

                }

            ],

            "size": 21

        },

        "options": {

            "timeout": "20s",

            "search_type": "dfs_query_then_fetch"

        }

    }

}

DCausse (WMF) (talkcontribs)

Hi,

the proper syntax is -incategory:Sql and the json output you pasted shows that it works since it has this section:

"must_not": [
    {
        "bool": {
            "should": [
                {
                    "match": {
                        "category.lowercase_keyword": {
                            "query": "Sql"
                        }
                    }
                }
            ]
        }
    }
]
Nicolas senechal (talkcontribs)

Thank you very much for your patience and your response!

But why does it work without quotes to exclude and work with quotes to include?

DCausse (WMF) (talkcontribs)

incategory:"sql" and incategory:sql should produce the same query.

Similarly, -incategory:"sql" and -incategory:sql should also produce the same query. If not please try to identify what differs from the cirrusDumpQuery output.

Reply to "-incategory don't work"

Code used to parse the text as in cirrus dump

2
80.12.85.103 (talkcontribs)

Could someone provide the code or the means to parse the wikicode as in the "text" attribute within the elasticsearch pages - or where to better inform me ? I've been working with NLP dataset generation from wikipedia dumps, but I can't get satisfactory results with most of the parsers I've tested (mwparserfromhell, wikitextparser, mediawiki-parser). I would need to have the same text as in the cirrus dump but keeping the internal links. Thank you for any information!

EBernhardson (WMF) (talkcontribs)

The text used in the CirrusSearch dumps comes from the allText value created by WikiTextStructure::extractWikitextParts.

For the most part the processing takes the html output from mediawiki's wikitext parser, strips out elements matching a set of css selectors identifying some of the non-content and auxiliary parts of a page, and then strips all the tags out of the remaining content.

Unfortunately I'm not aware of a way to get the bulk html content of the wiki, you may need to use the mediawiki parser, and that still may have difficulties depending on template and lua usage.

Reply to "Code used to parse the text as in cirrus dump"

Need to upgrade elastic search Library log4j-1.2-api-2.11.1.jar

5
Summary by Ciencia Al Poder

Stop creating duplicate posts. Topic:Wm67mprhel2mv59q

Pooja2425 (talkcontribs)

Hi Team,

Hi Team,

we are using below,

MediaWiki 1.35.3
PHP 7.4.23 (apache2handler)
MySQL 8.0.26
Lua 5.1.5
Elasticsearch 6.5.4

/usr/share/elasticsearch/lib/log4j-1.2-api-2.11.1.jar

log4j-api-2.11.1.jar

log4j-core-2.11.1.jar

x-pack-security/log4j-slf4j-impl-2.11.1.jar

please provide us any patch which is higher then log4j>2.15.0

TheDJ (talkcontribs)

We do not provide elasticsearch. Mediawiki only uses it. Please contact elastic.co itself. Or just restart elasticsearch with the variable which disables the affected functionality. This is widely documented but looks something like -Dlog4j2.formatMsgNoLookups=true

Pooja2425 (talkcontribs)

Thanks @TheDJ for help,

please let me know only i need to set -Dlog4j2.formatMsgNoLookups=true into etc/elasticsearch/jvm.options or anything else also need to do.

As we are using Elasticsearch v 6.5.4 and java 1.8.0

Or pls suggest me link where i can confirm all this.

thanks

Pooja2425 (talkcontribs)
TheDJ (talkcontribs)

Please contact elastic.co

Return to "CirrusSearch" page.