Help talk:CirrusSearch

About this board

Justin C Lloyd (talkcontribs)

I'm currently working on testing CirrusSearch with AWS Elasticsearch in my dev environment, but first had to implement (AWS Elasticache) Redis for the job queue. However, I was recently told by someone on the Search team (I apologize, I forget who) that this may not be necessary unless there are perhaps hundreds of thousands to millions of jobs as WMF has. I've seen at most maybe 100-120k (aggregate for my five wikis) but mine are usually 100s, occasionally 1000s, and even more rarely 10s of thousands, which were otherwise handled fine in MySQL. So is it really necessary for CirrusSearch to have the job queues in Redis at that level?

Ciencia Al Poder (talkcontribs)

Using match_phrase_prefix

4
2001:1711:FA4B:D10:1163:390A:525:F58B (talkcontribs)

Hi. Mediawiki 1.38.2 and CirrusSearch generate Elastic queries using "query_string". How do I make Cirrus use "match_phrase_prefix" instead" ?

This will allow me to find page using partial keywords: Example "Cirr" will return pages with "Cirrus" inside.

Any ideas ? Thanks.

EBernhardson (WMF) (talkcontribs)

Within cirrus we don't have anything that directly supports match_phrase_prefix. We generally avoid this style of query as it provides queries that give unexpected outputs that can change depending on which replicas of the index it lands on. In particular there is no guarantee with match_phrase_prefix that "cirr" will return pages with "Cirrus" inside of them. Instead it will look at term dictionaries and select a number of words somewhat arbitrarily that start with cirr and then search for those words. Depending on the exact term statistics in the replica it lands on this can choose a different set of words to search for when repeating the same query.

While I would generally suggest avoiding it, the existing query_string queries do support this style of query. You can achieve the same functionality by appending a *, such as cirr*

2001:1711:FA4B:D10:1029:E025:F952:194F (talkcontribs)

Thanks for your anser.

It's a Swiss-French medical Wiki used by doctors, with a lot of long words. Our basic users don't know Elatic tricks, like "*" or "~".

I.E: "Prostatectomie", should be found by juste by entering "Prostat"

So if we cannot use match_phrase_prefix, can we put the final "*" by default in all search with Cirrus?

DCausse (WMF) (talkcontribs)

To customize the main full text search query you can implement your own \CirrusSearch\Query\FullTextQueryBuilder implementation and register it in the wgCirrusSearchFullTextQueryBuilderProfiles config var, see some examples for other builder profiles here. Then you can activate this new profile as the default by setting wgCirrusSearchFullTextQueryBuilderProfile to its name.

You have some examples of how to implement a FullTextQueryBuilder here.

Note that doing this is not very trivial but this is I think the only way to achieve what you want without teaching your users to use the search syntax.

Reply to "Using match_phrase_prefix"

"Words in all content pages" on Special:Statistics not updated when words are added and php ./maintenance/runJobs.php yields "Job queue is empty."

3
Alberto56789 (talkcontribs)

When trying to post my question here I get the ⧼abusefilter-warning-linkspam⧽ error, so I posted my full question on stackoverflow at questions/75269346 and I will post only a summary here:

I have installed Cirrus, Elastica and ElasticSearch as per the instructions, but no matter what I do (for example php ./maintenance/runJobs.php, php maintenance/updateSpecialPages.php), number of words on the statistics page never updates.

How can I get that to update? Thanks!

DCausse (WMF) (talkcontribs)
216.246.250.184 (talkcontribs)

Thanks! Wow that was driving me nuts!

Reply to ""Words in all content pages" on Special:Statistics not updated when words are added and php ./maintenance/runJobs.php yields "Job queue is empty.""

UpdateSearchIndexConfig.php not working?

1
Summary by DCausse (WMF)
Gamebrew (talkcontribs)

I couldn't able to figure out why the UpdateSearchIndexConfig.php isn't working on the latest Mediawiki 1.39. Can someone able to help me a bit?


Error Log:

php /extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php


Updating cluster ...

indexing namespaces...

mw_cirrus_metastore missing, creating new metastore index.

Creating metastore index... mw_cirrus_metastore_first   Scanning available plugins...none

Elastica\Exception\ResponseException from line 178 of /public/extensions/Elastica/vendor/ruflin/elastica/src/Transport/Http.php:

#0 /public/extensions/Elastica/vendor/ruflin/elastica/src/Request.php(178): Elastica\Transport\Http->exec()

#1 /public/extensions/Elastica/vendor/ruflin/elastica/src/Client.php(513): Elastica\Request->send()

#2 /public/extensions/Elastica/vendor/ruflin/elastica/src/Index.php(655): Elastica\Client->request()

#3 /public/extensions/CirrusSearch/includes/MetaStore/MetaStoreIndex.php(201): Elastica\Index->request()

#4 /public/extensions/CirrusSearch/includes/MetaStore/MetaStoreIndex.php(139): CirrusSearch\MetaStore\MetaStoreIndex->createNewIndex()

#5 /public/extensions/CirrusSearch/includes/Maintenance/Maintenance.php(227): CirrusSearch\MetaStore\MetaStoreIndex->createIfNecessary()

#6 /public/extensions/CirrusSearch/maintenance/IndexNamespaces.php(40): CirrusSearch\Maintenance\Maintenance->maybeCreateMetastore()

#7 /public/extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php(72): CirrusSearch\Maintenance\IndexNamespaces->execute()

#8 /public/maintenance/includes/MaintenanceRunner.php(309): CirrusSearch\Maintenance\UpdateSearchIndexConfig->execute()

#9 /public/maintenance/doMaintenance.php(85): MediaWiki\Maintenance\MaintenanceRunner->run()

#10 /public/extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php(117): require_once('/home/nginx/dom...')

#11 {main}


My Mediawiki Info:

Product Version
Mediawiki 1.39.1
PHP 7.4.33 (fpm-fcgi)
MariaDB 10.3.37-MariaDB
ICU 62.2
Pygments 2.11.2
Elasticsearch 7.10.2
CirrusSearch 6.5.4 (e15ac38) 06:42, January 10, 2023 GPL-2.0-or-later
Elastica 6.2.0 (1baee3b) 06:13, December 4, 2022

Elastic is working fine at my end.

curl -XGET 'localhost:9200'

{

  "name" : "node",

  "cluster_name" : "nodecluster",

  "cluster_uuid" : "_na_",

  "version" : {

    "number" : "7.10.2",

    "build_flavor" : "default",

    "build_type" : "rpm",

    "build_hash" : "747e1cc71def077253878a59143c1f785afa92b9",

    "build_date" : "2021-01-13T00:42:12.435326Z",

    "build_snapshot" : false,

    "lucene_version" : "8.7.0",

    "minimum_wire_compatibility_version" : "6.8.0",

    "minimum_index_compatibility_version" : "6.0.0-beta1"

  },

  "tagline" : "You Know, for Search"

}

46.193.3.148 (talkcontribs)

Hello quick question once I have downloaded all the dependencies for the cirrus Search extension how do I link it with elastic search ?

Ciencia Al Poder (talkcontribs)
Summary by Speravir

Question for wildcards answered, at least questioner was satisfied (thanked in background).

Mystyc1 (talkcontribs)

Is the * treated like a wildcard or like a regex operator? If it's a wildcard, then does it represent one character, a group of characters, a word, or something else? Maybe this isn't the right article to answer such questions, but if it is, then it is remarkably bad.  

I suspect the latter because I have not been able to find a link in the article that would direct me to a more appropriate page.

Speravir (talkcontribs)

See help page, section Words, phrases, and modifiers:

The last words give a hint that there are regex searches possible in which the * has a different meaning. “Covered later” refers to the according section.

This sentence from section for insource searches should be noted, as well:

(You can find all of this by searching for wildcard on the help page.)

217.117.125.83 (talkcontribs)

How to search for an exact string including greyspace characters?

Speravir (talkcontribs)
Reply to "greyspace characters"
2001:1711:FA4B:D10:B1BE:F13C:8327:704F (talkcontribs)

Hi,

Any profile example on how we can use a synonym file with CirrusSearch and Elastic ?

Thanks

EBernhardson (WMF) (talkcontribs)

Unfortunately synonyms aren't something CirrusSearch has any support for. It's been in the background as something to work on, but we need to come up with a solution that works in hundreds of languages and likely defers the actualy synonym definition to wiki editors rather than system administrators.

While not exactly synonyms, on the WMF wikis we rely on redirects to pages to provide alternate names for them. In most cases where wiki search externally appears to have used synonyms what actually happened was there was a redirect to the page giving alternate titles (that are used as a fairly strong ranking signal).

Aparolini (talkcontribs)

Thanks for the feedback.

Because Elasticsearch doses support synonyms as a filter and that Cirrus is really just a Bridge to Elastic, I was hopping we could work this out with profiles, such as

'default' => [
'builder_class' => Query\FullTextQueryStringQueryBuilder::class,
'settings' => [
    'filter' => [
	'type' => 'synonym',
	'settings' => [
		'synonyms_path' => 'my_synonyms.txt',
                'updateable' => 'true'
	]
      ]
],

Synonyms are important to us (medical wiki), as for instance if you look for, say "audition", you should find not only page with "audition" in it, but also page with "hear" or "malleus" (small bone inside the hear).

Editing the page to add synonyms is not an option for us, as this will add a lot of work for page producers.

Reply to "Synonyms"
Jonteemil (talkcontribs)

Hello!

Some of the filenames of these files aren't complete for some reason. Also you would expect that this search would say "File:PDP-CH - Philadelphia Orchestra - -Wikipedia-Leopold Stokowski-Leopold Stokowski - Brandenburg Concerto No. 2 in F major, BWV 1047 - 1st Movement- Allegro - Johann Sebastian Bach - Hmv-d1708-42-606.flac (redirect from File:PDP-CH - Philadelphia Orchestra - -Wikipedia-Leopold Stokowski-Leopold Stokowski - Brandenburg Concerto No. 2 in F major, BWV 1047 - 1st Movement- Allegro - Johann Sebastian Bach - Hmv-d1708-42-606.flac.flac)" however it isn't. Might anyone here know why?

DCausse (WMF) (talkcontribs)

Hi,

I'm not sure to understand what is not complete in the query results for intitle:/Philadelphia Orchestra/ filemime:audio/x-flac do you have a specific page missing.

The second query you pasted contains error and after fixing it (I assumed you wanted to search for the redirect with the doubled .flac extension) finds the page you mention. Here is how I fixed the query:

intitle:/Philadelphia Orchestra/ filemime:audio/x-flac intitle:/\.flac\.flac/.

When searching for a redirect the search engine will always display the redirected page, sometimes you may see a hint that you matched a redirect when the mention (redirect from: page_name) appears after the page title, see for instance the results for: intitle:/\.flac\.flac/ filemime:audio/x-flac incategory:"Swiss Foundation Public Domain".

Speravir (talkcontribs)

DCausse, the filenames are only partially displayed, from the first search Jonteemil provided it seems there is a maximum length for display, some limit for characters, and then the second search condition intitle:/\.flac\.flac/ only narrows down the result(s) without adjusting the displayed lines.

But with an altered search I get the full display, it links to the redirected file with only one file extension, though, as already pointed out by DCausse: file: filemime:audio/x-flac intitle:/Philadelphia Orchestra.+\.flac\.flac/. Note that I merged both regex searches, as the one with two of them is really bad in terms of server loading (I also added the namespace as search domain; this should be always added if possible).

Reply to "2 questions"

cirrussearch vs database backup dumps

2
69.191.241.48 (talkcontribs)

Hi - there are two types of dumps available for enwiki pages - monthly database dump structured in XML which you can subscribe to and weekly cirrussearch dumps, which are structured in JSON for bulk upload to elasticsearch. We're trying to diff the two dumps to see if they're comparable, but notice some articles are in the monthly XML dump not in the weekly cirrussearch dump. I'm having trouble finding an explanation in the main wikimedia homepage that clearly states the difference beteween these two enwiki dumps. Any additional information would be much appreciated.

I would post links, but am getting an error when trying to post, so please navigate to dumps.wikimedia.org and look for the extensions

cirrussearch dump: /other/cirrussearch/

xml dump: /enwiki/latest/

Ciencia Al Poder (talkcontribs)

Check if the "missing" articles in cirrus search dumps exist on the live wiki. If not, that means those articles got deleted after the monthly XML dumps but before the weekly cirrus search dumps

Reply to "cirrussearch vs database backup dumps"
Return to "CirrusSearch" page.