I'm currently working on testing CirrusSearch with AWS Elasticsearch in my dev environment, but first had to implement (AWS Elasticache) Redis for the job queue. However, I was recently told by someone on the Search team (I apologize, I forget who) that this may not be necessary unless there are perhaps hundreds of thousands to millions of jobs as WMF has. I've seen at most maybe 100-120k (aggregate for my five wikis) but mine are usually 100s, occasionally 1000s, and even more rarely 10s of thousands, which were otherwise handled fine in MySQL. So is it really necessary for CirrusSearch to have the job queues in Redis at that level?
Help talk:CirrusSearch
Extension:CirrusSearch#Dependencies has a note about Redis. It's not required, though.
Hi. Mediawiki 1.38.2 and CirrusSearch generate Elastic queries using "query_string". How do I make Cirrus use "match_phrase_prefix" instead" ?
This will allow me to find page using partial keywords: Example "Cirr" will return pages with "Cirrus" inside.
Any ideas ? Thanks.
Within cirrus we don't have anything that directly supports match_phrase_prefix. We generally avoid this style of query as it provides queries that give unexpected outputs that can change depending on which replicas of the index it lands on. In particular there is no guarantee with match_phrase_prefix that "cirr" will return pages with "Cirrus" inside of them. Instead it will look at term dictionaries and select a number of words somewhat arbitrarily that start with cirr and then search for those words. Depending on the exact term statistics in the replica it lands on this can choose a different set of words to search for when repeating the same query.
While I would generally suggest avoiding it, the existing query_string queries do support this style of query. You can achieve the same functionality by appending a *, such as cirr*
Thanks for your anser.
It's a Swiss-French medical Wiki used by doctors, with a lot of long words. Our basic users don't know Elatic tricks, like "*" or "~".
I.E: "Prostatectomie", should be found by juste by entering "Prostat"
So if we cannot use match_phrase_prefix, can we put the final "*" by default in all search with Cirrus?
To customize the main full text search query you can implement your own \CirrusSearch\Query\FullTextQueryBuilder implementation and register it in the wgCirrusSearchFullTextQueryBuilderProfiles config var, see some examples for other builder profiles here. Then you can activate this new profile as the default by setting wgCirrusSearchFullTextQueryBuilderProfile to its name.
You have some examples of how to implement a FullTextQueryBuilder here.
Note that doing this is not very trivial but this is I think the only way to achieve what you want without teaching your users to use the search syntax.
When trying to post my question here I get the ⧼abusefilter-warning-linkspam⧽ error, so I posted my full question on stackoverflow at questions/75269346 and I will post only a summary here:
I have installed Cirrus, Elastica and ElasticSearch as per the instructions, but no matter what I do (for example php ./maintenance/runJobs.php, php maintenance/updateSpecialPages.php), number of words on the statistics page never updates.
How can I get that to update? Thanks!
Hi,
This number is cached by CirrusSearch for one day: https://gerrit.wikimedia.org/g/mediawiki/extensions/CirrusSearch/+/be6fd75573ebabbae739823d0b53bac9727ead57/includes/Query/CountContentWordsBuilder.php#15
It might be the reason why it was not updated right after you edited the page.
Thanks! Wow that was driving me nuts!
I couldn't able to figure out why the UpdateSearchIndexConfig.php isn't working on the latest Mediawiki 1.39. Can someone able to help me a bit?
Error Log:
php /extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php
Updating cluster ...
indexing namespaces...
mw_cirrus_metastore missing, creating new metastore index.
Creating metastore index... mw_cirrus_metastore_first Scanning available plugins...none
Elastica\Exception\ResponseException from line 178 of /public/extensions/Elastica/vendor/ruflin/elastica/src/Transport/Http.php:
#0 /public/extensions/Elastica/vendor/ruflin/elastica/src/Request.php(178): Elastica\Transport\Http->exec()
#1 /public/extensions/Elastica/vendor/ruflin/elastica/src/Client.php(513): Elastica\Request->send()
#2 /public/extensions/Elastica/vendor/ruflin/elastica/src/Index.php(655): Elastica\Client->request()
#3 /public/extensions/CirrusSearch/includes/MetaStore/MetaStoreIndex.php(201): Elastica\Index->request()
#4 /public/extensions/CirrusSearch/includes/MetaStore/MetaStoreIndex.php(139): CirrusSearch\MetaStore\MetaStoreIndex->createNewIndex()
#5 /public/extensions/CirrusSearch/includes/Maintenance/Maintenance.php(227): CirrusSearch\MetaStore\MetaStoreIndex->createIfNecessary()
#6 /public/extensions/CirrusSearch/maintenance/IndexNamespaces.php(40): CirrusSearch\Maintenance\Maintenance->maybeCreateMetastore()
#7 /public/extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php(72): CirrusSearch\Maintenance\IndexNamespaces->execute()
#8 /public/maintenance/includes/MaintenanceRunner.php(309): CirrusSearch\Maintenance\UpdateSearchIndexConfig->execute()
#9 /public/maintenance/doMaintenance.php(85): MediaWiki\Maintenance\MaintenanceRunner->run()
#10 /public/extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php(117): require_once('/home/nginx/dom...')
#11 {main}
My Mediawiki Info:
Product | Version |
---|---|
Mediawiki | 1.39.1 |
PHP | 7.4.33 (fpm-fcgi) |
MariaDB | 10.3.37-MariaDB |
ICU | 62.2 |
Pygments | 2.11.2 |
Elasticsearch | 7.10.2 |
CirrusSearch | 6.5.4 (e15ac38) 06:42, January 10, 2023 GPL-2.0-or-later |
Elastica | 6.2.0 (1baee3b) 06:13, December 4, 2022 |
Elastic is working fine at my end.
curl -XGET 'localhost:9200'
{
"name" : "node",
"cluster_name" : "nodecluster",
"cluster_uuid" : "_na_",
"version" : {
"number" : "7.10.2",
"build_flavor" : "default",
"build_type" : "rpm",
"build_hash" : "747e1cc71def077253878a59143c1f785afa92b9",
"build_date" : "2021-01-13T00:42:12.435326Z",
"build_snapshot" : false,
"lucene_version" : "8.7.0",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
Hello quick question once I have downloaded all the dependencies for the cirrus Search extension how do I link it with elastic search ?
Is the *
treated like a wildcard or like a regex operator? If it's a wildcard, then does it represent one character, a group of characters, a word, or something else? Maybe this isn't the right article to answer such questions, but if it is, then it is remarkably bad.
I suspect the latter because I have not been able to find a link in the article that would direct me to a more appropriate page.
See help page, section Words, phrases, and modifiers:
“A wildcard character inside a word can be an (escaped) question mark \? for one character or an asterisk * character for zero or more characters.”
[…] “The two wildcard characters are the star and the (escaped) question mark, and both can come in the middle or end of a word. The escaped question mark \? stands for one character and the star * stands for any number of characters.” […] “The star * wildcard matches a string of letters and digits within a rendered word, but never the beginning character. One or more characters must precede the * character.
The \? wildcard represents one letter or number; *\? is also accepted, but \?* is not recognized. The wildcards are for basic word, phrase, and insource searches, and may also be an alternative to (some) advanced regex searches (covered later).” |
The last words give a hint that there are regex searches possible in which the * has a different meaning. “Covered later” refers to the according section.
This sentence from section for insource searches should be noted, as well:
“But indexed searches all ignore greyspace; wildcards searches do not match greyspace, so regexes are the only way to find an exact string of any and all characters” |
(You can find all of this by searching for wildcard
on the help page.)
How to search for an exact string including greyspace characters?
Try "exact string including?" insource:/"exact string including?"/
. The last part is found under Regular Expression searches.
Hi,
Any profile example on how we can use a synonym file with CirrusSearch and Elastic ?
Thanks
Unfortunately synonyms aren't something CirrusSearch has any support for. It's been in the background as something to work on, but we need to come up with a solution that works in hundreds of languages and likely defers the actualy synonym definition to wiki editors rather than system administrators.
While not exactly synonyms, on the WMF wikis we rely on redirects to pages to provide alternate names for them. In most cases where wiki search externally appears to have used synonyms what actually happened was there was a redirect to the page giving alternate titles (that are used as a fairly strong ranking signal).
Thanks for the feedback.
Because Elasticsearch doses support synonyms as a filter and that Cirrus is really just a Bridge to Elastic, I was hopping we could work this out with profiles, such as
'default' => [ 'builder_class' => Query\FullTextQueryStringQueryBuilder::class, 'settings' => [ 'filter' => [ 'type' => 'synonym', 'settings' => [ 'synonyms_path' => 'my_synonyms.txt', 'updateable' => 'true' ] ] ],
Synonyms are important to us (medical wiki), as for instance if you look for, say "audition", you should find not only page with "audition" in it, but also page with "hear" or "malleus" (small bone inside the hear).
Editing the page to add synonyms is not an option for us, as this will add a lot of work for page producers.
Hello!
Some of the filenames of these files aren't complete for some reason. Also you would expect that this search would say "File:PDP-CH - Philadelphia Orchestra - -Wikipedia-Leopold Stokowski-Leopold Stokowski - Brandenburg Concerto No. 2 in F major, BWV 1047 - 1st Movement- Allegro - Johann Sebastian Bach - Hmv-d1708-42-606.flac (redirect from File:PDP-CH - Philadelphia Orchestra - -Wikipedia-Leopold Stokowski-Leopold Stokowski - Brandenburg Concerto No. 2 in F major, BWV 1047 - 1st Movement- Allegro - Johann Sebastian Bach - Hmv-d1708-42-606.flac.flac)" however it isn't. Might anyone here know why?
Hi,
I'm not sure to understand what is not complete in the query results for intitle:/Philadelphia Orchestra/ filemime:audio/x-flac
do you have a specific page missing.
The second query you pasted contains error and after fixing it (I assumed you wanted to search for the redirect with the doubled .flac extension) finds the page you mention. Here is how I fixed the query:
intitle:/Philadelphia Orchestra/ filemime:audio/x-flac intitle:/\.flac\.flac/
.
When searching for a redirect the search engine will always display the redirected page, sometimes you may see a hint that you matched a redirect when the mention (redirect from: page_name) appears after the page title, see for instance the results for: intitle:/\.flac\.flac/ filemime:audio/x-flac incategory:"Swiss Foundation Public Domain".
DCausse, the filenames are only partially displayed, from the first search Jonteemil provided it seems there is a maximum length for display, some limit for characters, and then the second search condition intitle:/\.flac\.flac/
only narrows down the result(s) without adjusting the displayed lines.
But with an altered search I get the full display, it links to the redirected file with only one file extension, though, as already pointed out by DCausse: file: filemime:audio/x-flac intitle:/Philadelphia Orchestra.+\.flac\.flac/
. Note that I merged both regex searches, as the one with two of them is really bad in terms of server loading (I also added the namespace as search domain; this should be always added if possible).
Hi - there are two types of dumps available for enwiki pages - monthly database dump structured in XML which you can subscribe to and weekly cirrussearch dumps, which are structured in JSON for bulk upload to elasticsearch. We're trying to diff the two dumps to see if they're comparable, but notice some articles are in the monthly XML dump not in the weekly cirrussearch dump. I'm having trouble finding an explanation in the main wikimedia homepage that clearly states the difference beteween these two enwiki dumps. Any additional information would be much appreciated.
I would post links, but am getting an error when trying to post, so please navigate to dumps.wikimedia.org and look for the extensions
cirrussearch dump: /other/cirrussearch/
xml dump: /enwiki/latest/
Check if the "missing" articles in cirrus search dumps exist on the live wiki. If not, that means those articles got deleted after the monthly XML dumps but before the weekly cirrus search dumps