Help talk:CirrusSearch

About this board

Wakelamp (talkcontribs)

Is there a sandbox to test in, so I don't affect performance ~~~~

Abuluntu (talkcontribs)

What are you planning to test? Abuluntu (talk) 12:46, 1 November 2021 (UTC)

Wakelamp (talkcontribs)

No So I am not doing a performance load test, :-) I just saw the performance issue affecting others. Have many people screwed up? I assume you have some sort of timeout - Is that a switch I can set lower? Anway I am trying to work out ways of getting what I am after some other way.

Abuluntu (talkcontribs)

I’m not sure if I’m the right one to ask. If there is a performance issue it’s probably discussed at phabricator. Wish you the best of luck!

EBernhardson (WMF) (talkcontribs)

If you are referring to the WMF wikis, when the search system cuts off a query for performance reasons (typically it took too long to execute) that's normal and expected behaviour that shouldn't negatively impact others (assuming you aren't a bot making many parallel queries). Overall you shouldn't need to worry about it, beyond pondering how to construct a query that doesn't timeout

If you are trying to construct a query and it keeps timing out, could you could post some details of the information you are trying to retrieve? Someone might be able to point out a more efficient way to get the same information.

Reply to "Safe sandbox"

Problem indexing pdf documents that include cyrilic characters

2
LveFunc (talkcontribs)

Hello guys,


Lately i've been up to a task of deploying local MediaWiki. Everything went smooth until it came to indexing inside of pdf files that contain characters other that US ascii. Doing '?action=cirrusDump' and looking at 'file_text' field shows that all cyrillic characters are getting dropped while latin characters are preserved. Folks at ru.wikipedia.org somehow managed to do it but i couldn't find solution online. I would be very thankful if somebody could point out why that happens and how i could potentially solve this problem.

My configuration is:

MediaWiki - 1.36.1

PHP - 7.4.22 (apache2handler)

PostgreSQL - 13.3

ICU - 66.1

Elasticsearch - 6.5.4

PDF Handler - c9705a8

AdvancedSearch - c8a42b8

CirrusSearch - 6.5.4 (ab802b7)

Elastica - 6.1.3 (9f6e66a)

My elasticsearch configuration:

analysis-icu

extra MediaWiki plugin

ingest-attachment

DCausse (WMF) (talkcontribs)

Hi,

CirrusSearch does not manipulate the text it receives from Extension:PdfHandler. I would check if this extension is working properly especially that the tooling it depends on (set via $wgPdftoText, likely to be pdftotext) is properly extracting the text you expect.

DonSimon (talkcontribs)

How to exclude edits by users (let from some certain vector of users) from search view via cirrus?

It'd be useful for patrolling recent changes when you don't want see 2 most active users right now, for example.

PerfektesChaos (talkcontribs)

Unsatisfying answer: Search can analyze content only, but not metadata.

  • The adjoined metadata of a page would be timestamp, user, summary etc.
  • The content is the visible text of the rendered page, or source text of the page itself without transclusions.

For recent changes a search within all articles or pages within the wiki is the wrong tool.

  • There might be some gadgets already existing, which will filter the recent change list by a vector of trusted users, to focus on less well known people. At least a gadget programmer could easily remove those by screen grabbing. This one could do that, among many other things, but is not focussing on such list of trusted users.
  • By MediaWiki software there are options on watchlist and recent changes (which are based on the identical software) to suppress bots, registered users, or minor edits (unsafe).
Reply to "Exclude edits by users"

Couldn't connect to host, Elasticsearch down?", Elastica\\Exception\\Connection\\HttpException

1
Pooja2425 (talkcontribs)

Hi, I am using these, I installed elastic server But not showing below.

Product Version
MediaWiki 1.35.3
PHP 7.4.24 (apache2handler)
MySQL 8.0.26
ICU 65.1
Lua 5.1.5

also installed required extensions and elasticsearch/elasticsearch": "6.5 client.

Elastica 6.1.3 (f3c9459) 01:29, 3 September 2021
CirrusSearch 6.5.4 (95b958b) 19:07, 20 August 2021

wfLoadExtension( 'Elastica' );

wfLoadExtension( 'CirrusSearch' );

$wgDisableSearchUpdate = true;

$wgCirrusSearchIndexBaseName =  ''; //DataBase Name

php extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php

Now remove wgDisableSearchUpdate = true;

php extensions/CirrusSearch/maintenance/ForceSearchIndex.php --skipLinks --indexOnSkip

php extensions/CirrusSearch/maintenance/ForceSearchIndex.php –-skipParse

$wgSearchType = 'CirrusSearch';

# php /data/www/html/wiki/maintenance/runJobs.php


1)when trying to search anything in search engine :

An error has occurred while searching: We could not complete your search due to a temporary problem. Please try again later.


2) http://wiki/api.php?action=cirrus-settings-dump


"code": "internal_api_error_Elastica\\Exception\\Connection\\HttpException", "info": "[YUsbMuAOtsHM4e8mbGG6NwAAAAs] Exception caught: Couldn't connect to host, Elasticsearch down?", "errorclass": "Elastica\\Exception\\Connection\\HttpException", "*": "Elastica\\Exception\\Connection\\HttpException at /data/www/html/wiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Transport/Http.php(190)\n#0 /data/www/html/wiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Request.php(194): Elastica\\Transport\\Http->exec()\n#1 /data/www/html/wiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Client.php(689): Elastica\\Request->send()\n#2 /data/www/html/wiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Index.php(571): Elastica\\Client->request()\n#3 /data/www/html/wiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Index/Settings.php(383): Elastica\\Index->request()\n#4 /data/www/html/wiki/extensions/Elastica/vendor/ruflin/elastica/lib/Elastica/Index/Settings.php(75): Elastica\\Index\\Settings->request()\n#5 /data/www/html/wiki/extensions/CirrusSearch/includes/Api/SettingsDump.php(36): Elastica\\Index\\Settings->get()\n#6 /data/www/html/wiki/includes/api/ApiMain.php(1593): CirrusSearch\\Api\\SettingsDump->execute()\n#7 /data/www/html/wiki/includes/api/ApiMain.php(529): ApiMain->executeAction()\n#8 /data/www/html/wiki/includes/api/ApiMain.php(500): ApiMain->executeActionWithErrorHandling()\n#9 /data/www/html/wiki/api.php(90): ApiMain->execute()\n#10 /data/www/html/wiki/api.php(45): wfApiMain()\n#11 {main}" }


pls suggest which step i am missing,

Reply to "Couldn't connect to host, Elasticsearch down?", Elastica\\Exception\\Connection\\HttpException"
Bwwiseman (talkcontribs)

Hi,


I believe it is working. If I change the config to an invalid search it fails and correct it , it works

$wgCirrusSearchServers = [ [ 'host' => "192.168.1.blahblah", 'port' => 31130 ]  ];


My expectation, which may be incorrect, was that I could type in a subheading e.g. "Cors" and the page containing ==Cors== would return the page in the search


Here is my set up


Thanks


LocalSettings

# This is the elastic search config

# Dont forget kubebernetes

wfLoadExtension( 'Elastica' );

wfLoadExtension('CirrusSearch');

#$wgDisableSearchUpdate = true;

$wgCirrusSearchServers = [ [ 'host' => "192.168.1.70", 'port' => 31130 ]  ];

$wgSearchType = 'CirrusSearch';

Add this to LocalSettings.php:

wfLoadExtension( 'Elastica' );

wfLoadExtension( 'CirrusSearch' );

$wgDisableSearchUpdate = true;

Add this elastic search endpoint or points

$wgCirrusSearchServers = [ 'elasticsearch0', 'elasticsearch1' ];

your elasticsearch index:

php $MW_INSTALL_PATH/extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php

Now remove $wgDisableSearchUpdate = true from LocalSettings.php.  Updates should start heading to Elasticsearch.

Next bootstrap the search index by running:

php $MW_INSTALL_PATH/extensions/CirrusSearch/maintenance/ForceSearchIndex.php --skipLinks --indexOnSkip

php $MW_INSTALL_PATH/extensions/CirrusSearch/maintenance/ForceSearchIndex.php --skipParse


I am using elasticsearch 6.5.4

DCausse (WMF) (talkcontribs)

Hi,

It seems to me that the Manual:Job_queue might not be properly configured. CirrusSearch relies on it to ship the documents to elasticsearch and if it's not properly configured the indices might remain empty.

Reply to "Set up CirrusSearch"
Jonteemil (talkcontribs)

Hello!

Some of the filenames of these files aren't complete for some reason. Also you would expect that this search would say "File:PDP-CH - Philadelphia Orchestra - -Wikipedia-Leopold Stokowski-Leopold Stokowski - Brandenburg Concerto No. 2 in F major, BWV 1047 - 1st Movement- Allegro - Johann Sebastian Bach - Hmv-d1708-42-606.flac (redirect from File:PDP-CH - Philadelphia Orchestra - -Wikipedia-Leopold Stokowski-Leopold Stokowski - Brandenburg Concerto No. 2 in F major, BWV 1047 - 1st Movement- Allegro - Johann Sebastian Bach - Hmv-d1708-42-606.flac.flac)" however it isn't. Might anyone here know why?

DCausse (WMF) (talkcontribs)

Hi,

I'm not sure to understand what is not complete in the query results for intitle:/Philadelphia Orchestra/ filemime:audio/x-flac do you have a specific page missing.

The second query you pasted contains error and after fixing it (I assumed you wanted to search for the redirect with the doubled .flac extension) finds the page you mention. Here is how I fixed the query:

intitle:/Philadelphia Orchestra/ filemime:audio/x-flac intitle:/\.flac\.flac/.

When searching for a redirect the search engine will always display the redirected page, sometimes you may see a hint that you matched a redirect when the mention (redirect from: page_name) appears after the page title, see for instance the results for: intitle:/\.flac\.flac/ filemime:audio/x-flac incategory:"Swiss Foundation Public Domain".

Speravir (talkcontribs)

DCausse, the filenames are only partially displayed, from the first search Jonteemil provided it seems there is a maximum length for display, some limit for characters, and then the second search condition intitle:/\.flac\.flac/ only narrows down the result(s) without adjusting the displayed lines.

But with an altered search I get the full display, it links to the redirected file with only one file extension, though, as already pointed out by DCausse: file: filemime:audio/x-flac intitle:/Philadelphia Orchestra.+\.flac\.flac/. Note that I merged both regex searches, as the one with two of them is really bad in terms of server loading (I also added the namespace as search domain; this should be always added if possible).

Reply to "2 questions"

Can't find words that include ʿ or ʾ

13
Pooya72 (talkcontribs)

Hello everybody,


In our MediaWiki we have titles that use diacritics for transliterations. Since CirrusSearch does accent/diacritic folding out of the box, we are able to search for pages without using diacritics. This is important especially for mobile users. However, if a word includes either ʿ or ʾ then it is not possible to find a page by searching for that word without including either ʿ,ʾ, or a \? (wildcard). I was wondering if it was possible to add ʿ and ʾ to the filter to have words be searchable without including them. For example:

Search for ʿilm by typing ilm, or 'ilm (normal quote).


This is our setup:

MediaWiki 1.35.3
PHP 7.4.22 (apache2handler)
MariaDB 10.4.20-MariaDB
ICU 66.1
Semantic MediaWiki 3.2.3
Elasticsearch 6.5.4
CirrusSearch 6.5.4 (ad4210f)
Elastica 6.1.1


Regards,


Pooya

TJones (WMF) (talkcontribs)

Different languages have different analysis chains configured, besides the language-specific components. We have a long term plan (see T219550) to make them more consistent. I mention that because your description of the situation doesn't match the behavior I'm seeing on English Wikitionary. (If I search for ʿayin on English Wiktionary, then ʿayin is the first result and ayin is second. If I search for ayin, then ayin is first and ʿayin is third. If I search for 'ayin, then ayin is second and ʿayin is third.) What language config are you using?

You should be able to add something to your analysis chain to deal with this. I looked in detail at the English analysis and ʿ & ʾ are both removed by icu_folding. Ahh... that could be it! If you don't have analysis-icu installed, then you get ascii_folding, not icu_folding from the default English config, and ascii_folding does not remove ʿ & ʾ. That could be it!

If you don't want to or can't install analysis-icu or don't want all the aggressive folding of icu_folding, then you could have a much more targeted solution by adding a character filter to remove ʿ & ʾ. (You could map them to ' if ' gets used elsewhere in similar contexts, but it may lead to unwanted behavior. The standard tokenizer strips ' at the edges of words, and aggressive_splitting splits on ', so a'ilm'b gets tokenized as a, ilm, b, while aʿilmʾb gets tokenized as ailmb—at least when icu_folding is enabled.)

I hope that helps. If that doesn't address all your problems, let us know what plugins you have installed, please!

—Trey

Pooya72 (talkcontribs)

Thanks @TJones (WMF)! Looks like installing anaysis-icu is the way to go. Tokenizing aʿilmʾb as ailmb is what we want. If you could point me towards the relevant documentation that would be great.


Edit: I installed the plugin as per the instructions here, and saw the plug-in come up when I restarted elasticsearch. I also added $wgCirrusSearchUseIcuFolding = 'yes'; to LocalSettings.php but the search functionality is still the same. My current language code is:

$wgLanguageCode = "en-gb";

TJones (WMF) (talkcontribs)

Glad to help. Looks like you found the documentation on your own!

Did you also reindex with UpdateSearchIndexConfig.php? It will rebuild the analysis chain with the ICU upgrades and reindex.

You can search for "text" and "plain" (with quotes) on the English Wikipedia config to see what the analysis config should look like, and see where yours differs (if it does).

Pooya72 (talkcontribs)
TJones (WMF) (talkcontribs)

Sounds good, @Pooya72! Monday is a holiday for us, so we'll check in on Tuesday.

Pooya72 (talkcontribs)
TJones (WMF) (talkcontribs)

@Pooya72, check out the docs on in-place reindexing for UpdateSearchIndexConfig.php usage. You need to call it with mwscript . Internally, we use the reindex() function, or some variation of it, to reindex our wikis. In addition to calling UpdateSearchIndexConfig.php, it keeps track of the time the reindex started (REINDEX_START) and calls ForceSearchIndex.php to catch up on activity that happened while the reindex was running—reindexing English Wikipedia, for example, can take hours, so a lot can happen in the meantime.

TJones (WMF) (talkcontribs)

Just wanted to call out that the params --reindexAndRemoveOk --indexIdentifier now are the really necessary ones to make the reindex happen. You need to specify the wiki and cluster (if you have multiple clusters, and then you need to reindex each cluster, too). If things don't seem to work, please share the command you used and its output.

Pooya72 (talkcontribs)

Thanks again @TJones (WMF). I ran php ./extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --reindexAndRemoveOk --indexIdentifier=now and this was the output: pastebin. We only have one instance, and it's at the beginning of the project so there are only a handful of entries, and no users.

TJones (WMF) (talkcontribs)

That looks like a successful run, but I see that you only have the analysis-icu plugin. You also need the Wikimedia extra plugin to enable icu_folding. I'm sorry that I'm not familiar with the base MediaWiki install, so I didn't realize this sooner; I had assumed extra and experimental-highlighter and maybe ltr would be installed by default.. You can install extra like so: elasticsearch-plugin install org.wikimedia.search:extra:6.5.4

After that, reindex again and let's see where we are.

Pooya72 (talkcontribs)
TJones (WMF) (talkcontribs)

Woo-hoo! Glad to help.

Reply to "Can't find words that include ʿ or ʾ"

Mediawiki fulltext search for czech language

21
Svrl (talkcontribs)

Dear Mediawiki CirrusSearch community,

I would like to ask, how one have to set up a Mediawiki Elasticsearch (ES) index, to allow czech fulltext search - in order to following: icu_folding, czech stemmer and lowercase shift. I have installed CirrusSearch, Elastica, Elasticsearch to and around my Mediawiki (MW) installation. I am currently on 1.31 MW version with 5.6.16 ES. I have these versions available because of internal purposes, but there is possiblity of upgrade to MW 1.35 and ES 6.5.4.


I think I have installed (LocalSettings.php reference, run php maintenance/update.php and so on) & configured everything properly according to these steps:

1) Add to LocalSettings.php: $wgDisableSearchUpdate = true;

2) Generating of ES index: php extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php

*while index already created with my settings, this requirement pops out:

--startOver nebo --reindexAndRemoveOk

Started with --startOver, because -reindexAndRemoveOk did nothing, but the same pop out.

3) Remove fromLocalSettings.php: $wgDisableSearchUpdate = true;

4) Bootstrap index: php extensions/CirrusSearch/maintenance/forceSearchIndex.php --skipLinks --indexOnSkip

5) Bootstrap index: php extensions/CirrusSearch/maintenance/forceSearchIndex.php --skipParse

6) Add $wgSearchType = 'CirrusSearch';



Estimated index settings (one of my examples without dictionary) using CURL CLI (I know the MW index is more detailed):

curl -X PUT localhost:9200/omkmediawikitest_general_first/  -d '

{

  "settings": {

    "index": {

      "number_of_shards": "1",

      "number_of_replicas": "0",

      "analysis": {

        "analyzer": {

          "czech": {

            "type": "custom",

            "tokenizer": "standard",

            "filter": ["lowercase","czech_stemmer","icu_folding"]

          }

        },

        "filter": {

          "czech_stemmer": {

            "type": "stemmer",

            "name": "czech"

          }

        }

      }

    }

  }

}'

curl -X PUT localhost:9200/omkmediawikitest_content_first/ -d '

{

  "settings": {

    "index": {

      "analysis": {

        "analyzer": {

          "czech": {

            "type": "custom",

            "tokenizer": "standard",

            "filter": ["lowercase","czech_stemmer","icu_folding"]

          }

        },

        "filter": {

          "czech_stemmer": {

            "type": "stemmer",

            "name": "czech"

          }

        }

      }

    }

  }


}'




Questions:

  1. Is even able to set Mediawiki index according to my needs? I think czech Wikipedia have this issue already solved, so there might be solution so it seems: https://cs.wikipedia.org/wiki/Speci%C3%A1ln%C3%AD:Verze. Is this case solveable by upgrading to MW version 1.35 where equivalent ES version allow these settings automatically?
  2. Should I somehow edit, how the MW index is created to include my own settings? Or should I somehow set the index to only pass MW settings, that will add new settings not overwrite? Or should I add my own settings to the MW indexed index to reindex it again with proper MW settings plus mine settings?
  3. How can I please solve that case & issue?
DCausse (WMF) (talkcontribs)

You can definitely run what is running on cs.wikipedia.org. You can even have a look at the analysis config that is there.

For this first you need to install these elasticsearch plugins:

Then you will have to set your wiki configuration as follow:

$wgLanguageCode = 'cs';
$wgCirrusSearchUseIcuFolding = 'yes';

Note that cs.wikipedia.org does not use ICU by default yet since we haven't investigated what is the set of chars that should not be folded. If you want some chars to be skipped by ICU folding then then you can use wgCirrusSearchICUFoldingUnicodeSetFilter:

// e.g. do not fold åäöÅÄÖ into a or o
$wgCirrusSearchICUFoldingUnicodeSetFilter = "[^åäöÅÄÖ]";

And re-create your index using: php extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php --reindexAndRemoveOk --indexIdentifier now

Svrl (talkcontribs)

Hello, than you for answer! :)


I would like to mention, that there is no word in the README documentation about wikimedia extra plugin, and I maybe missed it in the official web documentation. I have only found after your reply mention in /docs/settings.txt.

When I go to the analysis config, error is given:

{ "error": { "code": "badvalue", "info": "Unrecognized value for parameter \"action\": cirrus-settings-dump|running.", "*": "See https://cs.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for notice of API deprecations and breaking changes." }, "servedby": "mw1388" }


Is the Wikimedia extra plugin available for version 5.6.16 or should I upgrade to ES 6.5.4 = MW 1.35?

These settings for LocalSettings.php - is there any specific order how to drop them in the configuration?

$wgLanguageCode = 'cs';
$wgCirrusSearchUseIcuFolding = 'yes';
$wgCirrusSearchICUFoldingUnicodeSetFilter = "[^åäöÅÄÖ]";

So after I upgrade and install or just install the missing wikimedia extra plugin, I will:

Do steps in my question 1-6 and continue with suggested:

7. Add suggested configuration by you above: *LangCode*, *SearchIcu*, *SetFilter*

8. And perform this command: php extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php --reindexAndRemoveOk --indexIdentifier now


Is this suggested order correct? Can you please correct me if not?

DCausse (WMF) (talkcontribs)

Sorry for the broken link, it should be: https://cs.wikipedia.org/w/api.php?action=cirrus-settings-dump

I'm also sorry the documentation is very lacking... the CirrusSearch.php has some documentation about the various config options (docs/settings.txt as well).

For earlier versions of the extra plugin the latest of the 5.6 series we built is: https://repo1.maven.org/maven2/org/wikimedia/search/extra/5.6.14/extra-5.6.14.zip (if you want to make it compatible with 5.6.16 you might want to try unzipping it and changing the plugin descriptor file to force the elastic version to 5.6.16 it might just work otherwise you will have to build it from the source and 5.6 branch: https://gerrit.wikimedia.org/r/plugins/gitiles/search/extra/+/refs/heads/5.6).

About your steps, it looks correct to me but if you do start from scratch do step 7 first so that you don't need step 8.

Svrl (talkcontribs)

I will have to probably upload whole implementation including php and MW with ES. The installer doesn't accept customized file as you mentioned to do. Maybe the update of ES ato 6.5.4 and all other SW to equivalent versions would be only outcome. But I'm not sure if viable according to my needs as @TJones (WMF) said. Migh I ask you to your thoughts here?

Svrl (talkcontribs)

Thank you for informations provided. I will try and give proper feedback.


Have a nice day :).

TJones (WMF) (talkcontribs)

@DCausse (WMF), I'm not sure that you can use $wgLanguageCode = 'cs'; and $wgCirrusSearchUseIcuFolding = 'yes'; together. The Czech analyzer from Elastic is "monolithic" so it isn't customizable, and adding the ICU folding filter won't do anything.

If you unpack it or use parts of it, like @Svrl did above, then you can customize it. (Note that the Czech analyzer also includes Czech stop words, which you may or may not want, @Svrl.)

Customization is always complicated.

Svrl (talkcontribs)

Thank you for answer.

But this actually little confused me now. What am I about to do now? Is the usage of extra plugin still viable, but using only as you mentioned one of two settings or just the $wgLanguageCode = 'cs';?

I have feeling that the ICU folding actually did something, because it was able to find some words that wasn't before AFAIK oc.

TJones (WMF) (talkcontribs)

I'm not sure what the right way is to define your own custom analyzer within MediaWiki. @DCausse (WMF), do you know how to do that?

The custom stemmer you defined above does use ICU folding, so if you can get that working, you can do whatever you want/need.

Svrl (talkcontribs)

Addition:

How can I even set an index if the indexing my Mediawiki rewrite all settings of predefined index? The index is not able to be set after Mediawiki indexing neither. Should I somehow edit, how the Mediawiki & CirrusSearch set the index? I thought that suggested solution by @DCausse (WMF) was about not setting the index bymyself, but just let it to be set by Mediawiki indexing with Elasticsearch having installed ICU and Extra search with those LocalSetings.php settings.

Svrl (talkcontribs)

According to some czech guides which I find viable I think it's one of the possbile right approaches. I will do upgrade, check all dependencies and test multiple settings.

Have a nice day .).

DCausse (WMF) (talkcontribs)

Indeed, thanks for pointing this limitation but I think it might still be somewhat effective for the plain field, so stems won't be ICU folded due to the limitation you mention but "plain" words should. Not ideal and a bit misleading but perhaps better than nothing?

DCausse (WMF) (talkcontribs)

Please scratch my comment above, I responded too quickly. ICU folding won't be effective even for plain words, the only place where it will be effective for cs is during completion search of titles from the search box top-right. Sorry!

TJones (WMF) (talkcontribs)

@Svrl: @DCausse (WMF) and I talked about this some more today, and I opened a ticket to look into making this easier. We think it should be possible to do something like this:

$wgLanguageCode = 'custom';
$wgCirrusTextAnalyzer = 'CzechIcuText';
$wgCirrusPlainAnalyzer = 'CzechPlain';

But we need to look into the code more carefully and make sure it's as feasible as it seems. There are also, as always, issues of prioritization and planning, so I don't know when we'll get to it, but you can track progress and make further comments on the Phab ticket.

Svrl (talkcontribs)

Thank you for your last reply.

Let me ask you please, is this action as I metioned above: setting MW, CS, Elastica, ES indexed MW data to be able to search through Czech language the way how it is expected: word stemming, searching through uppercase and lowercase character equivalent and searching through diactitics (c-č, z-ž, a-á) and so on all combined possible?

We were able to prepare and set index the way we think it is right, but we are blocked by the way how the Mediawiki during indexation rewrite defined structure of precreated index. It is not even able to change the settings afterwards. So the Mediawiki indexes itself the way how it wants to be, only using the ES installed icu_folding istead of asci_folding, which is only difference which also affects the search, but in such a unimportant minority.

It it is not able in this moment, let me know please, but we know that czech Wikipedia already solved that issue, so there must be a way so I turned up to you, to the source :).

Svrl (talkcontribs)

Dear community and developers,

Allow me to share with you my recent experience of upgrading and setting up Elasticsearch and Mediawiki in order to reach some search criteria of mine.

I have tried an update of php version to 7.3.x, Elasticsearch to 6.5.4 and Mediawiki to 1.35.1, and installing extra and icu_folding plugins into Es, upgrading composer in Elastica and going through indexing the MW with some kind an expectation of possible further or better configuration, or possible compatibility of new version advantage, but all with no major improvement at all.


Steps I have maintained:

After installation of MW, ES, Elastica, CirrusSearch, installing icu_folding and extra plugin for Es, adding CirrusSearch and Elastica LocalSettings.php Mw definition, into running update.php and updating composer for Elastica I have just done expected steps for indexing:


Add this to LocalSettings.php:

1. wfLoadExtension( 'Elastica' );

wfLoadExtension( 'CirrusSearch' );

$wgDisableSearchUpdate = true;


2. Now run this script to generate your elasticsearch index:

php extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php


3. Now remove $wgDisableSearchUpdate = true from LocalSettings.php.  Updates should start heading to Elasticsearch.


4. Next bootstrap the search index by running:

php extensions/CirrusSearch/maintenance/ForceSearchIndex.php --skipLinks --indexOnSkip

php extensions/CirrusSearch/maintenance/ForceSearchIndex.php --skipParse


5. Note that this can take some time.  For large wikis read "Bootstrapping large wikis" below.

Once that is complete add this to LocalSettings.php to funnel queries to ElasticSearch:

$wgSearchType = 'CirrusSearch';


I have to admit I am very lost and I feel like I have not enough information pieces correctly puzzled together. Am I missing some step or point? After some portion of testing and research I have finished with these findings below:

Comparrison of index of mine vs index of cs.mediawiki:

I will mention here two examples of search/index settings. One you have had provided above csmediawiki, second is from my wiki. My question here is, which steps do I have to maintain to be my index/search settings same or simillar to the cs.mediawiki?

From my own wiki - from disc google: https://docs.google.com/document/d/e/2PACX-1vRMnWjIrTsN9Y_V84Cxq4Ys_V899Qup9hfOx0MCYxhYX9-CKGuQ6eyhoN6eqsXy9j7OMFPHfon0-Fzq/pub


Partial indexing?

Is possible that not all pages and categories are indexed correctly or fully? I have just witnnessed a reality that word with "á" character was not found for the first search and after third and second search (refreshing the search page of this keyword) just appeared.

For next time I witnessed similar reality that page with some of these characters: "í, é, á, ý" was not able to be found unless I visited the particular page containing this character in title. Did I did something horribly wrong?

Or is this expected use case? All apologies for my possible knowledge limitation here, but wouldn't it all be structurally explained in some kinda a documentation to this particular use case? Like combination of Elastica, CirrusSearch, Elasticsearch and Mediawiki to make all connected and working together?

Wouldn't it be problem with this Topic:Ud6sblxvbtlzlm16? Or this one Topic:V5iwq5ev1fmwnkq5?

Reindexation of customized index?

Might I ask you which steps do I have to mainting to reach the same or simillar settings as in the cs Wikipedia? My point here is, how can I customize the way how the index is created and filled? Is it correct way to let the Mediawiki be indexed and afterwards change the index and reindex it again?


Standalone ICU server lib/sw?

If I uderstand clearly if I do not want to use icu_folding in Elasticsearch as a plugin, I can use ICU library as server software as available in cs wikipedia "ICU".


Additional post-install settings of CirrusSearch?

Everything what I have missed is post-install configuration for CirrusSearch, because there is no clear explanation of what settings is crucical and what is optional. Example of settings in this ticket: Topic:Ud6sblxvbtlzlm16


All questions have common basic ground which in my humble opinion is that I am just not able to find some kinda a proper documentation and explanation of what is optional and what is crucial, thinking about very strict README files or even more strict Mediawiki official documentation for extensions.

I am sorry for this long post and I thank you for your time and effort,

Svrl

TJones (WMF) (talkcontribs)

Hi @Svrl—We've been talking about this, and @DCausse (WMF) thinks he has an approach that may help, and I'll try to add some information specific to Czech. It'll take a few days to test and write it up, but we're hoping to have a real reply on Friday.

TJones (WMF) (talkcontribs)

Hi @Svrl@DCausse (WMF) figured out a way you can insert your own custom analysis config into CirrusSearch. There is a paste on Phab with the code to add to your LocalSettings.php file. After I updated my LocalSettings.php file, I reindexed with mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --reindexAndRemoveOk --indexIdentifier now and the expected configuration was created.

Note that CirrusSearch has a final round of analysis customization code still runs on your config. It checks $wgCirrusSearchUseIcuFolding to control whether ASCII folding gets changed to ICU folding, for example.

The customization also enables the homoglyph_norm filter if the extra plugin is available, so you will see that in your config. (The homoglyph_norm filter tries to convert mixed-script tokens to single-script tokens when it can; it keeps the original mixed-script token, too, which means it is not compatible with some other filters... only aggressive_splitting that we know of, see Phab T268730.) If having homoglyph_norm is a problem, we can look at ways to disable it (currently the config to do so is private within the relevant code).

I'm not 100% sure about whether lowercase_keyword is used correctly in the paste example, but I'll talk to DCausse about it on Monday and we will get back to you here if there is a problem.

TJones (WMF) (talkcontribs)

DCausse updated the paste to include both lowercase_keyword and plain, and added comments explaining why. It should be good to go.

Svrl (talkcontribs)

Hello, I would like to say that I very appreciate your work and effort generated for this issue. Thank you.


I will definitely try out configuration you have provided on Phabricator here.

I have only one question. What implementation details towards to needed software, plugins and versions are in need?

May I suppose:

Mediawiki 1.35,

CirrusSearch and Elastica in relevant version,

Elasticsearch 6.5.4,

Extra plugin in relevant version,

NOT icu folding (according to the configuration definition),

Is that all? May I ask you to complete the list please? It would be really helpful for me to understand whole scope of the thought the configuration is refereing to.


Thank you!

Best regards,

Svrl

DCausse (WMF) (talkcontribs)

Hi,

you need:

  • MediaWiki and its extensions: CirrusSearch and Elastica
  • elasticsearch 6.5.4
  • the extra plugin (latest version should be: 6.5.4-wmf-11)
  • the analysis-icu plugin is required

Hope it helps,

David.

Reply to "Mediawiki fulltext search for czech language"
Justin C Lloyd (talkcontribs)

I'm currently working on testing CirrusSearch with AWS Elasticsearch in my dev environment, but first had to implement (AWS Elasticache) Redis for the job queue. However, I was recently told by someone on the Search team (I apologize, I forget who) that this may not be necessary unless there are perhaps hundreds of thousands to millions of jobs as WMF has. I've seen at most maybe 100-120k (aggregate for my five wikis) but mine are usually 100s, occasionally 1000s, and even more rarely 10s of thousands, which were otherwise handled fine in MySQL. So is it really necessary for CirrusSearch to have the job queues in Redis at that level?

Reply to "Is Redis really required?"

How can I search for all the work of a particular artist? Search box is useless.

2
2601:643:8880:160:9181:5380:4E4D:62BB (talkcontribs)

How can I search for all the work of a particular artist?  Search box is useless.

Speravir (talkcontribs)

Do you speak of media in Commons? If so:

  • First you could search for a template {{Creator}} for the artist you search for. Type in search: creator: artist name or, if this lists too many results, creator: "artist name".
  • Then with the known creator template search for file: hastemplate:"creator:artist name" (strangely, this is apparently case sensitive for "artist name", which differs from the default behaviour of this filter).
  • Also, for every artist there should be a category which could be explored.

If want to search for media that is from an artist, but does not has the Creator template search for file: insource:"artist name" -hastemplate:"creator:artist name". If someone does not have a creator template then leave out the last part.

Reply to "How can I search for all the work of a particular artist? Search box is useless."
Return to "CirrusSearch" page.