Help talk:CirrusSearch

About this board

The MediaWiki API srsort parameter

3
WhitWye (talkcontribs)

When the result is "A warning has occurred while searching: Sort order of last_edit_desc is unrecognized, default sorting will be applied. Valid sort orders are: relevance", does this imply something additional must be configured or added at the MediaWiki API level for date-order searches to be enabled? MediaWiki 1.34 is installed with CirrusSearch, AdvancedSearch, Elastica and ElasticSearch all otherwise apparently happy, and recognized under Version. I'm so far unable to find documentation on what the missing piece might be.

DCausse (WMF) (talkcontribs)

If only "relevance" is shown I suspect that CirrusSearch is not properly activated did you set $wgSearchType = 'CirrusSearch'; to activate CirrusSearch?

WhitWye (talkcontribs)

Thanks. I'd indeed managed to miss that. The web page on installation is partly redundant with the README about setting up CirrusSearch, so on seeing the redundancy, I managed to overlook that only the README mentions adding the $wgSearchType line. Since the requirement for that appears to be a constant across CirrusSearch generations, it might be logical to put it in the web version of the instructions, along with the other LocalSettings requirements which are listed (redundantly) there.

Reply to "The MediaWiki API srsort parameter"

Index of local files without import?

4
188.111.50.2 (talkcontribs)

The underlying ElasticSearch seems to be a very mighty search engine. Is it possible to use CirrusSearch for recursive indexing of an archive on a local disk without importing every file in the wiki?

In my case there is a huge collection of PDFs that i want to be found by keyword searching. May be by automaticaly generated UNC links or something like that.

DCausse (WMF) (talkcontribs)

Unfortunately CirrusSearch is designed to index mediawiki it cannot be used to index documents present on the filesystem. I'd suggest looking at other tools that are designed for this task, e.g. fscrawler.

188.111.50.2 (talkcontribs)

Thank you. Is it possible to run CirrusSearch and FSCrawler on the same instance of ElasticSearch sharing results?

DCausse (WMF) (talkcontribs)

Re-using the same elasticsearch cluster is certainly possible but using FSCrawler generated index to populate Mediawiki Search results won't be possible out of the box. The only way to use mediawiki to search these files will be to import them I'm afraid.

Reply to "Index of local files without import?"

search results in rendered form

2
2001:638:607:205:0:0:0:30 (talkcontribs)

For my MediaWiki Project I'm looking for a way to convert the search results from the raw (wikitext) form into the rendered form to make it look better. Is CirrusSearch able to do that? Or do you guys have any other idea how i can achieve this?


DCausse (WMF) (talkcontribs)

CirrusSearch is not able to do this, the snippets presented are issued from a text version obtained from \WikiTextStructure::getMainText(). To highlight we insert html tags at precise offsets returned by the highlighter run inside elastic, if the text indexed by elastic and the text displayed are different then you'll have to track where offsets are, basically knowing that offset 123 in the text version is at offset 342 in the rendered output. Add this to the fact that since you can't display the whole content (too big) you need to select a consistent chunk of the rendered output to display. This is very challenging in my opinion. Perhaps limiting to a set of known text formatting options might make this a bit easier but handling everything including tables sounds particularly complex.

Reply to "search results in rendered form"

Easy way to identify articles with/without images?

6
Astinson (WMF) (talkcontribs)

So one of the larger theories about reader experience is that illustrated content is more catchy and engaging -- and that we will want to connect content from Commons with those potential articles.

Moreover, when community groups want to organize events like meta:VisibleWikiWomen or meta:Wikipedia_Pages_Wanting_Photos, it would be super useful to be able to identify which articles don't yet have Commons media on them.

Is there a way to surface whether or not articles have image in search? Right now the closest thing I can find to this, is whether or not a page has a pageimage indexed in https://en.wikipedia.org/wiki/Mother_Teresa?action=info . Magnus's petscan surfaces that element: but it's not reliable -- sometimes a page will have an image, but not a high quality one -- or it will be a logo or something that doesn't meet whatever the criteria is being used for that filter.

@DTankersley (WMF) @DCausse (WMF) & @EBernhardson (WMF) -- would love your thoughts.

TJones (WMF) (talkcontribs)

I don't think there is currently a good tool for this. You can do something with insource: regular expressions, but regexes can be very expensive queries and they aren't necessarily scalable. (We only allow so many regex queries at once, and if you have no other search terms to narrow the scope of the regex, it will always return incomplete results on large wikis because it times out.)

Here's a fairly generic regex that finds File: links with image suffixes (you may want to add other suffixes):

insource:/\[\[File[^|\]]+\.(jpg|png|gif|svg)[|\]]/

So, this query on enwiki currently returns about 100K results, but it times out, so the list is not complete.

The negation ( -insource:/\[\[File[^|\]]+\.(jpg|png|gif|svg)[|\]]/ ) returns 440K documents, and also timed out.

However, if you can limit your search to a particular category or title match or even fairly rare keyword, it should complete. For example: deepcat:"Film stubs" -insource:/\[\[File[^|\]]+\.(jpg|png|gif|svg)[|\]]/ finishes and gives 632 results (deepcat:"Film stubs" only gets 641 results, so it is easy for the regular expression to run over that limited set).

Note that insource: looks at the actual source of the page, so images included by templates, transclusion, etc, will not be detected.

So, as a once-in-a-while query or set of queries to generate lists for an editathon or other event, this would work. As a widely deployed user-facing tool, it probably would not—though maybe if there are always focused additional search terms.

If you are open to non-search approaches, you could also look at the dumps and write a tool to scan the latest dump for articles without images. It wouldn't be up-to-the-minute, but you could process 100% of a wiki if you wanted to, which would never be possible with insource: searches on larger wikis.

TJones (WMF) (talkcontribs)

As @DCausse (WMF) pointed out, not every wiki uses File: so another regex may work, or may not. Infoboxes and templates may have other syntaxes. I suppose just looking for things that look like image file names might work, with a few false positives where an article discusses images without actually having one—which seems rare. Parsing dumps sounds better and better.

Astinson (WMF) (talkcontribs)

@TJones (WMF) that is a very interesting solution, but regex would not be what I would want to provide to provide to organizers for regular use.

Also, just tried your query with a small set and get one image almost immediately: https://en.wikipedia.org/wiki/Lyra_McKee ). I have tried multiple examples, and its seems to be retrieving a not-insignificant number of false positives. I tried something else without regex and it seems to produce a better result. That solves my short term question.

In the long term, I would think a filter like this would be super useful in the search interface itself. I think the challenge with dumps, is that you create a huge barrier to novel use cases by folks who are wiki literate, but not necessarily technically literate. There are some tools that kindof do this kind of search live (i.e. FIST: https://tools.wmflabs.org/fist -- but that tool is kindof overwhelming, and breaks from the typical workflows (i.e. leveraging Petscan for categories because deepcat seems to break everytime I use it (as too many categories)). But search makes a lot more sense in a tool like petscan (or any other end-user tool). Theoretically you would want to be able to share a tool links that generates a query and share it around with others, and then the updates are going to be consistent from search.

Astinson (WMF) (talkcontribs)
197.218.85.218 (talkcontribs)

It seems like the most sensible place to add this functionality would be as a way to fetch page properties (https://phabricator.wikimedia.org/T200860). Of course there would need to be a generic property added to a page that contains any image at all, e.g. "hasimages" property. Currently it only gets added when an image fulfills certain criteria. Then it would work in a similar manner to Special:PagesWithProp .

Anyway, just knowing whether a page has an image can probably be done by Wikidata Query Service. For instance in a sparql query (https://w.wiki/MrK). That example simply searches for cats within the category:cats on English wikipedia.

I just cobbled that together using a few sparql examples. People more proficient with this would probably be able to make an image appear there, and make it possible to filter out pages without images. It would also be possible to generate those using templates for various use cases.


Reply to "Easy way to identify articles with/without images?"
Be..anyone (talkcontribs)

On the phabricator pages folks discuss some obscure feature related to file uploads on phab. I vaguely recall that I added links to two images on phab as "other versions" on a commons file. So where was this, how can I find it again? Maybe Special:Contributions should offer a search limited to all pages edited by the given user.

Nemo bis (talkcontribs)

I don't know if worth it, but this could be feasible, "simply" dumping the history into ElasticSearch. Even just usernames would end up being huge, though.

Reply to "Feature suggestion"

What is the recipe for properly re-initializing Elastic/CirrusSearch?

5
WhitWye (talkcontribs)

Somehow I've ended up with CirrusSearch mostly working, but failing entirely to find some terms known to be in the imported wiki. Also, as we keep a live backup of our wiki to which we nightly import the whole of the main one, we should have the search DB there thoroughly refreshed each night. What is the proper formula to purge and rebuild the search DB? Apologies if it's someplace that should be obvious, which I've so far missed.

EBernhardson (WMF) (talkcontribs)

CirrusSearch contains a maintenance script called forceSearchIndex.php for this purpose. It can be invoked something like the following. This will essentially queue up to 10k indexing jobs, wait for that to go down to ~1k jobs (to prevent dominating the job queue and forcing other jobs to wait for the entire process to complete), and then push more jobs up to 10k in a repeated fashion.


php extensions/CirrusSearch/maintenance/forceSearchIndex.php --queue --maxJobs 10000 --pauseForJobs 1000

WhitWye (talkcontribs)

Running that script on an otherwise idle system, after a string of "Queued 100 pages" messages there's a seemingly endless repeat of "[              wikidb] 179 jobs left on the queue." After many minutes of that htop is showing a load between 0.00 and 0.01. Is there a prerequisite to running this maintenance script successfully? Running it without the flags I see it runs into a parsing error:


MWException from line 348 of /var/www/mediawiki-1.34.0/includes/parser/ParserOutput.php: Bad parser output text. ....


Obviously I should report a bug: https://phabricator.wikimedia.org/T244603

EBernhardson (WMF) (talkcontribs)

I see in the phab ticket you came up with a temporary solution to the parser failure. With that somewhat resolved, does the reindexing complete?

Legaulph (talkcontribs)

wrong

Reply to "What is the recipe for properly re-initializing Elastic/CirrusSearch?"

Add articletopic to Draft space

3
Summary by DCausse (WMF)
Sadads (talkcontribs)

I was working on topics on English Wikipedia, and realized, that it would be super handy to have the ORES topic models applied to draft space to make it easier to search.

Speravir (talkcontribs)
DCausse (WMF) (talkcontribs)

About the copyrights of the pictures

2
141.237.124.41 (talkcontribs)

Hello guys I am new to your community and I just wanted to ask you one thing. I have written some E-books and I need to add some pictures to theml . Is it ok to use pictures from your community? Of course I will state that these pictures are not mine and I will give you full credit about them.

Speravir (talkcontribs)
Reply to "About the copyrights of the pictures"

How to list more than 1 result from a wiki page

2
Chachacha2020 (talkcontribs)

Hi, I'm using


MediaWiki     1.27.1

PHP     5.5.9-1ubuntu4.22 (apache2handler)

MySQL     5.5.53-0ubuntu0.14.04.1

ICU     52.1

Elasticsearch     1.7.5

and kinda pleased with the search result. However, I have a problem. My wiki has a page "Windows tip" and 2 heading name "windows can't sleep" and "Windows wake from sleep". A search "windows sleep" only bring "Windows wake from sleep" then the results come from another page. How to list more than 1 result from a wiki page?

PS: I can code a bit, so if this feature not available I can contribute.

DCausse (WMF) (talkcontribs)

Sadly the smallest unit is the wiki page with CirrusSearch, diverging from that might require significant changes to CirrusSearch internal data model. Another issue is that CirrusSearch does not know how to attach sections and text to each other. Imagine the search query matches a section name, the text displayed below won't be extracted necessarily from this same section causing some incomprehension (see phab:T131950).

Perhaps changing the structure of some of you wiki pages (subpages instead of sections) is an option to you?

If not I'm sorry for not being able to point at a reasonable solution for adapting the CirrusSearch code.

Reply to "How to list more than 1 result from a wiki page"
Jonteemil (talkcontribs)

Hello!

Is there a feature which you can use to search for the beginnings of pages? For example if you want to find every page on Commons that begins with {{Information but exclude every page that begins with something else and has {{Information in the second row?

DCausse (WMF) (talkcontribs)

Hi,

Cirrus does not allow searching for anchors (start or end of document) but I believe you can search for what you want by combining two regular expressions:

insource:/\{\{Information/ -insource:/.\{\{Information/


The first insource:/\{\{Information/ will search for all pages containing the wikitext {{Information. The second -insource:/.\{\{Information/ will exclude all pages that contain a character followed by {{Information (these are all the pages where the Information template is not used at the beginning of the wikitext).

Note that this regular expression is a bit slow to process as it has to scan a lot pages so you may end up only seeing partial results.

Jonteemil (talkcontribs)

I see, thanks! Why doesn't cirrus allow searching for anchors?

DCausse (WMF) (talkcontribs)

Simply because the underlying regular expression engine that we use does not support such feature :)

Jonteemil (talkcontribs)

Just to be sure. Will "beginswith:" and your insource regex have the exact 100% same result, however with different methods? "Beginswith:" is what I call the non-existant feature that would serve my need.

Jonteemil (talkcontribs)
DCausse (WMF) (talkcontribs)

@Jonteemil: no, the solution I provided only works if the characters you search for are only used at the beginning of the wikitext content not repeated elsewhere.


Assuming that we want to search for "xyz" only appearing at the beginning the wikitext insource:/xyz/ -insource/.xyz/ will discard valid results where "xyz" appears at the beginning but also somewhere else in the text.

In other words the query I provided is only accurate to 100% for the pages that include the Information template only once.


Allowing to anchor the search string to the start or the end of the string has been somehow brought up in https://meta.wikimedia.org/wiki/Community_Wishlist_Survey_2019/Search#Search_by_suffix

Instead of adding a new keyword I think it would make more sense to add support for ^ and $ to the insource:// and intitle:// keywords rather than adding a new keyword.

Jonteemil (talkcontribs)

Okay, thanks!

Jonteemil (talkcontribs)

Aha, thanks for the knowledge!

Speravir (talkcontribs)

In addition to @DCausse (WMF): Citing the help “when possible, please avoid running a bare regexp search”. But you also have to care about the different possible cases. Note that all this allowed: {{Information, {{information, {{ Information, {{ information, in fact an almost endless number of spaces between the opening braces and the template name.

Even I narrowed down the search amount I got a warning with this query because of the heavy template use: file: hastemplate:information insource:"information" insource:/\{\{ *[Ii]nformation/

And, of course, for this I was warned, too: file: hastemplate:information insource:"information" insource:/\{\{ *[Ii]nformation/ -insource:/.\{\{ *[Ii]nformation/

Jonteemil, this is the wrong place here (it should be discussed at Commons’ Village pump, I guess), but why do you want to know this? Do you want to add == {{int:filedesc}} ==? If so: This is not mandatory!

Jonteemil (talkcontribs)

To add == {{int:filedesc}} == was indeed my intention. Eventhough it might not be mandotory I think the goal should be that all files should have it, but as you say this is not mediawiki matter, rather Commons. I asked the question here since the question itself could be of use for every Wikimedia project. Even if I intended to use the answer on Commons.

Reply to "beginswith:?"
Return to "CirrusSearch" page.