Help talk:CirrusSearch

About this board

search results in rendered form

2
2001:638:607:205:0:0:0:30 (talkcontribs)

For my MediaWiki Project I'm looking for a way to convert the search results from the raw (wikitext) form into the rendered form to make it look better. Is CirrusSearch able to do that? Or do you guys have any other idea how i can achieve this?


DCausse (WMF) (talkcontribs)

CirrusSearch is not able to do this, the snippets presented are issued from a text version obtained from \WikiTextStructure::getMainText(). To highlight we insert html tags at precise offsets returned by the highlighter run inside elastic, if the text indexed by elastic and the text displayed are different then you'll have to track where offsets are, basically knowing that offset 123 in the text version is at offset 342 in the rendered output. Add this to the fact that since you can't display the whole content (too big) you need to select a consistent chunk of the rendered output to display. This is very challenging in my opinion. Perhaps limiting to a set of known text formatting options might make this a bit easier but handling everything including tables sounds particularly complex.

Reply to "search results in rendered form"
Colin M (talkcontribs)

The filters section says A namespace or a prefix term is not a filter because a namespace will not run standalone, and a prefix will not negate. This seems empirically untrue. On EnWP I get the following number of results for each of these queries, as expected:

  • incategory:"LGBT-related musical films": 58
  • incategory:"LGBT-related musical films" prefix:"Hello": 2
  • incategory:"LGBT-related musical films" -prefix:"Hello": 56

So it seems like negating a prefix does work. Am I misunderstanding what this is trying to say? For now I added a {{dubious}} tag.

TJones (WMF) (talkcontribs)

I took a long hard look at this and I'm confused, too. I'm not sure I understand the definition of "filter" being used. I don't know if the documentation is out of date or using some model that I'm not able to wrap my head around. Similarly, I don't get this: Insource ... is also a filter, but insource:/regexp/ is not a filter. insource:word and insource:/regex/ behave pretty much the same, other than the regex being much slower. Sounds like the documentation could use a thorough review to make sure all the advanced features and special cases are still described correctly.

Cpiral (talkcontribs)

A "filter" can reduce unwanted matches, providing refinement. Regex are special, catered for, terms that use filters. I made educated guesses at numerous terms, and even invented "greyspace". Just trying to help.

When I rewrote the help page to its current form, (years ago), there was neither documentation, nor discussion. So I would say prefixes can be negated now, but not then.

TJones (WMF) (talkcontribs)

@Cpiral, thanks for the explanation. Also, I very much appreciate all the work you've put into these help pages! They definitely keep getting better.

Pols12 (talkcontribs)

So, I’m editing the page to indicate we can actually negate prefixes.

EDIT: since I’m not really sure what is true, I’ve only removed the concerned sentence. Feel free to explain what is possible and what is not.

Cpiral (talkcontribs)

It tested well. It is true that prefix is a filter. I added the sentence back, modified: A namespace is a specified search domain but not a filter because a namespace will not run standalone.  A ''prefix'' will negate so it is a filter.

Reply to "Prefixes don't negate?"

What is the recipe for properly re-initializing Elastic/CirrusSearch?

4
WhitWye (talkcontribs)

Somehow I've ended up with CirrusSearch mostly working, but failing entirely to find some terms known to be in the imported wiki. Also, as we keep a live backup of our wiki to which we nightly import the whole of the main one, we should have the search DB there thoroughly refreshed each night. What is the proper formula to purge and rebuild the search DB? Apologies if it's someplace that should be obvious, which I've so far missed.

EBernhardson (WMF) (talkcontribs)

CirrusSearch contains a maintenance script called forceSearchIndex.php for this purpose. It can be invoked something like the following. This will essentially queue up to 10k indexing jobs, wait for that to go down to ~1k jobs (to prevent dominating the job queue and forcing other jobs to wait for the entire process to complete), and then push more jobs up to 10k in a repeated fashion.


php extensions/CirrusSearch/maintenance/forceSearchIndex.php --queue --maxJobs 10000 --pauseForJobs 1000

WhitWye (talkcontribs)

Running that script on an otherwise idle system, after a string of "Queued 100 pages" messages there's a seemingly endless repeat of "[              wikidb] 179 jobs left on the queue." After many minutes of that htop is showing a load between 0.00 and 0.01. Is there a prerequisite to running this maintenance script successfully? Running it without the flags I see it runs into a parsing error:


MWException from line 348 of /var/www/mediawiki-1.34.0/includes/parser/ParserOutput.php: Bad parser output text. ....


Obviously I should report a bug: https://phabricator.wikimedia.org/T244603

EBernhardson (WMF) (talkcontribs)

I see in the phab ticket you came up with a temporary solution to the parser failure. With that somewhat resolved, does the reindexing complete?

Reply to "What is the recipe for properly re-initializing Elastic/CirrusSearch?"

The MediaWiki API srsort parameter

3
WhitWye (talkcontribs)

When the result is "A warning has occurred while searching: Sort order of last_edit_desc is unrecognized, default sorting will be applied. Valid sort orders are: relevance", does this imply something additional must be configured or added at the MediaWiki API level for date-order searches to be enabled? MediaWiki 1.34 is installed with CirrusSearch, AdvancedSearch, Elastica and ElasticSearch all otherwise apparently happy, and recognized under Version. I'm so far unable to find documentation on what the missing piece might be.

DCausse (WMF) (talkcontribs)

If only "relevance" is shown I suspect that CirrusSearch is not properly activated did you set $wgSearchType = 'CirrusSearch'; to activate CirrusSearch?

WhitWye (talkcontribs)

Thanks. I'd indeed managed to miss that. The web page on installation is partly redundant with the README about setting up CirrusSearch, so on seeing the redundancy, I managed to overlook that only the README mentions adding the $wgSearchType line. Since the requirement for that appears to be a constant across CirrusSearch generations, it might be logical to put it in the web version of the instructions, along with the other LocalSettings requirements which are listed (redundantly) there.

Reply to "The MediaWiki API srsort parameter"

Index of local files without import?

4
188.111.50.2 (talkcontribs)

The underlying ElasticSearch seems to be a very mighty search engine. Is it possible to use CirrusSearch for recursive indexing of an archive on a local disk without importing every file in the wiki?

In my case there is a huge collection of PDFs that i want to be found by keyword searching. May be by automaticaly generated UNC links or something like that.

DCausse (WMF) (talkcontribs)

Unfortunately CirrusSearch is designed to index mediawiki it cannot be used to index documents present on the filesystem. I'd suggest looking at other tools that are designed for this task, e.g. fscrawler.

188.111.50.2 (talkcontribs)

Thank you. Is it possible to run CirrusSearch and FSCrawler on the same instance of ElasticSearch sharing results?

DCausse (WMF) (talkcontribs)

Re-using the same elasticsearch cluster is certainly possible but using FSCrawler generated index to populate Mediawiki Search results won't be possible out of the box. The only way to use mediawiki to search these files will be to import them I'm afraid.

Reply to "Index of local files without import?"

Automatic reset to 20 search results - why?

3
Gerbis (talkcontribs)

Why does the search function reset to 20 search results every time search terms are rephrased? It is very cumbersome and scroll intensive, particularly because the option to display more results is only at the bottom of the page.

Example: I tried to search for a picture of a BEST showroom by the architecture firm SITE. As both "best" and "site" are very common words, I didn't expect to find what I'm looking for at the top of the results. But every time I refined the search terms (e.g. adding a city), I had to scroll down to the bottom of the page to click on 500 results (and, of course, have to look at the first 20 results twice as well). What a waste of time!

And what's worse: the search results page is not automatically active. Therefore, to get to the bottom of the page (or, later, back to the top of the page) you can't use the End or Home button on your keyboard without clicking somewhere in the page first. Very, very annoying.

Could this not be made more user friendly?

Speravir (talkcontribs)

This annous me, too, most of the time. I think, though, this is wrong here, but had to be one or two Phabricator ticket/-s.

Cpiral (talkcontribs)

It is annoying. I use my browser shortcut key to go to the bottom of the search results page.

Probably not a bug or feature request, but instead: a 20-results per page is a characteristic of processing. The help page tries to say that while indexing and weighting are unavoidable pre-processing of a query, the snippets and bolding are avoidable post-processing requiring heavy networking and text processing.

Reply to "Automatic reset to 20 search results - why?"
Jonteemil (talkcontribs)

Hello!


On en.wikt. I want to find all pages with this syntax:


===Adjective===

{{head|de|adjective form}}


# {{inflection of|de|/positive form/||str|gen|m//n|s|supd|;|wk//mix|gen//dat|all-gender|s|supd|;|str//wk//mix|acc|m|s|supd|;|str|dat|p|supd|;|wk//mix|all-case|p|supd}}


The /positive form/ varies from page to page, it can be ”rot”, ”dumm”, ”froh” etc., so is there any way to make an insource search for the entire syntax? I can do it for everything after /positive form/ and everything before /positive form/ but not everything including the varying /positive form/? Just to clarify, /positive form/ is never written on any page it’s just what I use as a variable for the words that are written in that place.

Speravir (talkcontribs)

I do not get it fully. Some examples for possible variations would be nice, the examples you give do not have this syntax. Also your search query as you have it now would be good.

What is your actual interest for finding: the doubled empty lines, the doubled slashes?

Jonteemil (talkcontribs)
Speravir (talkcontribs)

Thanks.

And do you need exactly this string from the beginning with the third order section until the end where only the actual adjective in positive form varies?

Just as a start: I would first narrow down the amount for the search query, hence the query should begin with (do not overlook the first colon):

: hastemplate:head hastemplate:"inflection of" insource:adjective

After this would come the regex insource depending on what do you expect. For the posive form part I would use this regex: [^|}]+.

Speravir (talkcontribs)

@Jonteemil, what’s up? I know you have been active in the meantime. – Speravir (talk) 01:39, 19 December 2019 (UTC)

Jonteemil (talkcontribs)

Sorry for not replying. I realized what I wanted wasn’t possible to achieve in the way I thought, and that made me leave it and also forget this talk page. I appreciate your answer, thanks! Just out of curiousity btw, what do you mean with [^|}]+?Jonteemil (talk) 02:09, 19 December 2019 (UTC)

Speravir (talkcontribs)

Well, searching for this what you presented above is possible, though the search query gets quite long. Hence I asked for what you exactly interested in.

[^|}]+ is the regex for “everything, but not a pipe and closing brace character, at least one occurrence”. This is for the variable adjective string. BTW: For German adjectives we could change this to a narrower character search [a-zäöüß]+ or, if irregular upper case letters have to be expected, [A-Za-zÄäÖöÜüß]+</code.

Jonteemil (talkcontribs)

I see, thanks for the knowledge!

185.66.254.155 (talkcontribs)

wikipedi sayfasına kullanıcı olarak içerik yüklemek istiyorum. içerik ortağı olarak katkıda bulunmak istiyorum ama üye olmama rağmen, bu isteğim gerçekleşmiyor.

ne yapmam gerek?

TJones (WMF) (talkcontribs)

EN: Sorry, this is not related to CirrusSearch, so I don't think you will get an answer here. Try asking on the Village Pump or Köy çeşmesi.

TR: Üzgünüz, bu CirrusSearch ile ilişkili değil, bu yüzden burada bir cevap alacağınızı sanmıyorum. Burada sormayı deneyin: Village Pump / Köy çeşmesi.

Reply to "İçerik yükleme engeli!!!!"

incategory parameter and white space

5
Summary by Tacsipacsi

Use quotation marks ("New cars") to ignore whitespace in parameter.

Loman87 (talkcontribs)

Hello everybody,

I am noticing an issue when using the incategory parameter, which doesn't work if there is a white space in the query, e.g. incategory:New cars doesn't work; incategory:New_cars works fine instead. This was always ok for me, but now I need to use this parameter with Extension:InputBox to limit search to specific categories. In the call to this categories I also need to use Variables, which give as output values with white space, e.g. {{FULLPAGENAME}} gives something like Category:New cars. Is there any way to make Cirrus Search working also with white spaces in category names? Or also to "force" variables to give values with underscores instead of the white spaces?

I am not sure this the right place to post this question, anyway any help is really appreciated.

Thanks,

Lorenzo

Tacsipacsi (talkcontribs)

Space is the separator between search terms, so incategory:New cars searches for pages mentioning cars in Category:New. You can explicitly mark search term boundaries with quotation marks, i.e. incategory:"New cars".

Loman87 (talkcontribs)

This is a wonderful workaround, thanks very much!

PerfektesChaos (talkcontribs)

Or incategory:New_cars since _ is regular replacement for spaces in page names.

Tacsipacsi (talkcontribs)

But it’s not so easy to convert the output of {{PAGENAME}} (not {{PAGENAMEE}}) to use underscores, and that’s what the question is about.

T506 not clear about tilde position for easy translation - reformulate please

2
Wladek92 (talkcontribs)

In

<!--T:506--> 
A fuzzy-word or fuzzy-phrase search can suffix a tilde ~ character (and a number telling the degree).

We can understand that the fuzzy elements come AFTER the tilde, or we may also guess that the fuzzy element has a tilde as it suffix (...???). More of that the next sentence "A tilde ~ character prefixed to the first term of a query guarantees search results instead of any possible navigation." is the same as the first proposition and makes a repetition. Can somebody reformulate please ? Thanks.

Christian FR (talk) 12:54, 1 November 2019 (UTC)

Ciencia Al Poder (talkcontribs)

I think T:506 refers to "phrase~".

About "A tilde ~ character prefixed to the first term of a query guarantees search results instead of any possible navigation.", I think it means, if you search for "MediaWiki", since a page with that name exists, it will redirect you straight to the page called MediaWiki. Searching for "~MediaWiki" will give you the search results page, even if a page with that name exists.

Reply to "T506 not clear about tilde position for easy translation - reformulate please"
Return to "CirrusSearch" page.