Help talk:CirrusSearch

About this board

113.210.117.21 (talkcontribs)

Bolehkah sy guna bahasa melayu. Krn sy tak faham bahasa inggriss..tkasih.

TJones (WMF) (talkcontribs)

EN: We can try to communicate with Google Translate. We can ask others to help check our translations.

MS: Kami boleh cuba berkomunikasi dengan Terjemahan Google. Kami boleh meminta orang lain untuk membantu memeriksa terjemahan kami.

TJones (WMF) (talkcontribs)

@Tofeiku, @Bennylin, and @Yosri: can you help translate and/or check Google's translations? Thanks!!

Tofeiku (talkcontribs)

@TJones (WMF) Your Google translation is fine and understandable.

Reply to "Boleh sy guna bahasa melayu?"

Wildcard does not work at all for me

6
73.190.170.210 (talkcontribs)

Neither \? nor * works. Google got rid of their wildcard feature too so I can't use that hack. All I want is a simple wildcard search damn it

DCausse (WMF) (talkcontribs)

Could explain what is not working with a specific query and what you expect? But also if the problem you encounter is on WMF wikis or on a custom mediawiki installation. CirrusSearch supports wildcard queries but it has limitations, without more clarifications from your side it is hard to help you.

73.190.170.210 (talkcontribs)

I'm going to use Wiktionary for my examples, since that's where I intend to use these parameters, and the search behavior should be the same as Wikipedia regardless.

My problem is that everything I can find on the Help page, including intitle:, \?, and *, only works in the middle or at the end of a word. For example:

"Which English words begin with 'zr-'?"

incategory:"English lemmas" -zr intitle:zr* (returns a bunch of Slavic loans, but English nonetheless)

But how do I place a wildcard character at the beginning of a word, for instance to search Wiktionary for questions like:

"Which English words contain the consonant cluster '-zr-'?"

The query:

incategory:"English lemmas" -zr intitle:*zr*

returns an error.

Thanks in advance for your help.

73.190.170.210 (talkcontribs)

It seems like the regex method:

incategory:"English lemmas" -zr intitle:/zr/

is one possible solution. Is there a way to make this query more efficient?

DCausse (WMF) (talkcontribs)

This is correct, one limitation of the wildcard queries on WMF wikis is that you can't use them at beginning of the words, the reason is performance (to avoid scanning the full lexicon).

Dealing with precise "in-word" matching require switching to regular expression searches which I see you already tried. Sadly with a 2 letters search (3 contiguous letters will enable some optimizations) there is very little we can actually do other than trying to find other metadata to search on to limit the search space (e.g. incategory:"English lemmas").


@TJones (WMF): might have some tricks to share here.

TJones (WMF) (talkcontribs)

I don't have too much to add—@DCausse (WMF) hit all the highlights: 3 contiguous letters helps a lot, and limiting the search space with other terms can help even more. incategory:"English lemmas" helps some, but not too much because it still gives almost 600K entries to grep through.

If you are looking for not only zr, but also Zr, ZR, and zR, you will want to add an i flag to your query: incategory:"English lemmas" -zr intitle:/zr/i Unfortunately it will make your query a little slower, since it is more expensive to grep while ignoring case.

My only other thought is that if it is very important that you find every instance of zr, or if you are going to do a lot of searching like this, consider looking into downloading the wiktionary dumps and searching them offline. (https://dumps.wikimedia.org/) It would be a fair amount of work to parse the dumps, but it would get you 100% complete results.

Sorry there's not a better answer.

Reply to "Wildcard does not work at all for me"
Summary by DCausse (WMF)

The fix has been deployed, this problem should no longer occur.

Jonteemil (talkcontribs)

Hello!

Why is c:Special:Search/all: hastemplate:PD-textlogo returning 0 pages, when c:Special:WhatLinksHere/Template:PD-textlogo isn't empty? This is very weird. When I preview this comment and click on the first link, the search actually work. I get 100.000+ results. However when I then click "search" without having changed anything in the search bar, it gives me 0 pages again. If I just refresh, it still works but when I click search it doesn't. Is this some kind of bug?

Speravir (talkcontribs)

I confirm the odd behaviour when executing the search a second time. But I do not get what the relation is to another search tool with another template.

(It is clear after search query, but it would have been better to mention at first that you speak of templates in Wikimedia Commons.)

Jonteemil (talkcontribs)

Oh, I meant to use the same template. My bad.

Jonteemil (talkcontribs)

Okay, now it works again. Very odd.

Speravir (talkcontribs)

I still observe the strange behaviour.

Jonteemil (talkcontribs)

I do it now again as well. But it worked for a time yesterday. This is very strange.

DCausse (WMF) (talkcontribs)

This is a known issue due to a test running on commons, this should be fixed today.

Speravir (talkcontribs)
DCausse (WMF) (talkcontribs)
Speravir (talkcontribs)

Thank you. I will close this thread, again.

hastemplate:[sometemplate] none

1
Summary by Bouzinac

"-hastemplate:[sometemplate]"

Bouzinac (talkcontribs)

How to find articles that havent sometemplate ?

search results in rendered form

2
2001:638:607:205:0:0:0:30 (talkcontribs)

For my MediaWiki Project I'm looking for a way to convert the search results from the raw (wikitext) form into the rendered form to make it look better. Is CirrusSearch able to do that? Or do you guys have any other idea how i can achieve this?


DCausse (WMF) (talkcontribs)

CirrusSearch is not able to do this, the snippets presented are issued from a text version obtained from \WikiTextStructure::getMainText(). To highlight we insert html tags at precise offsets returned by the highlighter run inside elastic, if the text indexed by elastic and the text displayed are different then you'll have to track where offsets are, basically knowing that offset 123 in the text version is at offset 342 in the rendered output. Add this to the fact that since you can't display the whole content (too big) you need to select a consistent chunk of the rendered output to display. This is very challenging in my opinion. Perhaps limiting to a set of known text formatting options might make this a bit easier but handling everything including tables sounds particularly complex.

Reply to "search results in rendered form"

Not all articles present in dumps?

3
Polm23 (talkcontribs)

I downloaded the cirrussearch-content dump and was filtering articles based on categories and noticed some articles I expected to find weren't present.

This is the file I downloaded:

enwiki-20200727-cirrussearch-content.json.gz

Some articles I expected to find, but didn't: "Paper Mario", "Sakura Wars (1996 video game)". I thought maybe it was an issue with my category filtering, but I also tried just printing every title and checking that list and those articles don't seem to be present.

Polm23 (talkcontribs)

On further examination, the JSON I have has a little over 5M entries with a "title" property. But English Wikipedia passed 6M entries in January, so that number seems off...

TheDJ (talkcontribs)
Reply to "Not all articles present in dumps?"
Tufor (talkcontribs)

Hello. I was thinking of a possibility of combining funcionalities of PetScan and CirrusSearch. Like first run a query on Petscan, get a list of pages and then run a query on these pages with CirrusSearch. That obviously means that I need to store that first list somewhere. As you cannot put a thousand of pages in intitle: ;) and you cannot introduce an array of titles to run query on in any other way then the only sensible choice would be to save the list as links somewhere in a sandbox, and then use it later. Unfortunetaly such funcionality doesn't exist as well. That's why I would like to ask if it would be possible to introduce anything like this? We have "linksto:" then why not "linksfrom:"? It shouldn't be too hard. I'm not too well-versed in MediaWiki database architecture, so excuse me if I'm just talking nonsense, but aren't links on a page stored somewhere in one table?

EBernhardson (WMF) (talkcontribs)

Not sure it would be possible, at least not easily, from the search engine . From this perspective the problem is that `linksto` is a property of the page content, it comes out of the mediawiki parser as a property of the rendered wikitext. `linksfrom` is the opposite, This would have to be propagated on each edit to the set of linked pages.

There are links tables in SQL that can do joins to resolve the set of pages linking to a page, but within the Search engine there is nothing conceptually similar, the underlying technology doesn't have the ability to join multiple datasources at query time.

It's far from impossible, but the conceptual model around updates is reversed from the current model making both updating the data and ensuring its correctness difficult.

Tufor (talkcontribs)

Thanks for your answer. I thought it would not be so difficult since we have a special page Special:RecentChangesLinked which shows changes in pages that not only link to a given page but also those that link from it; it seems to me that there should a way. But since you say it'd be difficult then I'm not going to push for it (though such feature would be of use). Sorry for my bad English.

Reply to "Linksfrom:?"

Easy way to identify articles with/without images?

6
Astinson (WMF) (talkcontribs)

So one of the larger theories about reader experience is that illustrated content is more catchy and engaging -- and that we will want to connect content from Commons with those potential articles.

Moreover, when community groups want to organize events like meta:VisibleWikiWomen or meta:Wikipedia_Pages_Wanting_Photos, it would be super useful to be able to identify which articles don't yet have Commons media on them.

Is there a way to surface whether or not articles have image in search? Right now the closest thing I can find to this, is whether or not a page has a pageimage indexed in https://en.wikipedia.org/wiki/Mother_Teresa?action=info . Magnus's petscan surfaces that element: but it's not reliable -- sometimes a page will have an image, but not a high quality one -- or it will be a logo or something that doesn't meet whatever the criteria is being used for that filter.

@DTankersley (WMF) @DCausse (WMF) & @EBernhardson (WMF) -- would love your thoughts.

TJones (WMF) (talkcontribs)

I don't think there is currently a good tool for this. You can do something with insource: regular expressions, but regexes can be very expensive queries and they aren't necessarily scalable. (We only allow so many regex queries at once, and if you have no other search terms to narrow the scope of the regex, it will always return incomplete results on large wikis because it times out.)

Here's a fairly generic regex that finds File: links with image suffixes (you may want to add other suffixes):

insource:/\[\[File[^|\]]+\.(jpg|png|gif|svg)[|\]]/

So, this query on enwiki currently returns about 100K results, but it times out, so the list is not complete.

The negation ( -insource:/\[\[File[^|\]]+\.(jpg|png|gif|svg)[|\]]/ ) returns 440K documents, and also timed out.

However, if you can limit your search to a particular category or title match or even fairly rare keyword, it should complete. For example: deepcat:"Film stubs" -insource:/\[\[File[^|\]]+\.(jpg|png|gif|svg)[|\]]/ finishes and gives 632 results (deepcat:"Film stubs" only gets 641 results, so it is easy for the regular expression to run over that limited set).

Note that insource: looks at the actual source of the page, so images included by templates, transclusion, etc, will not be detected.

So, as a once-in-a-while query or set of queries to generate lists for an editathon or other event, this would work. As a widely deployed user-facing tool, it probably would not—though maybe if there are always focused additional search terms.

If you are open to non-search approaches, you could also look at the dumps and write a tool to scan the latest dump for articles without images. It wouldn't be up-to-the-minute, but you could process 100% of a wiki if you wanted to, which would never be possible with insource: searches on larger wikis.

TJones (WMF) (talkcontribs)

As @DCausse (WMF) pointed out, not every wiki uses File: so another regex may work, or may not. Infoboxes and templates may have other syntaxes. I suppose just looking for things that look like image file names might work, with a few false positives where an article discusses images without actually having one—which seems rare. Parsing dumps sounds better and better.

Astinson (WMF) (talkcontribs)

@TJones (WMF) that is a very interesting solution, but regex would not be what I would want to provide to provide to organizers for regular use.

Also, just tried your query with a small set and get one image almost immediately: https://en.wikipedia.org/wiki/Lyra_McKee ). I have tried multiple examples, and its seems to be retrieving a not-insignificant number of false positives. I tried something else without regex and it seems to produce a better result. That solves my short term question.

In the long term, I would think a filter like this would be super useful in the search interface itself. I think the challenge with dumps, is that you create a huge barrier to novel use cases by folks who are wiki literate, but not necessarily technically literate. There are some tools that kindof do this kind of search live (i.e. FIST: https://tools.wmflabs.org/fist -- but that tool is kindof overwhelming, and breaks from the typical workflows (i.e. leveraging Petscan for categories because deepcat seems to break everytime I use it (as too many categories)). But search makes a lot more sense in a tool like petscan (or any other end-user tool). Theoretically you would want to be able to share a tool links that generates a query and share it around with others, and then the updates are going to be consistent from search.

Astinson (WMF) (talkcontribs)
197.218.85.218 (talkcontribs)

It seems like the most sensible place to add this functionality would be as a way to fetch page properties (https://phabricator.wikimedia.org/T200860). Of course there would need to be a generic property added to a page that contains any image at all, e.g. "hasimages" property. Currently it only gets added when an image fulfills certain criteria. Then it would work in a similar manner to Special:PagesWithProp .

Anyway, just knowing whether a page has an image can probably be done by Wikidata Query Service. For instance in a sparql query (https://w.wiki/MrK). That example simply searches for cats within the category:cats on English wikipedia.

I just cobbled that together using a few sparql examples. People more proficient with this would probably be able to make an image appear there, and make it possible to filter out pages without images. It would also be possible to generate those using templates for various use cases.


Reply to "Easy way to identify articles with/without images?"
85.229.20.147 (talkcontribs)

How do I search for a word written in all lower case, omitting the same word when it is a name or otherwise spelled with a capital letter?

DCausse (WMF) (talkcontribs)

The only way to do case-sensitive searches is to use regex through the insource search keyword.

TJones (WMF) (talkcontribs)

Regex queries are very expensive, and on big wikis, like any of the bigger Wikipedias, regex queries will time out. So instead of searching for insource:/mark/, you should search for mark insource:/mark/, which will find all articles with mark (upper or lowercase), and then only do the expensive regex search on the subset of articles that have some form of mark.

Unfortunately, our regex engine doesn't support word boundary syntax (like \b), so using mark in the query will also filter out completely unrelated results (like remarkable). You may still get related terms like marks.

Even with this optimization, the query might still time out (it did for the mark example, because there are 500,000+ articles with mark in them). If you have that problem, add any extra search terms that would decrease the total number of articles the regex needs to scan.

Also, if you want to omit articles that have the capitalized version as well as the uncapitalized version, you can negate an insource query, too: mark insource:/mark/ -insource:/Mark/ -insource:/MARK/. This won't catch every version (mArK, if it exists, could still match), but it should filter the results some more.

Speravir (talkcontribs)

In addition: If you search for such a word in the title only then use intitle. It has to be used the same like insource shown by TJones, and it has also the danger of timing out when using the regex search. And a tip: Narrowing down the search amount by limiting to one namespace helps in a lot of cases (but not every time), so for the main namespace (articles in Wikipedias) start your search with a single isolated colon. Check section “Prefix and namespace” of the CirrusSearch help.

TJones (WMF) (talkcontribs)

@Speravir: Good call on intitle! It can time out, but obviously the title index is a lot smaller than the full article index, so it is much less likely to time out if you have additional keywords in your query.

85.229.20.147 (talkcontribs)

I don't understand how to use this. I type /mark/ in the search window at Wikipedia and still get all the results for Mark too. When I type mark insource:/mark/ , I get the result

Create the page "Mark insource:/mark/" on this wiki!

No results matched your query.

TJones (WMF) (talkcontribs)

Sorry not to respond sooner. I was on vacation last week and not checking messages.

You can't do a regex search for /mark/ by itself, you have to use the insource: keyword. When I search for mark insource:/mark/ on English Wikipedia, I get tens of thousands of results. (Since there are over 500,000 articles with mark in them, it still times out, so the number of results you get is not exact.) You will still get matches to Mark when both Mark and mark occur in the article. For example, the article on Mark Knopfler includes the URL http://www.mark-knopfler-news.co.uk/biogs/mark.html, so the article contains both Mark and mark. If you don't want any instance of Mark or MARK, try mark insource:/mark/ -insource:/Mark/ -insource:/MARK/. You will still get partial-word matches like remarkable.

Regexes that match exact words are hard on-wiki because we don't support all regex features.

Maybe if you describe the larger goal that you are trying to accompllish, we can suggest some other ways of doing it. If this is something you need to do regularly for the foreseeable future, maybe downloading a wiki dump and using better tools on the dump would be more effective, for example.

Reply to "Shift sensitivity"
Jonteemil (talkcontribs)

Hello!


How can you make a search for pages which you then can copy and past into AWB? When searching there is always some content to each page below the result. I just want the list of pages, not the content to each page. Is this possible? I know you can make, like, a query but I don't really know SQL, I only know the insource regexes. Can someone help me?

Tacsipacsi (talkcontribs)

The AWB “Wiki search” (or something similar, I don’t have it open right now) generate option uses CirrusSearch on Wikimedia wikis. You could call that API endpoint yourself and transform the result to a copy-pasteable format, but why would you do that if you can simply paste the search query in AWB’s search box?

Jonteemil (talkcontribs)

I use w:Wp:JWB and I don't think there's a search box there. I can't find one anyway.

Tacsipacsi (talkcontribs)

Oh, I thought that if you write AWB, you mean AWB… Actually search is on its way to JWB, although it seems to be a bit slow on its way. Probably you should urge the author to incorporate this feature.

Jonteemil (talkcontribs)

Is there a way I can do it at the moment? Or will I have to wait?

Tacsipacsi (talkcontribs)

Expressing your wish in the talk page section might greatly speed up things. For the time being, you can use w:User:Colin M/scripts/JWB annotated.js (mw.loader.load('https://en.wikipedia.org/w/index.php?title=User:Colin_M/scripts/JWB_annotated.js&action=raw&ctype=text/javascript');) on Project:AutoWikiBrowser/Script Beta. Please note that this is at your own risk, as Colin M can put any, even malicious JavaScript code on this page, although that’s quite unlikely.

Jonteemil (talkcontribs)

Thanks so much for showing me that fork of JWB. It was very helpful.

Reply to "Generate a list"
Return to "CirrusSearch" page.