Topic on Help talk:CirrusSearch/Flow

Shift sensitivity

7 comments • 14:33, 10 August 2020 4 years ago

7

85.229.20.147 (talkcontribs)

How do I search for a word written in all lower case, omitting the same word when it is a name or otherwise spelled with a capital letter?

Reply 04:31, 30 July 2020 4 years ago

DCausse (WMF) (talkcontribs)

The only way to do case-sensitive searches is to use regex through the insource search keyword.

Reply 07:44, 30 July 2020 4 years ago

TJones (WMF) (talkcontribs)

Regex queries are very expensive, and on big wikis, like any of the bigger Wikipedias, regex queries will time out. So instead of searching for insource:/mark/, you should search for mark insource:/mark/, which will find all articles with mark (upper or lowercase), and then only do the expensive regex search on the subset of articles that have some form of mark.

Unfortunately, our regex engine doesn't support word boundary syntax (like \b), so using mark in the query will also filter out completely unrelated results (like remarkable). You may still get related terms like marks.

Even with this optimization, the query might still time out (it did for the mark example, because there are 500,000+ articles with mark in them). If you have that problem, add any extra search terms that would decrease the total number of articles the regex needs to scan.

Also, if you want to omit articles that have the capitalized version as well as the uncapitalized version, you can negate an insource query, too: mark insource:/mark/ -insource:/Mark/ -insource:/MARK/. This won't catch every version (mArK, if it exists, could still match), but it should filter the results some more.

Reply 16:17, 30 July 2020 4 years ago

Speravir (talkcontribs)

In addition: If you search for such a word in the title only then use intitle. It has to be used the same like insource shown by TJones, and it has also the danger of timing out when using the regex search. And a tip: Narrowing down the search amount by limiting to one namespace helps in a lot of cases (but not every time), so for the main namespace (articles in Wikipedias) start your search with a single isolated colon. Check section “Prefix and namespace” of the CirrusSearch help.

Reply 17:46, 30 July 2020 4 years ago

TJones (WMF) (talkcontribs)

@Speravir: Good call on intitle! It can time out, but obviously the title index is a lot smaller than the full article index, so it is much less likely to time out if you have additional keywords in your query.

Reply 17:59, 30 July 2020 4 years ago

85.229.20.147 (talkcontribs)

I don't understand how to use this. I type /mark/ in the search window at Wikipedia and still get all the results for Mark too. When I type mark insource:/mark/ , I get the result

Create the page "Mark insource:/mark/" on this wiki!

No results matched your query.

Reply 14:00, 2 August 2020 4 years ago

TJones (WMF) (talkcontribs)

Sorry not to respond sooner. I was on vacation last week and not checking messages.

You can't do a regex search for /mark/ by itself, you have to use the insource: keyword. When I search for mark insource:/mark/ on English Wikipedia, I get tens of thousands of results. (Since there are over 500,000 articles with mark in them, it still times out, so the number of results you get is not exact.) You will still get matches to Mark when both Mark and mark occur in the article. For example, the article on Mark Knopfler includes the URL http://www.mark-knopfler-news.co.uk/biogs/mark.html, so the article contains both Mark and mark. If you don't want any instance of Mark or MARK, try mark insource:/mark/ -insource:/Mark/ -insource:/MARK/. You will still get partial-word matches like remarkable.

Regexes that match exact words are hard on-wiki because we don't support all regex features.

Maybe if you describe the larger goal that you are trying to accompllish, we can suggest some other ways of doing it. If this is something you need to do regularly for the foreseeable future, maybe downloading a wiki dump and using better tools on the dump would be more effective, for example.

Reply 14:33, 10 August 2020 4 years ago

Reply to "Shift sensitivity"