Talk:Search/Old/status

About this board

New "insource:" syntax, etc.

3 comments • 03:10, 22 November 2014 10 years ago

3

I'm happy about the new "insource:" syntax, because every once in a while I find myself wishing for just that kind of low-level inspection (finding certain kinds of malformed information on commons file pages, for instance). Can I assume that the regex flavor is implemented in a way that's smart enough to only run it on files that would be the results returned by the rest of the query? I don't want to hammer the servers playing with it, but it did seem to be quite fast in my first use, which had a "prefix:" term that narrowed the field to ~600 hits by itself. Is this a reasonable usage?

Also, does the non-regex version basically just ignore non-word characters much like the other search functionallity? It seemed so from a few quick tests.

And, just to completely overload this post, I'm also wondering about what the "first paragraph" weighting would apply to in the context of a typical commons File: page... basically only up to the first heading? This could be significant in terms of best practices for adding information to the pages, I think. Current upload methods tend to slap a ==Summary== header at the very top of the page, while many older uploads are lacking this. Could this cause some wonky weighting of older vs. newer uploads?

Reply 02:03, 29 June 2014 10 years ago

😂 (talkcontribs)

You can't hammer the servers too much, pool counter is pretty small for regex searches ;-) I'll leave it to Nik to say how late/early in result processing it handles regular expressions.

Yes, most punctuation and so forth is ignored for non-regex searches. insource: searches a different field `source_text` instead of `text`, the latter of which is configured with all kinds of language-specific bells and whistles to make it better at finding content for the majority of readers.

The first paragraph weighting isn't as nice on the PHP side as I'd like. It uses the pretty naïve approach you outline there, where it just uses stuff before the first heading which isn't necessarily the best.

Reply Edited by Shirayuki 03:10, 22 November 2014 10 years ago

NEverett (WMF) (talkcontribs)

Sorry it took me so long to get to this.... Vacation and performance work have been squeezing me dry.

As far as order of operations - Elasticsearch _should_ do the right thing and execute the expensive filter last. On Thursday (I think) we're pushing a change to Cirrus that gives Elasticsearch a big hint that the regexes need to come last. If it isn't fast now it should be then.

The non-regex flavor of insource uses the standard analyzer used for the rest of the text. So its exactly how intitle works except against source. Its not prefect but its at least somewhat intuitive.

The first paragraph weighting thing is more something we should change to work around on wiki habits rather then the other way around. I built it so you could plug multiple implementations into it but only implemented the naive, until the first heading approach. It'd be simple enough modify or create a new one that skips the first heading if it is the very first thing.

Reply 14:56, 29 July 2014 10 years ago

Reply to "New "insource:" syntax, etc."

Works for Japanese

8 comments • 14:57, 29 July 2014 10 years ago

8

Whym (talkcontribs)

I don't see much feedback from Japanese wikis so I'd like to give one. The new search works pretty well for me on Japanese Wikipedia and Wiktionary. I especially like the section title highlighting and the improved word count in each search result. For example, against this query the old search gives a search result with a line saying "6 kb (24 words)" which is unreasonable, while the new search gives "6 kb (1,836 words)" which is reasonable.

Reply 08:31, 16 May 2014 10 years ago

NEverett (WMF) (talkcontribs)

Yay! Thanks! The old search used spaces for word count (I believe) but the Cirrus delegates to the text analyzer which has some knowledge about Japanese.

Questions:

There is an Elasticsearch plugin that is supposed to make Japanese analysis better. Would you be willing to try it out if I expose it in jawiki in beta and tell me if it is better/worse/the same?
I'd like to start enabling cirrus as the default on more wikipedias. We're almost everywhere but wikipedias. Anyway, would you be willing to talk about it on jawiki's village pump? I'd love to do it with community support rather then force it on folks.

Reply 16:47, 16 May 2014 10 years ago

Nemo bis (talkcontribs)

+1 on Nik: whym, it would be wonderful if you could help with that. :)

Reply 19:02, 16 May 2014 10 years ago

NEverett (WMF) (talkcontribs)

I deployed the plugin in beta this afternoon and loaded a few pages. You can try it and compare: http://ja.wikipedia.beta.wmflabs.org/w/index.php?title=%E7%89%B9%E5%88%A5%3A%E6%A4%9C%E7%B4%A2&profile=default&search=%E4%B8%89&fulltext=Search

Reply 21:31, 16 May 2014 10 years ago

Whym (talkcontribs)

NEverett, I'd like to try both. Do you know what exactly is the difference between the kuromoji plugin and the one you currently use? Is the current one inherited from lsearchd? This will also allow the community to know how the difference will be (and maybe how they can help debug).

The version on the beta.wmflabs.org looks not bad, but it is hard to say whether it is "better" unless we test these analyzers against the same document set. In general, I believe the difference of those Japanese analysis engines will be very subtle when looking at the search result quality, as long as they use the same or a similar morphological dictionary.

Reply Edited 02:17, 17 May 2014 10 years ago

NEverett (WMF) (talkcontribs)

The Kuromoji plugin looks to be an effort to integrate this which claims support for lemmatization and readings for kanji. I'm playing with the default setup for it and I don't see any kanji normalization, but it does a much better job with word segmentation then the one that is deployed on jawiki now. The one deployed on jawiki now is Lucene's StandardAnalyzer which implements unicode word segmentation. I haven't dove into that deeply enough to explain it, but some examples.

日本国 becomes
日本 and 国 in kuromoji
日 and 本 and 国 in standard

にっぽんこく becomes
にっぽん and こい in kuromoji
に and っ and ぽ and ん and こ and い in standard

From that it looks like kuromoji should be better but standard is saved by executing the search for all the characters as a phrase search which makes everything line up _reasonably_ well. It won't perform as well, but that should be ok too.

And it looks like my fancy highlighter chokes on kuromoji, which isn't cool. Look here. There are results without any highlighted anything which isn't good.

With regards to lsearchd: I'm not sure what it uses. It doesn't have the api that lets me see how text is analyzed so I have to guess from reading the code and there is a lot of it.

Reply Edited 19:17, 19 May 2014 10 years ago

Whym (talkcontribs)

Do you want to continue working on kuromoji plugin until it is ready regarding highlighting? Or do you want to officialize the current beta feature as it is? I support your observations in that kuromoji's segmentation is more linguistically meaningful, which could improve search. However, failing to highlight is a major issue, and so far I personally cannot see how much search/snippet would be improved by kuromoji.

Is it easy for you to import all pages from jawiki to the test instance, or create another test instance using the same reduced document set, but processed by StandardAnalyzer? I'd be interested in testing various queries to check differences in search results, looking at what are retrieved and what are not by which.

Reply 14:16, 23 May 2014 10 years ago

NEverett (WMF) (talkcontribs)

My I'm bad at replying to these. Sorry. So, yeah, my plan right now is to go ahead with the standard analyzer and do more work on getting the Japanese one better later. Sorry for the late reply.

Reply 14:57, 29 July 2014 10 years ago

Reply to "Works for Japanese"

Zero Width Joiner and Zero Width Non Joiner

5 comments • 15:25, 24 July 2014 10 years ago

5

Siddhartha Ghai (talkcontribs)

Hi.

I'm curious as to what behaviour search has when an input string has a ZWJ or ZWNJ unicode character. Are results without the ZWJ / ZWNJ searched for? And what if a search doesn't contain ZWJ/ZWNJ but a page with the exact same spelling but including one of these characters in between exists?

As far as I know, search on the WMF cluster as of now doesn't treat words including ZWJ/ZWNJ the same as those not including these. I don't think this behaviour is correct, and the matter probably needs to be investigated since I think some indic language IMEs provide options for the input of these characters (to force the rendering of a particular glyph) and pages with titles containing these characters may be created.

Reply 13:18, 1 March 2014 10 years ago

NEverett (WMF) (talkcontribs)

So we've been holding off on these kinds of issues until we're able to get the unicode plugin for Elasticsearch deployed on the cluster. The plan is for CirrusSearch to use it (if it is installed on Elasticsearch) to take a first crack at the problem and then go from there. We're willing and able to go beyond that but we'd like to start there. The holdup is just that Elasticsearch plugins are deployed differently then most other things at WMF so we have to work up a special mechanism for them. We're moving along on that project so we should be able to start really improving things "soon".

Still, I'd love some good test cases to make sure that we're going in the right direction. I'd be thrilled if you filed a bug with some examples of things that don't work but should.

Reply 17:50, 4 March 2014 10 years ago

Siddhartha Ghai (talkcontribs)

TL;DR version:

I would file a bug except that I'm not sure what the behaviour should be. I think the issue needs some discussion before an actual bug is filed, since as I see the issue, it is complicated, and there are several potential methods to resolve it.

Full comment:

My interest in these chars is in indic languages, specifically hindi.

Per the Unicode Indic joining behaviour model, there are 4 different ways in which ZWJ/ZWNJ can be used, with the resulting renderings differing.

An example case is the following four pages (the page content has the unicode sequence used):

(Note: The last two were created today and may not show up in search till tomorrow)

It should be noted that the rendering would differ depending on what glyphs the actual font has. So, a font designed for, say, Sanskrit may have a full conjunct glyph, whereas one for hindi may not (since sanskrit used many more conjunct forms than hindi IIRC). As for the current situation, the proprietary Mangal font that ships with Windows by default shows the above four in the same way, in the fully expanded form with explicit viram, since it doesn't contain any glyph. However, changing the font family to Lohit (the font used for hindi in ULS), the rendering for the first page differs from the other three, the first showing a conjunct glyph with the others still showing the fully expanded form. There may be cases where all four renderings differ, but I'm not aware if the behaviour model is implemented by any fonts yet or not.

Now, as far as language is concerned, the subpagename in all four is essentially the same word. The fact that the glyph may be rendered differently doesn't change how it's read (pronounced), or what it means.

So what we have effectively is four different ways to write the same word, possibly with four different renderings or one rendering depending on the font the user has.

This means that as of now, depending on the IME a particular user is using, he/she may not find in search what they were looking for and end up creating duplicate pages on the same topic. And the two titles may be rendered exactly the same for another user. Needless to say, this will leave the average user perplexed.

(Note: IIRC, I have come across one such case where a dupe was created by a newbie when he couldn't find the article that he created)

I find this to be complicated, similar to the unicode normalization issue, with various possible solutions.

Solution 1

Strip all ZWJ/ZWNJ from all text and pagenames and search queries

Pros:

No chances of page duplication
No search issues

Cons:

No ability to force particular glyphs
Probably problematic for sanskrit wikisource (where ZWJ/ZWNJ may be really needed)

Solution 2

Strip all ZWJ/ZWNJ from pagenames and search queries

Pros:

No chances of page duplication
No search issues

Cons:

No ability to force particular glyphs

Solution 3

Treat all four cases as one for search

Pros:

Probably easiest to implement

Cons:

Duplicate page creation remains possible
Even if the search functionality works, the text find and replace in the editbar, and the inbuilt find/replace feature of browsers may not work correctly.

Reply 05:10, 7 March 2014 10 years ago

NEverett (WMF) (talkcontribs)

Sorry for the super duper late reply, but, here goes:

I can use case folding to flatten all four of these examples into "the same" word from search's perspective. That is, NFKC with case folding tacked on the end.

Now some choices: 1. Do this on both the analyzers that we use for text or just the less exact one. If I just do the less exact one then the words that match without normalization will bubble above those that match with normalization. And, by default, "quoting" a word will not find it normalized. I'm leaning towards adding the normalization to both analyzers for this reason. 2. Should I add this to all languages, most languages, just languages for which I don't have a good default, or just languages that ask for it? Note that I'm actually waiting on a change upstream to enable me to add things to "all" or "most" languages. 3. Other stuff?

Reply 17:50, 8 May 2014 10 years ago

Siddhartha Ghai (talkcontribs)

Sorry for the super duper late reply (went on a wikibreak):

I don't think applying case folding to search queries will have a major effect on projects in languages that don't have case. AFAIK, none of the indic family scripts have case. Do note though that just because the project is in an indic language doesn't necessarily mean that there won't be any content in other case-sensitive languages. There can always be discussions, Help pages and Mediawiki: namespace stuff in english. So searches related to such stuff will be affected.

The decision about whether or not to apply case folding by default could be decided on the basis of how much content on a particular project seems to be in a case-sensitive language. Finding this out, will, ofcourse, require some database queries to analyze how much content is in which script on the project.

So:

I also think applying it to both analyzers would be better
The change should be applied on a case-by-case basis to language projects that ask for it (Although if the change is found useful on a few language projects of the indic script family, I think it can be extended to all indic scripts).
Other stuff: This resolves the search part, but not the title part. Ideally, it shouldn't be possible to create four different pages for the same title, and, if needed, the glyph to be used in the title should be controlled by a magic word or something. Not sure where to raise this point for a proper discussion. Ideas?

Reply 15:25, 24 July 2014 10 years ago

Reply to "Zero Width Joiner and Zero Width Non Joiner"

'.' considered a word character?

7 comments • 07:57, 18 July 2014 10 years ago

7

Junkyardsparkle (talkcontribs)

When using the following query on commons:

incategory:California_Historical_Society_Collection,_1860-1960 intitle:restoration.jpg

One result is found, but if "restoration.jpg" is truncated to "restoration" (as would normally be the case when searching for that term) no results are returned. This is highly problematic for title searches on commons, where most page titles include file extensions. Possibly related to this this bug?

Reply 01:08, 7 April 2014 10 years ago

NEverett (WMF) (talkcontribs)

I believe this will be fixed by the fix for bugzilla:63861. That should hit commons tomorrow and I'll rebuild the index and see if that fixes it.

Reply 21:55, 21 April 2014 10 years ago

Junkyardsparkle (talkcontribs)

Great. In general, the new search works so well that I forget that it's an opt-in beta... in particular, it's nice for creating fairly tight heuristically-defined lists of files for acting on with cat-a-lot on commons. Most of the outliers are conveniently sorted to the end of the listing for easy unselecting. I have bumped my head on (apparently) some query complexity limits while doing crazy things, but otherwise it's a very powerful tool for more than just landing a casual user on the right article page... cheers to everybody working on it. :)

Reply 23:11, 21 April 2014 10 years ago

NEverett (WMF) (talkcontribs)

Thanks! Its just User:Demon and I working on it but we're leaning on Elasticsearch and Lucene which are pretty powerful. Would you mind posting examples of some of the neat queries that work for you? I'll add them to the regression test suite.

I did try to rebuild commons yesterday but bumped up against a timeout error during one of the rebuild steps an hour and a half into the process. I built a fix this morning and I'll try to squeeze it out to production today and try again.

Reply 13:02, 23 April 2014 10 years ago

NEverett (WMF) (talkcontribs)

@Junkyardsparkle: That seems to have done the trick. Give it a shot now.

Reply 19:02, 23 April 2014 10 years ago

Junkyardsparkle (talkcontribs)

Looks good, I'll get back to categorizing now... don't know if my actually used queries would be useful for regression testing, because they tend to obsolete themselves after I act on the results, at least as far as the (possibly negated) "incategory:" terms go, which is what I tend to end up with a lot of... but if I can abstract a good test that reflects my use case, I'll mention it here. Thanks again.

Reply 06:08, 24 April 2014 10 years ago

Junkyardsparkle (talkcontribs)

As of right now, things have regressed to this being a problem again. Example case, this query no longer finds this file. Hope those new servers help enough to put the fix back in place. :)

Reply 07:57, 18 July 2014 10 years ago

Reply to "'.' considered a word character?"

Weight hits that are early in the article more highly then results at the end

23 comments • 16:09, 7 June 2014 10 years ago

23

Summary by Nemo bis

Tracked in Bugzilla
Bug 66045

Tracked in Bugzilla
Bug 61669

Shawn à Montréal (talkcontribs)

I couldn't figure out why my search results had suddenly gone to pot until I figured out that "new search" had been auto-enabled. Sorry, but in my experience it's complete junk. I use search to find documentaries in related fields quite a lot, and suddenly it had seemed as if the search function was returning near-random results. Now, with old search back, a search for say, "Algerian War" and "documentary" gets me what I'm looking for: articles related to those terms. With the "new" function on, such a search is virtually useless.

Reply 16:07, 23 May 2014 10 years ago

Deskana (WMF) (talkcontribs)

Thanks for the report. Can you tell us what wiki you're searching on?

Reply 21:17, 23 May 2014 10 years ago

Shawn à Montréal (talkcontribs)

English Wikipedia. I'm surprised no one else has mentioned it. I'm no SEO guy, but it's as if the "new" search has lost the ability to weight results depending on where they occur within an article. Specifically, I'd been surprised to see that if there was a mention of one my search terms -- say, the word "documentary" -- even in a reference or an external link, as opposed to the actual body text of the article. the new search would return those results near the top. "Old" search seems to be to be much closer to what I get in a Google search, which is to say, the search function would have the intelligence to distinguish been non-trivial and trivial mentions, somehow.

Reply Edited 22:30, 23 May 2014 10 years ago

Junkyardsparkle (talkcontribs)

Part of the issue may be related to the introduction of slop into phrase searches, which confused me initially... see "Double quotes no longer result in phrase search" thread below for gory details. TLDR:

"algerian war"~0 documentary

in the new search will gives results fairly similar to

"algerian war" documentary

in the old search, where not using the "~0" gives looser results.

Reply 23:07, 23 May 2014 10 years ago

Shawn à Montréal (talkcontribs)

I don't think it's (just) that. For example, a search for the words Mugabe and documentary in old mode returns the two articles on documentaries about the leader, first and second, flawlessly. But switch to new mode, and the two docs are in first and sixth place -- clearly not as good. I don't know what you folks have done, but it's a net loss not a gain, from what I can see.

Reply Edited 23:56, 23 May 2014 10 years ago

Junkyardsparkle (talkcontribs)

No, I didn't mean to imply that the problem was entirely (or even mostly) that... the new search does seem to be less magical with respect to your examples. The weighting voodoo is beyond my monkey comprehension, I'm just happy that I can create an explicit query when I want to, and there is some nice syntax available now... for instance, for your purposes, this seems to work pretty well:

mugabe documentary boost-templates:"Template:infobox film|300%"

Again, I'm not trying to say you don't have a valid complaint, just presenting what might be a useful workaround (or potentially even an improvement on hoping the search will weight things the way you want). :)

Reply 00:22, 24 May 2014 10 years ago

Shawn à Montréal (talkcontribs)

I'm sorry I don't know what that means or what to do with it. But thanks for trying to help.

Anyway, so long as we still have access to the old search function, the one that worked, it's fine.

Reply 01:31, 24 May 2014 10 years ago

Junkyardsparkle (talkcontribs)

It boosts the weighting of results that have the "film" infobox on the page. I don't think they plan to maintain the old search indefinitely, so forgive me if I hijack the thread with some ideas about how to make the new one work better for your purposes.

I'm wondering if it would be possible to implement a weighting method that uses boost-templates under the hood, by mapping certain high-confidence templates to the occurrence of certain associated terms (when not used in a phrase). For instance, if "documentary", "movie", etc implied a boost to the "Infobox film" template. Sorry if this isn't feasible or is already implemented in some way, I'm pretty ignorant of the weighting magic, like I said...

Reply 02:21, 24 May 2014 10 years ago

Shawn à Montréal (talkcontribs)

Oh, I see, you're actually talking about improving how the dingus works?

But why don't get is why folks messed with it in the first place. It worked just fine.

Reply 02:52, 24 May 2014 10 years ago

Junkyardsparkle (talkcontribs)

I'm talking about that now, but I was also pointing out that you can use the boost-template syntax in your own searches; using the example given should help articles about films bubble up towards the top of the list. From what I understand, the old search was difficult to maintain on the back end, and the new one will be better in that regard.

Reply 03:04, 24 May 2014 10 years ago

Shawn à Montréal (talkcontribs)

I see. I'm sorry but I have no idea how to modify the syntax or anything of that nature. Like most, I guess. I just type words in the window and hit the button.

Reply 03:12, 24 May 2014 10 years ago

Shawn à Montréal (talkcontribs)

I'm confident they'll fix the new search before they remove the old one - but one other thing I realize one can do is use Google Advanced Search to search Wikipedia. Tried it and it works pretty well.

Reply 11:14, 24 May 2014 10 years ago

NEverett (WMF) (talkcontribs)

Yeah, that boost-templates thing is more to test the default template boosting. You see, there is a configuration parameter on wiki that can be set to make everyone's searches silently contain some boost. The idea was to allow community curation of the results. Commmons uses it but not that extensively. You'd use boost-templates in your query either when you want to disable the defaults or when you want to test new ones. So its really a "super expert" kind of thing. In addition to that, its a convenient hook for my regression tests to check the feature.

Reply 20:50, 4 June 2014 10 years ago

NEverett (WMF) (talkcontribs)

Thanks for coming here to complain about these results. We'll figure some way out to make it at least as good for this class of search.

As to why we're replacing the old search when it is so good at finding results, here is the short list:

Old search crashes/rans out of resources from time to time and no one knows how to fix it. Its a pretty large code base based on really old libraries. New search is based off of relatively standard services under active development.
Old search updates every few days and often misses things. New one updates pretty near real time. Page edits are usually in the index in under a minute. Template edits are can take longer to be reflected in the pages that contain those templates.
Old search doesn't do anything with templates. New search fully resolves templates. Its *righter* but its more trouble.

The truth is that the replacement project was driven internally by ops folks raising a ruckus because the old one had no maintainer and wasn't super stable. There is also a significant backlog of bugs and feature requests for search that we've had to ignore because the old one was so hard to work on. So that's how you get where we are.

As far as why the new search doesn't spit out results exactly like the old one, one of the reasons is that the old one is super customized for English Wikipedia. Its difficult to navigate and many of the customizations were speculative: they didn't really provide better results, they just were there. So we implemented the ones that were obviously better and deployed the new search as a BetaFeature so folks could try it. When we tried it we found the results were usually similar but not better or worse. You've hit on one of the customizations that we didn't reimplement: the old search weights hits that are early in the article more highly then results at the end. We didn't do this because our tests didn't show it made much difference. But for you searches it makes a pretty huge difference.

Long story short, we'll implement that.

Also, if you are curious on how scoring works you can read the first half of this presentation. The other half won't be all the interesting.

Reply 17:53, 2 June 2014 10 years ago

Shawn à Montréal (talkcontribs)

Thank you very much. Frankly, I didn't think people would much care what I had to say.

Reply Edited 01:00, 3 June 2014 10 years ago

Junkyardsparkle (talkcontribs)

Interesting, I wouldn't have guessed that was the optimization involved, but now that you mention it, that weighting does make a huge amount of sense in the context of wikipedia articles, being summarized in the lead section...

Reply 21:58, 3 June 2014 10 years ago

Shawn à Montréal (talkcontribs)

I'm surprised it was not judged to be worth retaining, initially. Google has made much great strides in making its search more intelligent, in distinguishing between relevant and trivial mentions of search terms.

In Wikipedia, we have guidelines that explain the importance of summarizing key concepts in the article lead. To design a search engine to intentionally disregard that very structure is puzzling to me.

Reply 02:31, 4 June 2014 10 years ago

Nemo bis (talkcontribs)

They did not "intentionally disregard" the feature, they just have not spent time developing it from scratch; but you were told they will now. Also consider this weighing is not a search backend standard, it's not even valid for many MediaWikis including Wikimedia wikis (specifically, in order of traffic: Commons, Wiktionary, Wikiquote, Wikisource and Wikibooks).

This system of prioritisation makes sense to me: it would have been worse if they had tried to reimplement every single feature and customisation of the old custom search moloch, even unrequested. We would have wasted lots of developer time and ended up with another unmaintainable system which would receive no love for the next 5 years.

Reply 07:21, 4 June 2014 10 years ago

NEverett (WMF) (talkcontribs)

Thanks for the defence, but Shawn's right; its a relatively obvious optimization. Its something that's "been on the list" for a long time but it kept getting lower and lower under as we'd been in beta and no complained about quality in a way that this would have caught. I frankly forgot about it.

As far as intentionally disregard, if anyone did any disregarding, it was me. I'd prefer to characterize what I did in this case as getting snowblinded by all the (probably) speculative features to improve search quality that I didn't give this one as much weight as it deserves. But there isn't a clear line between that and intentionally disregard. It did, after all, make it onto my list, just too low.

I will admit to getting mired in a pet issue of mine, highlighting. The highlighter wasn't going to support it so I spent quite a bit of time on it. In fact, the highlighter used on enwiki and commons right now does prefer snippets from the beginning of the article. But I got distracted by the snippet issue and didn't cover the scoring issue.

Anyway, I'm going to go fiddle with positional boosts now. Depending on how that goes you'll get a solution soon.

Reply 20:02, 4 June 2014 10 years ago

Shawn à Montréal (talkcontribs)

I certainly didn't mean to offend anyone, sorry. I think it's kinda neat that this one lone comment from me has been helpful to the cause, and thanks.

Reply 20:21, 4 June 2014 10 years ago

NEverett (WMF) (talkcontribs)

Complaining is how I know more has to be done!

I implemented weighing terms early in the article more highly then later (locally, not deployed) but I'm not happy that'd be enough for your case. Mugabe's Zimbabwe doesn't have the word "documentary" in the opening. It calls it a "factual film". I'm sure there is a distinction but I imagine its small enough people still think of it as a documentary. I mean, it is in the "Documentary films about politicians" category. I think I'll add a search in the category with a decent weight as well. That seems like it'd help.

Reply 11:41, 5 June 2014 10 years ago

Shawn à Montréal (talkcontribs)

Oh yes, the "factual film" thing is a real outlier. Don't worry about that. But yes if you could weight the categories a bit more, then that might indeed help search results. Good idea.

Reply 13:54, 5 June 2014 10 years ago

NEverett (WMF) (talkcontribs)

Both of those changes are ready for review. I imagine the category thing will catch the factual film outlier. My best guess is we'll deploy them to the test wikis next Thursday and to wikipedias the Thursday after that. Both changes, though, will require some time to take effect because the index will have to be rebuilt. That'll take a few days. That is one of the problems with Cirrus: the old search could rebuild the entire index more quickly because it didn't bother with stuff like templates. We can't. We react more quickly because we're able to hook more tightly into the infrastructure and we can throw more cpu at the problem. But when you have to change the index it takes some time. OTOH its like 100 times easier to debug then the old stuff, so tradeoffs.....

Reply 14:45, 5 June 2014 10 years ago

Reply to "Weight hits that are early in the article more highly then results at the end"

Search result not including articles created in recent months in zh-yue

6 comments • 15:57, 19 May 2014 10 years ago

6

Yaukasin (talkcontribs)

Hello. I am a user of Cantonese Wikipedia (zh-yue). I am seeking for help because there is a major bug in our search function - only articles created on or before 2012 are included in the search result. Admins in Cantonese Wikipedia said that they don't know how to fix it. Please help us. Thank you very much. (Related discussion in Cantonese)

Reply 02:43, 10 May 2014 10 years ago

NEverett (WMF) (talkcontribs)

Can you check how the "New Search" BetaFeature works for you? If it does a decent job of finding stuff then I can switch the whole wiki over to using it as the default. This (not being updated) is exactly the kind of thing that is difficult to fix in the old search and really simple to work on in the new one.

Reply 18:39, 12 May 2014 10 years ago

Yaukasin (talkcontribs)

Thanks for your advice. I turn on "New Search" BetaFeature and discover that there seems to be no problem for searching recent articles. Take searching a Han character "辣" (meaning hot & spicy) as an example: In the old search, only 8 results were found dated from Aug 2011 to Apr 2012. In the new search, 163 results are found and the top 50 results contains pretty many articles created during 2013 to 2014.

Reply 14:43, 15 May 2014 10 years ago

NEverett (WMF) (talkcontribs)

I'll schedule zh-yue to switch to "New Search" as the primary search backend sometime next week then.

Reply 16:48, 16 May 2014 10 years ago

NEverett (WMF) (talkcontribs)

Switched. Let me know (here or in bugzilla) if anyone has any trouble with it.

Reply 14:45, 19 May 2014 10 years ago

Yaukasin (talkcontribs)

Thank you. I have just notified the zh-yue community about this improvement.

Reply 15:57, 19 May 2014 10 years ago

Reply to "Search result not including articles created in recent months in zh-yue"

Filters not working

8 comments • 19:27, 12 May 2014 10 years ago

8

Spinningspark (talkcontribs)

Search filters like prefix: and intitle: are not working from the search box.

Reply 00:21, 7 May 2014 10 years ago

NEverett (WMF) (talkcontribs)

Which wiki? Can you provide an example?

Reply 17:15, 8 May 2014 10 years ago

Spinningspark (talkcontribs)

This is en.wiki with the new search engine checked under my preferences in beta. Here are some results using the filter "prefix:Mechanical"

As you can see, the Cirrus results do not have the search term in the title of many results at all, let alone as a prefix.

Reply 17:53, 8 May 2014 10 years ago

Junkyardsparkle (talkcontribs)

Seems to be following redirects (ie. "Mechanical and Aeronautical Engineering" > "Engineering").

Reply 18:18, 8 May 2014 10 years ago

Spinningspark (talkcontribs)

Well that's still a problem. First of all, it is not transparent that redirects are being followed, I am being presented with the top result of "Engineering" which patently does not match my search specification and no indication of why it was included. Secondly, while I might possibly (or possibly not) have wanted to know about redirects, I probably don't want them at the top of the list. If redirects are going to be included, the name of the redirect should be in the results, not its target, it should be marked as a redirect, and it would be rather useful if it could be optionally suppressed.

The whole reason a user would use the prefix filter is they want exactly those pages that match. Not pages that are associated in some way.

Reply Edited 21:35, 8 May 2014 10 years ago

NEverett (WMF) (talkcontribs)

Filed at bugzilla:65232 and I'll look at it now.

Reply 18:19, 12 May 2014 10 years ago

NEverett (WMF) (talkcontribs)

And I've submitted a patch to fix it (https://gerrit.wikimedia.org/r/#/c/132973/). If all goes well we'll merge the patch in a few hours and this'll be fixed on enwiki a week from Thursday. We can get it faster if it is really killing you.

Reply 18:54, 12 May 2014 10 years ago

Spinningspark (talkcontribs)

It's not urgent as far as I'm concerned. While it's still a beta function it can be turned off. You might want to look at bug 65237 which I just raised before you patch anything though.

Reply 19:27, 12 May 2014 10 years ago

Reply to "Filters not working"

wikidata multi-lingual search.

2 comments • 17:22, 8 May 2014 10 years ago

2

Jaredzimmerman (WMF) (talkcontribs)

Are we using wikidata translations for cirrussearch? to sting matching to show wikidata items or even just to have multi-lingual searching working

Reply 16:49, 8 May 2014 10 years ago

NEverett (WMF) (talkcontribs)

We haven't done any special integration with wikidata beyond being their primary search backend. We don't have a proper strategy for multilingual wikis with Cirrus either. The plan is to get there after we've got cirrus everywhere. At least we'll be better off then we were. I'll admit that it isn't a great argument but it is where we are.

Reply 17:22, 8 May 2014 10 years ago

Reply to "wikidata multi-lingual search."

Double quotes no longer result in phrase search.

8 comments • 05:50, 8 May 2014 10 years ago

8

Junkyardsparkle (talkcontribs)

Not sure when this happened, but just noticing that "find this phrase" now returns results with "find", "this", and "phrase" anywhere instead of the expected behavior. I'm assuming this isn't intentional...

Reply 02:02, 27 April 2014 10 years ago

Junkyardsparkle (talkcontribs)

~~Doesn't seem to be happening anymore.~~ Ok, it doesn't seem to happen in a consistent way, which is driving me nuts, so I'm switching back to old search for now. Example case: "pacific electric" search on commons is sneaking in results with "pacific gas & electric", etc... probably not a big deal for some purposes, but for rounding up files to batch process, not so great. :/

Reply Edited 01:35, 4 May 2014 10 years ago

NEverett (WMF) (talkcontribs)

I don't _believe_ I've changed the behavior of cirrus with regards to double quotes on commons. I'm going to be swapping out the component that does the highlighting with another one that is faster. It is currently deployed on test, test2, wikidatatest, and mediawiki.org. It doesn't support limiting the results to matching phrases at the moment but I'm fixing it. What I believe your seeing is that Cirrus's default phrase slop is 1 rather then 0. In other words, one word in between is OK. I'll switch it to 0 this week. You can actually control the slop by putting ~0 after the phrase. So "pacific electric"~0 will get you what you want. It just isn't intuitive and I'll fix that.

Reply 15:35, 5 May 2014 10 years ago

Junkyardsparkle (talkcontribs)

Yes, I'm pretty sure now that I just wasn't tripping over the slop feature enough to notice it at first. Thanks very much for the ~0 workaround, that's useful information to have. If it's documented somewhere, I missed it, but if it's made clear in the basic search syntax guide, then the slop may not be such a terrible default setting. I don't really know what the "normal" user expectation about this behavior is. :)

EDIT: I found this, but it states that "closer to 1 is less fuzzy"... ~~which is backwards, isn't it?~~ Ok, it works differently for phrases and single terms. Should have read a little bit more, sorry. I think I'm gonna need a cheat sheet.

Reply Edited 02:58, 6 May 2014 10 years ago

NEverett (WMF) (talkcontribs)

I updated it to make it more clear, I hope. If you have ideas for what cheat sheet should look like please start one and I'll work on it!

Reply Edited 12:41, 6 May 2014 10 years ago

Junkyardsparkle (talkcontribs)

Well, I just meant something like this... of course, having made it, I feel like I won't need it, but I'm always wrong when I assume that. ;)

Reply 23:30, 6 May 2014 10 years ago

Nemo bis (talkcontribs)

If you move it to Help namespace here on mediawiki.org we can later also translate it.

Reply 05:51, 7 May 2014 10 years ago

Junkyardsparkle (talkcontribs)

If you think it's worth translating, you're more than welcome to copy it to any appropriate place, but please look it over first to make sure it actually makes sense to someone other than me...

Reply 05:50, 8 May 2014 10 years ago

Reply to "Double quotes no longer result in phrase search."

search by mimetype/filetype?

5 comments • 22:18, 4 May 2014 10 years ago

5

LuisVilla (talkcontribs)

I suspect this isn't quite the right place to ask it; feel free to point me at the right component in Bugzilla if not. But here goes: it should be possible to do something equivalent to Google's filetype: operator, at least on Commons. There are times when you want video; times when you want audio; etc., etc. So it'd be really nice to provide that :)

Reply 23:37, 3 May 2014 10 years ago

Junkyardsparkle (talkcontribs)

I've found that using something like "intitle:ogg" works reasonably well, since the number of allowed file types/extensions on commons is fairly limited...

Reply 23:55, 3 May 2014 10 years ago

LuisVilla (talkcontribs)

That's helpful, thanks! I still think a proper/accurate solution would be really helpful if we want people to treat us as a serious source of non-photo materials.