Topic on Extension talk:CirrusSearch

CirrusSearch Only Partially Indexing

5
199.16.64.3 (talkcontribs)

I posted this on the discussion for Help:CirrusSearch but am doing it here as well to see if I might find a solution.

I have a wiki running on a dev server with the following:

MediaWiki1.27.4

PHP5.6.25 (apache2handler)

MariaDB5.5.56-MariaDB

Elasticsearch1.7.6

Recently installed CirrusSearch, and it works as expected except for one issue: it's only returning a partial number of pages in the search results. For example, there are about 200 pages (yeah, it's not big) in the main namespace, but only 20 are returned. Likewise, there are about 1800 images, but only 160 are returned. I increased the memory for elasticsearch, but that had no discernible effect. Elastica is up and running. Null edits force the changes through, but I'd rather not do this 1K+ times.

Any ideas/suggestion as to how to fix this? Thanks in advance.

DCausse (WMF) (talkcontribs)

I think the first step would be to know if the problem is at index time or search time.

Could you tell us if the output of the forceSearchIndex.php maintenance script is sane compared to the number of docs you have (it outputs: Indexed a total of XYZ pages at Y/s).

To troubleshoot the issue I'd suggest that you paste the output of these commands:

  • To know how many docs have been indexed you can ask to elastic with: curl localhost:9200/wiki_name/_count?pretty
  • Having the list of indices in elastic might as well to troubleshoot the issue: curl localhost:9200/_cat/indices
  • An example search query sent by Cirrus to elastic: you can obtain it by appending &cirrusDumpQuery to the search results page.

Thanks!

199.16.64.3 (talkcontribs)

Thanks for the response! I'll get on this soon and respond in the next day or so.

104.162.109.170 (talkcontribs)

Ok this is what I got running the commands.

After running forceSearchIndex.php --skipLinks --indexOnSkip:

Skipping page with no content: 896
[wikidatabase] Indexed 9 pages ending at 900 at 18/second

After running forceSearchIndex.php --skipParse:

Indexed a total of 3716 pages at 197/second

After running curl localhost:9200/wiki_name/_count?pretty:

{
  "error" : "IndexMissingException[[mediawiki] missing]",
  "status" : 404
}

But when running the command as #curl localhost:9200/_count?pretty:\

{
  "count" : 938,
  "_shards" : {
   "total" : 10,
   "successful" : 10,
   "failed" : 0
  }

When running the command curl localhost:9200/_cat/indices:

green open mediawiki_cirrussearch_frozen_indexes 1 0   0   0   144b   144b

green open mw_cirrus_versions                    1 0   2   2  3.5kb  3.5kb

green open wikidatabase_general_first            4 0 800 574  8.1mb  8.1mb

green open wikidatabase_content_first            4 0 136  18 41.7mb 41.7mb

And this is the object returned when appending &cirrusDumpQuery to search for example term "rock":

{"description":"full_text search for 'rock'","path":"wikidatabase\/page\/_search","params":{"search_type":"dfs_query_then_fetch","timeout":"20s"},"query":{"_source":["id","title","namespace","redirect.*","timestamp","text_bytes"],"fields":"text.word_count","query":{"filtered":{"query":{"bool":{"minimum_number_should_match":1,"should":[{"query_string":{"query":"rock","fields":["all.plain^1","all^0.5"],"auto_generate_phrase_queries":true,"phrase_slop":0,"default_operator":"AND","allow_leading_wildcard":true,"fuzzy_prefix_length":2,"rewrite":"top_terms_boost_1024"}},{"multi_match":{"fields":["all_near_match^2"],"query":"rock"}}]}},"filter":{"terms":{"namespace":[0,102,108]}}}},"highlight":{"pre_tags":["</nowiki><nowiki><span class=\"searchmatch\">"],"post_tags":["<\/span>"],"fields":{"title":{"number_of_fragments":0,"type":"fvh","order":"score","matched_fields":["title","title.plain"]},"redirect.title":{"number_of_fragments":1,"fragment_size":10000,"type":"fvh","order":"score","matched_fields":["redirect.title","redirect.title.plain"]},"category":{"number_of_fragments":1,"fragment_size":10000,"type":"fvh","order":"score","matched_fields":["category","category.plain"]},"heading":{"number_of_fragments":1,"fragment_size":10000,"type":"fvh","order":"score","matched_fields":["heading","heading.plain"]},"text":{"number_of_fragments":1,"fragment_size":150,"type":"fvh","order":"score","no_match_size":150,"matched_fields":["text","text.plain"]},"auxiliary_text":{"number_of_fragments":1,"fragment_size":150,"type":"fvh","order":"score","matched_fields":["auxiliary_text","auxiliary_text.plain"]}},"highlight_query":{"query_string":{"query":"rock","fields":["title.plain^20","redirect.title.plain^15","category.plain^8","heading.plain^5","opening_text.plain^3","text.plain^1","auxiliary_text.plain^0.5","title^10","redirect.title^7.5","category^4","heading^2.5","opening_text^1.5","text^0.5","auxiliary_text^0.25"],"auto_generate_phrase_queries":true,"phrase_slop":1,"default_operator":"AND","allow_leading_wildcard":true,"fuzzy_prefix_length":2,"rewrite":"top_terms_boost_1024"}}},"suggest":{"text":"rock","suggest":{"phrase":{"field":"suggest","size":1,"max_errors":2,"confidence":2,"real_word_error_likelihood":0.95,"direct_generator":[{"field":"suggest","suggest_mode":"always","max_term_freq":0.5,"min_doc_freq":0,"prefix_length":2}],"highlight":{"pre_tag":"<em>","post_tag":"<\/em>"},"smoothing":{"stupid_backoff":{"discount":0.4}}}}},"stats":["suggest","full_text"],"size":20,"rescore":[{"window_size":8192,"query":{"query_weight":1,"rescore_query_weight":1,"score_mode":"multiply","rescore_query":{"function_score":{"functions":[{"field_value_factor":{"field":"incoming_links","modifier":"log2p","missing":0}},{"weight":"0.2","filter":{"terms":{"namespace":[102,108]}}}]}}}}]},"options":{"search_type":"dfs_query_then_fetch","timeout":"20s"}}

<b>Notice</b>: Uncommitted DB writes (transaction from DatabaseBase::query (User::loadFromDatabase)). in <b>/opt/rh/httpd24/root/var/www/html/mediawiki/includes/db/Database.php</b> on line <b>3306</b><br />

Thanks!

DCausse (WMF) (talkcontribs)

I don't see anything obviously wrong in the outputs you've pasted.

You mentioned that your wiki has 200 pages and about 1800 images but the _count reports 938 docs being indexed in total (including some non pages data such as namespace names and other metatada).

I would suggest trying to find a page/image that you are unable to find and narrow down the investigation to it and understand why it's not indexed. To do this try to pickup a random image/page and search for few words it has in its title if you cannot find it using Special:Search (beware to select the proper namespaces) then you have found a bogus page.

Then try to identify its page id using the ?action=info URI param to the page url.

Using this page id try to run:

forceSearchIndex.php --fromId ID --toId ID+1

to see if the maint script is able to repopulate this particular page.

You may also want to try to run the sanitizer that will try to identify and fix inconsistencies in the index:

saneitizer.php

So in the end it's unclear to me what is causing this behavior, I don't see any errors except the Notice: Uncommitted DB writes that you pasted at the end of the message. Do you remember which command generated this errors?

Good luck!

Reply to "CirrusSearch Only Partially Indexing"