Bugs Fix Plan for Search

Bugzilla Links

bugs
patches

32655 Improving search for templates

Bug Id & Name	Testing	Classification	Comments
32655 Improving search for templates	search for {{Authority control}}, {{Authority	indexing,ui,ranking	see options below

Specifying This Behaviour

use case 1: readers who do not want to see templates in their search results.
use case 2: editors who want to find template to use (knowing it's name)?.
use case 3: editors who want in finding suitable template in a catagory.
use case 4: template dev would be interested in finding all the pages where a template is used.
use case 5: template dev would be interested in finding all templates that use a template.
use case 6: template dev would be interested in finding all templates in a template catagory.
use case 7: admins would want to find all pages pages using a template.
use case 8: admins who want to find all pages using a template with a certain value parameters.
use case 9: admins whom want to find all pages using non existing templates.
use case 10: users whom want to find all pages containing arbitary code.

Open Questions

are there some more use cases ?
how common are these situations?
what is the current practice for the above use cases?
1. use case 2: Special:what links here.
2. use case 3: look at the templates catagory.
should search the results diffrentiate between template that exists templates that don't?
what about transclusion from outside the templates namespace?:
- when templates do not contain template syntax should they be shown?
- when a template is not in the template namespace (say in user's) how can we know they are templates?

Analysis

Here are some approaches possible to implement this feature.

Option 1: Quick and Dirty
1. storing raw page's source in a Field source with unexpanded source
2. querying with a litralStringQuery and litralStringPrefixQuery.
3. it will double the index size a WFTU^[1] per wiki.
4. it requires no UI change - just extra syntax + documentation.
  1. source:text → to search for wiki source text
  2. source:"text" → to search for exact wiki source text
  3. source:text* → to search for wiki source text
  4. source:{{text}} → to search for wiki source text
  5. source:{{text*}} → to search for wiki source text
5. it may require its own ranking.
Option 2: Elegant
1. indexing and storing the page's parsed source in a parsedSourceTreeField
2. and querying with a sourceSearchQuery to search the source
3. it would increase index by a factor of a WFTU.
4. it could require UI change
5. it could require its own ranking.
option 3: Efficient
1. indexing the page's parsed source in a flat parsedSourceField
2. querying using a sourceSearchQuery which would provide markup search capability.
3. it would increase index by a log(WFTU). (this is a guess)
4. it could require UI change
5. it could require its own ranking.

option 1 will likely be inefficient. To effectively index wiki code a (java) parser for wiki code would be required.< The requirements are a parser that can process and tag

templates
template parameters
magic words
parser functions
extensions
comments
nowiki
includeonly
noinclude

1. I have been doing some work on writing a preprocessor but the work is far from over - it could be completed do this task.

Ranking & User Interface

it is possible to avoid UI change by adding a new search syntax
if the source search feature will function as a stand alone aplication its ranking will need just a little tweeking.
if it is necessary to integrate it with general search it will require a more significant effort inolving.
- specification.
- design.
- implementation.

23629 incorrect UTF-8 processing on output of page and section titles

Bug Id & Name	Testing	Classification	Comments
23629 incorrect UTF-8 processing on output of page and section titles	search for א	Render Results

Specifying This Behaviour

highlighted text in search reults is sometimes corrupt when showing multibyte characters

Open Questions

where is this behaviour taking place?
- (analyzer) during indexing
- (analyzer) during retrieval
- (highlighter) during result rendering
- later in php

Analysis

investigate by unit testing

Bug 20173 - Lucene Search update script fails while downloading DTD

Bug Id & Name	Testing	Classification	Comments
Bug 20173 - Lucene Search update script fails while downloading DTD	search for א	Render Results

Specifying This Behaviour

highlighted text in search reults is sometimes corrupt when showing multibyte characters

Open Questions

where is this behaviour taking place?
- (analyzer) during indexing
- (analyzer) during retrieval
- (highlighter) during result rendering
- later in php

Analysis

investigate by unit testing

Bug 20173 - Lucene Search update script fails while downloading DTD

Bug Id & Name	Testing	Classification	Comments
Bug 20173 - Lucene Search update script fails while downloading DTD	search for א	Render Results

Specifying This Behaviour

when running the update script the DTD download fails with "Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"

This is the explanation given in w3.org for 503 response code

10.5.4 503 Service Unavailable

The server is currently unable to handle the request due to a temporary overloading or maintenance of the server. The implication is that this is a temporary condition which will be alleviated after some delay. If known, the length of the delay MAY be indicated in a Retry-After header. If no Retry-After is given, the client SHOULD handle the response as it would for a 500 response.

Note: The existence of the 503 status code does not imply that a server must use it when becoming overloaded. Some servers may wish to simply refuse the connection.

Open Questions

how to reproduce the error?

Analysis

looking at the stack trace the error occurs:

org.wikimedia.lsearch.oai.OAIParser.parse(OAIParser.java:64) called by
- org.wikimedia.lsearch.oai.OAIHarvester.read(OAIHarvester.java:64) called by
  - org.wikimedia.lsearch.oai.IncrementalUpdater line:191
workarounds
1. use commons-httpclients instead of HttpUrlConnection -- how to tell xerces
2. try to clear the poxy

  System.setProperty("http.proxyHost", proxyHost);
  System.setProperty("http.proxyPort", proxyPort);
  ......
  some code...
  .......
  System.clearProperty("http.proxyHost");
  System.clearProperty("http.proxyPort");

testing

Extension:DumpHTML extension

multithreading

http://phplens.com/phpeverywhere/?q=node/view/254

missing pages

debugging page id of a missing main page

SELECT page_id
  FROM `page`
 WHERE page_namespace =4
   AND page_title = 'Main_Page'
 LIMIT 0 , 30

debugging page id of a missing category page

SELECT page_id 
  FROM `page` 
 WHERE page_namespace=14 
   AND page_title='Latin_nouns'
 LIMIT 0 , 30

SQL schema

https://secure.wikimedia.org/wikipedia/mediawiki/wiki/File:MediaWiki_database_schema_1-17_%28r82044%29.png

References

↑ WFTU is a wiki full text unit = the size of all the text in a wiki.

[1] WFTU is a wiki full text unit = the size of all the text in a wiki.

[1]

User:OrenBochman/Bugs

Contents

Bugs Fix Plan for Search

Bugzilla Links

32655 Improving search for templates

Specifying This Behaviour

Open Questions

Analysis

Ranking & User Interface

23629 incorrect UTF-8 processing on output of page and section titles

Specifying This Behaviour

Open Questions

Analysis

Bug 20173 - Lucene Search update script fails while downloading DTD

Specifying This Behaviour

Open Questions

Analysis

Bug 20173 - Lucene Search update script fails while downloading DTD

Specifying This Behaviour

Open Questions

Analysis

Extension:DumpHTML extension

multithreading

missing pages

SQL schema

References