Search NG Project
Todo List	Operational Plan	Test Plan	Risk Assessment
NG Search Spec	Search NG Analytics	NLP Tools	Search Tools
Search Labs	Configuration	Lucene-search Spec	Old Code Review
Q&A

Search 2 Code Review

Why the long review

I've had to wait for months to get access to a search environment which is still offline. I could not run tests so I've written down my comments in this reviewing instead of trying to resolve them since this has not been possible.

Location of sorce code

project is http://svn.wikimedia.org/svnroot/mediawiki/trunk/lucene-search-2/src/
source is located at http://svn.wikimedia.org/svnroot/mediawiki/trunk/lucene-search-2/src/
test is http://svn.wikimedia.org/svnroot/mediawiki/trunk/lucene-search-2/test/

Package org.apache.lucene.search/ArticleInfo.Java

(limited) interface for metadata on article
- isSubpage() - if it is a subpage
- daysOld() - age in index
- namespace() - articles nameSpace
interface implementation is in org.wikimedia.lsearch.search/ArticleInfoImpl.java

note: the only implementation wraps methods of ArticleMetaSource so it could be refactored away

org.apache.lucene.search/ArticleNamespaceScaling.Java

boosts article using its namespace.
is used in:
- ArticleQueryWrap.customExplain(),
- ArticleQueryWrap.customScore()
- SearchEngine.PrefixMatch()
tested in
- testComplex
- testDefalut

org.apache.lucene.search/ConstMinScore

provides a boost queary with a minumum score
used by
- CustomScorer()

org.apache.lucene.search/CustomBoostQuery

Query that sets document score as a programmatic function of (up to) two (sub) scores.

Package org.apache.lucene.analysis

Package org.apache.lucene.search

Filters

Type	Purpose	Tests	Lucene version	Comments
AcronymFilter	N.S.A. → NSA	none	4.0.x
CJKFilter	C1C2C3C4 → C1C2 C2C3 C3C4.	none	2.4.x	alternative^[1]
EnglishSingularFilter	Add English singular forms of words as aliases of type "singular"	none	2.4.x
EsperantoStemFilter	Stemming of esperanto	none	2.4.x
HyphenFilter	CAT-\nFISH → CATFISH	none	2.4.x
LimitTokenFilter	Return tokens whose positional incremental less than some limit	none	2.4.x
PhraseFilter	Adds word bigrams to token stream. StopWords can force trigrams	none	2.4.x	implimentation ^[2] obsolecence^[3]
RussianStemFilter	Wrapper for Lucene's implementation of Russian stem filter.	none	2.4.x
SerbianFilter	Filter for Serbian dual-script text.	none	2.4.x
VietnameseFilter	Vietnamese standard transliterations to ascii	none	2.4.x

Some missing filters.

WikiArticleFilter indexes all wikipedia articles titles and redirects referenced in article texts using DB dump of title tables.
NamedEntity filter - uses a database to index named enteties mined from various wikis (semanticMW could help here)
WikiParserAnalyser - uses the latest parser implimentation to parse and index wiki source.

Analysers

Type	Purpose	Tests	Lucene version
AcronymFilter	N.A.S.A. -> N.A.S.A , NASA	I've added some	4.0.x

Others

WikiQueryParser
Aggregate - Aggregate bean that captures information about one item going into the some index aggregate field.

Problems in Aggregate

Token.termText() - undefined
Analyzer.tokenStream(String,String) - undefined
TokenStream.next() - undefined

Porting Filters

Filters should be ported from the 2.4.x to Lucene 2.9.x api. This involves:

writing unit tests for the old filter and seeing that they still work with new input....
Token next() and Token next(Token) have been deprected.
1. incrementToken needs to be called on the input token string, not on the filter which will cause a stack overflow).
2. to process the token add to the filter properties:

private CharTermAttribute termAtt; //Copies the contents of buffer, at an offset for length characters, to the termBuffer array.
private TypeAttribute typeAttr;

which should be intiated in the constructor via:

public filterXtor()
{
   super(input);
   termAttr = (CharTermAttribute) addAttribute(CharTermAttribute .class);
   typeAttr = addAttribute(TypeAttribute.class);
}

boolean incrementToken() is now required.
1. it moves the token stream one step forward.
2. it returns true is there are more tokens, false otherwise.

boolean incrementToken()
{
   if (!input.incrementToken()) 
     return false;

 // process token via termAttr.term()
 // next update buffers
  termAttr.setTermBuffer(modifiedToken);
  termAttr.setTermLength(this.parseBuffer(termAtt.termBuffer(), termAtt.termLength()));
  typeAttr.setType(TOKEN_TYPE_NAME);
  return true;
}

Porting SOLR Token Filter from Lucene 2.4.x to Lucene 2.9.x ^[4]
Porting to Lucene 4.0.x ^[5]

Package org.apache.lucene.search

Package org.apache.xmlrpc.webserver

no issues

Package org.wikimedia.lsearch.analyzers

Package org.wikimedia.lsearch.beans

no issues

Package org.wikimedia.lsearch.benchmark

no issues

Package org.wikimedia.lsearch.config

no issues

Package org.wikimedia.lsearch.frontend

no issues

Package org.wikimedia.lsearch.highlight

review
migrate to FastVectorHighlighter

Package org.wikimedia.lsearch.importer

org.wikimedia.lsearch.importer.SimpleIndexWriter

Broking Api Changes

IndexWriter writer = new IndexWriter(String path,null,boolean newIndex)

Has been deprecated from the api. Indexing has advanced considerably since since 2.3 and these changes should be intergrated into the indexer code.

it is possible to use an index while it is being updated.
it is possible to update documents in the index.

Non Breaking Api Changes

writer.setSimilarity(new WikiSimilarity())
IndexWriterConfig.setMaxBufferedDocs(int)
writer.setMergeFactor(mergeFactor)
writer.setUseCompoundFile(true);
writer.setMaxBufferedDocs(maxBufDocs)
writer.setMaxFieldLength(WikiIndexModifier.MAX_FIELD_LENGTH);

Have been deprecated from the api and should be replaced with:

IndexWriterConfig.setSimilarity(Similarity)}
IndexWriterConfig.setMaxBufferedDocs(int)} 
LogMergePolicy.setMergeFactor(int)
LogMergePolicy.setUseCompoundFile(boolean)
LimitTokenCountAnalyzer()

Porting notes: IndexWriterConfig needs to be created and then passed to the constructor.

IndexWriterConfig conf = new IndexWriterConfig(analyzer);
conf.setSimilarity(new WikiSimilarity()).setMaxBufferedDocs(maxBufDocs); //allows chaining
writer=new IndexWriter(Directory d, conf)      // Constructs a new IndexWriter per the settings given in conf.

All but the last are pretty trivial changes. However using LimitTokenCountAnalyzer instead of writer.setMaxFieldLength() means that another step needs to be added to analysis also the fact that some of the anlysis uses mutivalued<reference>where??</reference> fields means that LimitTokenCountAnalyzer will operate differently from before. The analyzer limits the number of tokens per token stream created, while this setting limits the total number of tokens to index. So that more tokens would get indexed

Package org.wikimedia.lsearch.index

Indexer and indexing related classes.

org.wikimedia.lsearch.indexWikiSimilarity

public float lengthNorm(String fieldName, int numTokens)

has been deprecated and needs to be replaced by

public float computeNorm(String field, FieldInvertState state)

Package org.wikimedia.lsearch.index

three files don't compile
review

Package org.wikimedia.lsearch.interoperability

no issues

Package org.wikimedia.lsearch.oai

no open issues - may be obsolete
review later

Package org.wikimedia.lsearch.prefix

one file needs review
generates title prefixes for search. is this still needed if title field is indexed as term position vectors

Package org.wikimedia.lsearch.ranks

another indexer which chounts links for page rank purposes
the analyser will need to be ported along the lines of the main anlayser in the impoerter packacge.

no open issues

Package org.wikimedia.lsearch.search

many issues
needs serious review.
losts of filter and filter wrappers - may be generaly useful code.
has distributed search related classes which sould be made obsolete by moving to solr

Package org.wikimedia.lsearch.spell

does the did you mean feature
two broken files

Package org.wikimedia.lsearch.spell.api

more of the did you mean feature
3 indexers are broken.

Package org.wikimedia.lsearch.spell.dist

no issues

Package org.wikimedia.lsearch.statistics

no issues

Package org.wikimedia.lsearch.storage

Storage of data, mainly in database. E.g. page ranks, text for highlighting...
three files broken
problems are indexing ENUM
String path -> FSDirectory Dir

Package org.wikimedia.lsearch.util

Search 2 Code Review

General

uses junit3
test methods are poorly written and badly named
- tests are not atomic
- tests do not seperate behaviour
- tests may have side effect (are thy safe to run in parallel)
- tests should use testing patterns
- no mocks of interfaces for complex behavoiur (may not be an issue)
most if not all extends WikiTestCase

WikiTestCase

- WikiTestCase is OS dependent.
- WikiTestCase hardcodes file locations.
- WikiTestCase reads a global configuration
- WikiTestCase sets WikiQueryParser
  - TITLE_BOOST
  - ALT_TITLE_BOOST
  - CONTENTS_BOOST
- When WikiTestCase has issues one cannot run test.
- WikiTestCase is in import org.wikimedia.lsearch.config.GlobalConfiguration;
- It is not possible to run test either WikiQueryParser or GlobalConfiguration can't compile. This is bad since they include most parts of the projects.

Package org.wikimedia.lsearch.analyzers

Package org.wikimedia.lsearch.beans

Package org.wikimedia.lsearch.config

Package org.wikimedia.lsearch.highlight

Package org.wikimedia.lsearch.index

Package org.wikimedia.lsearch.ranks

Package org.wikimedia.lsearch.search

Package org.wikimedia.lsearch.spell

Package org.wikimedia.lsearch.spell.api

Package org.wikimedia.lsearch.storage

Package org.wikimedia.lsearch.test

Package org.wikimedia.lsearch.util

Search 2 Code Porting Notes

Questions

is it possible to use the old filter code with new lucene libraries.
is it possible to make a non invasive modification which would allow using old filter code almost as is.
what do various filters do.
which filters are in use for various setups.

Contributing To Solr

It would be strtegicly wise to contributed back to Apache Lucne and/or Apache Solr. This is beacause:

Once it is integrated into SOLR they have to keeping it up to date with changes in lucene's API which invariably occuer over time
They have extensive tests
Their user base will find and fix bug much faster than can be done in the Wikipedia ecosystem.

Additional Porting notes

The method tokenStream(String, Reader) in the type Analyzer is not applicable for the arguments (String, String)
1. analyzer.tokenStream(field,text) → analyzer.tokenStream(field,new StringReader(text))
2. analyzer.tokenStream(field,"") → analyzer.tokenStream(field,(Reader)null)
The constructor HitQueue(int) is undefined
1. HitQueue hq = new HitQueue(nDocs,true) → #HitQueue hq = new HitQueue(nDocs,true)

References

↑ A Japanese stemmers now exsits and may do a better job than this filter
↑ implimentation is problematic because it could conflict with Ngram tokens in text. It would be better to prefix ngrams with a standard prefix or better yet to store them in payloads
↑ this filter may be obolete since indexing using term position vectors will allow to query for these token sequences using a standard index and would also be great for highlighting as well
↑ http://e-mats.org/2009/09/porting-solr-token-filter-from-lucene-2-4-x-to-lucene-2-9-x/
↑ https://issues.apache.org/jira/browse/SOLR-1876

[CJKFilter-1] A Japanese stemmers now exsits and may do a better job than this filter

[PhraseFilter1-2] tation is problematic because it could conflict with Ngram tokens in text. It would be better to prefix ngrams with a standard prefix or better yet to store them in payloads

[PhraseFilter2-3] this filter may be obolete since indexing using term position vectors will allow to query for these token sequences using a standard index and would also be great for highlighting as well

[MatsLindh-4] ttp://e-mats.org/2009/09/porting-solr-token-filter-from-lucene-2-4-x-to-lucene-2-9-x/

[SOLR-1876-5] ttps://issues.apache.org/jira/browse/SOLR-1876

[1]

[2]

[3]

[4]

[5]

User:OrenBochman/Search/Porting

Search 2 Code Review

Why the long review

Location of sorce code

Package org.apache.lucene.search/ArticleInfo.Java

org.apache.lucene.search/ArticleNamespaceScaling.Java

org.apache.lucene.search/ConstMinScore

org.apache.lucene.search/CustomBoostQuery

Package org.apache.lucene.analysis

Package org.apache.lucene.search

Filters

Analysers

Others

Porting Filters

Package org.apache.lucene.search

Package org.apache.xmlrpc.webserver

Package org.wikimedia.lsearch.analyzers

Package org.wikimedia.lsearch.beans

Package org.wikimedia.lsearch.benchmark

Package org.wikimedia.lsearch.config

Package org.wikimedia.lsearch.frontend

Package org.wikimedia.lsearch.highlight

Package org.wikimedia.lsearch.importer

org.wikimedia.lsearch.importer.SimpleIndexWriter

Broking Api Changes

Non Breaking Api Changes

Package org.wikimedia.lsearch.index

org.wikimedia.lsearch.indexWikiSimilarity

Package org.wikimedia.lsearch.index

Package org.wikimedia.lsearch.interoperability

Package org.wikimedia.lsearch.oai

Package org.wikimedia.lsearch.prefix

Package org.wikimedia.lsearch.ranks

Package org.wikimedia.lsearch.related

Package org.wikimedia.lsearch.search

Package org.wikimedia.lsearch.spell

Package org.wikimedia.lsearch.spell.api

Package org.wikimedia.lsearch.spell.dist

Package org.wikimedia.lsearch.statistics

Package org.wikimedia.lsearch.storage

Package org.wikimedia.lsearch.util

Search 2 Code Review

General

WikiTestCase

Package org.wikimedia.lsearch.analyzers

Package org.wikimedia.lsearch.beans

Package org.wikimedia.lsearch.config

Package org.wikimedia.lsearch.highlight

Package org.wikimedia.lsearch.index

Package org.wikimedia.lsearch.ranks

Package org.wikimedia.lsearch.search

Package org.wikimedia.lsearch.spell

Package org.wikimedia.lsearch.spell.api

Package org.wikimedia.lsearch.storage

Package org.wikimedia.lsearch.test

Package org.wikimedia.lsearch.util

Search 2 Code Porting Notes

Questions

Contributing To Solr

Additional Porting notes

References