lucene4.0 adds a lot of power but also break lots of things once again too. Some of the changes appear quite esoteric, but they are explained in recent blog post from the community.
- http://www.searchworkings.org/blog/-/blogs/lucene-s-tokenstreams-are-actually-graphs! this is a discussion of token streams as graphs. It details:
- how they do work (serial)-
- how these should work (graphs) with nodes having <position, positionIncrementAtribute, PositionLengthAttribute>.
- and why the don't work - PositionLengthAttribute is new and unsupported by many analysis componnents (WordDelimiterFilter, DictionaryCompoundWordTokenFilter, HyphenationCompoundWordTokenFilter, NGramTokenFilter, EdgeNGramTokenFilter)
- However it may be possible to write a wikiIndexer that indexes both source and output together. I.E. a branch for source and a branch for it expanded versions.
- The advantage of this aproch is that is that it is mostly encapsulated within the iterator.
- https://issues.apache.org/jira/browse/LUCENE-2858 - Seperation of atomic readers and reader collections.
- https://issues.apache.org/jira/browse/LUCENE-2831 - IndexReaderContext + Atomic-/CompositeReaderContext
- http://blog.mikemccandless.com/2010/08/lucene-performance-with-pfordelta-codec.html lucene uses ints to encode postings (docs, freqs, positions) in the index. These are currently compressed in a form optimised for write but not for read access. It is now possible to change this part of the index and some alternative implementation are being developed. These implementations are interesting in that they permit to index size for speed which is always somthing we like to look into. (c.f. http://www2008.org/papers/pdf/p387-zhangA.pdf; http://www.ir.uwaterloo.ca/book/)
- VBE - Variable Byte Encoding lucene's standard format where each integer is individually encoded as 1-5 bytes.
- FOR Frame of reference encoding (one size, the largest, fits all)
- PFOR Patched frame-of-reference encoding (use a smaller size with a bit marker for larger ints)
- PFORDELTA works on batches of 32 integers in two passes
- SIMPLE9 uses 9 cases to store the ints using 4 status bits
- SIMPLE16 uses 16 cases to store the ints using 4 status bits
- RICECODING a varient of golomb coding
- http://www.flax.co.uk/blog/2012/06/22/updating-individual-fields-in-lucene-with-a-redis-backed-codec/ one of thefirst alternate codec implemantation for the index
- Automaton Invasion (Lucene Revolution 2012) the slides are in https://docs.google.com/presentation/d/1Z7OYvKc5dHAXiVdMpk69uulpIT6A7FGfohjHx8fmHBU/edit?pli=1#slide=id.g5768afb_0_110