Extension talk:Lucene-search/archive/2009

2009

edit

./configure for v. 2.1 does not seem to work

edit

Running Ubuntu 8.04, Ant 1.7, Java 1.6.0_07, using the Binary install package:

user@host: ./configure /path/to/mw/install

"0 [main] WARN org.wikimedia.lsearch.util.Command  - Got exit value 1 while executing [/bin/bash, -c, cd /path/to/mw/install && (echo "return \$wgDBname" | php maintenance/eval.php)]
Exception in thread "main" java.io.IOException: Error executing command: 
	at org.wikimedia.lsearch.util.Command.exec(Command.java:45)
	at org.wikimedia.lsearch.util.Configure.getVariable(Configure.java:77)
	at org.wikimedia.lsearch.util.Configure.main(Configure.java:42)

user@host: sudo ./configure /path/to/mw/install
0 [main] WARN org.wikimedia.lsearch.util.Command  - Got exit value 1 while executing [/bin/bash, -c, cd /path/to/mw/install && (echo "return \$wgDBname" | php maintenance/eval.php)]
Exception in thread "main" java.io.IOException: Error executing command: 
	at org.wikimedia.lsearch.util.Command.exec(Command.java:45)
	at org.wikimedia.lsearch.util.Configure.getVariable(Configure.java:77)
	at org.wikimedia.lsearch.util.Configure.main(Configure.java:42)

user@host: sudo su
root@host: ./configure /path/to/mw/install
0 [main] WARN org.wikimedia.lsearch.util.Command  - Got exit value 1 while executing [/bin/bash, -c, cd  /path/to/mw/instal && (echo "return \$wgDBname" | php maintenance/eval.php)]
Exception in thread "main" java.io.IOException: Error executing command: 
	at org.wikimedia.lsearch.util.Command.exec(Command.java:45)
	at org.wikimedia.lsearch.util.Configure.getVariable(Configure.java:77)
	at org.wikimedia.lsearch.util.Configure.main(Configure.java:42)

Seems to me that this is highly unlikely to be a permissions issue. My MW installation is working just fine otherwise.

I can't even get past the first step of the instructions, which does not bode well. Will try building from source, but doubt that will make any difference.... Any ideas? --Fungiblename 20:38, 18 March 2009 (UTC)Reply

You need to replace /path/to/mw/install with the actual path to your mediawiki installation (e.g. something like /var/www/mediawiki/). --Rainman 21:07, 18 March 2009 (UTC)Reply
Thanks, I was using my actual path but did not want to reproduce it here in full. I was able to compile the SVN version, however, and even after changing the "hostname" variable to my actual hostname as recognized by Apache, I get the following:
./configure /var/www/mw 
0 [main] WARN org.wikimedia.lsearch.util.Command  - Got exit value 1 while executing [/bin/bash, -c, cd /var/www/mw && (echo "return \$wgDBname" | php maintenance/eval.php)]
Exception in thread "main" java.io.IOException: Error executing command: 

at org.wikimedia.lsearch.util.Command.exec(Command.java:45) at org.wikimedia.lsearch.util.Configure.getVariable(Configure.java:77) at org.wikimedia.lsearch.util.Configure.main(Configure.java:42) --Fungiblename 21:14, 18 March 2009 (UTC)Reply

If you go into your mw installation dir (i.e. one you supplied) and run echo "return \$wgDBname" | php maintenance/eval.php what do you get? Do you get the name of your database? --Rainman 21:28, 18 March 2009 (UTC)Reply
Thanks for the troubleshooting advice! It seems like this was a major an oversight on my part. I get the same error as above because I'm running a small wiki farm with shared code (symlinks from the install directory to the shared MediaWiki code). Once I wrote "export MW_INSTALL_PATH=/var/www/mw && ./configure /var/www/mw" it wrote all the config files. You may want to add a note on the main page about configuring for installations with shared code (at least this very basic step). I'll play around on my own to try to find a way to have multiple separate indexes (my plan is to set up multiple directories with separate config files, index directories, and a symlink to the main jar). I'll try to get it working with just one first, though. Thanks again for your help and all your hard work on this! --Fungiblename 07:39, 19 March 2009 (UTC)Reply
For me configure sets wrong value of dbname in config.ini and it cause . Here I see "dbname=> DatabaseName>". Note wrong ">" signs. Calling echo "return \$wgDBname" | php maintenance/eval.php returns
> DatabaseName
>
eval.php at some servers prints prompt to stdout. I found that it happens when php function posix_isatty exists. Sometimes it does not.
Also configure wants php to be in PATH. It is not always true either. --Roma7
I had the same problem and solved it. It seems that the ./configure didn't "recognize" the PHP in the LAMP package and so I simply installed PHP CLI and it worked...--Gregra 21:55, 5 December 2009 (UTC)Reply

Here's just a taste of my output from trying to build from source of the STABLE version

edit

user@host:~/common/elements/lucene-SVN-stable-2009-03-18$ ant Buildfile: build.xml

build:

   [mkdir] Created dir: /home/username/common/elements/lucene-SVN-stable-2009-03-18/bin
   [javac] Compiling 101 source files to /home/username/common/elements/lucene-SVN-stable-2009-03-18/bin
   [javac] /home/username/common/elements/lucene-SVN-stable-2009-03-18/src/org/wikimedia/lsearch/analyzers/WikiQueryParser.java:24: package org.mediawiki.importer does not exist
   [javac] import org.mediawiki.importer.ExactListFilter;
   [javac]                              ^
   [javac] /home/username/common/elements/lucene-SVN-stable-2009-03-18/src/org/wikimedia/lsearch/importer/DumpImporter.java:13: package org.mediawiki.importer does not exist...

.... rTest.java uses or overrides a deprecated API.

   [javac] Note: Recompile with -Xlint:deprecation for details.
   [javac] Note: Some input files use unchecked or unsafe operations.
   [javac] Note: Recompile with -Xlint:unchecked for details.
   [javac] 70 errors

BUILD FAILED /home/username/common/elements/lucene-SVN-stable-2009-03-18/build.xml:68: Compile failed; see the compiler error output for details.

Total time: 2 seconds

"ant -Xlint:deprecation -f build.xml Unknown argument: -Xlint:deprecation"

Does anyone have any instructions about how to even get this thing running? Are there some hidden instructions/prerequisites that I'm missing? Seems to me this should be pretty easy to run on Linux.... --Fungiblename 20:53, 18 March 2009 (UTC)Reply

Must place "mwdumper.jar" in "lib" of directory downloaded from SVN. --Fungiblename 21:12, 18 March 2009 (UTC)Reply

Unable to build

edit

When building from the binary I get this error. I am in Ubuntu:

root@testwiki:/usr/share/mediawiki/extensions/lucene-search-2.1# ./build
Dumping wikidb...
2009-03-19 20:14:42: wikidb 99 pages (143.215/sec), 100 revs (144.661/sec), ETA 2009-03-19 20:14:45 [max 513]
2009-03-19 20:14:42: wikidb 199 pages (192.676/sec), 200 revs (193.645/sec), ETA 2009-03-19 20:14:44 [max 513]
2009-03-19 20:14:43: wikidb 299 pages (222.928/sec), 300 revs (223.674/sec), ETA 2009-03-19 20:14:44 [max 513]
2009-03-19 20:14:43: wikidb 399 pages (230.430/sec), 400 revs (231.008/sec), ETA 2009-03-19 20:14:44 [max 513]
2009-03-19 20:14:43: wikidb 458 pages (243.707/sec), 458 revs (243.707/sec), ETA 2009-03-19 20:14:44 [max 513]
mkdir: cannot create directory `/var/lib/mediawiki/extensions/lucene-search-2.1/indexes/status': No such file or directory
./build: line 19: /var/lib/mediawiki/extensions/lucene-search-2.1/indexes/status/wikidb: No such file or directory
MediaWiki lucene-search indexer - rebuild all indexes associated with a database.
Trying config file at path /root/.lsearch.conf
Trying config file at path /var/lib/mediawiki/extensions/lucene-search-2.1/lsearch.conf
MediaWiki lucene-search indexer - index builder from xml database dumps.

1    [main] INFO  org.wikimedia.lsearch.util.Localization  - Reading localization for En
2799 [main] INFO  org.wikimedia.lsearch.ranks.Links  - Making index at /var/lib/mediawiki/extensions/lucene-search-2.1/indexes/import/wikidb.links
3208 [main] INFO  org.wikimedia.lsearch.ranks.LinksBuilder  - Calculating article links...
458 pages (26.889/sec), 458 revs (26.889/sec)
21058 [main] INFO  org.wikimedia.lsearch.index.IndexThread  - Making snapshot for wikidb.links
21291 [main] INFO  org.wikimedia.lsearch.index.IndexThread  - Made snapshot /var/lib/mediawiki/extensions/lucene-search-2.1/indexes/snapshot/wikidb.links/20090319161516
21405 [main] INFO  org.wikimedia.lsearch.search.UpdateThread  - Syncing wikidb.links
21963 [main] INFO  org.wikimedia.lsearch.ranks.Links  - Opening for read /var/lib/mediawiki/extensions/lucene-search-2.1/indexes/search/wikidb.links
21973 [main] INFO  org.wikimedia.lsearch.related.RelatedBuilder  - Rebuilding related mapping from links
34467 [main] INFO  org.wikimedia.lsearch.index.IndexThread  - Making snapshot for wikidb.related
34649 [main] INFO  org.wikimedia.lsearch.index.IndexThread  - Made snapshot /var/lib/mediawiki/extensions/lucene-search-2.1/indexes/snapshot/wikidb.related/20090319161529
34661 [main] INFO  org.wikimedia.lsearch.importer.Importer  - Indexing articles (index+highlight+titles)...
34663 [main] INFO  org.wikimedia.lsearch.ranks.Links  - Opening for read /var/lib/mediawiki/extensions/lucene-search-2.1/indexes/search/wikidb.links
35075 [main] INFO  org.wikimedia.lsearch.analyzers.StopWords  - Successfully loaded stop words for: [nl, en, it, fr, de, sv, es, no, pt, da] in 329 ms
35077 [main] INFO  org.wikimedia.lsearch.importer.SimpleIndexWriter  - Making new index at /var/lib/mediawiki/extensions/lucene-search-2.1/indexes/import/wikidb
35087 [main] INFO  org.wikimedia.lsearch.importer.SimpleIndexWriter  - Making new index at /var/lib/mediawiki/extensions/lucene-search-2.1/indexes/import/wikidb.hl
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
   at java.lang.System.arraycopy(libgcj.so.81)
   at java.io.ByteArrayOutputStream.write(libgcj.so.81)
   at org.apache.lucene.index.FieldsReader.uncompress(FieldsReader.java:514)
   at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:317)
   at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:166)
   at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:659)
   at org.apache.lucene.index.IndexReader.document(IndexReader.java:525)
   at org.wikimedia.lsearch.storage.RelatedStorage.getRelated(RelatedStorage.java:56)
   at org.wikimedia.lsearch.importer.DumpImporter.writeEndPage(DumpImporter.java:109)
   at org.mediawiki.importer.PageFilter.writeEndPage(Unknown Source)
   at org.mediawiki.importer.XmlDumpReader.closePage(Unknown Source)
   at org.mediawiki.importer.XmlDumpReader.endElement(Unknown Source)
   at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
   at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source)
   at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
   at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
   at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
   at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
   at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
   at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
   at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
   at javax.xml.parsers.SAXParser.parse(libgcj.so.81)
   at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source)
   at org.wikimedia.lsearch.importer.Importer.main(Importer.java:186)
   at org.wikimedia.lsearch.importer.BuildAll.main(BuildAll.java:109)
root@testwiki:/usr/share/mediawiki/extensions/lucene-search-2.1# 

root@testwiki:/usr/share/mediawiki/extensions/lucene-search-2.1# java  -version
java version "1.5.0"
gij (GNU libgcj) version 4.2.4 (Ubuntu 4.2.4-1ubuntu3)

Copyright (C) 2007 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
root@testwiki:/usr/share/mediawiki/extensions/lucene-search-2.1# javac
Eclipse Java Compiler v_774_R33x, 3.3.1
Copyright IBM Corp 2000, 2007. All rights reserved.
 
 Usage: <options> <source files | directories>
 If directories are specified, then their source contents are compiled.
 Possible options are listed below. Options enabled by default are prefixed
 with '+'.
 
 Classpath options:
    -cp -classpath <directories and zip/jar files 

What is wrong?

It won't work on GNU java. You can use openjdk6 which is also opensource java and is available as a package for ubuntu. --Rainman 21:14, 19 March 2009 (UTC)Reply

Thanks, I will give that a shot. 166.50.205.143 11:02, 20 March 2009 (UTC)Reply

Newest binary (2.1.1) does not appear to run on Mac OS 10.5.6 - 2.1 did not run either.

edit
$ java -version
java version "1.6.0_07"
Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153)
Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode)

$ export MW_INSTALL_PATH=/Sites/mw/ && ./configure /Sites/mw/
Exception in thread "main" java.lang.UnsupportedClassVersionError: Bad version number in .class file
	at java.lang.ClassLoader.defineClass1(Native Method)
	at java.lang.ClassLoader.defineClass(ClassLoader.java:675)
	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
	at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
	at java.net.URLClassLoader.access$100(URLClassLoader.java:56)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:316)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:280)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
	at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:374)

Any thoughts? I have been using Sphinx in the meantime (which uses about 90-95% less memory), but it does not provide a lot of the features that Lucene does; I would really like to get Lucene running. --Fungiblename 11:16, 26 March 2009 (UTC)Reply

Solution

edit

Change the Java preference using the Java Preferences app to make sure that Java SE 6 is the top preference, then it runs. Also, this appears to be hard-coded to look for mysql.sock in /var/mysql/mysql.sock (I grepped for it in the ls2.1 directory). I have no desire to recompile to attempt to tweak it for my system though. I run from a non-standard location, so I just made a symbolic link to that location from my actual install. YMMV. --Fungiblename 16:16, 31 March 2009 (UTC)Reply

For details see (meanwhile) Manual:Running MediaWiki on Mac OS X --Achimbode 20:33, 9 August 2009 (UTC)Reply

Hardcoded search port? 8123

edit

Thank you for keeping this up to date.

I recently upgraded to the latest. This time the configuration was way better. I loved that configuration generator. There's but one thing though. I cannot use the search port 8123. So I went off and changed it on lsearch.conf and "LocalSettings.php". However, it didn't like it at all. It is still listening on 8123. Now the "noddy" question, Am I missing something? Thanks --Cartoro 16:00, 31 March 2009 (UTC)Reply


I have same problem. In lsearch.conf, I have edited Search.port=8000. But when I start lsearchd Result 646 [Thread-2] INFO org.wikimedia.lsearch.frontend.SearchServer - Searcher started on port 8123 Sébas

This has been fixed in latest binary (available for download from sourceforge) and svn version. --Rainman 13:52, 15 April 2009 (UTC)Reply

I'm afraid the source is still showing the hardcoded "8123" (May 28, 2009).

Which file, where? Does changing the default port to some other value not work for you? --Rainman 21:31, 27 May 2009 (UTC)Reply

XML-RPC server incompete

edit

I've installed latest Lucene-search and MWSearch on MW 1.13 and found that updatePage and deletePage actions doesn't pass through.

Looking at source code I've found that these handlers were removed in rev 32681 of lucene-search RPCIndexDeamon.java. As far as I understand now there is new HTTP daemon available but MWSearch isn't aware of it.

Am I missing something?

--Eugenem 07:39, 15 April 2009 (UTC)Reply

Using HTTP to post articles (either via xml-rpc or as raw http attachment) is an old and deprecated way of index update, and thus the methods have been removed. To keep the index up-to-date please use either complete rebuilds (via "./build") or Extension:OAIRepository (via ("./update") --Rainman 09:47, 15 April 2009 (UTC)Reply
I see. Actually I was interested in these featured to make custom updates such as output of special pages. On our site we use a lot of special pages to show profiles so we'd like to index special page output instead of template. Is there any way to do that? I mean some interface to add bunch of pages to index using PHP and now writing custom Java parser.
You could include those pages into the xml dump of your database (produced by maintenance/dumpBackup.php) and then index everything. The other way would be to include it into the OAI table, although that could be tricky since you would need to have consistent page_ids for those special pages in order for incremental update to work properly. There might be other ways, but they are bound to break something, so my advice is to stick with these two standard ways. --Rainman 11:03, 15 April 2009 (UTC)Reply

Finally it works (for me)

edit

Nothing but the following worked for my install. Here's what I did:

svn co http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/MWSearch
mv MWSearch extensions

svn co http://svn.wikimedia.org/svnroot/mediawiki/branches/lucene-search-2.1/ lucene-search-2
cd lucene-search-2
ant
./configure
./build

Now, add the following to LocalSettings.php:

 # lsearch
 require_once("extensions/MWSearch/MWSearch.php");
 $wgSearchType = 'LuceneSearch';
 $wgLuceneHost = 'YourHostName';  # <-- change this!
 $wgLucenePort = 8123;
 # uncomment this if you use lucene-search 2.1
 # (MUST be AFTER the require_once!)
 $wgLuceneSearchVersion = 2.1;

Where YourHostName is the results of 'hostname'. The search doesn't work on my machine if I use the default, "192.168.0.1".

# test lucene, now
./lsearchd

How to customize synonyms and stop words?

edit
See also http://www.gossamer-threads.com/lists/wiki/mediawiki/99213#99213

How can I edit the synonyms and stop words in order to bring the engine more in line with our needs?

You need to checkout the source from svn. Then edit resources/dist/wordnet-en.txt (for synonyms) and stopwords-en.txt. If this does not work, then you could also try making your own Filter class and plugging it in into the FilterFactory class. --Rainman 18:31, 1 May 2009 (UTC)Reply

Thanks. I have done as you suggest. However, I do not see any indication that the system is ignoring stop words (e.g. if I search with the word "me", I get results). I also do not know how to confirm that the synonyms are working. Are there some good tests I could run to verify? ----Marc 14:31, 6 May 2009 (MDT)

Searching Attachments

edit

I am running

MediaWiki 1.13.1
PHP 5.2.4-2ubuntu5.6(apache2handler)
MySQL 5.0.51a-3ubuntu5.4

I have the FileIndexer

Extension:FileIndexer

and

Extension:MWSearch

now installed and running.

The Lucene search capability seems to work far better than the default search capability except that it no longer generates search results from attachements that were turned into text and then inserted in the image field

Is this a limitation of the present software? I had hoped the Lucene Search would index the attachments, especially given the use of the FileIndexer.

Is is significant that the FQDN is http://wiki.tesla.local/ (on a local LAN) but that the hostname is wiki

Attached are the configuration files.

lsearch.conf

# By default, will check /etc/lsearch.conf

################################################
# Global configuration
################################################

# URL to global configuration, this is the shared main config file, it can 
# be on a NFS partition or available somewhere on the network
MWConfig.global=file:///home/chris/lucene-search-2.1/lsearch-global.conf

# Local path to root directory of indexes
Indexes.path=/home/chris/lucene-search-2.1/indexes

# Path to rsync
Rsync.path=/usr/bin/rsync

# Extra params for rsync
# Rsync.params=--bwlimit=8192

################################################
# Search node related configuration
################################################

# Port of http daemon, if different from default 8123
# Search.port=8000

# In minutes, how frequently will the index host be checked for updates
Search.updateinterval=0.1

# In seconds, delay after which the update will be fetched
# used to scatter the updates around the hour
Search.updatedelay=0  

# In seconds, how frequently the dead search nodes should be checked
Search.checkinterval=10

# In milliseconds, for how long should the query be executed
# Search.timelimit=1000 

# if to wait for aggregates to warm up before deploying the searcher
Search.warmupaggregate=true

# cache *whole* index in RAM
Search.ramdirectory=false

# Disable wordnet aliases
Search.disablewordnet=true

# If this host runs on multiple CPUs maintain a pool of index searchers
# It's good idea to make it number of CPUs+1, or some larger odd number
SearcherPool.size=1

################################################
# Indexer related configuration
################################################

# In minutes, how frequently is a clean snapshot of index created
Index.snapshotinterval=2880

# Daemon type (http is started by default)
#Index.daemon=xmlrpc

# Port of daemon (default is 8321)
#Index.port=8080

# Maximal queue size after which index is being updated
Index.maxqueuecount=5000

# Maximal time an update can remain in queue before being processed (in seconds)
Index.maxqueuetimeout=12

# If to delete all old snapshots always (default to false - leaves the last good snapshot)
# Index.delsnapshots=true

################################################
# Log, ganglia, localization
################################################

# URL to MediaWiki message files
Localization.url=file:///home/chris/public_html_3/wiki/languages/messages 

# Username/password for password authenticated OAI repo
# OAI.username=user
# OAI.password=pass

# Max queue size on remote indexer after which we wait a bit
OAI.maxqueue=5000

# Number of docs to buffer before sending to inc updater
OAI.bufferdocs=500

# Log configuration
Logging.logconfig=/home/chris/lucene-search-2.1/lsearch.log4j

# Set debug to true to diagnose problems with log4j configuration
Logging.debug=false

# Turn this on to broadcast status to a Ganglia reporting system.
# Requires that 'gmetric' be in the PATH and runnable. You can
# override the default UDP broadcast port and interface if required.
#Ganglia.report=true
#Ganglia.port=8649
#Ganglia.interface=eth0

lsearch-global.conf

################################################
# Global search cluster layout configuration
################################################

[Database]
MediaWiki : (single) (spell,4,2) (language,en)

[Search-Group]
wiki : *

[Index]
wiki : *

[Index-Path]
<default> : /search

[OAI]
<default> : http://localhost/index.php

[Namespace-Boost]
<default> : (0,2) (1,0.5)

[Namespace-Prefix]
all : <all>
[0] : 0
[1] : 1
[2] : 2
[3] : 3
[4] : 4
[5] : 5
[6] : 6
[7] : 7
[8] : 8
[9] : 9
[10] : 10
[11] : 11
[12] : 12
[13] : 13
[14] : 14
[15] : 15

config.inc

dbname=MediaWiki
wgScriptPath=
hostname=wiki
indexes=/home/chris/lucene-search-2.1/indexes
mediawiki=/home/chris/public_html_3/wiki
base=/home/chris/lucene-search-2.1
wgServer=http://localhost
Unfortunately lucene-search won't search attachments no matter what kind of extra extension you use. You could however try Extension:EzMwLucene which is also lucene-based but has a different set of features, doesn't have some lucene-search stuff, but has attachment search. --Rainman 09:31, 26 May 2009 (UTC)Reply
Thank you so much for the prompt response. I will try the Extension:EzMwLucene search as attachment searching is key feature I would like in our company wiki.
Thanks a bunch Rainman. Do you know offhand what the major differences are between both Lucene extensions? We have Lucene-search installed but would like to enable EzMwLucene but it would be good to know what the feature differences are. --Gkullberg 13:59, 3 July 2009 (UTC)Reply

Search within files?

edit

Is it possible to use Lucene to search within files uploaded to MediaWiki?

On the Lucene page on Wikipedia it says:

"At the core of Lucene's logical architecture is the idea of a document containing fields of text. This flexibility allows Lucene's API to be independent of the file format. Text from PDFs, HTML, Microsoft Word, and OpenDocument documents, as well as many others can all be indexed so long as their textual information can be extracted."

It would be great if I could search within PDFs and Docs and whatever else I upload to my MediaWiki instance. --Gkullberg 19:55, 2 July 2009 (UTC)Reply

See answer to previous question.... --Rainman 10:08, 3 July 2009 (UTC)Reply

How to use CJKAnalyzer

edit

Is it possible to use CJKAnalyzer for indexing pages written in Japanese?

Yes, just change (language,en) to (language,ja) in your config file (and re-run the build process). --Rainman 08:27, 10 July 2009 (UTC)Reply

Periodic fatal errors while rebuilding index - "no segments* file"

edit

I'm running Lucene-search on our local wiki. The build script runs correctly and produces a valid index, which is picked up by the daemon, and everything works fine...for a bit. I've created a cron job that runs the build script hourly, with the output of the script being emailed to me. The cron job runs happily for a spell and then I receive this in the output:

MediaWiki lucene-search indexer - rebuild all indexes associated with a database.
Trying config file at path /home/system/mymintel-svc/.lsearch.conf
Trying config file at path /data/mymintel/mediawiki/lucene_search/lsearch.conf
MediaWiki lucene-search indexer - index builder from xml database dumps.
0    [main] INFO  org.wikimedia.lsearch.util.Localization  - Reading localization for En
582  [main] INFO  org.wikimedia.lsearch.ranks.Links  - Making index at /data/mymintel/mediawiki/lucene_search/indexes/import/it_wiki.links
924  [main] INFO  org.wikimedia.lsearch.ranks.LinksBuilder  - Calculating article links...
3,759 pages (338.679/sec), 3,759 revs (338.679/sec)
14271 [main] INFO  org.wikimedia.lsearch.index.IndexThread  - Making snapshot for it_wiki.links
14645 [main] INFO  org.wikimedia.lsearch.index.IndexThread  - Made snapshot /data/mymintel/mediawiki/lucene_search/indexes/snapshot/it_wiki.links/20090731050111
14696 [main] INFO  org.wikimedia.lsearch.search.UpdateThread  - Syncing it_wiki.links
15632 [main] INFO  org.wikimedia.lsearch.ranks.Links  - Opening for read /data/mymintel/mediawiki/lucene_search/indexes/search/it_wiki.links
15637 [main] INFO  org.wikimedia.lsearch.related.RelatedBuilder  - Rebuilding related mapping from links
15640 [main] FATAL org.wikimedia.lsearch.importer.Importer  - Cannot make related mapping: no segments* file found in  org.apache.lucene.store.FSDirectory@/data/mymintel/mediawiki/lucene_search/indexes/search/it_wiki.links: files:
MediaWiki lucene-search indexer - build spelling suggestion index.
16802 [main] INFO  org.wikimedia.lsearch.spell.SuggestBuilder  - Building spell-check for it_wiki
16802 [main] INFO  org.wikimedia.lsearch.util.Localization  - Reading localization for En
16931 [main] INFO  org.wikimedia.lsearch.spell.SuggestBuilder  - Rebuilding precursor index...
17037 [main] INFO  org.wikimedia.lsearch.analyzers.StopWords  - Successfully loaded stop words for: [nl, en, it, fr, de, sv, es, no, pt, da] in 68 ms
17039 [main] INFO  org.wikimedia.lsearch.spell.CleanIndexWriter  - Using phrase stopwords: [only, theirs, some, where, being, after, doing, did, they, herself, as, so, our, than, your, for, down, the, other, of, does, no, ours, with, from, them, by, also, you, hers, until, yourself, has, she, it, up, why, have, this, those, about, between, which, under, these, i, yours, but, his, myself, yourselves, having, more, be, her, into, its, an, he, on, over, was, here, to, such, above, because, nor, had, him, below, and, whoever, during, their, itself, been, most, that, out, each, or, a, own, all, what, in, ourselves, were, themselves, both, not, same, do, am, too, once, any, when, then, who, how, whom, my, through, there, before, very, we, against, few, while, again, me, at, if, himself, are, is, off, further]
17129 [main] INFO  org.wikimedia.lsearch.ranks.Links  - Opening for read /data/mymintel/mediawiki/lucene_search/indexes/search/it_wiki.links
java.io.IOException: no segments* file found in org.apache.lucene.store.FSDirectory@/data/mymintel/mediawiki/lucene_search/indexes/search/it_wiki.links: files:
 at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source)

From this point onwards, the job will not run correctly until I have deleted the indexes directory and started from scratch.

I've dumped the directory structure of the filesystem when the index is working correctly, and when it's broken; the output is below.

Working config

edit
indexes/
|-- import
|   |-- it_wiki
|   |   |-- _7.cfs
|   |   |-- segments.gen
|   |   `-- segments_h
|   |-- it_wiki.hl
|   |   |-- _7.cfs
|   |   |-- segments.gen
|   |   `-- segments_h
|   |-- it_wiki.links
|   |   |-- _8.cfs
|   |   |-- segments.gen
|   |   `-- segments_j
|   |-- it_wiki.related
|   |   |-- _d.cfs
|   |   |-- segments.gen
|   |   `-- segments_t
|   |-- it_wiki.spell
|   |   |-- _1v.cfs
|   |   |-- segments.gen
|   |   `-- segments_3t
|   `-- it_wiki.spell.pre
|       |-- _8.cfs
|       |-- segments.gen
|       `-- segments_j
|-- index
|   |-- it_wiki
|   |   |-- _7.cfs
|   |   |-- segments.gen
|   |   `-- segments_h
|   |-- it_wiki.hl
|   |   |-- _7.cfs
|   |   |-- segments.gen
|   |   `-- segments_h
|   |-- it_wiki.links
|   |   |-- _8.cfs
|   |   |-- segments.gen
|   |   `-- segments_j
|   `-- it_wiki.spell.pre
|       |-- _8.cfs
|       |-- segments.gen
|       `-- segments_j
|-- search
|   |-- it_wiki -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki/20090730163156
|   |-- it_wiki.hl -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.hl/20090730163156
|   |-- it_wiki.links -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090730163123
|   |-- it_wiki.related -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.related/20090730163127
|   `-- it_wiki.spell -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.spell/20090730163230
|-- snapshot
|   |-- it_wiki
|   |   `-- 20090730163156
|   |       |-- _7.cfs
|   |       |-- segments.gen
|   |       `-- segments_h
|   |-- it_wiki.hl
|   |   `-- 20090730163156
|   |       |-- _7.cfs
|   |       |-- segments.gen
|   |       `-- segments_h
|   |-- it_wiki.links
|   |   `-- 20090730163123
|   |       |-- _8.cfs
|   |       |-- segments.gen
|   |       `-- segments_j
|   |-- it_wiki.related
|   |   `-- 20090730163127
|   |       |-- _d.cfs
|   |       |-- segments.gen
|   |       `-- segments_t
|   |-- it_wiki.spell
|   |   `-- 20090730163230
|   |       |-- _1v.cfs
|   |       |-- segments.gen
|   |       `-- segments_3t
|   `-- it_wiki.spell.pre
|       `-- 20090730163210
|           |-- _8.cfs
|           |-- segments.gen
|           `-- segments_j
|-- status
|   `-- it_wiki
`-- update
    |-- it_wiki
    |   `-- 20090730163156
    |       |-- _7.cfs
    |       |-- segments.gen
    |       `-- segments_h
    |-- it_wiki.hl
    |   `-- 20090730163156
    |       |-- _7.cfs
    |       |-- segments.gen
    |       `-- segments_h
    |-- it_wiki.links
    |   `-- 20090730163123
    |       |-- _8.cfs
    |       |-- segments.gen
    |       `-- segments_j
    |-- it_wiki.related
    |   `-- 20090730163127
    |       |-- _d.cfs
    |       |-- segments.gen
    |       `-- segments_t
    `-- it_wiki.spell
        `-- 20090730163230
            |-- _1v.cfs
            |-- segments.gen
            `-- segments_3t

Broken Config

edit
indexes/
|-- import
|   |-- it_wiki
|   |   |-- _2f.cfs
|   |   |-- segments.gen
|   |   `-- segments_58
|   |-- it_wiki.hl
|   |   |-- _2f.cfs
|   |   |-- segments.gen
|   |   `-- segments_58
|   |-- it_wiki.links
|   |   |-- _5h.cfs
|   |   |-- segments.gen
|   |   `-- segments_bm
|   |-- it_wiki.related
|   |   |-- _4n.cfs
|   |   |-- segments.gen
|   |   `-- segments_9o
|   |-- it_wiki.spell
|   |   |-- _oj.cfs
|   |   |-- segments.gen
|   |   `-- segments_1dh
|   `-- it_wiki.spell.pre
|       |-- _39.fdt
|       |-- _39.fdx
|       |-- segments.gen
|       |-- segments_74
|       `-- write.lock
|-- index
|   |-- it_wiki
|   |   |-- _2f.cfs
|   |   |-- segments.gen
|   |   `-- segments_58
|   |-- it_wiki.hl
|   |   |-- _2f.cfs
|   |   |-- segments.gen
|   |   `-- segments_58
|   |-- it_wiki.links
|   |   |-- _5h.cfs
|   |   |-- segments.gen
|   |   `-- segments_bm
|   `-- it_wiki.spell.pre
|       |-- _39.fdt
|       |-- _39.fdx
|       |-- segments.gen
|       |-- segments_74
|       `-- write.lock
|-- search
|   |-- it_wiki -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki/20090731040228
|   |-- it_wiki.hl -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.hl/20090731040228
|   |-- it_wiki.links
|   |   |-- 20090731050111 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731050111
|   |   |-- 20090731060116 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731060116
|   |   |-- 20090731070104 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731070104
|   |   |-- 20090731080121 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731080121
|   |   |-- 20090731090112 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731090112
|   |   |-- 20090731100113 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731100113
|   |   |-- 20090731110108 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731110108
|   |   |-- 20090731120051 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731120051
|   |   `-- 20090731130055 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731130055
|   |-- it_wiki.related -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.related/20090731040125
|   `-- it_wiki.spell -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.spell/20090731040320
|-- snapshot
|   |-- it_wiki
|   |   |-- 20090731030246
|   |   |   |-- _27.cfs
|   |   |   |-- segments.gen
|   |   |   `-- segments_4r
|   |   `-- 20090731040228
|   |       |-- _2f.cfs
|   |       |-- segments.gen
|   |       `-- segments_58
|   |-- it_wiki.hl
|   |   |-- 20090731030247
|   |   |   |-- _27.cfs
|   |   |   |-- segments.gen
|   |   |   `-- segments_4r
|   |   `-- 20090731040228
|   |       |-- _2f.cfs
|   |       |-- segments.gen
|   |       `-- segments_58
|   |-- it_wiki.links
|   |   |-- 20090731120051
|   |   |   |-- _58.cfs
|   |   |   |-- segments.gen
|   |   |   `-- segments_b3
|   |   `-- 20090731130055
|   |       |-- _5h.cfs
|   |       |-- segments.gen
|   |       `-- segments_bm
|   |-- it_wiki.related
|   |   |-- 20090731030132
|   |   |   |-- _49.cfs
|   |   |   |-- segments.gen
|   |   |   `-- segments_8v
|   |   `-- 20090731040125
|   |       |-- _4n.cfs
|   |       |-- segments.gen
|   |       `-- segments_9o
|   |-- it_wiki.spell
|   |   |-- 20090731030355
|   |   |   |-- _mn.cfs
|   |   |   |-- segments.gen
|   |   |   `-- segments_19o
|   |   `-- 20090731040320
|   |       |-- _oj.cfs
|   |       |-- segments.gen
|   |       `-- segments_1dh
|   `-- it_wiki.spell.pre
|       |-- 20090731030320
|       |   |-- _2z.cfs
|       |   |-- segments.gen
|       |   `-- segments_6c
|       `-- 20090731040253
|           |-- _38.cfs
|           |-- segments.gen
|           `-- segments_6v
|-- status
|   `-- it_wiki
`-- update
    |-- it_wiki
    |   |-- 20090731030246
    |   |   |-- _27.cfs
    |   |   |-- segments.gen
    |   |   `-- segments_4r
    |   `-- 20090731040228
    |       |-- _2f.cfs
    |       |-- segments.gen
    |       `-- segments_58
    |-- it_wiki.hl
    |   |-- 20090731030247
    |   |   |-- _27.cfs
    |   |   |-- segments.gen
    |   |   `-- segments_4r
    |   `-- 20090731040228
    |       |-- _2f.cfs
    |       |-- segments.gen
    |       `-- segments_58
    |-- it_wiki.links
    |   |-- 20090731120051
    |   |   |-- _58.cfs
    |   |   |-- segments.gen
    |   |   `-- segments_b3
    |   `-- 20090731130055
    |       |-- _5h.cfs
    |       |-- segments.gen
    |       `-- segments_bm
    |-- it_wiki.related
    |   |-- 20090731030132
    |   |   |-- _49.cfs
    |   |   |-- segments.gen
    |   |   `-- segments_8v
    |   `-- 20090731040125
    |       |-- _4n.cfs
    |       |-- segments.gen
    |       `-- segments_9o
    `-- it_wiki.spell
        |-- 20090731030355
        |   |-- _mn.cfs
        |   |-- segments.gen
        |   `-- segments_19o
        `-- 20090731040320
            |-- _oj.cfs
             |-- segments.gen
            `-- segments_1dh

As you can see, the contents of index/search/it_wiki.links is completely different. I suspect that it's this that's causing the problem, but I don't know enough about what's going on to diagnose. Java version is:

java version "1.5.0_14-p8"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-p8-root_04_sep_2008_18_49)
Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-p8-root_04_sep_2008_18_49, mixed mode)

...and i'm running on FreeBSD 7.0, if that makes a difference. Any ideas what's going on? It'd be nice not to have to delete and rebuild the indexes by hand every day!

So this part

|-- search
|   |-- it_wiki.links
|   |   |-- 20090731050111 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731050111
|   |   |-- 20090731060116 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731060116
|   |   |-- 20090731070104 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731070104
|   |   |-- 20090731080121 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731080121
|   |   |-- 20090731090112 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731090112
|   |   |-- 20090731100113 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731100113
|   |   |-- 20090731110108 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731110108
|   |   |-- 20090731120051 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731120051
|   |   `-- 20090731130055 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731130055

Looks quite wrong.. all the files in search/ should be symlinks and should not have any subdirectories.. I'm not sure how these are created. You are sure that the whole build process takes less than an hour? If you get overlapping jobs trying to do the same thing they might lock eachother indexes. --Rainman 16:43, 3 August 2009 (UTC)Reply

When I run the process by hand, it never takes more than 5 mins to complete, so I'd be very surprised if jobs are overlapping. Would probably be a good idea to make certain though, so I'll change the cronjob to time the process and send an update next time it fails. -- Mrgroucho 16:48, 3 August 2009 (UTC)Reply

OK, I think we can rule out overlapping jobs. The indexer ran successfully as scheduled for over 12 hours yesterday, and then failed early this morning. The time stats for the job, and the one preceding it, are as follows:

Preceding Job

edit
real		 3m24.606s
user		 2m15.323s
sys		 0m16.361s

Failed Job

edit
real		 0m59.046s
user		 0m20.773s
sys		 0m2.410s

The failed job takes less time, but you'd expect that: it failed. I've noted that the output mentions various Threads - is there any way that this could be some sort of race condition/locking problem between those threads? -- Mrgroucho 13:44, 4 August 2009 (UTC)Reply

Any idea at all on how to fix this? I've put a workaround in place that deletes all of the indexes every midnight and then runs ./build, which means that if it breaks during the day the indexes will never be massively out of date, but it's hardly a pretty fix. -- Mrgroucho 13:33, 7 August 2009 (UTC)Reply


I've had the same problem. The index building job (cronned for every hour) started to fail every couple of weeks, and then every week, and then every few days etc., until I was manually clearing out and rebuild a couple of times a day! I'd already written a wrapper script around the 'build' script, which was just for making the process a bit more cron-friendly, checking the search daemon was running, and for aborting it altogether when my server-backup routines are in operation etc. I've now decided to have it build the indexes in a new location every hour, and then 'cut-over' if no error is encountered. So far, this seems to be effective.

The indexes folder I was using seemed to be growing exponentially, and I think this may have been related to the problem. However, not being a Java or Lucene expert, I think this is a problem I'm gonna have to continue to work-around instead of solving. --140.131.255.2 05:38, 7 September 2009 (UTC)Reply


Yeah, I have this problem too. If anyone can post a solution - I'd be grateful. Even a workaround script. Thanks. --Robinson Weijman 09:08, 5 March 2010 (UTC)Reply
We too are experiencing this problem and I am surprised no solution has been posted although many have reported this exact problem on the net. The last solution I tried was to do a "rm -r" of the whole "indexes" directory before each call to "build" the index. The problem still arises but at least, when it fails once, it usually succeeds afterwards. I think I gave a great chance to LuceneSearch 2.1.3 and I go to 2.0, hoping this problem will disappear. Phil Reid 30 Nov 2010

I also have this problem, and do indeed believe it to be some sort of locking issue. The way I (think) I've "solved" it is a bit of a kluge -- I am killing the search daemon (lsearchd) before the update and restarting it after the update. At the moment there is an interruption of 30 seconds every hour. This is not great, but I figure this is better than the index not being updated. Looking forward to a more elegant fix [Thu Dec 16 21:10:30 GMT 2010]

So much for that. This kluge worked until the symlink lucene-search-2.1.3/indexes/search/wikidb.links (pointing to lucene-search-2.1.3/indexes/update/wikidb.links/<datestamp>) was replaced by lucene-search-2.1.3/indexes/search/wikidb.links as a directory in its own right, containing symlinks to the datestamped dirs. I've now modifed the update script to check if indexes/search/wikidb.links is a symlink and recreate it if not, making the kluge even worse. [Thu Dec 16 23:55:20 GMT 2010]

We also have this stupid problem. I'm looking forward to change the cron script and remove the index if there is a failure. It would be nice if you post yours cron script dirty workarrounds. At this time, my cron bash script does check the logfile for errors after the update script runs, e.g.:

CHECK=$(grep -i "Rebuild I/O error:" $LOGFILE | head -1 | cut -c10)

If there is a failure, I send a mail... --BSG2000 15:46, 20 December 2010 (UTC)Reply


how to fix this when I run ./lsearchd

edit
0rz </usr/local/search/ls2-bin> # ./lsearchd 
RMI registry started.
Trying config file at path /root/.lsearch.conf
Trying config file at path /usr/local/search/ls2-bin/lsearch.conf
Exception in thread "main" java.lang.NullPointerException
        at org.wikimedia.lsearch.config.GlobalConfiguration.makeIndexIdPool(GlobalConfiguration.java:531)
        at org.wikimedia.lsearch.config.GlobalConfiguration.read(GlobalConfiguration.java:413)
        at org.wikimedia.lsearch.config.GlobalConfiguration.readFromURL(GlobalConfiguration.java:247)
        at org.wikimedia.lsearch.config.Configuration.<init>(Configuration.java:116)
        at org.wikimedia.lsearch.config.Configuration.open(Configuration.java:68)
        at org.wikimedia.lsearch.config.StartupManager.main(StartupManager.java:39)
0rz </usr/local/search/ls2-bin> #

my environment is:

0rz </usr/local/search/ls2-bin> # java -version
java version "1.6.0_13"
Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
0rz </usr/local/search/ls2-bin> # ant -version
Apache Ant version 1.7.1 compiled on June 27 2008
0rz </usr/local/search/ls2-bin> # 

LSearch Daemon Init Script for Ubuntu

edit
  • This is just a sample. You will need to adjust this based on where you put the lucene-search directory.
#!/bin/sh -e
### BEGIN INIT INFO
# Provides:             lsearchd
# Required-Start:       $syslog
# Required-Stop:        $syslog
# Default-Start:        2 3 4 5
# Default-Stop:         1
# Short-Description:    Start the Lucene Search daemon
# Description:          Provide a Lucene Search backend for MediaWiki
### END INIT INFO

test -x /usr/local/lucene-search-2.1/lsearchd || exit 0

OPTIONS=""
if [ -f "/etc/default/lsearchd" ] ; then
        . /etc/default/lsearchd
fi

. /lib/lsb/init-functions

case "$1" in
  start)
    cd /usr/local/lucene-search-2.1
    log_begin_msg "Starting Lucene Search Daemon..."
    start-stop-daemon --start --quiet --oknodo --chdir /usr/local/lucene-search-2.1 --background --exec /usr/local/lucene-search-2.1/lsearchd -- $OPTIONS
    log_end_msg $?
    ;;
  stop)
    log_begin_msg "Stopping Lucene Search Daemon..."
    start-stop-daemon --stop --quiet --oknodo --retry 2 --chdir /usr/local/lucene-search-2.1 --exec /usr/local/lucene-search-2.1/lsearchd
    log_end_msg $?
    ;;
  restart)
    $0 stop
    sleep 1
    $0 start
    ;;
  reload|force-reload)
    log_begin_msg "Reloading Lucene Search Daemon..."
    stat-stop-daemon --stop -signal 1 --chdir /usr/local/lucene-search-2.1 --exec /usr/local/lucene-search-2.1/lsearchd
    log_end_msg $?
    ;;
  status)
    status_of_proc /usr/local/lucene-search-2.1/lsearchd lsearchd && exit 0 || exit $?
    ;;
  *)
    log_success_msg "Usage: /etc/init.d/lsearchd {start|stop|restart|reload|force-reload|status}"
    exit 1
esac

exit 0

55,6

Error here when use configure

edit

Hi,
After I run "ant" to build the jar and generate configuration files, here comes the error. What's wrong?

Mediawiki: 1.15.1; Lucence: 2.1; OS: centOS

[root@xxx lucene-search-2.1]# ./configure /var/wk/
Exception in thread "main" java.net.UnknownHostException: 00:16:3h:2d:6c:b0-hk0.localdomain: 00:16:3h:2d:6c:b0-hk0.localdomain
        at java.net.InetAddress.getLocalHost(InetAddress.java:1425)
        at org.wikimedia.lsearch.util.Configure.main(Configure.java:52)

--Alpha3 11:26, 27 August 2009 (UTC)Reply

In which context config.inc will be used?

edit

--Ans 08:22, 4 September 2009 (UTC)Reply

It is used in build process "./build" --Ans 09:40, 4 September 2009 (UTC)Reply

Install from SVN or Binary.

edit

I would recommend SVN any day. I've been through several installations of Lucene Search this morning, and the most rapid and problem-free methods was to use the SVN approach. It was also the *easiest* method to get Lucene to work - it just works. Also reads your MW config and produces its own *correct* configuration files.

MWSearch works just fine and dandy on top of this Lucene instance.

The 2.02 installation went badly, several times for me - and it chews a LOT more resources. I did get it working, but when it came to rebuilding indexes, it spewed up on the whole Computer - chewed 100% Chip, chewed more than 100% RAM, which caused Kernel Panic and failures. Had to reboot forcibly with hardware. Re-attempted several times and re-configured settings to test against "should be working" configuration - got the same results with the machine crawling. So I gave up, went to SVN, and instead of choking on the indexes, it rebuilt them in under 20 seconds. I understand there are Java issues around this. Forget them, it's not worth breaking Java on the System, or putting up with some strange configuration, just to get Lucene working. Gooooo SVN!! :-)

BTW: It should be made apparent on the Lucene-search Extension page that the SVN installation DOES work, and works VERY well. I had previously avoided this method as I am SVN-wary - where with a bit of prompting, that would have been my first choice.

Cheers, Mike

Brief period with zero search results (using update script)

edit

Our wiki runs the "update" script every 15 minutes, and the update takes about 2 minutes. Updates are done locally on the single wiki server.

Unfortunately, for a brief time while this script runs, searches return zero results. This problem lasts just a few seconds, but our users do encounter it and become confused.

Any advice on eliminating this "zero results" period? We thought about running two different reindexing processes, each writing to a different directory, and switching between them with a symbolic link. Something like:

  1. At 6:00, update index #1 in /usr/local/lucene1.
  2. Point symbolic link /usr/local/lucene at /usr/local/lucene1.
  3. At 6:15, update index #2 in /usr/local/lucene2.
  4. Point symbolic link /usr/local/lucene at /usr/local/lucene2.
  5. At 6:30, update index #1 in /usr/local/lucene1.
  6. Point symbolic link /usr/local/lucene at /usr/local/lucene1.
  7. ...

But I don't see a way to make Lucene reindex one directory while serving out of another. Any better suggestions? Maiden taiwan 15:03, 23 September 2009 (UTC)Reply

This should not happen. Are there any errors in logs during this period? --Rainman 17:20, 23 September 2009 (UTC)Reply
Yes: the update script outputs:
MediaWiki lucene-search indexer - build a map of related articles.
  ...
413  [main] INFO  org.wikimedia.lsearch.related.RelatedBuilder  - Rebuilding related mapping from links
416  [main] FATAL org.wikimedia.lsearch.related.RelatedBuilder  - Rebuild I/O error:
no segments* file found in org.apache.lucene.store.FSDirectory@/usr/local/lucene-search-2.1/indexes/search/wikidb.links: files:
java.io.FileNotFoundException: no segments* file found in org.apache.lucene.store.FSDirectory@/usr/local/lucene-search-2.1/indexes/search/wikidb.links: files:
        at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:587)
        at org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63)
        at org.apache.lucene.index.IndexReader.open(IndexReader.java:209)
        at org.apache.lucene.index.IndexReader.open(IndexReader.java:173)
        at org.wikimedia.lsearch.ranks.Links.flushForRead(Links.java:213)
        at org.wikimedia.lsearch.ranks.Links.ensureRead(Links.java:239)
        at org.wikimedia.lsearch.ranks.Links.getKeys(Links.java:773)
        at org.wikimedia.lsearch.related.RelatedBuilder.rebuildFromLinks(RelatedBuilder.java:91)
        at org.wikimedia.lsearch.related.RelatedBuilder.main(RelatedBuilder.java:72)
The named folder (wikidb.links) contains only symbolic links named after timestamps: 20090924121512, etc., linking to folders. Inside the folders (that exist) are files:
-rw-r--r-- 2 root root 4952161 Sep 24 12:00 _hq.cfs
-rw-r--r-- 2 root root      46 Sep 24 12:00 segments_172
-rw-r--r-- 2 root root      20 Sep 24 12:00 segments.gen
--Maiden taiwan 16:29, 24 September 2009 (UTC)Reply

No, this appears to be a separate issues. In any case, the extension does use symbolic links to quickly switch between the new and the old index, and it also allows for the new and the old index to co-exist for a while until all the old searches finish or timeout. So, that shouldn't be a problem. What could be a problem is that if you have the indexer and searcher on the same machine with insufficient RAM, then the indexer bogs down the machine causing high I/O which then slows down the searchers to the point of searches timing out. --Rainman 08:56, 25 September 2009 (UTC)Reply

Thanks. We have plenty of RAM (4 GB I believe) on a virtual machine, and while the load average does go up to about 3.0 - 4.0 during indexing, users don't perceive any slowness. That is, the search query returns quickly with zero results. How long is the timeout? Maiden taiwan 11:49, 25 September 2009 (UTC)Reply

svn revision # for 2.1.2

edit

What is the revision number for the 2.1.2 binary hosted at SourceForge?

I am having trouble getting any search results when building from the HEAD of http://svn.wikimedia.org/svnroot/mediawiki/branches/lucene-search-2.1/ The index builds fine, but when I query I get no results. Returns exception:

java.lang.IllegalArgumentException: nDocs must be > 0

Querying directly to http://localhost:8123/search/wikidb/help gives:

267
#info search=[gziebold-15624s.local], highlight=[], suggest=[gziebold-15624s.local] in 46 ms
#no suggestion
#interwiki 0 0
#results 0


However, indexing and running based on 2.1.2 binary works fine.

It's the svn revision of the date of release, don't know offhand, you'll have to look it up. I've built some indexes but didn't have problems with latest svn, can you provide a full stack trace? --Rainman 11:09, 10 November 2009 (UTC)Reply
RMI registry started.
Trying config file at path /Users/gziebold/.lsearch.conf
Trying config file at path /Users/gziebold/Projects/mediawiki/lucene-search-2.1.built/lsearch.conf
0    [main] INFO  org.wikimedia.lsearch.util.Localization  - Reading localization for En
733  [main] INFO  org.wikimedia.lsearch.interoperability.RMIServer  - RMIMessenger bound
737  [Thread-1] INFO  org.wikimedia.lsearch.frontend.HTTPIndexServer  - Indexer started on port 8321
739  [Thread-2] INFO  org.wikimedia.lsearch.frontend.SearchServer  - Searcher started on port 8123
746  [Thread-5] INFO  org.wikimedia.lsearch.search.SearcherCache  - Starting initial deployer for [wikidb, wikidb.hl, wikidb.links, wikidb.related, wikidb.spell]
818  [Thread-5] INFO  org.wikimedia.lsearch.search.SearcherCache  - Caching meta fields for wikidb ... 
2522 [Thread-5] INFO  org.wikimedia.lsearch.search.SearcherCache  - Finished caching wikidb in 1705 ms
2554 [Thread-5] INFO  org.wikimedia.lsearch.interoperability.RMIServer  - RemoteSearchable<wikidb>$0 bound
2562 [Thread-5] INFO  org.wikimedia.lsearch.interoperability.RMIServer  - RemoteSearchable<wikidb.hl>$0 bound
2567 [Thread-5] INFO  org.wikimedia.lsearch.interoperability.RMIServer  - RemoteSearchable<wikidb.links>$0 bound
2575 [Thread-5] INFO  org.wikimedia.lsearch.interoperability.RMIServer  - RemoteSearchable<wikidb.related>$0 bound
2582 [Thread-5] INFO  org.wikimedia.lsearch.interoperability.RMIServer  - RemoteSearchable<wikidb.spell>$0 bound
6879 [Thread-8] INFO  org.wikimedia.lsearch.frontend.HttpMonitor  - HttpMonitor thread started
6881 [pool-2-thread-1] INFO  org.wikimedia.lsearch.frontend.HttpHandler  - query:/search/wikidb/wind?namespaces=0%2C500&offset=0&limit=20&version=2.1&iwlimit=10 what:search dbname:wikidb term:wind
6919 [pool-2-thread-1] INFO  org.wikimedia.lsearch.analyzers.StopWords  - Successfully loaded stop words for: [nl, en, it, fr, de, sv, es, no, pt, da] in 21 ms
7052 [pool-2-thread-1] INFO  org.wikimedia.lsearch.search.SearchEngine  - Using FilterWrapper wrap: {0, 500} []
java.lang.IllegalArgumentException: nDocs must be > 0
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:110)
	at org.wikimedia.lsearch.search.WikiSearcher.search(WikiSearcher.java:184)
	at org.apache.lucene.search.Searcher.search(Searcher.java:132)
	at org.wikimedia.lsearch.search.SearchEngine.search(SearchEngine.java:722)
	at org.wikimedia.lsearch.search.SearchEngine.search(SearchEngine.java:129)
	at org.wikimedia.lsearch.frontend.SearchDaemon.processRequest(SearchDaemon.java:101)
	at org.wikimedia.lsearch.frontend.HttpHandler.handle(HttpHandler.java:193)
	at org.wikimedia.lsearch.frontend.HttpHandler.run(HttpHandler.java:114)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:637)
7076 [pool-2-thread-1] WARN  org.wikimedia.lsearch.search.SearchEngine  - Retry, temporal error for query: [wind] on wikidb : nDocs must be > 0
java.lang.IllegalArgumentException: nDocs must be > 0
	at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:110)
	at org.wikimedia.lsearch.search.WikiSearcher.search(WikiSearcher.java:184)
	at org.apache.lucene.search.Searcher.search(Searcher.java:132)
	at org.wikimedia.lsearch.search.SearchEngine.search(SearchEngine.java:722)
	at org.wikimedia.lsearch.search.SearchEngine.search(SearchEngine.java:129)
	at org.wikimedia.lsearch.frontend.SearchDaemon.processRequest(SearchDaemon.java:101)
	at org.wikimedia.lsearch.frontend.HttpHandler.handle(HttpHandler.java:193)
	at org.wikimedia.lsearch.frontend.HttpHandler.run(HttpHandler.java:114)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:637)

I also confirmed that the index built from the HEAD is fine. If I swap the LuceneSearch.jar from 2.1.2 binary and run it against indexes built from HEAD, it works. Processing query with LuceneSearch.jar from HEAD fails.

Other details: MediaWiki 1.15.1 (r50) MWSearch (Version r45173)

-- Looks like r48153 == 2.1.2 Or at least I was able to get that to successfully work with MediaWiki 1.15.1 --GregZ 04:06, 11 November 2009 (UTC)Reply

Ah I see.. fixed in svn. Thanks for the report. --Rainman 11:23, 11 November 2009 (UTC)Reply
edit

Can anyone elaborate on the following from OVERVIEW.txt?

 searching categories. Syntax is: query incategory:"exact category
 name". It is important to note that category names are themselves
 not tokenized. Using logical operators, intersection, union and
 difference of categories can be searched. Since exact category is
 needed (only case is not important), it is maybe best to incorporate
 this somewhere on category page, and have category name put into
 query by MediaWiki instead manually by user.

The incategory: syntax does not appear to work as described (in 2.1.2)

Also, what is meant by ...it is maybe best to incorporate this somewhere on category page, and have category name put into query by MediaWiki instead manually by user. ? Is the suggestion to put a search form on the Category page and insert the incategory: syntax there?

It does work, but only for categories that are not added via templates, but in main article text. E.g. [1]. --Rainman 11:02, 10 November 2009 (UTC)Reply

Ah ha. That explains why my incategory: query was not working. The categories were added via templates. This is caused because lucene-search indexes the wikitext for articles and does not resolve templates? Has there been discussion of a different index of the rendered html content instead of only indexing wikitext?

Yes, however, the current mediawiki architecture makes it difficult to do... in fact, what we would want is article not in html, but in wikitext with expanded templates... In any case, you won't find a search extension that does it. --Rainman 15:51, 10 November 2009 (UTC)Reply
How about; Keep indexing wikitext, but access the MediaWiki database to determine category relationships. Maiden taiwan 18:30, 16 December 2009 (UTC)Reply
On default mysql install that won't scale very well, otherwise it would be implemented a long time ago. --Rainman 22:31, 16 December 2009 (UTC)Reply
Thanks. Can you explain why it wouldn't scale? Couldn't it be done while Lucene builds the index -- just read the Mediawiki database tables once and you're done? Or maybe it could be dynamic like DPL does: it checks category membership dynamically just fine. Finally, even if it's slow, could you make this behavior an option and let the sysadmin decide whether it works on his/her site? Right now "incategory" produces different results than Mediawiki Category pages do. That seems like a bug.... Thank you. --Maiden taiwan 01:06, 17 December 2009 (UTC)Reply

Also, can you comment on this ...it is maybe best to incorporate this somewhere on category page, and have category name put into query by MediaWiki instead manually by user. ? Is the suggestion to put a search form on the Category page and insert the incategory: syntax there?

Well yes... if someone would have done it that would be nice.. You need to take into consideration that the file has probably been written at 2am on some sunday, and not take everything in it very seriously ;) --Rainman 15:51, 10 November 2009 (UTC)Reply

I simply needed the "late-night fog" translation.  :) --GregZ 04:03, 11 November 2009 (UTC)Reply

Is incategory going to be fixed?

edit

Is the problem of incategory and transcluded category tags planned to be fixed? If incategory is returning wrong results (missing all articles with transcluded category tags), people are going to see incomplete search results and make wrong decisions ("Well, I guess there are no articles that match my query..."). This just happened on our wiki. Can't the extension just look at the wiki database to discover category relationships? Thanks. Maiden taiwan 18:02, 16 December 2009 (UTC)Reply

No, there are no plans to fix it. Lucene-search is primarily developed towards needs of WMF, and because this would never get enabled on WMF projects due to inefficiency it is not planned to be implemented. However, this is an open-source project and if you need this functionality you can make it yourself or pay someone to do it. --Rainman 12:52, 19 December 2009 (UTC)Reply
Return to "Lucene-search/archive/2009" page.