Extension talk:Lucene-search/archive/2009
2009
edit./configure for v. 2.1 does not seem to work
editRunning Ubuntu 8.04, Ant 1.7, Java 1.6.0_07, using the Binary install package:
user@host: ./configure /path/to/mw/install "0 [main] WARN org.wikimedia.lsearch.util.Command - Got exit value 1 while executing [/bin/bash, -c, cd /path/to/mw/install && (echo "return \$wgDBname" | php maintenance/eval.php)] Exception in thread "main" java.io.IOException: Error executing command: at org.wikimedia.lsearch.util.Command.exec(Command.java:45) at org.wikimedia.lsearch.util.Configure.getVariable(Configure.java:77) at org.wikimedia.lsearch.util.Configure.main(Configure.java:42) user@host: sudo ./configure /path/to/mw/install 0 [main] WARN org.wikimedia.lsearch.util.Command - Got exit value 1 while executing [/bin/bash, -c, cd /path/to/mw/install && (echo "return \$wgDBname" | php maintenance/eval.php)] Exception in thread "main" java.io.IOException: Error executing command: at org.wikimedia.lsearch.util.Command.exec(Command.java:45) at org.wikimedia.lsearch.util.Configure.getVariable(Configure.java:77) at org.wikimedia.lsearch.util.Configure.main(Configure.java:42) user@host: sudo su root@host: ./configure /path/to/mw/install 0 [main] WARN org.wikimedia.lsearch.util.Command - Got exit value 1 while executing [/bin/bash, -c, cd /path/to/mw/instal && (echo "return \$wgDBname" | php maintenance/eval.php)] Exception in thread "main" java.io.IOException: Error executing command: at org.wikimedia.lsearch.util.Command.exec(Command.java:45) at org.wikimedia.lsearch.util.Configure.getVariable(Configure.java:77) at org.wikimedia.lsearch.util.Configure.main(Configure.java:42)
Seems to me that this is highly unlikely to be a permissions issue. My MW installation is working just fine otherwise.
I can't even get past the first step of the instructions, which does not bode well. Will try building from source, but doubt that will make any difference.... Any ideas? --Fungiblename 20:38, 18 March 2009 (UTC)
- You need to replace /path/to/mw/install with the actual path to your mediawiki installation (e.g. something like /var/www/mediawiki/). --Rainman 21:07, 18 March 2009 (UTC)
- Thanks, I was using my actual path but did not want to reproduce it here in full. I was able to compile the SVN version, however, and even after changing the "hostname" variable to my actual hostname as recognized by Apache, I get the following:
./configure /var/www/mw 0 [main] WARN org.wikimedia.lsearch.util.Command - Got exit value 1 while executing [/bin/bash, -c, cd /var/www/mw && (echo "return \$wgDBname" | php maintenance/eval.php)] Exception in thread "main" java.io.IOException: Error executing command:
at org.wikimedia.lsearch.util.Command.exec(Command.java:45) at org.wikimedia.lsearch.util.Configure.getVariable(Configure.java:77) at org.wikimedia.lsearch.util.Configure.main(Configure.java:42) --Fungiblename 21:14, 18 March 2009 (UTC)
- If you go into your mw installation dir (i.e. one you supplied) and run
echo "return \$wgDBname" | php maintenance/eval.php
what do you get? Do you get the name of your database? --Rainman 21:28, 18 March 2009 (UTC)
- Thanks for the troubleshooting advice! It seems like this was a major an oversight on my part. I get the same error as above because I'm running a small wiki farm with shared code (symlinks from the install directory to the shared MediaWiki code). Once I wrote "export MW_INSTALL_PATH=/var/www/mw && ./configure /var/www/mw" it wrote all the config files. You may want to add a note on the main page about configuring for installations with shared code (at least this very basic step). I'll play around on my own to try to find a way to have multiple separate indexes (my plan is to set up multiple directories with separate config files, index directories, and a symlink to the main jar). I'll try to get it working with just one first, though. Thanks again for your help and all your hard work on this! --Fungiblename 07:39, 19 March 2009 (UTC)
- For me configure sets wrong value of dbname in config.ini and it cause . Here I see "dbname=> DatabaseName>". Note wrong ">" signs. Calling
echo "return \$wgDBname" | php maintenance/eval.php
returns
- For me configure sets wrong value of dbname in config.ini and it cause . Here I see "dbname=> DatabaseName>". Note wrong ">" signs. Calling
> DatabaseName >
- eval.php at some servers prints prompt to stdout. I found that it happens when php function posix_isatty exists. Sometimes it does not.
- Also configure wants php to be in PATH. It is not always true either. --Roma7
- I had the same problem and solved it. It seems that the ./configure didn't "recognize" the PHP in the LAMP package and so I simply installed PHP CLI and it worked...--Gregra 21:55, 5 December 2009 (UTC)
Here's just a taste of my output from trying to build from source of the STABLE version
edituser@host:~/common/elements/lucene-SVN-stable-2009-03-18$ ant Buildfile: build.xml
build:
[mkdir] Created dir: /home/username/common/elements/lucene-SVN-stable-2009-03-18/bin [javac] Compiling 101 source files to /home/username/common/elements/lucene-SVN-stable-2009-03-18/bin [javac] /home/username/common/elements/lucene-SVN-stable-2009-03-18/src/org/wikimedia/lsearch/analyzers/WikiQueryParser.java:24: package org.mediawiki.importer does not exist [javac] import org.mediawiki.importer.ExactListFilter; [javac] ^ [javac] /home/username/common/elements/lucene-SVN-stable-2009-03-18/src/org/wikimedia/lsearch/importer/DumpImporter.java:13: package org.mediawiki.importer does not exist...
.... rTest.java uses or overrides a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 70 errors
BUILD FAILED /home/username/common/elements/lucene-SVN-stable-2009-03-18/build.xml:68: Compile failed; see the compiler error output for details.
Total time: 2 seconds
"ant -Xlint:deprecation -f build.xml Unknown argument: -Xlint:deprecation"
Does anyone have any instructions about how to even get this thing running? Are there some hidden instructions/prerequisites that I'm missing? Seems to me this should be pretty easy to run on Linux.... --Fungiblename 20:53, 18 March 2009 (UTC)
- Must place "mwdumper.jar" in "lib" of directory downloaded from SVN. --Fungiblename 21:12, 18 March 2009 (UTC)
Unable to build
editWhen building from the binary I get this error. I am in Ubuntu:
root@testwiki:/usr/share/mediawiki/extensions/lucene-search-2.1# ./build Dumping wikidb... 2009-03-19 20:14:42: wikidb 99 pages (143.215/sec), 100 revs (144.661/sec), ETA 2009-03-19 20:14:45 [max 513] 2009-03-19 20:14:42: wikidb 199 pages (192.676/sec), 200 revs (193.645/sec), ETA 2009-03-19 20:14:44 [max 513] 2009-03-19 20:14:43: wikidb 299 pages (222.928/sec), 300 revs (223.674/sec), ETA 2009-03-19 20:14:44 [max 513] 2009-03-19 20:14:43: wikidb 399 pages (230.430/sec), 400 revs (231.008/sec), ETA 2009-03-19 20:14:44 [max 513] 2009-03-19 20:14:43: wikidb 458 pages (243.707/sec), 458 revs (243.707/sec), ETA 2009-03-19 20:14:44 [max 513] mkdir: cannot create directory `/var/lib/mediawiki/extensions/lucene-search-2.1/indexes/status': No such file or directory ./build: line 19: /var/lib/mediawiki/extensions/lucene-search-2.1/indexes/status/wikidb: No such file or directory MediaWiki lucene-search indexer - rebuild all indexes associated with a database. Trying config file at path /root/.lsearch.conf Trying config file at path /var/lib/mediawiki/extensions/lucene-search-2.1/lsearch.conf MediaWiki lucene-search indexer - index builder from xml database dumps. 1 [main] INFO org.wikimedia.lsearch.util.Localization - Reading localization for En 2799 [main] INFO org.wikimedia.lsearch.ranks.Links - Making index at /var/lib/mediawiki/extensions/lucene-search-2.1/indexes/import/wikidb.links 3208 [main] INFO org.wikimedia.lsearch.ranks.LinksBuilder - Calculating article links... 458 pages (26.889/sec), 458 revs (26.889/sec) 21058 [main] INFO org.wikimedia.lsearch.index.IndexThread - Making snapshot for wikidb.links 21291 [main] INFO org.wikimedia.lsearch.index.IndexThread - Made snapshot /var/lib/mediawiki/extensions/lucene-search-2.1/indexes/snapshot/wikidb.links/20090319161516 21405 [main] INFO org.wikimedia.lsearch.search.UpdateThread - Syncing wikidb.links 21963 [main] INFO org.wikimedia.lsearch.ranks.Links - Opening for read /var/lib/mediawiki/extensions/lucene-search-2.1/indexes/search/wikidb.links 21973 [main] INFO org.wikimedia.lsearch.related.RelatedBuilder - Rebuilding related mapping from links 34467 [main] INFO org.wikimedia.lsearch.index.IndexThread - Making snapshot for wikidb.related 34649 [main] INFO org.wikimedia.lsearch.index.IndexThread - Made snapshot /var/lib/mediawiki/extensions/lucene-search-2.1/indexes/snapshot/wikidb.related/20090319161529 34661 [main] INFO org.wikimedia.lsearch.importer.Importer - Indexing articles (index+highlight+titles)... 34663 [main] INFO org.wikimedia.lsearch.ranks.Links - Opening for read /var/lib/mediawiki/extensions/lucene-search-2.1/indexes/search/wikidb.links 35075 [main] INFO org.wikimedia.lsearch.analyzers.StopWords - Successfully loaded stop words for: [nl, en, it, fr, de, sv, es, no, pt, da] in 329 ms 35077 [main] INFO org.wikimedia.lsearch.importer.SimpleIndexWriter - Making new index at /var/lib/mediawiki/extensions/lucene-search-2.1/indexes/import/wikidb 35087 [main] INFO org.wikimedia.lsearch.importer.SimpleIndexWriter - Making new index at /var/lib/mediawiki/extensions/lucene-search-2.1/indexes/import/wikidb.hl Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(libgcj.so.81) at java.io.ByteArrayOutputStream.write(libgcj.so.81) at org.apache.lucene.index.FieldsReader.uncompress(FieldsReader.java:514) at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:317) at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:166) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:659) at org.apache.lucene.index.IndexReader.document(IndexReader.java:525) at org.wikimedia.lsearch.storage.RelatedStorage.getRelated(RelatedStorage.java:56) at org.wikimedia.lsearch.importer.DumpImporter.writeEndPage(DumpImporter.java:109) at org.mediawiki.importer.PageFilter.writeEndPage(Unknown Source) at org.mediawiki.importer.XmlDumpReader.closePage(Unknown Source) at org.mediawiki.importer.XmlDumpReader.endElement(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source) at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) at javax.xml.parsers.SAXParser.parse(libgcj.so.81) at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source) at org.wikimedia.lsearch.importer.Importer.main(Importer.java:186) at org.wikimedia.lsearch.importer.BuildAll.main(BuildAll.java:109) root@testwiki:/usr/share/mediawiki/extensions/lucene-search-2.1# root@testwiki:/usr/share/mediawiki/extensions/lucene-search-2.1# java -version java version "1.5.0" gij (GNU libgcj) version 4.2.4 (Ubuntu 4.2.4-1ubuntu3) Copyright (C) 2007 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. root@testwiki:/usr/share/mediawiki/extensions/lucene-search-2.1# javac Eclipse Java Compiler v_774_R33x, 3.3.1 Copyright IBM Corp 2000, 2007. All rights reserved. Usage: <options> <source files | directories> If directories are specified, then their source contents are compiled. Possible options are listed below. Options enabled by default are prefixed with '+'. Classpath options: -cp -classpath <directories and zip/jar files
What is wrong?
- It won't work on GNU java. You can use openjdk6 which is also opensource java and is available as a package for ubuntu. --Rainman 21:14, 19 March 2009 (UTC)
Thanks, I will give that a shot. 166.50.205.143 11:02, 20 March 2009 (UTC)
Newest binary (2.1.1) does not appear to run on Mac OS 10.5.6 - 2.1 did not run either.
edit$ java -version java version "1.6.0_07" Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153) Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode) $ export MW_INSTALL_PATH=/Sites/mw/ && ./configure /Sites/mw/ Exception in thread "main" java.lang.UnsupportedClassVersionError: Bad version number in .class file at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:675) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124) at java.net.URLClassLoader.defineClass(URLClassLoader.java:260) at java.net.URLClassLoader.access$100(URLClassLoader.java:56) at java.net.URLClassLoader$1.run(URLClassLoader.java:195) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:316) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:280) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:374)
Any thoughts? I have been using Sphinx in the meantime (which uses about 90-95% less memory), but it does not provide a lot of the features that Lucene does; I would really like to get Lucene running. --Fungiblename 11:16, 26 March 2009 (UTC)
Solution
editChange the Java preference using the Java Preferences app to make sure that Java SE 6 is the top preference, then it runs. Also, this appears to be hard-coded to look for mysql.sock in /var/mysql/mysql.sock (I grepped for it in the ls2.1 directory). I have no desire to recompile to attempt to tweak it for my system though. I run from a non-standard location, so I just made a symbolic link to that location from my actual install. YMMV. --Fungiblename 16:16, 31 March 2009 (UTC)
- For details see (meanwhile) Manual:Running MediaWiki on Mac OS X --Achimbode 20:33, 9 August 2009 (UTC)
Hardcoded search port? 8123
editThank you for keeping this up to date.
I recently upgraded to the latest. This time the configuration was way better. I loved that configuration generator. There's but one thing though. I cannot use the search port 8123. So I went off and changed it on lsearch.conf and "LocalSettings.php". However, it didn't like it at all. It is still listening on 8123. Now the "noddy" question, Am I missing something? Thanks --Cartoro 16:00, 31 March 2009 (UTC)
I have same problem.
In lsearch.conf, I have edited Search.port=8000.
But when I start lsearchd
Result 646 [Thread-2] INFO org.wikimedia.lsearch.frontend.SearchServer - Searcher started on port 8123
Sébas
- This has been fixed in latest binary (available for download from sourceforge) and svn version. --Rainman 13:52, 15 April 2009 (UTC)
I'm afraid the source is still showing the hardcoded "8123" (May 28, 2009).
- Which file, where? Does changing the default port to some other value not work for you? --Rainman 21:31, 27 May 2009 (UTC)
XML-RPC server incompete
editI've installed latest Lucene-search and MWSearch on MW 1.13 and found that updatePage and deletePage actions doesn't pass through.
Looking at source code I've found that these handlers were removed in rev 32681 of lucene-search RPCIndexDeamon.java. As far as I understand now there is new HTTP daemon available but MWSearch isn't aware of it.
Am I missing something?
--Eugenem 07:39, 15 April 2009 (UTC)
- Using HTTP to post articles (either via xml-rpc or as raw http attachment) is an old and deprecated way of index update, and thus the methods have been removed. To keep the index up-to-date please use either complete rebuilds (via "./build") or Extension:OAIRepository (via ("./update") --Rainman 09:47, 15 April 2009 (UTC)
- I see. Actually I was interested in these featured to make custom updates such as output of special pages. On our site we use a lot of special pages to show profiles so we'd like to index special page output instead of template. Is there any way to do that? I mean some interface to add bunch of pages to index using PHP and now writing custom Java parser.
- You could include those pages into the xml dump of your database (produced by maintenance/dumpBackup.php) and then index everything. The other way would be to include it into the OAI table, although that could be tricky since you would need to have consistent page_ids for those special pages in order for incremental update to work properly. There might be other ways, but they are bound to break something, so my advice is to stick with these two standard ways. --Rainman 11:03, 15 April 2009 (UTC)
Finally it works (for me)
editNothing but the following worked for my install. Here's what I did:
svn co http://svn.wikimedia.org/svnroot/mediawiki/trunk/extensions/MWSearch
mv MWSearch extensions
svn co http://svn.wikimedia.org/svnroot/mediawiki/branches/lucene-search-2.1/ lucene-search-2
cd lucene-search-2
ant
./configure
./build
Now, add the following to LocalSettings.php:
# lsearch require_once("extensions/MWSearch/MWSearch.php"); $wgSearchType = 'LuceneSearch'; $wgLuceneHost = 'YourHostName'; # <-- change this! $wgLucenePort = 8123; # uncomment this if you use lucene-search 2.1 # (MUST be AFTER the require_once!) $wgLuceneSearchVersion = 2.1;
Where YourHostName is the results of 'hostname'. The search doesn't work on my machine if I use the default, "192.168.0.1".
# test lucene, now
./lsearchd
How to customize synonyms and stop words?
editHow can I edit the synonyms and stop words in order to bring the engine more in line with our needs?
- You need to checkout the source from svn. Then edit resources/dist/wordnet-en.txt (for synonyms) and stopwords-en.txt. If this does not work, then you could also try making your own Filter class and plugging it in into the FilterFactory class. --Rainman 18:31, 1 May 2009 (UTC)
Thanks. I have done as you suggest. However, I do not see any indication that the system is ignoring stop words (e.g. if I search with the word "me", I get results). I also do not know how to confirm that the synonyms are working. Are there some good tests I could run to verify? ----Marc 14:31, 6 May 2009 (MDT)
Searching Attachments
editI am running
MediaWiki 1.13.1 PHP 5.2.4-2ubuntu5.6(apache2handler) MySQL 5.0.51a-3ubuntu5.4
I have the FileIndexer
and
now installed and running.
The Lucene search capability seems to work far better than the default search capability except that it no longer generates search results from attachements that were turned into text and then inserted in the image field
Is this a limitation of the present software? I had hoped the Lucene Search would index the attachments, especially given the use of the FileIndexer.
Is is significant that the FQDN is http://wiki.tesla.local/ (on a local LAN) but that the hostname is wiki
Attached are the configuration files.
lsearch.conf
# By default, will check /etc/lsearch.conf ################################################ # Global configuration ################################################ # URL to global configuration, this is the shared main config file, it can # be on a NFS partition or available somewhere on the network MWConfig.global=file:///home/chris/lucene-search-2.1/lsearch-global.conf # Local path to root directory of indexes Indexes.path=/home/chris/lucene-search-2.1/indexes # Path to rsync Rsync.path=/usr/bin/rsync # Extra params for rsync # Rsync.params=--bwlimit=8192 ################################################ # Search node related configuration ################################################ # Port of http daemon, if different from default 8123 # Search.port=8000 # In minutes, how frequently will the index host be checked for updates Search.updateinterval=0.1 # In seconds, delay after which the update will be fetched # used to scatter the updates around the hour Search.updatedelay=0 # In seconds, how frequently the dead search nodes should be checked Search.checkinterval=10 # In milliseconds, for how long should the query be executed # Search.timelimit=1000 # if to wait for aggregates to warm up before deploying the searcher Search.warmupaggregate=true # cache *whole* index in RAM Search.ramdirectory=false # Disable wordnet aliases Search.disablewordnet=true # If this host runs on multiple CPUs maintain a pool of index searchers # It's good idea to make it number of CPUs+1, or some larger odd number SearcherPool.size=1 ################################################ # Indexer related configuration ################################################ # In minutes, how frequently is a clean snapshot of index created Index.snapshotinterval=2880 # Daemon type (http is started by default) #Index.daemon=xmlrpc # Port of daemon (default is 8321) #Index.port=8080 # Maximal queue size after which index is being updated Index.maxqueuecount=5000 # Maximal time an update can remain in queue before being processed (in seconds) Index.maxqueuetimeout=12 # If to delete all old snapshots always (default to false - leaves the last good snapshot) # Index.delsnapshots=true ################################################ # Log, ganglia, localization ################################################ # URL to MediaWiki message files Localization.url=file:///home/chris/public_html_3/wiki/languages/messages # Username/password for password authenticated OAI repo # OAI.username=user # OAI.password=pass # Max queue size on remote indexer after which we wait a bit OAI.maxqueue=5000 # Number of docs to buffer before sending to inc updater OAI.bufferdocs=500 # Log configuration Logging.logconfig=/home/chris/lucene-search-2.1/lsearch.log4j # Set debug to true to diagnose problems with log4j configuration Logging.debug=false # Turn this on to broadcast status to a Ganglia reporting system. # Requires that 'gmetric' be in the PATH and runnable. You can # override the default UDP broadcast port and interface if required. #Ganglia.report=true #Ganglia.port=8649 #Ganglia.interface=eth0
lsearch-global.conf
################################################ # Global search cluster layout configuration ################################################ [Database] MediaWiki : (single) (spell,4,2) (language,en) [Search-Group] wiki : * [Index] wiki : * [Index-Path] <default> : /search [OAI] <default> : http://localhost/index.php [Namespace-Boost] <default> : (0,2) (1,0.5) [Namespace-Prefix] all : <all> [0] : 0 [1] : 1 [2] : 2 [3] : 3 [4] : 4 [5] : 5 [6] : 6 [7] : 7 [8] : 8 [9] : 9 [10] : 10 [11] : 11 [12] : 12 [13] : 13 [14] : 14 [15] : 15
config.inc
dbname=MediaWiki wgScriptPath= hostname=wiki indexes=/home/chris/lucene-search-2.1/indexes mediawiki=/home/chris/public_html_3/wiki base=/home/chris/lucene-search-2.1 wgServer=http://localhost
- Unfortunately lucene-search won't search attachments no matter what kind of extra extension you use. You could however try Extension:EzMwLucene which is also lucene-based but has a different set of features, doesn't have some lucene-search stuff, but has attachment search. --Rainman 09:31, 26 May 2009 (UTC)
- Thank you so much for the prompt response. I will try the Extension:EzMwLucene search as attachment searching is key feature I would like in our company wiki.
- Thanks a bunch Rainman. Do you know offhand what the major differences are between both Lucene extensions? We have Lucene-search installed but would like to enable EzMwLucene but it would be good to know what the feature differences are. --Gkullberg 13:59, 3 July 2009 (UTC)
Search within files?
editIs it possible to use Lucene to search within files uploaded to MediaWiki?
On the Lucene page on Wikipedia it says:
"At the core of Lucene's logical architecture is the idea of a document containing fields of text. This flexibility allows Lucene's API to be independent of the file format. Text from PDFs, HTML, Microsoft Word, and OpenDocument documents, as well as many others can all be indexed so long as their textual information can be extracted."
It would be great if I could search within PDFs and Docs and whatever else I upload to my MediaWiki instance. --Gkullberg 19:55, 2 July 2009 (UTC)
- See answer to previous question.... --Rainman 10:08, 3 July 2009 (UTC)
How to use CJKAnalyzer
editIs it possible to use CJKAnalyzer for indexing pages written in Japanese?
- Yes, just change (language,en) to (language,ja) in your config file (and re-run the build process). --Rainman 08:27, 10 July 2009 (UTC)
Periodic fatal errors while rebuilding index - "no segments* file"
editI'm running Lucene-search on our local wiki. The build script runs correctly and produces a valid index, which is picked up by the daemon, and everything works fine...for a bit. I've created a cron job that runs the build script hourly, with the output of the script being emailed to me. The cron job runs happily for a spell and then I receive this in the output:
MediaWiki lucene-search indexer - rebuild all indexes associated with a database. Trying config file at path /home/system/mymintel-svc/.lsearch.conf Trying config file at path /data/mymintel/mediawiki/lucene_search/lsearch.conf MediaWiki lucene-search indexer - index builder from xml database dumps.
0 [main] INFO org.wikimedia.lsearch.util.Localization - Reading localization for En 582 [main] INFO org.wikimedia.lsearch.ranks.Links - Making index at /data/mymintel/mediawiki/lucene_search/indexes/import/it_wiki.links 924 [main] INFO org.wikimedia.lsearch.ranks.LinksBuilder - Calculating article links... 3,759 pages (338.679/sec), 3,759 revs (338.679/sec) 14271 [main] INFO org.wikimedia.lsearch.index.IndexThread - Making snapshot for it_wiki.links 14645 [main] INFO org.wikimedia.lsearch.index.IndexThread - Made snapshot /data/mymintel/mediawiki/lucene_search/indexes/snapshot/it_wiki.links/20090731050111 14696 [main] INFO org.wikimedia.lsearch.search.UpdateThread - Syncing it_wiki.links 15632 [main] INFO org.wikimedia.lsearch.ranks.Links - Opening for read /data/mymintel/mediawiki/lucene_search/indexes/search/it_wiki.links 15637 [main] INFO org.wikimedia.lsearch.related.RelatedBuilder - Rebuilding related mapping from links 15640 [main] FATAL org.wikimedia.lsearch.importer.Importer - Cannot make related mapping: no segments* file found in org.apache.lucene.store.FSDirectory@/data/mymintel/mediawiki/lucene_search/indexes/search/it_wiki.links: files: MediaWiki lucene-search indexer - build spelling suggestion index.
16802 [main] INFO org.wikimedia.lsearch.spell.SuggestBuilder - Building spell-check for it_wiki 16802 [main] INFO org.wikimedia.lsearch.util.Localization - Reading localization for En 16931 [main] INFO org.wikimedia.lsearch.spell.SuggestBuilder - Rebuilding precursor index... 17037 [main] INFO org.wikimedia.lsearch.analyzers.StopWords - Successfully loaded stop words for: [nl, en, it, fr, de, sv, es, no, pt, da] in 68 ms 17039 [main] INFO org.wikimedia.lsearch.spell.CleanIndexWriter - Using phrase stopwords: [only, theirs, some, where, being, after, doing, did, they, herself, as, so, our, than, your, for, down, the, other, of, does, no, ours, with, from, them, by, also, you, hers, until, yourself, has, she, it, up, why, have, this, those, about, between, which, under, these, i, yours, but, his, myself, yourselves, having, more, be, her, into, its, an, he, on, over, was, here, to, such, above, because, nor, had, him, below, and, whoever, during, their, itself, been, most, that, out, each, or, a, own, all, what, in, ourselves, were, themselves, both, not, same, do, am, too, once, any, when, then, who, how, whom, my, through, there, before, very, we, against, few, while, again, me, at, if, himself, are, is, off, further] 17129 [main] INFO org.wikimedia.lsearch.ranks.Links - Opening for read /data/mymintel/mediawiki/lucene_search/indexes/search/it_wiki.links java.io.IOException: no segments* file found in org.apache.lucene.store.FSDirectory@/data/mymintel/mediawiki/lucene_search/indexes/search/it_wiki.links: files: at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source)
From this point onwards, the job will not run correctly until I have deleted the indexes directory and started from scratch.
I've dumped the directory structure of the filesystem when the index is working correctly, and when it's broken; the output is below.
Working config
editindexes/ |-- import | |-- it_wiki | | |-- _7.cfs | | |-- segments.gen | | `-- segments_h | |-- it_wiki.hl | | |-- _7.cfs | | |-- segments.gen | | `-- segments_h | |-- it_wiki.links | | |-- _8.cfs | | |-- segments.gen | | `-- segments_j | |-- it_wiki.related | | |-- _d.cfs | | |-- segments.gen | | `-- segments_t | |-- it_wiki.spell | | |-- _1v.cfs | | |-- segments.gen | | `-- segments_3t | `-- it_wiki.spell.pre | |-- _8.cfs | |-- segments.gen | `-- segments_j |-- index | |-- it_wiki | | |-- _7.cfs | | |-- segments.gen | | `-- segments_h | |-- it_wiki.hl | | |-- _7.cfs | | |-- segments.gen | | `-- segments_h | |-- it_wiki.links | | |-- _8.cfs | | |-- segments.gen | | `-- segments_j | `-- it_wiki.spell.pre | |-- _8.cfs | |-- segments.gen | `-- segments_j |-- search | |-- it_wiki -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki/20090730163156 | |-- it_wiki.hl -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.hl/20090730163156 | |-- it_wiki.links -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090730163123 | |-- it_wiki.related -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.related/20090730163127 | `-- it_wiki.spell -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.spell/20090730163230 |-- snapshot | |-- it_wiki | | `-- 20090730163156 | | |-- _7.cfs | | |-- segments.gen | | `-- segments_h | |-- it_wiki.hl | | `-- 20090730163156 | | |-- _7.cfs | | |-- segments.gen | | `-- segments_h | |-- it_wiki.links | | `-- 20090730163123 | | |-- _8.cfs | | |-- segments.gen | | `-- segments_j | |-- it_wiki.related | | `-- 20090730163127 | | |-- _d.cfs | | |-- segments.gen | | `-- segments_t | |-- it_wiki.spell | | `-- 20090730163230 | | |-- _1v.cfs | | |-- segments.gen | | `-- segments_3t | `-- it_wiki.spell.pre | `-- 20090730163210 | |-- _8.cfs | |-- segments.gen | `-- segments_j |-- status | `-- it_wiki `-- update |-- it_wiki | `-- 20090730163156 | |-- _7.cfs | |-- segments.gen | `-- segments_h |-- it_wiki.hl | `-- 20090730163156 | |-- _7.cfs | |-- segments.gen | `-- segments_h |-- it_wiki.links | `-- 20090730163123 | |-- _8.cfs | |-- segments.gen | `-- segments_j |-- it_wiki.related | `-- 20090730163127 | |-- _d.cfs | |-- segments.gen | `-- segments_t `-- it_wiki.spell `-- 20090730163230 |-- _1v.cfs |-- segments.gen `-- segments_3t
Broken Config
editindexes/ |-- import | |-- it_wiki | | |-- _2f.cfs | | |-- segments.gen | | `-- segments_58 | |-- it_wiki.hl | | |-- _2f.cfs | | |-- segments.gen | | `-- segments_58 | |-- it_wiki.links | | |-- _5h.cfs | | |-- segments.gen | | `-- segments_bm | |-- it_wiki.related | | |-- _4n.cfs | | |-- segments.gen | | `-- segments_9o | |-- it_wiki.spell | | |-- _oj.cfs | | |-- segments.gen | | `-- segments_1dh | `-- it_wiki.spell.pre | |-- _39.fdt | |-- _39.fdx | |-- segments.gen | |-- segments_74 | `-- write.lock |-- index | |-- it_wiki | | |-- _2f.cfs | | |-- segments.gen | | `-- segments_58 | |-- it_wiki.hl | | |-- _2f.cfs | | |-- segments.gen | | `-- segments_58 | |-- it_wiki.links | | |-- _5h.cfs | | |-- segments.gen | | `-- segments_bm | `-- it_wiki.spell.pre | |-- _39.fdt | |-- _39.fdx | |-- segments.gen | |-- segments_74 | `-- write.lock |-- search | |-- it_wiki -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki/20090731040228 | |-- it_wiki.hl -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.hl/20090731040228 | |-- it_wiki.links | | |-- 20090731050111 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731050111 | | |-- 20090731060116 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731060116 | | |-- 20090731070104 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731070104 | | |-- 20090731080121 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731080121 | | |-- 20090731090112 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731090112 | | |-- 20090731100113 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731100113 | | |-- 20090731110108 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731110108 | | |-- 20090731120051 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731120051 | | `-- 20090731130055 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731130055 | |-- it_wiki.related -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.related/20090731040125 | `-- it_wiki.spell -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.spell/20090731040320 |-- snapshot | |-- it_wiki | | |-- 20090731030246 | | | |-- _27.cfs | | | |-- segments.gen | | | `-- segments_4r | | `-- 20090731040228 | | |-- _2f.cfs | | |-- segments.gen | | `-- segments_58 | |-- it_wiki.hl | | |-- 20090731030247 | | | |-- _27.cfs | | | |-- segments.gen | | | `-- segments_4r | | `-- 20090731040228 | | |-- _2f.cfs | | |-- segments.gen | | `-- segments_58 | |-- it_wiki.links | | |-- 20090731120051 | | | |-- _58.cfs | | | |-- segments.gen | | | `-- segments_b3 | | `-- 20090731130055 | | |-- _5h.cfs | | |-- segments.gen | | `-- segments_bm | |-- it_wiki.related | | |-- 20090731030132 | | | |-- _49.cfs | | | |-- segments.gen | | | `-- segments_8v | | `-- 20090731040125 | | |-- _4n.cfs | | |-- segments.gen | | `-- segments_9o | |-- it_wiki.spell | | |-- 20090731030355 | | | |-- _mn.cfs | | | |-- segments.gen | | | `-- segments_19o | | `-- 20090731040320 | | |-- _oj.cfs | | |-- segments.gen | | `-- segments_1dh | `-- it_wiki.spell.pre | |-- 20090731030320 | | |-- _2z.cfs | | |-- segments.gen | | `-- segments_6c | `-- 20090731040253 | |-- _38.cfs | |-- segments.gen | `-- segments_6v |-- status | `-- it_wiki `-- update |-- it_wiki | |-- 20090731030246 | | |-- _27.cfs | | |-- segments.gen | | `-- segments_4r | `-- 20090731040228 | |-- _2f.cfs | |-- segments.gen | `-- segments_58 |-- it_wiki.hl | |-- 20090731030247 | | |-- _27.cfs | | |-- segments.gen | | `-- segments_4r | `-- 20090731040228 | |-- _2f.cfs | |-- segments.gen | `-- segments_58 |-- it_wiki.links | |-- 20090731120051 | | |-- _58.cfs | | |-- segments.gen | | `-- segments_b3 | `-- 20090731130055 | |-- _5h.cfs | |-- segments.gen | `-- segments_bm |-- it_wiki.related | |-- 20090731030132 | | |-- _49.cfs | | |-- segments.gen | | `-- segments_8v | `-- 20090731040125 | |-- _4n.cfs | |-- segments.gen | `-- segments_9o `-- it_wiki.spell |-- 20090731030355 | |-- _mn.cfs | |-- segments.gen | `-- segments_19o `-- 20090731040320 |-- _oj.cfs |-- segments.gen `-- segments_1dh
As you can see, the contents of index/search/it_wiki.links is completely different. I suspect that it's this that's causing the problem, but I don't know enough about what's going on to diagnose. Java version is:
java version "1.5.0_14-p8" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_14-p8-root_04_sep_2008_18_49) Java HotSpot(TM) 64-Bit Server VM (build 1.5.0_14-p8-root_04_sep_2008_18_49, mixed mode)
...and i'm running on FreeBSD 7.0, if that makes a difference. Any ideas what's going on? It'd be nice not to have to delete and rebuild the indexes by hand every day!
So this part
|-- search | |-- it_wiki.links | | |-- 20090731050111 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731050111 | | |-- 20090731060116 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731060116 | | |-- 20090731070104 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731070104 | | |-- 20090731080121 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731080121 | | |-- 20090731090112 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731090112 | | |-- 20090731100113 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731100113 | | |-- 20090731110108 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731110108 | | |-- 20090731120051 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731120051 | | `-- 20090731130055 -> /data/mymintel/mediawiki/lucene_search/indexes/update/it_wiki.links/20090731130055
Looks quite wrong.. all the files in search/ should be symlinks and should not have any subdirectories.. I'm not sure how these are created. You are sure that the whole build process takes less than an hour? If you get overlapping jobs trying to do the same thing they might lock eachother indexes. --Rainman 16:43, 3 August 2009 (UTC)
When I run the process by hand, it never takes more than 5 mins to complete, so I'd be very surprised if jobs are overlapping. Would probably be a good idea to make certain though, so I'll change the cronjob to time the process and send an update next time it fails. -- Mrgroucho 16:48, 3 August 2009 (UTC)
OK, I think we can rule out overlapping jobs. The indexer ran successfully as scheduled for over 12 hours yesterday, and then failed early this morning. The time stats for the job, and the one preceding it, are as follows:
Preceding Job
editreal 3m24.606s user 2m15.323s sys 0m16.361s
Failed Job
editreal 0m59.046s user 0m20.773s sys 0m2.410s
The failed job takes less time, but you'd expect that: it failed. I've noted that the output mentions various Threads - is there any way that this could be some sort of race condition/locking problem between those threads? -- Mrgroucho 13:44, 4 August 2009 (UTC)
Any idea at all on how to fix this? I've put a workaround in place that deletes all of the indexes every midnight and then runs ./build, which means that if it breaks during the day the indexes will never be massively out of date, but it's hardly a pretty fix. -- Mrgroucho 13:33, 7 August 2009 (UTC)
I've had the same problem. The index building job (cronned for every hour) started to fail every couple of weeks, and then every week, and then every few days etc., until I was manually clearing out and rebuild a couple of times a day! I'd already written a wrapper script around the 'build' script, which was just for making the process a bit more cron-friendly, checking the search daemon was running, and for aborting it altogether when my server-backup routines are in operation etc. I've now decided to have it build the indexes in a new location every hour, and then 'cut-over' if no error is encountered. So far, this seems to be effective.
The indexes folder I was using seemed to be growing exponentially, and I think this may have been related to the problem. However, not being a Java or Lucene expert, I think this is a problem I'm gonna have to continue to work-around instead of solving. --140.131.255.2 05:38, 7 September 2009 (UTC)
- Yeah, I have this problem too. If anyone can post a solution - I'd be grateful. Even a workaround script. Thanks. --Robinson Weijman 09:08, 5 March 2010 (UTC)
- We too are experiencing this problem and I am surprised no solution has been posted although many have reported this exact problem on the net. The last solution I tried was to do a "rm -r" of the whole "indexes" directory before each call to "build" the index. The problem still arises but at least, when it fails once, it usually succeeds afterwards. I think I gave a great chance to LuceneSearch 2.1.3 and I go to 2.0, hoping this problem will disappear. Phil Reid 30 Nov 2010
I also have this problem, and do indeed believe it to be some sort of locking issue. The way I (think) I've "solved" it is a bit of a kluge -- I am killing the search daemon (lsearchd) before the update and restarting it after the update. At the moment there is an interruption of 30 seconds every hour. This is not great, but I figure this is better than the index not being updated. Looking forward to a more elegant fix [Thu Dec 16 21:10:30 GMT 2010]
- So much for that. This kluge worked until the symlink
lucene-search-2.1.3/indexes/search/wikidb.links
(pointing tolucene-search-2.1.3/indexes/update/wikidb.links/<datestamp>
) was replaced bylucene-search-2.1.3/indexes/search/wikidb.links
as a directory in its own right, containing symlinks to the datestamped dirs. I've now modifed the update script to check ifindexes/search/wikidb.links
is a symlink and recreate it if not, making the kluge even worse. [Thu Dec 16 23:55:20 GMT 2010]
We also have this stupid problem. I'm looking forward to change the cron script and remove the index if there is a failure. It would be nice if you post yours cron script dirty workarrounds. At this time, my cron bash script does check the logfile for errors after the update script runs, e.g.:
CHECK=$(grep -i "Rebuild I/O error:" $LOGFILE | head -1 | cut -c10)
If there is a failure, I send a mail... --BSG2000 15:46, 20 December 2010 (UTC)
how to fix this when I run ./lsearchd
edit0rz </usr/local/search/ls2-bin> # ./lsearchd RMI registry started. Trying config file at path /root/.lsearch.conf Trying config file at path /usr/local/search/ls2-bin/lsearch.conf Exception in thread "main" java.lang.NullPointerException at org.wikimedia.lsearch.config.GlobalConfiguration.makeIndexIdPool(GlobalConfiguration.java:531) at org.wikimedia.lsearch.config.GlobalConfiguration.read(GlobalConfiguration.java:413) at org.wikimedia.lsearch.config.GlobalConfiguration.readFromURL(GlobalConfiguration.java:247) at org.wikimedia.lsearch.config.Configuration.<init>(Configuration.java:116) at org.wikimedia.lsearch.config.Configuration.open(Configuration.java:68) at org.wikimedia.lsearch.config.StartupManager.main(StartupManager.java:39) 0rz </usr/local/search/ls2-bin> #
my environment is:
0rz </usr/local/search/ls2-bin> # java -version java version "1.6.0_13" Java(TM) SE Runtime Environment (build 1.6.0_13-b03) Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing) 0rz </usr/local/search/ls2-bin> # ant -version Apache Ant version 1.7.1 compiled on June 27 2008 0rz </usr/local/search/ls2-bin> #
LSearch Daemon Init Script for Ubuntu
edit- This is just a sample. You will need to adjust this based on where you put the lucene-search directory.
#!/bin/sh -e ### BEGIN INIT INFO # Provides: lsearchd # Required-Start: $syslog # Required-Stop: $syslog # Default-Start: 2 3 4 5 # Default-Stop: 1 # Short-Description: Start the Lucene Search daemon # Description: Provide a Lucene Search backend for MediaWiki ### END INIT INFO test -x /usr/local/lucene-search-2.1/lsearchd || exit 0 OPTIONS="" if [ -f "/etc/default/lsearchd" ] ; then . /etc/default/lsearchd fi . /lib/lsb/init-functions case "$1" in start) cd /usr/local/lucene-search-2.1 log_begin_msg "Starting Lucene Search Daemon..." start-stop-daemon --start --quiet --oknodo --chdir /usr/local/lucene-search-2.1 --background --exec /usr/local/lucene-search-2.1/lsearchd -- $OPTIONS log_end_msg $? ;; stop) log_begin_msg "Stopping Lucene Search Daemon..." start-stop-daemon --stop --quiet --oknodo --retry 2 --chdir /usr/local/lucene-search-2.1 --exec /usr/local/lucene-search-2.1/lsearchd log_end_msg $? ;; restart) $0 stop sleep 1 $0 start ;; reload|force-reload) log_begin_msg "Reloading Lucene Search Daemon..." stat-stop-daemon --stop -signal 1 --chdir /usr/local/lucene-search-2.1 --exec /usr/local/lucene-search-2.1/lsearchd log_end_msg $? ;; status) status_of_proc /usr/local/lucene-search-2.1/lsearchd lsearchd && exit 0 || exit $? ;; *) log_success_msg "Usage: /etc/init.d/lsearchd {start|stop|restart|reload|force-reload|status}" exit 1 esac exit 0
55,6
Error here when use configure
editHi,
After I run "ant" to build the jar and generate configuration files, here comes the error. What's wrong?
Mediawiki: 1.15.1; Lucence: 2.1; OS: centOS
[root@xxx lucene-search-2.1]# ./configure /var/wk/ Exception in thread "main" java.net.UnknownHostException: 00:16:3h:2d:6c:b0-hk0.localdomain: 00:16:3h:2d:6c:b0-hk0.localdomain at java.net.InetAddress.getLocalHost(InetAddress.java:1425) at org.wikimedia.lsearch.util.Configure.main(Configure.java:52)
--Alpha3 11:26, 27 August 2009 (UTC)
In which context config.inc will be used?
edit--Ans 08:22, 4 September 2009 (UTC)
- It is used in build process "./build" --Ans 09:40, 4 September 2009 (UTC)
Install from SVN or Binary.
editI would recommend SVN any day. I've been through several installations of Lucene Search this morning, and the most rapid and problem-free methods was to use the SVN approach. It was also the *easiest* method to get Lucene to work - it just works. Also reads your MW config and produces its own *correct* configuration files.
MWSearch works just fine and dandy on top of this Lucene instance.
The 2.02 installation went badly, several times for me - and it chews a LOT more resources. I did get it working, but when it came to rebuilding indexes, it spewed up on the whole Computer - chewed 100% Chip, chewed more than 100% RAM, which caused Kernel Panic and failures. Had to reboot forcibly with hardware. Re-attempted several times and re-configured settings to test against "should be working" configuration - got the same results with the machine crawling. So I gave up, went to SVN, and instead of choking on the indexes, it rebuilt them in under 20 seconds. I understand there are Java issues around this. Forget them, it's not worth breaking Java on the System, or putting up with some strange configuration, just to get Lucene working. Gooooo SVN!! :-)
BTW: It should be made apparent on the Lucene-search Extension page that the SVN installation DOES work, and works VERY well. I had previously avoided this method as I am SVN-wary - where with a bit of prompting, that would have been my first choice.
Cheers, Mike
Brief period with zero search results (using update script)
editOur wiki runs the "update" script every 15 minutes, and the update takes about 2 minutes. Updates are done locally on the single wiki server.
Unfortunately, for a brief time while this script runs, searches return zero results. This problem lasts just a few seconds, but our users do encounter it and become confused.
Any advice on eliminating this "zero results" period? We thought about running two different reindexing processes, each writing to a different directory, and switching between them with a symbolic link. Something like:
- At 6:00, update index #1 in /usr/local/lucene1.
- Point symbolic link /usr/local/lucene at /usr/local/lucene1.
- At 6:15, update index #2 in /usr/local/lucene2.
- Point symbolic link /usr/local/lucene at /usr/local/lucene2.
- At 6:30, update index #1 in /usr/local/lucene1.
- Point symbolic link /usr/local/lucene at /usr/local/lucene1.
- ...
But I don't see a way to make Lucene reindex one directory while serving out of another. Any better suggestions? Maiden taiwan 15:03, 23 September 2009 (UTC)
- This should not happen. Are there any errors in logs during this period? --Rainman 17:20, 23 September 2009 (UTC)
- Yes: the update script outputs:
MediaWiki lucene-search indexer - build a map of related articles. ... 413 [main] INFO org.wikimedia.lsearch.related.RelatedBuilder - Rebuilding related mapping from links 416 [main] FATAL org.wikimedia.lsearch.related.RelatedBuilder - Rebuild I/O error: no segments* file found in org.apache.lucene.store.FSDirectory@/usr/local/lucene-search-2.1/indexes/search/wikidb.links: files: java.io.FileNotFoundException: no segments* file found in org.apache.lucene.store.FSDirectory@/usr/local/lucene-search-2.1/indexes/search/wikidb.links: files: at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:587) at org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63) at org.apache.lucene.index.IndexReader.open(IndexReader.java:209) at org.apache.lucene.index.IndexReader.open(IndexReader.java:173) at org.wikimedia.lsearch.ranks.Links.flushForRead(Links.java:213) at org.wikimedia.lsearch.ranks.Links.ensureRead(Links.java:239) at org.wikimedia.lsearch.ranks.Links.getKeys(Links.java:773) at org.wikimedia.lsearch.related.RelatedBuilder.rebuildFromLinks(RelatedBuilder.java:91) at org.wikimedia.lsearch.related.RelatedBuilder.main(RelatedBuilder.java:72)
- The named folder (wikidb.links) contains only symbolic links named after timestamps: 20090924121512, etc., linking to folders. Inside the folders (that exist) are files:
-rw-r--r-- 2 root root 4952161 Sep 24 12:00 _hq.cfs -rw-r--r-- 2 root root 46 Sep 24 12:00 segments_172 -rw-r--r-- 2 root root 20 Sep 24 12:00 segments.gen
No, this appears to be a separate issues. In any case, the extension does use symbolic links to quickly switch between the new and the old index, and it also allows for the new and the old index to co-exist for a while until all the old searches finish or timeout. So, that shouldn't be a problem. What could be a problem is that if you have the indexer and searcher on the same machine with insufficient RAM, then the indexer bogs down the machine causing high I/O which then slows down the searchers to the point of searches timing out. --Rainman 08:56, 25 September 2009 (UTC)
- Thanks. We have plenty of RAM (4 GB I believe) on a virtual machine, and while the load average does go up to about 3.0 - 4.0 during indexing, users don't perceive any slowness. That is, the search query returns quickly with zero results. How long is the timeout? Maiden taiwan 11:49, 25 September 2009 (UTC)
svn revision # for 2.1.2
editWhat is the revision number for the 2.1.2 binary hosted at SourceForge?
I am having trouble getting any search results when building from the HEAD of http://svn.wikimedia.org/svnroot/mediawiki/branches/lucene-search-2.1/ The index builds fine, but when I query I get no results. Returns exception:
java.lang.IllegalArgumentException: nDocs must be > 0
Querying directly to http://localhost:8123/search/wikidb/help gives:
267 #info search=[gziebold-15624s.local], highlight=[], suggest=[gziebold-15624s.local] in 46 ms #no suggestion #interwiki 0 0 #results 0
However, indexing and running based on 2.1.2 binary works fine.
- It's the svn revision of the date of release, don't know offhand, you'll have to look it up. I've built some indexes but didn't have problems with latest svn, can you provide a full stack trace? --Rainman 11:09, 10 November 2009 (UTC)
RMI registry started. Trying config file at path /Users/gziebold/.lsearch.conf Trying config file at path /Users/gziebold/Projects/mediawiki/lucene-search-2.1.built/lsearch.conf 0 [main] INFO org.wikimedia.lsearch.util.Localization - Reading localization for En 733 [main] INFO org.wikimedia.lsearch.interoperability.RMIServer - RMIMessenger bound 737 [Thread-1] INFO org.wikimedia.lsearch.frontend.HTTPIndexServer - Indexer started on port 8321 739 [Thread-2] INFO org.wikimedia.lsearch.frontend.SearchServer - Searcher started on port 8123 746 [Thread-5] INFO org.wikimedia.lsearch.search.SearcherCache - Starting initial deployer for [wikidb, wikidb.hl, wikidb.links, wikidb.related, wikidb.spell] 818 [Thread-5] INFO org.wikimedia.lsearch.search.SearcherCache - Caching meta fields for wikidb ... 2522 [Thread-5] INFO org.wikimedia.lsearch.search.SearcherCache - Finished caching wikidb in 1705 ms 2554 [Thread-5] INFO org.wikimedia.lsearch.interoperability.RMIServer - RemoteSearchable<wikidb>$0 bound 2562 [Thread-5] INFO org.wikimedia.lsearch.interoperability.RMIServer - RemoteSearchable<wikidb.hl>$0 bound 2567 [Thread-5] INFO org.wikimedia.lsearch.interoperability.RMIServer - RemoteSearchable<wikidb.links>$0 bound 2575 [Thread-5] INFO org.wikimedia.lsearch.interoperability.RMIServer - RemoteSearchable<wikidb.related>$0 bound 2582 [Thread-5] INFO org.wikimedia.lsearch.interoperability.RMIServer - RemoteSearchable<wikidb.spell>$0 bound 6879 [Thread-8] INFO org.wikimedia.lsearch.frontend.HttpMonitor - HttpMonitor thread started 6881 [pool-2-thread-1] INFO org.wikimedia.lsearch.frontend.HttpHandler - query:/search/wikidb/wind?namespaces=0%2C500&offset=0&limit=20&version=2.1&iwlimit=10 what:search dbname:wikidb term:wind 6919 [pool-2-thread-1] INFO org.wikimedia.lsearch.analyzers.StopWords - Successfully loaded stop words for: [nl, en, it, fr, de, sv, es, no, pt, da] in 21 ms 7052 [pool-2-thread-1] INFO org.wikimedia.lsearch.search.SearchEngine - Using FilterWrapper wrap: {0, 500} [] java.lang.IllegalArgumentException: nDocs must be > 0 at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:110) at org.wikimedia.lsearch.search.WikiSearcher.search(WikiSearcher.java:184) at org.apache.lucene.search.Searcher.search(Searcher.java:132) at org.wikimedia.lsearch.search.SearchEngine.search(SearchEngine.java:722) at org.wikimedia.lsearch.search.SearchEngine.search(SearchEngine.java:129) at org.wikimedia.lsearch.frontend.SearchDaemon.processRequest(SearchDaemon.java:101) at org.wikimedia.lsearch.frontend.HttpHandler.handle(HttpHandler.java:193) at org.wikimedia.lsearch.frontend.HttpHandler.run(HttpHandler.java:114) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:637) 7076 [pool-2-thread-1] WARN org.wikimedia.lsearch.search.SearchEngine - Retry, temporal error for query: [wind] on wikidb : nDocs must be > 0 java.lang.IllegalArgumentException: nDocs must be > 0 at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:110) at org.wikimedia.lsearch.search.WikiSearcher.search(WikiSearcher.java:184) at org.apache.lucene.search.Searcher.search(Searcher.java:132) at org.wikimedia.lsearch.search.SearchEngine.search(SearchEngine.java:722) at org.wikimedia.lsearch.search.SearchEngine.search(SearchEngine.java:129) at org.wikimedia.lsearch.frontend.SearchDaemon.processRequest(SearchDaemon.java:101) at org.wikimedia.lsearch.frontend.HttpHandler.handle(HttpHandler.java:193) at org.wikimedia.lsearch.frontend.HttpHandler.run(HttpHandler.java:114) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:637)
I also confirmed that the index built from the HEAD is fine. If I swap the LuceneSearch.jar from 2.1.2 binary and run it against indexes built from HEAD, it works. Processing query with LuceneSearch.jar from HEAD fails.
Other details: MediaWiki 1.15.1 (r50) MWSearch (Version r45173)
-- Looks like r48153 == 2.1.2 Or at least I was able to get that to successfully work with MediaWiki 1.15.1 --GregZ 04:06, 11 November 2009 (UTC)
- Ah I see.. fixed in svn. Thanks for the report. --Rainman 11:23, 11 November 2009 (UTC)
Category search
editCan anyone elaborate on the following from OVERVIEW.txt?
searching categories. Syntax is: query incategory:"exact category name". It is important to note that category names are themselves not tokenized. Using logical operators, intersection, union and difference of categories can be searched. Since exact category is needed (only case is not important), it is maybe best to incorporate this somewhere on category page, and have category name put into query by MediaWiki instead manually by user.
The incategory: syntax does not appear to work as described (in 2.1.2)
Also, what is meant by ...it is maybe best to incorporate this somewhere on category page, and have category name put into query by MediaWiki instead manually by user. ? Is the suggestion to put a search form on the Category page and insert the incategory: syntax there?
- It does work, but only for categories that are not added via templates, but in main article text. E.g. [1]. --Rainman 11:02, 10 November 2009 (UTC)
Ah ha. That explains why my incategory: query was not working. The categories were added via templates. This is caused because lucene-search indexes the wikitext for articles and does not resolve templates? Has there been discussion of a different index of the rendered html content instead of only indexing wikitext?
- Yes, however, the current mediawiki architecture makes it difficult to do... in fact, what we would want is article not in html, but in wikitext with expanded templates... In any case, you won't find a search extension that does it. --Rainman 15:51, 10 November 2009 (UTC)
- How about; Keep indexing wikitext, but access the MediaWiki database to determine category relationships. Maiden taiwan 18:30, 16 December 2009 (UTC)
- On default mysql install that won't scale very well, otherwise it would be implemented a long time ago. --Rainman 22:31, 16 December 2009 (UTC)
- Thanks. Can you explain why it wouldn't scale? Couldn't it be done while Lucene builds the index -- just read the Mediawiki database tables once and you're done? Or maybe it could be dynamic like DPL does: it checks category membership dynamically just fine. Finally, even if it's slow, could you make this behavior an option and let the sysadmin decide whether it works on his/her site? Right now "incategory" produces different results than Mediawiki Category pages do. That seems like a bug.... Thank you. --Maiden taiwan 01:06, 17 December 2009 (UTC)
Also, can you comment on this ...it is maybe best to incorporate this somewhere on category page, and have category name put into query by MediaWiki instead manually by user. ? Is the suggestion to put a search form on the Category page and insert the incategory: syntax there?
- Well yes... if someone would have done it that would be nice.. You need to take into consideration that the file has probably been written at 2am on some sunday, and not take everything in it very seriously ;) --Rainman 15:51, 10 November 2009 (UTC)
I simply needed the "late-night fog" translation. :) --GregZ 04:03, 11 November 2009 (UTC)
Is incategory going to be fixed?
editIs the problem of incategory and transcluded category tags planned to be fixed? If incategory is returning wrong results (missing all articles with transcluded category tags), people are going to see incomplete search results and make wrong decisions ("Well, I guess there are no articles that match my query..."). This just happened on our wiki. Can't the extension just look at the wiki database to discover category relationships? Thanks. Maiden taiwan 18:02, 16 December 2009 (UTC)
- No, there are no plans to fix it. Lucene-search is primarily developed towards needs of WMF, and because this would never get enabled on WMF projects due to inefficiency it is not planned to be implemented. However, this is an open-source project and if you need this functionality you can make it yourself or pay someone to do it. --Rainman 12:52, 19 December 2009 (UTC)