Fulltext search engines

This is a list of Fulltext Search Engines, and technologies that could potentially be used to build them, for MediaWiki.

Clusterpoint Server edit

Clusterpoint (from "cluster" and "point") Server is a hybrid database management system, representing server based transactional database storage, fast full text search engine and native clustering software; all functionalities implemented into a single cohesive software platform with open API. It is a high-performance, schema-free, XML document-oriented database server written in the C++ programming language. It manages collections of data objects that are stored in native XML data format. It allows many applications to store data in a natural human-readable way that matches their native data types and structures. All database content is indexed automatically and completely for fast structured, unstructured and semi-structured search. Clusterpoint combines into a single software platform several widely used but isolated software technologies so that database developers can substantially simplify their application software.

Development of Clusterpoint Server began in August 2006 by Clusterpoint Ltd. The first public release was in January 2008. The latest stable production version is 2.0.3, released in 2011.

Among the features are:

Fast full text search performance: delivers sub second query response times that do not depend on the total database size or particular data structure
Ad hoc search: supports free format user-friendly Internet-style search queries
Real-time full text index updates: adding new or modifying existing documents automatically updates database index used for full text search
Capacity for big data: stores, indexes and searches large databases without performance loss characteristic to SQL-based search solutions
Open data storage platform: uses only industry standard XML data format at database storage level, API and in all client-server transactions
Data structure agnostic database: handles custom schema-less free format XML documents as database objects, tolerates different data structure objects in the same database
Consistent UTF-8 encoding. Non-UTF-8 data can be saved, queried, and retrieved with a special binary data type.
Cross-platform support: binaries are available for Linux, FreeBSD and OS X. Clusterpoint can be compiled on almost any operating system.
Type-rich: supports unstructured data, dates, numbers, meta-data, binary data, and more (all XML types)
XML objects for query results: enables direct integration in programming languages supporting XML parsing, no client software required
Includes rich enterprise search functionality: eliminates the need to integrate database application software with 3rd party search software
Flexible data ranking at search: a customizable mechanism for programming database content ranking for the best search relevance
Transparent cluster software architecture: no single point of failure, any cluster node can serve as master
Horizontal scalability: scales out from a single server to hundreds of servers per database in bigger clusters
Security partitioning: users, administrators and access rights are based on groups and roles, granular to specific storages and API commands
Centralized web GUI based database administration: enables to create, manage and control all Clusterpoint databases, including clustered and replicated databases

Licensing:

Clusterpoint DBMS is available for free under the Clusterpoint DBMS Free Community License for use on a single hardware server or a virtual machine.

Community non-profit projects qualify for a free Clusterpoint DBMS Non-commercial License.

There are several commercial Clusterpoint DBMS software licenses available, starting from a 2-server cluster for Clusterpoint DBMS licensing, please see Clusterpoint DBMS Licensing Options.

Clusterpoint DBMS is commercially supported software with free basic support over email and paid premium technical support services for customers using the software in production environments, please see Clusterpoint DBMS Technical Support Options.

More information is here:

JODA edit

(ioda, because joda was already taken by some other project)

Download http://sourceforge.net/projects/ioda/
See live on http://wikipedia.rhein-zeitung.de/index.php/Hauptseite (this page demonstrate only the indexer and is not intended as a mirror for wikipedia)

From a mailing list posting of Jochen Magnus: older versions of Joda are working since 1996 as news paper archive for the Rhein-Zeitung (Koblenz and Mainz, Germany). It's also used for archive and newsdesk purposes from several other european newspapers. At the moment it is going into action as full text index for europeans biggest magazine. It is also in use for the public index of the state archive of Rheinland-Pfalz (Germany).

Last year I created two mirrors of WikiPedia, one using MediaWiki for demonstration purposes and another - our public one - using our own read-only web frontend. Joda is integrated into both mirrors:

http://wikipedia.rhein-zeitung.de/index.php/Hauptseite (MediaWiki)
http://lexikon.rhein-zeitung.de/ (our special Wikipedia interface)

At the suggestion of Magnus Manske (not related :-) I published Joda under LGPL and made serveral improvements for the Wikipedia task. I wrote tools for indexing a whole cur table either from MySQL or from a SQL dump (which is twice faster). Indexing the german Wikipedia cur table (>210.000 articles, 36 million words) lasts approx. 45 minutes. An optional database optimization lasts additional 25 minutes. Both on a dual Athlon 2800+ machine with 1 GB RAM (the indexer is a multi threaded perl program).

Joda can erase or update entries on the fly and can handle queries with parantheses and word distance operators like http://lexikon.rhein-zeitung.de/?((Albert OR Alfred) AND.1 Einstein) NEAR Quant*) NOT Gravitation. See more features under http://ioda.sourceforge.net/

Joda kernel is written with the Free Pascal compiler (http://sourceforge.net/projects/freepascal/). The tools are written in Perl. There a libraries for using joda directly from C, Perl, Python and PHP, all published under LGPL. The joda binaries are: command line program, TCP socket driven server and CGI.

Lucene-search edit

Lucene is a text search engine written in Java, sponsored by the Apache project.

A Lucene-based search server is now up and running experimentally to cover searches on the English Wikipedia. It is compiled with GCJ, so is free software and does not rely on Sun Java VM.

Using a separate search server like this instead of MySQL's fulltext index lets us take some load off the main databases.

To compare our options Brooke did an experimental port to C# using dotlucene; some benchmarking showed that while the C# version running on Mono outpaced the Java version on GCJ for building the index, Java+GCJ did better on actual searches (even surpassing Sun's Java in some tests). Since searches are more time-critical (as long as updates can keep up with the rate of edits), we'll probably stick with Java.

More information on this implementation can be found on the Wikitech LiveJournal and at meta:User:Brooke Vibber/MWDaemon

At the moment the drop-down suggest-while-you-type box is disabled as GCJ and BerkeleyDB Java Edition really don't get along. Brooke has said that he will either hack it to use the native library version of BDB or just rewrite the title prefix matcher to use a different backend.

Here are some step-by-step instructions on how to install this kind of search on a wiki.

Solr edit

http://lucene.apache.org/solr/
A lucene based search server with XML/HTTP interfaces, caching, replication, web admin.

DBSight edit

http://www.dbsight.net/
J2EE application
Database + Lucene + Display Template, with Scheduler
Scalable, online demo http://search.dbsight.com holds 1.2G data, 1.7 million records
Work on live systems, new or old legacy systems, without changing existing code.
Customizable crawl, customizable indexing, customizable searching, customizable results templates

Domino edit

IBM Lotus Notes database (.NSF files )

Full text and rich text search including MS office, PDF
Document management includes delete add change
Easy no code programming

Pylucene edit

http://pylucene.osafoundation.org/
can be GCJ-compiled which avoids the "non free" java issue above.

Plucene edit

perl port of lucene
http://search.cpan.org/perldoc?Plucene

KinoSearch edit

Search engine library
http://search.cpan.org/perldoc?KinoSearch

Google Search Appliance edit

Hardware box made by google
http://www.google.com/enterprise/gsa/index.html
proprietary, closed-source, etc, etc.
- but may be able to receive this w:en:gratis.
- but Kate says: "the current situation appears to be that non-free software is not allowed, but software contained on other embedded devices is okay (e.g. switch firmware). given this i don't think there would be an issue with using one of the google devices." (wikitech-l Wed, 30 Mar 2005 08:08:16) gmane official archives
- According to Google, the basic single-slot GSA only does 500,000 documents (but can be licensed to search up to 1.5 million documents at a rate of 300 queries per minute). For perspective, here is the totals for each of the english projects hosted by Wikimedia:

Project name	Number of pages
Wikipedia	1,487,491
Wiktionary	81,836
Wikibooks	23,643
Wikiquote	7,397
Wikisource	26,565
Wikinews	7,396
Wikispecies	4,906
Commons	100,207
Meta	15,092
sep11	1,627
Grand total:	1,754,533

Note that that's just English! I have not gathered stats on any other of the large languages. Considering that there are several languages in with 6-digit figures for articles, the total number of pages hosted by Wikimedia could easily be triple or quadruple this number! I hope Google is willing to give you more than just hosting.

Lupy edit

http://www.divmod.org/projects/lupy

Sphinx edit

Very fast
Plugs directly into MySQL and Postgresql if desired.
Handles some major sites such as ljseek.com (> 100 million records, 120+GB database)
http://sphinxsearch.com/

Installation guide : Extension:SphinxSearch

ZEND Framework edit

Lucene Class of the Zend Framework (http://framework.zend.com/manual/en/zend.search.html).

100% PHP
Lucene Binary Compatible Index
Extension:Zend_Search_Lucene_for_MediaWiki
Extension:Woogle4MediaWiki

swish-e edit

Very fast
Easy to setup
Can index almost everything
Differential indexing capabilities
http://swish-e.org

Sphider edit

An easy to set up and install PHP web-application on top of MySQL that implements a web-spider for indexing and a flexible search page. Will index a complete wiki and can easily replace the built in search functionality.

ScimoreDB edit

Use Lucene. Windows only. Supports data compression to reduce 4-10 times disk usage. Scalable (up to 1024 nodes), clustered and fault tolerant. SQL, T-SQL, Stored procedures, .NET provider targeting .NET4.0/.NET2.0.

Ksana Search For Wikipedia edit

Ksana Search For Wikipedia (剎那搜尋維基百科) is GPL.

myBigSearch.com edit

Components for developers and programmers who want to do their own projects include full-text search in texts.

Component for C++, C#, C#.NET, ASP.NET, Visual Basic
Deployment in Visual Studio project

Big Data - Quick parametric searching in large volumes of text data.

http://www.mybigsearch.com/

points to consider edit

efficiency is key
- we already have full text search, but it uses the databases and isn't efficient. any alternative needs to be sufficiently "cheaper" in terms of hardware to make it worthwhile

http://www.google.com/search?q=site:en.wikipedia.org+&q=search
- we can link to google for free.
- not as fresh, as google won't update as often as wikipedia does
- not 100% coverage

do we want to be able to search across older versions / diffs?
- if yes, this content should probably not be searched by default. Namely, default is to just search the current content

can we take the index off-line when we need to update entries?
- swish-e 2.2.0 now supports this feature, lucene as well

do we want to update the index in small chunks (e.g. if only a single file has changed)?
- swish-e can do this but its somewhat hackish (you would use multiple indexes) while Lucene is designed for this.

outstanding question edit

if we include a summary, like Google, for each result, what should be shown?
- the google style : the section of the document that contains the search terms
- some short meta description of the article
- the first paragraph, or first N words

should titles be given more weighting?
- namely, if I search for the term "red wine", and there are two identical documents, except one contains "red wine" as a section title while another simply mentions it in the text ... should we return the first doc first, or should they be truly equal?
- is text in a title more important than other text

do we want a page rank style link analysis?
- eg, a wikipedia article that is linked to more often within the context of wikipedia suggests it is more important

an alternative is length/edit-rank
- article with more edits, or that are longer, get boosted in the results?

Discussion edit

Why not find an efficient database solution?
- Because databases aren't the best solution for high volume free text search. In the same way Excel could do tax returns, but there is much better software for cracking that nut in many cases.
I don't agree with that. Keeping the searching as close to the data as possible makes sense, and there are plenty of solutions out there (e.g. tsearch2) that seem efficient enough. Most of them are basically applications that have been joined to the database already, which certainly reduces a step for us.
- tsearch2 is a PostgreSQL feature, AFAIK. do you have an equivalent thing that works with MySQL?
  - MySQL surprisingly does full-text search. Many PHP-based bulletin boards make use of this. It's certainly convenient, but I don't think it's as powerful or flexible as an external engine like Lucene.
    - We already support MySQL's fulltext search. Its uselessness is mainly what inspired me to write the Lucene support :)

Thunderstone makes a product similar to Google's Search Appliance but it appears to be substantially less expensive. Another option to consider. --TidyCat 14:53, 9 December 2005 (UTC)[reply]