WikidataEntitySuggester
This page is obsolete. It is being retained for archival purposes. It may document extensions or features that are obsolete and/or no longer supported. Do not rely on the information here being up-to-date. |
The Wikidata Entity Suggester aims to make the task of adding or editing Items on Wikidata easier by suggesting different entities to the author.
Features
editHere is a breakdown of its prime features:
- Suggest properties to be used in a claim, based on the properties that already exist in the item's claims.
- The API can take an item's prefixed ID and recommend properties for it.
- The API can also be fed a list of properties and it can recommend properties based upon the list.
- Suggest properties to be used in source references, based on the properties that already exist in the claim containing the source ref.
- The API can take a claim GUID and recommend properties for its source ref.
- The API can also be fed a list of properties and it can recommend properties based upon the list.
- Suggest qualifiers for a given property
- Suggest values for a given property.
Basic components + Software requirements
editThe Suggester consists of two main parts - a backend REST API written in Java and a frontend MediaWiki extension containing the API module written in PHP.
The backend consists of a number of parts - it has two Myrrix instances (ie. two WAR files or Java EE apps running on Tomcat) and another Java EE war app (the REST API containing the Recommenders, Servlets etc.). The REST API provides a number of servlets to suggest entities and to ingest datasets (train the recommendation engine). In order to train the recommendation engine, a number of CSV-style datasets need to be generated. Python MapReduce scripts have been written, to be run on Hadoop through Hadoop Streaming, that will generate the training datasets from a wikidata data dump like wikidatawiki-20130922-pages-meta-current.xml.bz2 on this page.
So, the external software required to run the backend API are (assuming Python, Java, PHP etc. are installed and configured as usual on a LAMP server):
- Apache Tomcat (tested with Aapache Tomcat 7.0.39 downloadable here)
- Hadoop (tested with Hadoop 0.20.2-cdh3u6 downloadable here)
Everything has been tested with Oracle Java build 1.7.0_25-b15. It is recommended that you use Oracle Java 1.7, otherwise Hadoop will cause problems.
Setup
editSoftware Installation
editI have detailed setup and installation instructions for Tomcat and Hadoop here.
Setting up the Entity Suggester
editClone the WikidataEntitySuggester repo and build it:
git clone https://gerrit.wikimedia.org/r/mediawiki/extensions/WikidataEntitySuggester cd WikidataEntitySuggester mvn install
Copy the built Myrrix war files to Tomcat's webapps:
cp myrrix-claimprops/target/myrrix-claimprops.war <tomcat_install_directory>/webapps/ cp myrrix-refprops/target/myrrix-refprops.war <tomcat_install_directory>/webapps/
Check the catalina log file logs/catalina.out
in the Tomcat directory to see whether the Myrrix WARs have been deployed successfully. Check machine_ip:8080/myrrix-claimprops
and machine_ip:8080/myrrix-refprops
to see whether the Myrrix instances are running.
Now, copy the REST API war to webapps:
cp client/target/entitysuggester.war <tomcat_install_directory>/webapps/
Wait for it to be deployed by the server and check machine_ip:8080/entitysuggester/
to see if the welcome page has come up with examples of possible actions.
Training the Suggester
editDownload the latest wikidata data dump, decompress it and push it to HDFS:
cd <your_work_directory> wget http://dumps.wikimedia.org/wikidatawiki/latest/wikidatawiki-latest-pages-meta-current.xml.bz2 bzip2 -d wikidatawiki-latest-pages-meta-current.xml.bz2 cd /hadoop bin/hadoop dfs -copyFromLocal <your_work_directory>/wikidatawiki-latest-pages-meta-current.xml /input/dump.xml
You can find two Python scripts in the /wikiparser source directory, called mapper.py and reducer.py. There are docstrings at the beginning that explain how to run these files with Hadoop.
There are different ways you may run Hadoop. The method mentioned in this document relies on the Cloudera tarball. I haven't set Hadoop environment variables. But if you make an installation through rpms or debs, hadoop will be in your path. According to that you may have to make some modifications to the runhadoop.sh
shell script in the wikiparser source directory.
Assuming you're going to run hadoop from /hadoop, change the hadoop and hadoop_command variables in runhadoop.sh
to "/hadoop" and "bin/hadoop" (Yes, you have to make it bin/hadoop and run it with /hadoop as the present directory, or else it'll break with access problems.) Make sure the hadoop user has permissions to access /hadoop.
cp wikiparser/*.py /hadoop/ cp wikiparser/*.sh /hadoop/ chmod a+x /hadoop/*.py /hadoop/*.sh cp -rf wikiparser/target /hadoop/ chown -R hadoop:hadoop /hadoop cd /hadoop ./runhadoop.sh global-ip-pairs /input/dump.xml /output/global-ip-pairs ./train-claimprops.csv
OR, if the hadoop binaries are already in your path and everything is setup correctly, you may not need to modify runhadoop.sh
and you can simply do, as an example:
cd <source-directory>/wikiparser ./runhadoop.sh global-ip-pairs /input/dump.xml /output/global-ip-pairs ./train-claimprops.csv
global-ip-pairs is the type of training file to generate. More on this follows. The second and third parameters are paths on the HDFS and not on the local linux file system. The fourth parameter is optional, but important. Specifying it will cause the output written to HDFS to be copied the this file.
There are six examples each in both the files that can be used to build the six datasets needed to train the entity suggester. The datasets are:
- To train claim property suggester
- To train claim property suggester for empty items (if no property input is given)
- To train source ref property suggester
- To train source ref property suggester for empty claims (if no property input is given)
- To train qualifier suggester
- To train value suggester
Here is one of the six examples to build the dataset for training the value suggester:
bin/hadoop jar contrib/streaming/hadoop*streaming*jar -libjars /path/to/wikiparser-0.1.jar \ -inputformat org.wikimedia.wikibase.entitysuggester.wikiparser.WikiPageInputFormat \ -input /input/dump.xml -output /output/prop-values \ -file /path/to/mapper.py -mapper '/path/to/mapper.py prop-values' \ -file /path/to/reducer.py -reducer '/path/to/reducer.py prop-values'
After the Hadoop job completes, you may copy the output from HDFS to a local file:
bin/hadoop dfs -cat /output/prop-values/part-* > value-train.csv
Now, to train the Entity Suggester for suggesting values, do a HTTP POST with the file's contents in the POST body:
curl -X POST --data-binary @value-train.csv http://machine_ip:8080/entitysuggester/ingest/values
Similarly, all the other five suggesters can be trained using the /ingest/* servlets. Please use runhadoop.sh
for convenience.
How to use the backend REST API
editThe REST API has two types of servlets - suggester (/suggest/*) servlets and ingester (/ingest/*) servlets. Please note that all entity IDs that the suggester deals with are prefixed IDs because the training datasets contain prefixed IDs. The suggester makes no internal assumption of prefixes or the nature of IDs and treats them as raw strings. Therefore, it'll behave just the way it is trained to by the training datasets.
Suggester Servlets
edit- Claim property suggester:
- This servlet can suggest properties based on a comma-separated input list of properties like this:
/entitysuggester/suggest/claimprops/P41,P24,P345
- One can also omit the list of properties to get some default suggestions provided by a popularity-sorted property recommender:
/entitysuggester/suggest/claimprops/
- A howMany parameter should be added, to limit the number of suggestions. It is 0 by default. Example:
/entitysuggester/suggest/claimprops/?howMany=10
or/entitysuggester/suggest/claimprops/P41,P24,P345?howMany=20
- This servlet can suggest properties based on a comma-separated input list of properties like this:
- Source ref property suggester:
- This servlet can suggest properties based on a comma-separated input list of properties like this:
/entitysuggester/suggest/refprops/P41,P24,P345
- One can also omit the list of properties to get some default suggestions provided by a popularity-sorted property recommender:
/entitysuggester/suggest/refprops/
- A howMany parameter should be added, to limit the number of suggestions. It is 0 by default. Example:
/entitysuggester/suggest/refprops/?howMany=10
or/entitysuggester/suggest/refprops/P41,P24,P345?howMany=20
- This servlet can suggest properties based on a comma-separated input list of properties like this:
NOTE: The two different property suggesters are trained by different data sets; hence they provide different suggestions.
- Qualifier property suggester:
- This servlet can suggest qualifiers for a mandatory single property input like:
/entitysuggester/suggest/qualifiers/P41
- A howMany parameter should be added, to limit the number of suggestions. It is 0 by default. Example:
/entitysuggester/suggest/qualifiers/P41?howMany=10
- This servlet can suggest qualifiers for a mandatory single property input like:
- Value suggester:
- This servlet can suggest values for a mandatory single property input like:
/entitysuggester/suggest/values/P41
- A howMany parameter should be added, to limit the number of suggestions. It is 0 by default. Example:
/entitysuggester/suggest/values/P41?howMany=10
- This servlet can suggest values for a mandatory single property input like:
Output Format for Suggester Servlets
editAll the suggester servlets give output in JSON format. As an example, /entitysuggester/suggest/refprops?howMany=2
may yield an output like:
[["P143",0.9924422],["P248",0.007505652]]
It is an array of arrays, where each constituent array consists of the entity ID (string) and the relative score (float).
Ingester Servets
editAll ingest servlets read the datasets from the POST body. As explained in the "Training the Suggester" section above, it's easy to train the suggester using curl to POST the training file to the servlet. Example:
curl -X POST --data-binary @value-train.csv http://machine_ip:8080/entitysuggester/ingest/values
Code Organization
editThe Java side of the project is built using maven. The current master branch consists of 4 main directories:
- client - Contains the Java REST API. It contains suggester servlets, training servlets, custom-written Recommender classes and Recommenders that act as a client to the two Myrrix instances running on the same Tomcat server.
- myrrix-claimprops and myrrix-refprops - The two Myrrix war files are generated using the pom.xml files in these directories. Each has a different contextPath and different temporary directory location (path where they store the generated binary model after training).
- wikiparser - Contains the Java source for InputFormat classes used by Hadoop while parsing the XML data dumps; also, there are Python MapReduce scripts that are run with Hadoop Streaming to generate the training datasets.
The Java packages
editorg.wikimedia.wikibase.entitysuggester.client.recommender
editThis package contains a generic Recommender interface and an AbstractRecommender that implements Recommender<TranslatedRecommendedItem, String>. The latter is an abstract class that's used by all different recommenders that accept a List of Strings as input and provide recommendations as a List of TranslatedRecommendedItems.
org.wikimedia.wikibase.entitysuggester.client.recommender.util
editContains a simple class called TranslatedRecommendedItemImpl that implements the TranslatedRecommendedItem interface. This interface is used to wrap all kinds of recommendations. It's from the Myrrix packages and the Myrrix recommenders return lists of this type, so it's perhaps more convenient to keep using it rather than create our own wrapper.
TranslatedRecommendedItemDatasetParser is a class that's used for reading a specific format of strings. It may be used to parse a file and iteratively return pairs of keys and values, where value here means a list of TranslatedRecommendedItems.
RecommenderFactory is a class containing a static create() method that returns the correct type of Recommender depending upon the kind of entity needed to be suggested.
org.wikimedia.wikibase.entitysuggester.client.recommender.impl
editThe concrete Recommender classes are located here.
MyrrixWebClientRecommender has a TranslatingRecommender instance that it uses to communicate with a Myrrix recommendation engine. It acts like a client to the Myrrix engine, and handles "translating to and fro" of Wikibase property and Wikibase item IDs ("Items" and "Users" respectively in the Myrrix collaborative-filtering lingo).
MultiMapRecommender contains a Guava ListMultimap (it's basically like a Map with Lists as Values). It uses TranslatedRecommendedItemDatasetParser to build a map of String keys and List<TranslatedRecommendedItem> as values. This is used for recommending qualifiers and values. For qualifiers, it has properties as keys and qualifier properties and their normalized relative usage frequency scores as individual elements in the list; when the client wants a qualifier suggestion for a particular property, the list for that property key fetched, and the top N elements from it are recommended. So in essence, it suggests the most frequently used qualifiers for a specific property. It's similar for values, except that it recommends values instead of qualifier properties.
SortedListRecommender is basically a List of TranslatedRecommendedItems and a recommendation is just the top N elements from the List.
PropertyRecommender is a class that extends MyrrixWebClientRecommender and contains an instance of SortedListRecommender. It recommends properties via Myrrix based upon a list of properties. If no properties are given as input (ie. empty List), it uses the SortedListRecommender to simply fetch the most frequently used properties.
org.wikimedia.wikibase.entitysuggester.client.servlets
editContains an AbstractEntitySuggesterServlet class that contains implemented getRecommender and setRecommender methods. It also has an abstract getEntityType() method that must be implemented in concrete Servlets. The return value of this method makes getting and setting recommenders easier. It's also used by RecommenderFactory to decide what type of Recommender to return.
org.wikimedia.wikibase.entitysuggester.client.servlets.suggest
editThe AbstractSuggesterServlet class in this package contains all the actual suggester servlet logic. It overrides the doGet method, that splits the path string after the servlet's name using a comma-delimiter-based Splitter. It uses getRecommender to fetch its recommender and feeds the previously created list into recommendAsJSON to receive the suggestions as a JSON string, ultimately writing it out. In the init() method, if a recommender is not found (not set from a call to the corresponding ingest servlet) it throws a ServletException.
The four concrete classes in this package are the suggester servlets. They are minimal, and only need to implement getEntityType().
org.wikimedia.wikibase.entitysuggester.client.servlets.ingest
editAbstractIngestServlet contains the ingest POST logic. POST data is fed into the recommender instance using a Reader in the doPost method. getRequestReader() is a method that checks the 'Content-Encoding' header in the request and returns the request reader (or wraps a GZIPInputStream around request.getInputStream() if the data is gzip compressed).
The rest of the classes in the package (six of them) are the concrete ingester servlets. There are two servlets EmptyClaimPropertyIngestServlet and EmptyRefPropertyIngestServlet that override the trainRecommender method of the abstract class. These classes expect a PropertyRecommender from getRecommender() obviously; the PropertyRecommender.trainEmpty method is called here. This is probably a code smell and will be fixed in the future using an EmptyTrainable interface or something like that.
org.wikimedia.wikibase.entitysuggester.wikiparser
editThis package contains a class borrowed from the Cloud9 GitHub repo, XMLInputFormat that's used by WikiPageInputFormat. The latter is used with the Hadoop Streaming Python scripts to generate the training datasets. As of now this InputFormat will break if there are the strings like <page> or </page> present in any CDATA section or some such in the data dump. It has survived the latest data dump till now, but may be replaced in the future by WikiHadoop which requires Apache Hadoop 0.21, 0.22 or 0.23. Since I couldn't get it to run with the Cloudera CDH3 distribution and a few other releases I tried it with, I decided not to use it at the moment, left for future experimentation.
The Python wikiparser code
editTODO: To be filled in soon.
Progress Reports
editI'll be maintaining monthly and weekly reports on this page.