This page is obsolete. It is being retained for archival purposes. It may document extensions or features that are obsolete and/or no longer supported. Do not rely on the information here being up-to-date. |
The WDQS Service is now in production beta, this page describes pre-production/testing setup. If you are interested in information about production beta, please see the WDQS User Manual.
Wikidata Query Service Beta
Wikidata query service beta deployment
|
The purpose of this deployment is to provide test grounds for the query service and collect basic usage patterns. The service runs at http://wdqs-test.wmflabs.org/ (offline).
Deployment hosts
editwdq-beta.eqiad.wmflabs.
and db01.eqiad.wmflabs
.
wdq-beta
serves http://wdqs-test.wmflabs.org/
, db01
is internal host for experiments.
If you need access to it ping any member of wikidata-query project on Labs. Each is an xlarge instance with 160G storage.
Source code
editThe code comes from https://github.com/wikimedia/wikidata-query-rdf/. See https://github.com/wikimedia/wikidata-query-rdf/blob/master/docs/getting-started.md for detailed description of how to build and set up stuff. This is already done on the beta host, so it's for information/disaster recovery purposes only.
All necessary data except for nginx configs (see below) is contained in service-*-dist.zip
deployment package, which is what is deployed at /srv/wdqs/blazegraph
. Deployment can be done by puppet role below. Note that puppet role does not start Updater service.
Puppet deployment
editPuppet is using self-hosted puppetmaster at wdqs-puppetmaster
.
Configuration for puppetmaster:
- check
role::puppet::self
- set the puppetmaster to
wdqs-puppetmaster
- check role
puppetmaster::autosigner
- set
puppetmaster_autoupdate
to true
Configuration for clients:
- check
role::puppet::self
- set the puppetmaster to
wdqs-puppetmaster
- enable role
role::wdqs
Blazegraph deployment
editBlazegraph is deployed in /srv/wdqs/blazegraph
, running under user blazegraph
. If the service is stopped or crashes, to restart it, run:
# ./runBlazegraph.sh | tee $(date +%s).log
from /srv/blazegraph
. Preserving logs at least for some time is recommended in case some unexpected failure happens. No log rotation scheme in place so far, so just delete the old ones once you're sure nobody needs them anymore.
Some interesting settings may be found in /srv/wdqs/blazegraph/blazegraph/WEB-INF/web.xml
- namely queryThreadPoolSize
and queryTimeout
. Changing those probably requires restart. Note that if you restart the Blazegraph service you may also need to restart the updater as it may give up if the Blazegraph is offline for too long (see below).
The Blazegraph instance has a GUI workbench accessible at http://localhost:9999/
. It is not for public access, as it allows full write access to the database. One can access it by configuring port forwarding while logging in to the host via ssh.
Updater deployment
editThe updater is the service that is constantly pulling Wikidata and synchronizing it with current database. If it stops, query service is still functional but contains data up to the last successful update. This service can be run under any user, as it communicates with Blazegraph only via REST API and does not store any persistent data by itself, everything is stored in Blazegraph. Currently runs under smalyshev
.
Can be run as:
# ./runUpdater -n wdq
from /srv/wdqs/blazegraph
. However, running it as a service: service wdqs-updater start
- is recommended.
The updater log is configured by updater-logs.xml
. The updater logs progress information like this:
20:32:55.850 [main] INFO org.wikidata.query.rdf.tool.Update - Polled up to 2015-05-19T09:11:50Z at (2.6, 2.7, 2.8) updates per second and (2085.5, 2096.2, 2202.2) milliseconds per second
The date is the point in the main database to which it is updated, the first set of numbers is number of entities updated per second, the second - how far in catching up with the main data it got in a second. These numbers are relevant only if the service is behind the main DB.
If there is no updates, the updater will sleep and then re-check the wikidata site. It can also be stopped and re-started in any moment without affecting query service functionality. If blazegraph service is down, it will retry for a short time, then exit.
Web access
editExternal access to the service is provided at the URL http://wdqs-beta.wmflabs.org/.
The access is performed via nginx proxy, configs are in /etc/nginx/sites-enabled/wdqs
. Only GET requests to URLs starting with /bigdata/
are proxied to the Blazegraph.
The root document for http://wdqs-beta.wmflabs.org/ is the WDQS Beta GUI, which is served from /srv/wdqs/blazegraph/gui/
. It is located in gui/
subdirectory in the sources.
Access logs are in /var/log/nginx
. Searching for "/bigdata/namespace/wdq/sparql
" would provide the list of queries that were attempted from the GUI.
See also sample nginx config that can be used for the service.
Monitoring
editThe logs for SPARQL requests are available at labs Logstash: https://logstash-beta.wmflabs.org/#/dashboard/elasticsearch/wdqs
The Graphite monitoring is available on http://graphite.wmflabs.org/, e.g.: http://graphite.wmflabs.org/dashboard/#wdq-beta
Other tools
editThis section should eventually find better place, for now this is the list of related tools:
- http://tools.wmflabs.org/wdq2sparql/w2s.php - WDQ to SPARQL translator
- https://tools.wmflabs.org/bene/sparql/ - SPARQL query generator
- https://tools.wmflabs.org/ppp-sparql/ - natural language query generator based on Platypus