WDQS Production

Group:	Discovery
Team members:	Stas Malyshev
Lead:	Stas Malyshev

The following is the list of items that need to be completed for putting Wikidata Query Service into production:

#	Item	Severity	Complexity	Who can do it	Depends on	Phab task
1	Productization
11	Packaging to ops standards	MUST	2	Ops	12	T103897
12	Converting services to Debian-model services	MUST	2	Stas+Ops	-	T103904
13	Preparing puppet scripts to ops standards	MUST	1	Stas+Ops	11,12	T95679
14	Automated initial loading from dumps	MAY			-
15	Automated version upgrade	MAY	1		12
16	Backup/disaster recovery story	SHOULD		Ops	-	T103906
17	External hardening	MUST				T103907
18	Internal harderning	SHOULD				T103908
19	Security review completed	MUST			11, 12	T90115
110	Size, request & obtain hardware	MUST	1		-	T86561
2	Monitoring				1
21	Devise performance monitoring criteria	SHOULD	1		-	T103922
22	Set up service alerts	MUST	1		1	T103911
23	Connect to performance monitoring services	SHOULD	1		21	T103931
24	Connect to analytic log collection services	MUST	1		1	T98030
3	Features
31	Negative dates handling	SHOULD	4	Stas	-	T94539
32	Redirects handling	MAY	10	Stas	-	T96490
33	Geocoordinates handling	SHOULD	20	Blazegraph?	-
34	Labels handling	SHOULD			-	T97079
35	User-facing documentation	SHOULD	3		-	T103932

Severity key:

MUST - we can not go to production before this is done.
SHOULD - ideally, we need this for production-quality service, but if we can't deliver it right now, we can proceed without it but should prioritize it right after we're done with MUSTs.
MAY - we should have it eventually, but we can survive for some time without it.

Complexity is a (very) rough estimate of how many work days it may take to get it working. The estimate concerns only amount of actual work to be done and does not include waiting for resource allocation, bureaucracy, etc.

Detailed description of each item follows.

Productization

Packaging to ops standards

Currently the packaging is a single ZIP which is built by maven, uploaded to maven central and is supposed to be downloaded and deployed manually. We want to keep Maven option, but we also need something that matches what Ops usually work with. We need to figure out how to make this package and deploy it.

We should talk to ops about it. We could go with deb packaging but it might be simpler to deploy using git-fat.

Converting services to Debian-model services

WDQS has two Java services - Blazegraph and Updater - which right now are run by manual scripts, with no log rotation, no watchdog/restart, etc. This needs to be changed to make these more standard supported services. Also database, log, configs, etc. locations should be reviewed and possibly changed to match standards.

I (manybubbles) sees two options:

systemd has facilities to take std-out and std-err and dump into nicely rotated log files I believe
use the standard Java loggers and enabled stuff like size based rolling policy.

Whatever ops wants, though I suspect it'll be systemd.

Preparing puppet scripts to ops standards

After the above is finished, the puppet scripts now living in private repo should be updated and fixed according to ops standards and included in WMF's standard puppet.

This should be assigned to an opsen, probably. It'd be way, way faster than for the discovery team.

Automated initial loading from dumps

Currently the only way to load dump into the query service is to follow a manual procedure. We may want to have more automated way of doing it.

Maybe build it into the updater? If it can't find the load then it does the download, etc? That'd be super duper convenient for external users. We have a script to do it already, so maybe just have puppet run the script if it can't see any data or something? You just have to make the script idempotent -or- make a check_if_required style script. Not super complex. It'd be interesting to test this using vagrant I think. We do this with Cirrus in Vagrant and that works, at least.

Automated version upgrade

Currently there is no way to move from one version of the service to another automatically - one needs to install new version, manually shut down the old one and transfer DB manually. We need to make it automatic.

On the other hand Elasticsearch upgrades also require a manual shutdown and restart and that is fine, even good. They don't require any manual fiddling with the files though. We should make the updater restart itself automatically on upgrade. Certainly if its deb packaged.

Backup/disaster recovery story

There is no recovery option currently except for "set up completely new server and reload from scratch". We need to review this and decide if we want to have better options.

External hardening

The service should be properly firewalled, not modifiable from outside (i.e. SPAQRL UPDATE) requests blocked and not be able to issue SERVICE requests to outside sources. External user should not be able to modify the data on the service or cause the service to call out to external processes or over the network.

Internal hardening

The service user should not be able to consume more than defined share of the resources and cause DoS to other users. The queries should be limited by time and memory space and should timeout/abort when the limits are reached.

Security review completed

We need to complete security review of WDQS setup. See: https://phabricator.wikimedia.org/T90115

Size, request & obtain hardware

Determine which hardware we need, request it from ops and set it up.

Monitoring

Devise performance monitoring criteria

Devise a set of metrics that we need to collect from running service in order to monitor its performance.

Set up service alerts

Set up a system that monitors entry points for Blazegraph, GUI and health of Updater service and alerts when any of them goes down.

Connect to performance monitoring services

Create scripts to measure the metrics described above and send them to graphite or other metric collection tool.

Connect to analytic log collection services

Set up log collection and connect it to existing analytic systems.

Features

Negative dates handling

Right now year 0 and negative dates are not handled consistently by the WDQS service due to the fact that custom logic is not used for date calculations. We need to fix that. See: https://phabricator.wikimedia.org/T94539

Redirects handling

Some entities are redirects to other entities. The semantics of it should be that these entity IDs are completely interchangeable. This currently does not work. See: https://phabricator.wikimedia.org/T96490

Geocoordinates handling

Right now we store geographic data but unable to do any geographic searches at all. We need to be able to at least have distance between two points and ideally have some index that allows us to do geographic searches.

Labels handling

In order to obtain label for the item, users have to perform cumbersome SPARQL queries that are easy to mishandle. We should define custom function which would produce labels in preferred language with fallback. See: https://phabricator.wikimedia.org/T97079

User-facing documentation

We may want to create better service description and organize existing documentation into consistent documentation package for the service which allows the user quickly get up to speed with the service.