Draft:ABaso (WMF)/Wikidata Query Service graph database reload times, 2024 edition

This post is about importing Wikidata into the graph database technology used for hosting the Wikidata Query Service.

Summary

I'll go into greater detail, but if you are looking to host your own Blazegraph database of Wikidata data you might try the following:

Get a desktop with the fastest CPU you can afford, and buy a speedy 4 TB NVMe plus 64 GB or more of DDR5 RAM. Import using the N-Triples format, not the Turtle format.
During pauses in the batch import operations (e.g., every 100 files imported), make a backup copy of the graph database. That way if your graph database import fails you can resume from that backup. The backup will slow you down, but it may save you many days in the end.
If you can't afford to build or upgrade a new server of your own, consider use of a cloud server to perform the import, then copy the produced graph database to a more budget friendly computer; remember that in addition to cloud compute and storage costs, there may be data transfer costs.

Graph databases and Wikidata

Graph databases are a useful technology for data mining relationships between all kinds of things. Within the Wikimedia content universe, we have a powerful graph database offering called the Wikidata Query Service ("WDQS") which is based on a mid-2010s technology called Blazegraph.

Wikidata community members model topics you might find on Wikipedia, and this modeling makes it possible to answer all kinds of questions about the world via the Wikidata data replicated into WDQS.

Big data headache

As Wikidata has grown, the WDQS graph database has become pretty big, approaching about 16 billion rows as of this writing, with many intricate relationships between those rows. Unfortunately, the WDQS graph database has also become unstable as a result, and this seems to be getting worse as the database gets larger. The last time a data corruption occurred it took about 60 days to reload the graph database to a healthy state.

The long recovery time was a prompt to further enhance the data reload mechanisms and to figure out a way to manage the growth in data volume. Over the course of the last year, the Search Platform Team, which is part of the Data Platform Engineering unit at the Wikimedia Foundation, worked on a project to improve things.

As part of its goal setting, the team determined it should make it possible to support more graph database growth (up to 20 billion rows in total) while being able to recover more reliably and more quickly in the event of a database corruption (within 10 days). This would allow Wikidata's value to continue to be able to be more fully realized while reducing future prolonged disruptions to WDQS, which is one of the most important tools in the Wikidata system.

In order to support more database growth, it was pretty clear that either the backend graph database would need to be completely replaced or it would be necessary to split the graph database to buy some time, as the clock had run out on the graph database being stable. A full backend graph database replacement is necessary, but this is a rather complex undertaking and would push timelines out considerably.

A stopgap solution seemed best. So, the team pursued the approach of splitting the graph database from one monolithic database into separate databases partitioned by two coarse grained knowledge domains: (1) scholarly article entities and (2) everything else. As fate would have it, these two knowledge domains are roughly equivalent in size, so a split has the nice property of allowing for more growth, albeit limited anecdotally to about 14 billion rows apiece, in the future for both partitions.

While working through the split of the graph database, although initial testing suggested that it should be possible to achieve a data reload of a graph of 10 billion rows within ten days and reloads for both knowledge domains could run in parallel, there still wasn't a lot of room for error. What happens if a graph database corruption happens right when the weekend starts? What if some other sort of server maintenance is blocking the start of a reload for a day or two? We wanted to be certain that we could reload and still have some breathing room to stay within 10 days.

Hardware to the rescue?

From previous investigations it seemed that more powerful servers could speed up data reloads. Obvious, right?

Well, yes and no. It's a little more complicated. People have tried.

Adam Shoreland shared his observations in testing WDQS Blazegraph data load performance.
Several researchers, including a previous consultant to Wikimedia Foundation, Andrea Westerinen, profiled getting and hosting your own copy of Wikidata.
Ghislain Auguste Atemezing analyzed Wikidata imports with Amazon Neptune, which is the commercial successor to Blazegraph (Blazegraph is no longer actively maintained), as well as other alternatives.

The legacy Blazegraph wiki has some nice guidance on Blazegraph performance optimization, I/O optimization, and query optimization (some of this knowledge is evident in a configuration ticket from as early as 2015 involving one of the original maintainers of Blazegraph). Some of it is still useful and seems to apply, although some changes in the JDK plus the sheer scale of Wikidata make some of the settings harder to reason about in practice. Data reloads are so big and time consuming that it really becomes difficult to profile every permutation of hardware configuration and Blazegraph, Java, and OS configuration.

This said, after noticing that a personal gaming-class machine I bought in 2018 for its machine learning capabilities was able to do much faster WDQS imports than what we were seeing on our data center servers, I wanted to understand if there were advances with CPU, memory, and disk in the wild that might point the way to even faster data reloads and wanted to understand better if any software configuration variables could yield bigger performance gains.

Computing configuration

My personal gaming-class machine with a 6-CPU configuration (up to 4.6 GHz turbo boost) after several years of upgrades has 64 GB of RAM and a 2 TB NVMe.

TKTK

CPU: Although CPU speed increases are still being observed with each new generation of processor, much of the advances in computing have to do with parallelizing computation across more cores. And although WDQS's graph database holds up relatively well in parallelizing queries across multiple cores, its data reloading doesn't take great advantage of the many-cores architecture.

Memory: Although more memory is commonly beneficial to large data operations and intuitively you might expect a graph database to work better with more memory, the manner in which memory is used by running programs can drive performance in surprising ways, ranging from good to bad. WDQS runs on Java technology, and configuration of the Java heap is notoriously challenging for achieving performance without long garbage collection ("GC") pauses. We deliberately use a 31 GB heap in production for our Blazegraph instances. It's also important to remember that a large Java heap requires a lot of RAM, which can become expensive even in this day of in-memory database architectures.

Nevertheless, more memory can be helpful for filesystem paging operations. Taking the hardware configuration guidance at face value suggests that we would need about 12 TB of memory for the scale of data we have today for an ideal server configuration (we have about 1200 GB of data with 16 billion records). We're getting by with 128 GB of memory per server, which is much less than 12 TB of memory. More memory would be nice, but it's too expensive in a multi-node setup built for redundancy across multiple data centers.

Disk: NVMe disks have brought increased speed to data operations. But backpressure on CPU or memory can also mask what might otherwise be able to manifest with speedier NVMe throughput.