Draft:ABaso (WMF)/Wikidata Query Service graph database reload at home, 2024 edition

This post is about importing Wikidata into the graph database technology used for hosting the Wikidata Query Service (WDQS). You'll find commands below to stand up your own local copy of the graph database powering WDQS.

Graph databases and Wikidata

edit

Graph databases are a useful technology for data mining relationships between all kinds of things and for enriching knowledge retrieval and generation systems. Within the Wikimedia content universe, we have a powerful graph database offering called the Wikidata Query Service ("WDQS") which is based on a mid-2010s technology called Blazegraph.

Wikidata community members model topics you might find on Wikipedia, and this modeling makes it possible to answer all kinds of questions about the world via the Wikidata data replicated into WDQS.

Big data growing pains

edit

As Wikidata has grown, the WDQS graph database has become pretty big, with about 16 billion rows (known as triples) as of this writing, with many intricate relationships between those rows. Unfortunately, the WDQS graph database has also become unstable as a result, and this seems to be getting worse as the database gets larger. The last time a data corruption occurred it rippled through the infrastructure and it took about 60 days to reload the graph database to a healthy state across all WDQS nodes.

The long recovery time was a prompt to further enhance the data reload mechanisms and to figure out a way to manage the growth in data volume. Over the course of the last year, the Search Platform Team, which is part of the Data Platform Engineering unit at the Wikimedia Foundation, worked on a project to improve things.

As part of its goal setting, the team determined it should make it possible to support more graph database growth (up to 20 billion rows in total) while being able to recover more reliably and more quickly in the event of a database corruption (within 10 days). This would allow Wikidata's value to continue to be able to be more fully realized while reducing future prolonged disruptions to WDQS, which is one of the most important tools in the Wikidata ecosystem.

In order to support more database growth, it was pretty clear that either the backend graph database would need to be completely replaced or it would be necessary to split the graph database to buy some time, as the clock had run out on the graph database being stable. A full backend graph database replacement is necessary, but this is a rather complex undertaking and would push timelines out considerably.

A stopgap solution seemed best. So, the team pursued the approach of splitting the graph database from one monolithic database into separate databases partitioned by two coarse grained knowledge domains: (1) scholarly article entities and (2) everything else. As fate would have it, these two knowledge domains are roughly equivalent in size, so a split has the nice property of allowing for more growth, albeit limited anecdotally to about 14 billion rows apiece, in the future for both partitions.

After many changes to the data ingestion, streaming, network topology, and server automation the split has been deployed to its first set of production servers, and the plan is to migrate all of the WDQS servers to the split-based architecture in the spring of 2025.

Now, while working through the split of the graph database, although initial testing suggested that it should be possible to achieve a data reload of a graph of 10 billion rows within ten days and reloads for both knowledge domains could run in parallel (thus allowing for 20 billion rows in total), there still wasn't a lot of room for error. What happens if a graph database corruption happens right when the weekend starts? What if some other sort of server maintenance is blocking the start of a reload for a day or two? We wanted to be certain that we could reload and still have some breathing room to stay within 10 days.

Hardware to the rescue?

edit

From previous investigations it seemed that more powerful servers could speed up data reloads. Obvious, right?

Well, yes and no. It's a little more complicated. People have tried.

The legacy Blazegraph wiki has some nice guidance on Blazegraph performance optimization, I/O optimization, and query optimization (some of this knowledge is evident in a configuration ticket from as early as 2015 involving one of the original maintainers of Blazegraph). Some of it is still useful and seems to apply, although some changes backported in the JDK plus the sheer scale of Wikidata make some of the settings harder to reason about in practice. Data reloads have become so big and time consuming (think on the order of weeks, not hours) that it is impractical (and expensive) to profile every permutation of hardware configuration and Blazegraph, Java, and OS configuration.

This said, after noticing that a personal gaming-class machine I bought in 2018 for its machine learning (will link to: https://www.youtube.com/watch?v=mFykPLCphqE) capabilities was able to do much faster WDQS imports than what we were seeing on our data center servers, I wanted to understand if there were advances with CPU, memory, and disk in the wild that might point the way to even faster data reloads and wanted to understand better if any software configuration variables could yield bigger performance gains.

This was explored in T359062, where you'll find an analysis and running log of import performance on various AWS EC2 configurations, a MacBook Pro (2019 Intel-based), my desktop (2018 Intel-based), our bare metal data center servers, and Amazon Neptune. The takeaways from that analysis were that:

  • Cloud virtual machines are sufficiently fast for running imports. They may be an option in a pinch.
  • Removal of CPU governor limits on data center class bare metal servers significantly improved performance.
  • Removal of CPU governor limits didn't confer an advantage on prosumer grade computers relative to data center class bare metal servers (the prosumer grade chips are already tuned for high performance), but their high grade hard drives and a Blazegraph buffer configuration variable increase significantly improved performance.
  • The Amazon Neptune service was by far the fastest option for import. It's unclear if free or near-free data ingestion observed during the free cloud credit period would extend for additional future imports, though. It is a viable option for imports, but requires additional architectural consideration post an import.
  • The N-Triples file format (.nt) dramatically improved import speed. It should be (and now, is) used instead of the more complicated Turtle (.ttl) format for imports.

The rest of this post focuses on how you can perform your own full Wikidata import to Blazegraph in about a week if you have a nice desktop computer, which was one of the nice takeaways from the analysis.

Computing configuration and initial setup

edit

My 2018 personal gaming-class machine with a 6-CPU configuration (up to 4.6 GHz turbo boost) after several years of upgrades has 64 GB of DDR4 RAM and a 4 TB NVMe.

A full Wikidata graph import into Blazegraph took 5.22 days with this configuration in August 2024.

I had the benefit of pre-split N-triples files produced from our Spark cluster as part of an Airflow DAG that runs weekly, where there are no duplicate lines in the files. If you're doing this at home without a large Spark cluster, though, you can fetch latest-all.nt.bz2 from the Wikidata dumps and run some shell commands to prepare files to achieve something similar (albeit with more duplicated records that ultimately get discarded during Blazegraph import).

You can at present import somewhat reliably and peformantly with one 4 TB NVMe internal drive and one 2 TB external (or SATA) SSD drive. In the example that follows, I assume that you have three drives, though: one 4 TB NVMe drive (let's say this is your primary drive), one external (or SATA) 4 TB SSD (that's /media/ubuntu/EXTERNAL_DRIVE in the example), and another external (or SATA) 2 TB SSD (that's /media/ubuntu/SOME_OTHER_DRIVE in the example). On my own desktop I actually do this with just one 4 TB NVMe, modifying the loadData.sh script so it deletes each split file after it's imported to ensure I have enough disk space (rm $LOCATION/$f after the curl line); I also initiate the script at the same time as decompressing and splitting the compressed dump file. The decompressed dump is about 3 TB, so it's good to think through disk at each step.

The commands

edit

Here are the commands you'll need to download the Wikidata dump, break it up into smaller files that Blazegraph can handle, and import within a reasonable timeframe.

Note that you'll need to have a copy of the logback.xml file downloaded to your home directory.

# Download some dependencies
sudo apt update
sudo apt install bzip2 git openjdk-8-jdk-headless screen 
git clone https://gerrit.wikimedia.org/r/wikidata/query/rdf
cd rdf
./mvnw package -DskipTests
sudo mkdir /var/log/wdqs
# Your username and group may differ from ubuntu:ubuntu
sudo chown ubuntu:ubuntu /var/log/wdqs
touch /var/log/wdqs/wdqs-blazegraph.log
cd dist/target/
tar xzvf service-0.3.*-SNAPSHOT-dist.tar.gz
cd service-0.3.*-SNAPSHOT/
cd /media/ubuntu/EXTERNAL_DRIVE
mkdir wd
cd wd
# Run the next multiline command at the end of your day - be sure to verify your
# that your computer will stay awake. The dumps server throttles download speeds
# somewhat, so the download takes a while. And the decompression and splitting
# also take a while. There are faster ways, but this is easy enough. In case
# you were wondering, wget seems to work more reliably than other options.
# Torrents do exist for dumps, but be sure to verify their checksums against
# dumps.wikimedia.org and verify the date of a given dump.
date && \
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.nt.bz2 && \
date && \
bzcat latest-all.nt.bz2 | split -d --suffix-length=4 --lines=7812500 --additional-suffix='.nt' - 'wikidata_full.' && \
date
# You could technically proceed with the next set of commands while the split
# files are being produced if you're eager to get things running.
# Let's head back to where you were:
cd ~/rdf/dist/target/service-0.3.*-SNAPSHOT/
mv ~/logback.xml .
# Using runBlazegraph.sh like production, change heap from 16g to 31g and
# point to logback.xml by updating HEAP_SIZE and LOG_CONFIG to look like so,
# without the # comment symbols, of course.
#  HEAP_SIZE=${HEAP_SIZE:-"31g"}
#  LOG_CONFIG=${LOG_CONFIG:-"./logback.xml"}
vim runBlazegraph.sh
# Modify the buffer in RWStore.properties so it looks like this (1M, not 100K),
# without the # comment symbol, of course.
#   com.bigdata.rdf.sail.bufferCapacity=1000000
vim RWStore.properties
# Let's get Blazegraph running in the background.
screen
# Wait a few seconds after running the next command to ensure it's good.
 ./runBlazegraph.sh
# Then CTRL-a-d to leave screen session running in background.
# You can chain the following commands together with && \ if you like.
# Let's import the first file to make sure it's working (takes about 1 minute).
time ./loadData.sh -n wdq -d /media/ubuntu/EXTERNAL_DRIVE/wd -s 0 -e 0 -f 'wikidata_full.%04d.nt' 2>&1 | tee -a loadData.log
# If it worked, let's import another 9 files (maybe another ~10 minutes).
time ./loadData.sh -n wdq -d /media/ubuntu/EXTERNAL_DRIVE/wd -s 1 -e 9 -f 'wikidata_full.%04d.nt' 2>&1 | tee -a loadData.log
# Let's see how long it took to import the first ten files, just sum and Then
# divide by 1000 for seconds (sum / 1000 / 60 / 60 / 24 for days).
grep COMMIT loadData.log | cut -f2 -d"=" | cut -f1 -d"m"
# Now let's handle the rest of the files. This could take a week or more - be
# sure to verify your that your computer will stay awake.
time ./loadData.sh -n wdq -d /media/ubuntu/EXTERNAL_DRIVE/wd -s 10 -f 'wikidata_full.%04d.nt' 2>&1 | tee -a loadData.log
# Hopefully that worked. Go to http://localhost:9999/bigdata/#query and run the
# following query:
#   SELECT (count(*) as ?ct) WHERE { ?s ?p ?o }
# Is it 16B or more records? It probably worked. Celebrate!
# Let's close Blazegraph and make a backup of the Blazegraph journal.
screen -r
# CTRL-c to stop Blazegraph
exit
# Okay, screen session ended, let's look at the size of the file
ls -alh wikidata.jnl
cp wikidata.jnl /media/ubuntu/SOME_OTHER_DRIVE/

You'll notice here I don't take time to make intermediate backups of the Blazegraph journal file. It's a good exercise for the reader!

Production, in practice

edit

We were a little surprised that my desktop could perform faster imports than what we were seeing in our data center servers. Our colleague Brian King in Data Platform SRE had a hunch, which turned out to be correct, that we could adjust the CPU governor on the production servers. This helped dramatically on the production servers, and when coupled with the graph split it makes recovery much faster. We don't need to use the buffer size configuration trick as described above, but we also have that as an option should it become necessary.

Considerations

edit

It would be nice to have no hardware limitations, but there are some practical limitations.

CPU: Although CPU speed increases are still being observed with each new generation of processor, much of the advances in computing have to do with parallelizing computation across more cores. And although WDQS's graph database holds up relatively well in parallelizing queries across multiple cores, its data import isn't easily optimized for many-cores architecture.

Memory: Although more memory is commonly beneficial to large data operations and intuitively you might expect a graph database to work better with more memory, the manner in which memory is used by running programs can drive performance in surprising ways, ranging from good to bad. WDQS runs on Java technology, and configuration of the Java heap is notoriously challenging for achieving performance without long garbage collection ("GC") pauses. We deliberately use a 31 GB heap in production for our Blazegraph instances. It's also important to remember that a large Java heap requires a lot of RAM, which can become expensive.

Nevertheless, more memory can be helpful for filesystem paging operations. Taking the hardware configuration guidance at face value suggests that we would need about 12 TB of memory for the scale of data we have today for an ideal server configuration (we have about 1200 GB of data with about 16 billion records). We're getting by with 128 GB of memory per server, which is much less than 12 TB of memory. More memory would be nice, but today it's too expensive in a multi-node setup built for redundancy across multiple data centers.

Disk: NVMe disks have brought increased speed to data operations. But backpressure on CPU or memory can also mask what might otherwise be able to manifest with speedier NVMe throughput. NVMEs did show a material performance gain during testing, although presently in production we're thankfully doing okay with RAIDed data center class SSDs (6 TBs). NVMes would most likely be an improvement in the future in the data center, but they are priced higher for data center quality devices, whereas prosumer grade NVMes for personal computers are reasonably priced; due to the risks of hardware failure we prefer to avoid prosumer grade NVMes in the data center.

Caveats

edit

A few things to remember if you're running Blazegraph at home:

  • It isn't really intended to support queries with SERVICE wikibase:mwapi syntax, which uses other Wikimedia servers.
  • Beware of exposing on the network: it doesn't have the same load balancing and firewall arrangement, as well as other security controls, as the real Wikidata Query Service.

Conclusion

edit

After splitting the Wikidata graph database in two and removing CPU throttling for Wikidata Query Service production data center nodes, we're now able to import the WDQS database and catch Blazegraph up to the Wikidata edit stream in less than a week.

If you are looking to host your own Blazegraph database of Wikidata data without having two graph partitions (i.e., if you want to have the full graph in one partition) you might try the following:

  1. Get a desktop with the fastest CPU possible and acquire a speedy 4 TB NVMe plus 64 GB or more of DDR5 RAM; get a couple larger external drives if you can, too. As of the writing consumer grade 4 TB NVMes can be had with reasonable price-performance tradeoffs; perhaps 6 GB or 8 GB NVMes with the same level of performance will become available in the next year or two.
  2. Import using the N-Triples format split into multiple files.
  3. Consider scripting the batch import operation to make a backup copy of the graph database for every 100 files imported. That way if your graph database import fails at some point you can troubleshoot and resume from the point of backup. The intermediate backup will slow things down a little but it may save you many days in the end.
  4. If you can't build or upgrade a desktop of your own, consider use of a cloud server to perform the import, then copy the produced graph database journal file to a more budget friendly computer; remember that in addition to cloud compute and storage costs, there may be data transfer costs.

As you saw up above, there are a few variables in configuration files that you need to update in order to speed an import along.

Acknowledgments

edit

I'd like to thank the wonderful colleagues in Search Platform, Data Platform SRE, Infrastructure Foundations, Traffic, and Data Center Operations for the solid work on the graph split, and more specifically regarding this post, the support in exploring opportunities to improve performance. I'd like to especially express my gratitude to David Causse (WDQS tech lead and systems thinker), Peter Fischer (Flink-Kafka graph splitter extraordinaire), Erik Bernhardson (thank you for the Airflow environment niceties), Ryan Kemper & Brian King & Stephen Munene & Cathal Mooney (cookbook puppeteers who balance networks and servers with a keyboard), Andrew McAllister (thank you for your analysis of query patterns!), and Willy Pao & Rob Halsell & Sukhbir Singh (thank you for helping investigate NVMe options). As ever, major thanks to Guillaume Lederrey, Luca Martinelli, and Lydia Pintscher (WMDE) for partnership in WDQS and Wikidata and its amazing community.