SQL/XML Dumps/"Other" dump jobs via cron

“Other” dump jobs via cron

So far we’ve looked at sql/xml dumps and at Wikibase dumps for wikidata entities. Now let’s look at how other dumps are generated, and how we might use this approach for the watchlist dumps, if we decided not to include them in the sql/xml dumps but to run them separately on a weekly basis.

This isn’t as far-fetched as you might think. The sql output in our sql/xml dumps is typically a dump of an entire table via mysqldump, which can be loaded wholesale for import into a new wiki. The watchlist table dump, while producing data from a table for each wiki, is in a special format lending itself to use as a dataset rather than as a potential import. And most such files are produced in the manner described below.

Puppet manifests for dumps: layout

Everything about the various dumps is puppetized, and so are these. They run from a dedicated snapshot host, and the role applied to that host is role::dumps::generation::worker::dumper_misc_crons_only

In general, we have roles for dumps generation workers, which are the snapshot hosts that run the python and MediaWiki scripts, and for dumps generation servers, which are the dumpsdata NFS servers.

The profiles break down the same way; there are profiles for the workers, with the prefix profile::dumps::generation::worker, and profiles for the NFS servers, with the prefix, can you guess? That’s right, profile::dumps::generation::server. All of the worker hosts have the profile profile::dumps::generation::worker::common in common, and then they have a secondary profile according to what they run. In our case, the snapshot host we are interested in has the profile profile::dumps::generation::worker::cronrunner for running these “other” dumps via cron.

If we drill down into that, we can see that the one class it applies is snapshot::cron and you notice right away a difference in the naming scheme. At the class level, workers get classes starting with the name “snapshot”, and the NFS servers get classes with the name “dumps”.

Note that other servers also get classes with the name “dumps”, including the public facing web server and the server that supplies NFS service of the dumps to stat1007.

Digging into the cron class

If you look at the beginning of the code block of the cron class, you’ll see the creation of the file /usr/local/etc/dump_functions.sh. You might remember this file from our discussion of the Wikibase dumps. These functions are there for you to use so that you know where to write output, among other things.

After that we have each cron job listed in random order. Please don’t snark about that too much. Some of these are quite complex jobs, like the Wikibase ones, and some are quite simple. Let’s look at a simple case.

A simple “other” dump: shorturls

snapshot::cron::shorturls ([1]) is a tiny little class for a tiny self-contained dump. Let’s have a look.

We know where the config file for dumps is located, thanks to dump_functions.sh ([2]), we know where the dumps repo is located for the same reason, and lastly we know where output is to be written (“cronsdir”).

After we have those values, it’s a simple one-liner to set up the command to cron. This uses onallwikis.py ([3]), a dumps-specific script that is convenient for cases like these.

class snapshot::cron::shorturls(
    $user      = undef,
    $filesonly = false,
) {
    $cronsdir = $snapshot::dumps::dirs::cronsdir
    $repodir = $snapshot::dumps::dirs::repodir
    $confsdir = $snapshot::dumps::dirs::confsdir

    if !$filesonly {
        cron { 'shorturls':
            ensure      => 'present',
            environment => 'MAILTO=ops-dumps@wikimedia.org',
            user        => $user,
            command     => "cd ${repodir}; python3 onallwikis.py --wiki metawiki --configfile ${confsdir}/wikidump.conf.dumps:monitor --filenameformat 'shorturls-{d}.gz' --outdir '${cronsdir}/shorturls'  --script extensions/UrlShortener/maintenance/dumpURLs.php 'compress.zlib://{DIR}'",
            minute      => '5',
            hour        => '8',
            weekday     => '1',
        }
    }
}

Because shorturls are the same everywhere, we can just ask for them on one wiki, in this case metawiki. But if we wanted to run a script across all wikis, we can do that just as easily by specifying --outdir '${cronsdir}/mediatitles/{d}' and --filenameformat '{w}-{d}-watchlist.tsv.gz', and adjusting the watchlist maintenance script so that it writes compressed gz output files, just as the shorturls maintenance script dumpUrls.php ([4]) does.

The watchlist script does or will take an output file parameter, so we can make sure that parameter has the string {FILE} in it, to pick up the filename from the onallwikis script, and that the output directory parameter to the MW maintenance script has the string {DIR} in it to pick up the output directory from the onallwikis script.

Note that onallwikis.py has built-in retry logic if a dump on some wiki fails, as well as emergency cutoff if too many wikis fail the dump. No need to add that to your maintenance script!

Cleaning up old dumps

We don’t keep dumps forever. Disks are cheap but they still take power and space, and rack space is NOT cheap. So at some point we delete old dumps of all kinds to make way for more recent ones.

In the case of “other” dumps run from cron, there’s a class for this: dumps::web::cleanups::miscdumps ([5]). From the name, you can tell it must run on the dumpsdata and/or web server hosts, and not on the snapshots. This is correct! Each host that gets copies of the dump output files also cleans up old ones on its own. We could rsync --delete but that can be very dangerous when it goes wrong. This is the safe and easily configurable approach.

Let’s have a look.

You don’t need to know the details of the cleanup_old_miscdumps.sh script ([6]). It cleans up old dumps from the various directories, using a config file to tell it how many of which type to keep. This number is different on the dumpsdata servers, which have smaller disks and only need to keep them long enough to rsync them out to the public-facing hosts, than on the web servers, which should keep lots of them for download.

class dumps::web::cleanups::miscdumps(
    $isreplica = undef,
    $miscdumpsdir = undef,
) {
    file { '/usr/local/bin/cleanup_old_miscdumps.sh':
        ensure => 'directory',
        path   => '/usr/local/bin/cleanup_old_miscdumps.sh',
        mode   => '0755',
        owner  => 'root',
        group  => 'root',
        source => 'puppet:///modules/dumps/web/cleanups/cleanup_old_miscdumps.sh',
    }

Here’s the config piece, and you’ll be adding to it, giving the name of the directory where your particular flavour of “other” dumps are written:

    $keep_generator=['categoriesrdf:3', 'categoriesrdf/daily:3', 'cirrussearch:3', 'contenttranslation:3', 'globalblocks:3', 'imageinfo:3', 'machinevision:3', 'mediatitles:3', 'pagetitles:3',
 'shorturls:3', 'wikibase/wikidatawiki:3']
    $keep_replicas=['categoriesrdf:11', 'categoriesrdf/daily:15', 'cirrussearch:11', 'contenttranslation:14', 'globalblocks:13', 'imageinfo:32', 'machinevision:13', 'mediatitles:90', 'pagetitles:90', 'shorturls:7', 'wikibase/wikidatawiki:20']

Replicas are the public-facing servers. Why aren’t the fallback dumpsdata servers considered replicas? Because they can become the main NFS server for generation at any minute, so they should have the same data as the current dumpsdata generator.

You can see below where the config file gets written for use by the cleanup script. If you have access to the dumpsdata servers, you can have a look at one. Not much to it.

    if ($isreplica == true) {
        $content= join($keep_replicas, "\n")
    } else {
        $content= join($keep_generator, "\n")
    }

    file { '/etc/dumps/confs/cleanup_misc.conf':
        ensure  => 'present',
        path    => '/etc/dumps/confs/cleanup_misc.conf',
        mode    => '0755',
        owner   => 'root',
        group   => 'root',
        content => "${content}\n"
    }

    $cleanup_miscdumps = "/bin/bash /usr/local/bin/cleanup_old_miscdumps.sh --miscdumpsdir ${miscdumpsdir} --configfile /etc/dumps/confs/cleanup_misc.conf"

This bit is for the adds-changes dumps, you can ignore it.

    if ($isreplica == true) {
        $addschanges_keeps = '40'
    } else {
        $addschanges_keeps = '7'
    }

    # adds-changes dumps cleanup; these are in incr/wikiname/YYYYMMDD for each day, so they can't go into the above config/cron setup
    $cleanup_addschanges = "find ${miscdumpsdir}/incr -mindepth 2 -maxdepth 2 -type d -mtime +${addschanges_keeps} -exec rm -rf {} \\;"

And here’s the actual cron job itself. Easy peasy. Note the require making sure that the script itself is in place before the cron job gets added, and the MAILTO environment variable so that only people on the ops-dumps alias get spammed in case of error output, and not all of the SRE team.

    cron { 'cleanup-misc-dumps':
        ensure      => 'present',
        environment => 'MAILTO=ops-dumps@wikimedia.org',
        command     => "${cleanup_miscdumps} ; ${cleanup_addschanges}",
        user        => root,
        minute      => '15',
        hour        => '7',
        require     => File['/usr/local/bin/cleanup_old_miscdumps.sh'],
    }
}

The output directory though…

The top-level output directory (like e.g. “watchlists”) should be either created by your script, which is maybe not the best thing in case your script gets confused and tries to create a directory in some weird place, or you can have it created by puppet.

Yes, there’s a puppet manifest for that! dumps::generation::server::dirs ([7]) is the place to add your new dir. Define it in the list of declarations of dirs, and then add it to the stanza commented with “subdirs for various generated dumps”, and it will be automatically generated for you.

But the downloaders!

Your downloaders need to know how to find your shiny new dumps. Otherwise the whole exercise is a bit like navel-gazing. There’s an index.html file where you can add an entry ([8]) This file is long overdue for restructuring, but until then, just add your entry somewhere in the list, probably not at the top though. Don’t add an index.html file in the subdirectory where your datasets go, unless you are prepared to generate it dynamically with links to each file as it’s created.

Rsyncing?

No need to worry about this. We rsync everything in the root of the “other” dumps tree on the dumpsdata NFS server out to the web servers, so if you write it, it will appear. Just make sure not to store some private data over there temporarily, thinking it will remain private!