SQL/XML Dumps/Puppet for dumps maintainers

Introduction to puppet edit

This is a companion document to the slides from the Puppet for dumps maintainers presentation on December 9, 2020. The slides and speaker notes from the presentation cover much more ground, including linting, testing, and facter, but they don't look at code examples in any detail.

Basic puppet syntax, classes edit

Let's have a thorough look at the file modules/dumps/manifests/generation/server/dirs.pp ([1]).

Files of puppet code are called "manifests", so we'll call them that too.


We start off with the class definition. Generally classes are defined in files of the same name. The class is applied to internal dumpsdata NFS servers and to the public facing labstore dumps servers.

All arguments to the class must be explicitly given a default, even if that default is "undef". This class creates some directories where dumps output and html files will reside. We don't want to create anything by hand; puppet should do it all for us.



class dumps::generation::server::dirs(
    $datadir         = undef,
    $xmldumpsdir     = undef,
    $tempdir         = undef,
    $miscdatasetsdir = undef,
    $user            = undef,
    $group           = undef,


This class has been defined elsewhere (see the little hand pointer link), but we declare it here with specific values for the arguments, depending on the arguments passed in to us.

Why not just hard-code the directory paths? Well, for one thing, on the labstore boxes they are different than on the dumpsdata boxes, because each of those groups of servers is set up to meet different needs.

Addtitionally, the locations of some directories has changed in the past, as code was simplified, etc. Imagine having to find every location in the puppet manifests where we have hardcoded the path to the root of the dumps output filesystem! (Narrator: I did this. Never again.)


) {
    class {'dumps::server_dirs':
        datadir         => $datadir,
        xmldumpsdir     => $xmldumpsdir,
        miscdatasetsdir => $miscdatasetsdir,
        user            => $user,
        group           => $group,

    # Directories where dumps of any type are generated
    # This list is not for one-off directories, nor for
    # directories with incoming rsyncs of datasets
    $cirrussearchdir              = "${miscdatasetsdir}/cirrussearch"
    $xlationdir                   = "${miscdatasetsdir}/contenttranslation"
    $categoriesrdfdir             = "${miscdatasetsdir}/categoriesrdf"
    $categoriesrdfdailydir        = "${miscdatasetsdir}/categoriesrdf/daily"
    $globalblocksdir              = "${miscdatasetsdir}/globalblocks"
    $medialistsdir                = "${miscdatasetsdir}/imageinfo"
    $incrsdir                     = "${miscdatasetsdir}/incr"


The next line defines a variable for where the machine vision dump output will live. Notice that all variables begin with a $ and that if we want to use a variable within a string, we should double-quote the string and put brackets {} around the variable name.

Note also that all of the equals signs are lined up. We use puppet-lint ([2]) in our production CI environment, and you'll want to use it too, to save yourself a lot of wailing and moaning from jenkins.


    $machinevisiondir             = "${miscdatasetsdir}/machinevision"
    $mediatitlesdir               = "${miscdatasetsdir}/mediatitles"
    $pagetitlesdir                = "${miscdatasetsdir}/pagetitles"
    $shorturlsdir                 = "${miscdatasetsdir}/shorturls"
    $otherwikibasedir             = "${miscdatasetsdir}/wikibase"
    $otherwikibasewikidatadir     = "${miscdatasetsdir}/wikibase/wikidatawiki"
    $otherwikidatadir             = "${miscdatasetsdir}/wikidata"


This is how files are declared. Files are one of many types of "resources" defined natively in puppet. You can also define your own, should you need it.

Each instance of a resource has a unique name associated with it. In this case, the name is defined implicitly to be the value of the variable $tempdir. Only one resource with a given name can be defined in the manifests applied to a given server.


    # top level directories for various dumps/datasets, on generation hosts only
    file { $tempdir:
        ensure => 'directory',
        mode   => '0755',
        owner  => $user,
        group  => $group,


Note that you can pass a list of filenames to a resource declaration and each file will be generated separately using the arguments given.


    # subdirs for various generated dumps
    file { [ $cirrussearchdir, $xlationdir, $categoriesrdfdir,
        $categoriesrdfdailydir, $globalblocksdir, $medialistsdir, $incrsdir,
        $mediatitlesdir, $pagetitlesdir, $shorturlsdir, $machinevisiondir ]:

        ensure => 'directory',
        mode   => '0755',
        owner  => $user,
        group  => $group,

    # needed for wikidata weekly crons
    file { [ $otherwikibasedir, $otherwikibasewikidatadir, $otherwikidatadir ]:
        ensure => 'directory',
        mode   => '0755',
        owner  => $user,
        group  => $group,

Not too bad, was it? Next let's look at resources a bit more.

Resources in puppet edit

We'll have a look at the cron jobs for one of the "other" dumps, and in particular the job that generates lists of titles of articles in the main space for each wiki once a day: modules/snapshot/manifests/systemdjobs/pagetitles.pp ([3]).


Cron jobs are a type of resource, so we will be defining a class that declares a cron job resource for these dumps.

All dumps jobs run as a specific user, and these are no exception. Yes, that user too has changed over time, and while I expect it not to change in the future, it's still best practices not to hardcode it into the manifests.


class snapshot::systemdjobs::pagetitles(
    $user      = undef,
    $filesonly = false,
) {
    $systemdjobsdir = $snapshot::dumps::dirs::systemdjobsdir
    $repodir = $snapshot::dumps::dirs::repodir
    $confsdir = $snapshot::dumps::dirs::confsdir


We specify a MAILTO environment variable so that all output created by these jobs is sent to the specific email alias. We don't want to silence error output because then we won't know if the job fails, but we don't want to spam all of root (= all of SRE) with it either, which is what happens by default.


    if !$filesonly {
        systemd::timer::job { 'pagetitles-ns0':
            ensure             => 'present',
            description        => 'Regular jobs to build snapshot of page titles of main namespace',
            user               => $user,
            monitoring_enabled => false,
            send_mail          => true,
            environment        => {'MAILTO' => 'ops-dumps@wikimedia.org'},
            working_directory  => $repodir,
            command            => "/usr/bin/python3 onallwikis.py --configfile ${confsdir}/wikidump.conf.dumps:monitor  --filenameformat '{w}-{d}-all-titles-in-ns-0.gz' --outdir '${systemdjobsdir}/pagetitles/{d}' --query \"'select page_title from page where page_namespace=0;'\"",
            interval           => {'start' => 'OnCalendar', 'interval' => '*-*-* 8:10:0'},


This job produces a list of all media files on each wiki, also on a daily basis.

Note the ensure => 'present' in both of these stanzas. This is how we tell puppet we want the cron job to be there if it's missing. We could also ask puppet to make sure it's gone by saying ensure => 'absent'. But doing this sort of cleanup with puppet is often more trouble than it's worth. For more complicated restructuring you may want to use a mixture of puppet absenting and manual checks and cleanup, to be sure you've got the right results.


        systemd::timer::job { 'pagetitles-ns6':
            ensure             => present,
            description        => 'Regular jobs to build snapshot of page titles of file namespace',
            user               => $user,
            monitoring_enabled => false,
            send_mail          => true,
            environment        => {'MAILTO' => 'ops-dumps@wikimedia.org'},
            working_directory  => $repodir,
            command            => "/usr/bin/python3 onallwikis.py --configfile ${confsdir}/wikidump.conf.dumps:monitor  --filenameformat '{w}-{d}-all-media-titles.gz' --outdir '${systemdjobsdir}/mediatitles/{d}' --query \"'select page_title from page where page_namespace=6;'\"",
            interval           => {'start' => 'OnCalendar', 'interval' => '*-*-* 8:50:0'},

Roles, profiles and classes edit

Each server gets one role. Each role is built from profiles (just another collection of puppet manifests). And each profile is built from classes from various modules, and sometimes from other profiles.

Roles edit

Let's have a look. We'll start with the dumper role modules/role/manifests/dumps/generation/worker/dumper.pp ([4]), which is applied to snapshot hosts that just run xml/sql dumps and nothing else.


These first two profiles are included on (almost) all production hosts. They set up basic firewall rules, make sure prometheus is running so we can monitor the host, set up system users and, hey, users for us so we can ssh in, and stuff like that.



class role::dumps::generation::worker::dumper {
    include ::profile::standard
    include ::profile::base::firewall


This is the good stuff right here. These profiles get the work done for dumps snapshot hosts.

You might wonder what that crontester profile is; it puts all the files in place for "other" dumps but doesn't actually set up the related cron jobs, so that if someone wants to test those manually when a host is idle, they can.



    include profile::dumps::generation::worker::common
    include profile::dumps::generation::worker::dumper
    include profile::dumps::generation::worker::crontester


The system role thing here basically sets up a MOTD message so that the host tells you what its purpose is wen you log in. Every role should have one so that every server advertises what it does. Nice!



    system::role { 'dumps::generation::worker::dumper':
        description => 'dumper of XML/SQL wiki content',

Profiles edit

Let's have a closer look at the common profile modules/profile/manifests/dumps/generation/worker/common.pp ([5]), since it's used in all the roles.

What does every dump worker need? Well, it needs MediaWiki to be set up, of course. And MediaWiki installations also have a few things required; we don't want the full-fledged installation that goes on every app server, since we're not actually serving any web requests from these hosts, so we choose a few "lower-level" MediaWiki profiles and pull them in.


The lookup() call is a reference to puppet's settings for servers and clusters, called hiera. These settings are stored in a directory tree as yaml files, and there are some very specific rules about which directories will be checked for your setting depending on the profile from which the lookup is done ([6]).

This is also the ONE PLACE we set up the dumps user. One place where it's hardcoded.


class profile::dumps::generation::worker::common(
    $dumps_nfs_server = lookup('dumps_nfs_server'),
    $cron_nfs_server = lookup('dumps_cron_nfs_server'),
    $managed_subdirs = lookup('dumps_managed_subdirs'),
    $datadir_mount_type = lookup('dumps_datadir_mount_type'),
    $extra_mountopts = lookup('profile::dumps::generation::worker::common::nfs_extra_mountopts'),
    $php = lookup('profile::dumps::generation::worker::common::php'),
    $dumps_misc_cronrunner = lookup('profile::dumps::generation::worker::common::dumps_misc_cronrunner'),
) {
    # mw packages and dependencies
    require profile::mediawiki::scap_proxy
    require profile::mediawiki::common
    require profile::mediawiki::nutcracker
    class { 'profile::mediawiki::mcrouter_wancache':
        prometheus_exporter => false
    require profile::services_proxy::envoy

    $xmldumpsmount = '/mnt/dumpsdata'

    class { '::dumpsuser': }


If the host is running the non-xml/sql dumps, i.e. all the rest, it writes to a different nfs server than the rest. Splitting up the work means we can split up rsyncs and iops, as well as managing security reboots according to better schedules.



    if ($dumps_misc_cronrunner) {
        $nfs_server = $cron_nfs_server
    else {
        $nfs_server = $dumps_nfs_server
    snapshot::dumps::datamount { 'dumpsdatamount':
        mountpoint      => $xmldumpsmount,
        mount_type      => $datadir_mount_type,
        extra_mountopts => $extra_mountopts,
        server          => $nfs_server,
        managed_subdirs => $managed_subdirs,
        user            => 'dumpsgen',
        group           => 'dumpsgen',


We create a file which can be sourced by bash scripts, containing paths for the more important directories, so that people writing dumps jobs don't have to hardcode these paths either. Note how we get most of our paths from args passed in; only /srv/mediawiki is hardcoded here. If that gets moved one day a lot of stuff is going to break :-D



    # dataset server config files,
    # stages files, dblists, html templates
    class { '::snapshot::dumps::dirs':
        user               => 'dumpsgen',
        xmldumpsmount      => $xmldumpsmount,
        xmldumpspublicdir  =>  "${xmldumpsmount}/xmldatadumps/public",
        xmldumpsprivatedir =>  "${xmldumpsmount}/xmldatadumps/private",
        dumpstempdir       =>  "${xmldumpsmount}/xmldatadumps/temp",
        cronsdir           =>  "${xmldumpsmount}/otherdumps",
        apachedir          => '/srv/mediawiki',


Here's the code that actually does all the xml/sql dumps setup: config files, cron jobs, everything. Note that we can pass in the value of php so that on the testbd for example, we can switch to a new version early for testing, leaving the production servers alone.

We also set up scap for deployment of the dumps repo. The scap setup has a few moving pieces, so before you change it for some reason, poke someone who's done it before.



    class { '::snapshot::dumps': php => $php}

    # scap3 deployment of dump scripts
    scap::target { 'dumps/dumps':
        deploy_user => 'dumpsgen',
        manage_user => false,
        key_name    => 'dumpsdeploy',
    ssh::userkey { 'dumpsgen':
        content => secret('keyholder/dumpsdeploy.pub'),

Puppet repo layout edit

We've talked about roles and profiles and classes, but where does all of this live? Why, in our puppet repo, of course. It's available for public checkout ([7]) and has "production" as its main branch, so you'll want to make sure that's the one checked out.

Our top level manifest which declares all the hosts and assigns roles to them, is in manifests/site.pp ([8]). But everything else is a module living somewhere in the modules directory.

If you look at any directory under modules, you'll see the same layout with the following three subdirectories: files, manifests, and templates. Look for yourself: [9].

  • Files are content for file resources that don't get any variable interpolation. They just get plopped right onto the server as is, in the location and with the permissions you specify.
  • Manifests are puppet code, and we've seen some examples of that already.
  • Templates are content for file resources with little bits of ruby code in them, which will be evaluated to shove values from variables and so on in them before they are written out. We'll look at an example later.

Now you may be wondering where profiles live. Remember that profiles are a sort of thing we made up; they are a nice convention but the name "profile" doesn't have any special meaning to puppet. There is just a module that we maintain called "profile" ([10]), and it has manifests, files and templates like any other module.

The same is true of roles. There is a "role" module with files, manifests and templates in it ([11]). It's how we use these modules that makes them special.

Puppet Templates: basic syntax edit

The easiest way to understand how they work is to look at an example. So, here we go. This is the template used to generate configuration files for the "misc" (not xml/sql) dumps: modules/snapshot/templates/wikidump.conf.other.erb ([12]).


The configuration files are formatted as ini files, which is what the standard python configuration module likes. Each group of settings falls into a specific section, so for example the first group is under the section "wiki".


# This file is maintained by puppet!
# modules/snapshot/templates/wikidump.conf.other.erb

# minimal config file with common settings used by 'misc' dumps
# i.e. not xml/sql dumps, not adds/changes dumps

dblist=<%= @configvals['global']['dblist'] %>
privatelist=<%= scope.lookupvar('snapshot::dumps::dirs::apachedir') -%>/dblists/private.dblist
multiversion=<%= scope.lookupvar('snapshot::dumps::dirs::apachedir') -%>/multiversion


The lookupvar calls here retrieve the value of the corresponding variable, given by its full name in the puppet namespace. The variable must have been set by or available to the manifest that processes this template. The <%= and %> are indicators to puppet that the stuff in between is ruby to be interpreted and the results inserted. A leading - before the closing %> tells puppet not to add a newline after the results of the interpolation.

Note how here we can specify the version of php we want to use depending on what server with what role and what profiles gets a copy of the config file from this template.


temp=<%= scope.lookupvar('snapshot::cron::configure::tempdir') %>

php=<%= scope.lookupvar('snapshot::cron::configure::php') %>


The configvals variable is an associative array which currently has one key global for values that apply to all jobs, and another key wikidata which contains some key/value pairs that apply only to wikidata jobs.

First we check to make sure there really is a wikiata key, and then we go through the key/value pairs it has, sort them, and for each one, write the key name and value to the output.

Output from a template is stored in a file by declaring a file resource with content derived from the specific template.



<% if @configvals.has_key?('wikidata') -%>
# specific settings for wikidata entity dumps
<% @configvals['wikidata'].keys.sort.each do |wdsetting| -%>
<%= wdsetting %>=<%= @configvals['wikidata'][wdsetting] %>
<% end -%>
<% end -%>

Puppet Templates: usage edit

Now that we have a template, how do we use it? This was alluded to briefly above, but let's look at an actual invocation. You can check the version of the puppet manifest modules/snapshot/manifests/cron/configfile.pp used for this document ([13]) if you like.


Note that this stanza starts not with "class" but with "define". What does this mean? We are defining our own resource, built out of a file resource and some variables.


define snapshot::cron::configfile(
    $configvals = undef,
    ) {
    $confsdir = $snapshot::dumps::dirs::confsdir


The name of the file, i.e. the full path to it, will be ${confsdir}/${title}, whatever that is. Permissions, group and owner are all spelled out explicitly; this is good to do rather than relying on puppet defaults, because those defaults have changed in the past. That was exciting!

The last line of the file resource declaration tells puppet where to get the content for the file. That .erb on the end of the name is a surefire giveaway that we're talking about a template, and templates live in a special subdirectory of each puppet module, called, funnily enough, "templates".


    file { "${confsdir}/${title}":
        ensure  => 'present',
        path    => "${confsdir}/${title}",
        mode    => '0755',
        owner   => 'root',
        group   => 'root',
        content => template('snapshot/wikidump.conf.other.erb'),