SQL/XML Dumps/Running a dump job

Running dump jobs

At some point you actually want to run one or more dumps jobs, for testing if nothing else. We’ve talked about the list of jobs that is assembled, and how just the jobs requested to run are marked to run, and how a given job runs. Today, let’s look at the worker.py script ([1]) that is run from the command line, along with all of its options.

worker.py

This, like all python dump scripts, is a python3 script. Because debian stretch has python 2 as the default, we will have to invoke python3 explicitly:

ariel@snapshot1009:/srv/deployment/dumps/dumps/xmldumps-backup$ python3 ./worker.py --help

We won’t discuss all of the options here, just the most useful ones.

def usage(message=None):
    if message:
        sys.stderr.write("%s\n" % message)
    usage_text = """Usage: python3 worker.py [options] [wikidbname]
Options: --aftercheckpoint, --checkpoint, --partnum, --configfile, --date, --job,
         --skipjobs, --addnotice, --delnotice, --force, --noprefetch,
         --prefetchdate, --nospawn, --restartfrom, --log, --cleanup, --cutoff,
         --batches, --numbatches\n")
--aftercheckpoint: Restart this job from the after specified checkpoint file, doing the
               rest of the job for the appropriate part number if parallel subjobs each
               doing one part are configured, or for the all the rest of the revisions
               if no parallel subjobs are configured;
               only for jobs articlesdump, metacurrentdump, metahistorybz2dump.
--checkpoint:  Specify the name of the checkpoint file to rerun (requires --job,
               depending on the file this may imply --partnum)
--partnum:     Specify the number of the part to rerun (use with a specific job
               to rerun, only if parallel jobs (parts) are enabled).

configfile: unsurprisingly, the path to dumps config file. It is followed optionally by a config section name with extra settings. Known names in production are: bigwikis (for wikis like dewiki, frwiki, itwiki and so on), wd (wikidata), and en (enwiki). So you might run python ./worker.py --job some-job-here --date 20201201 --configfile /etc/dumps/wikidumps.conf.dumps:wd for example.

''' ...(snipped)
--configfile:  Specify an alternative configuration file to read.
               Default config file name: wikidump.conf

In production, we run dumps on the 1st and 20th of the month. All jobs will therefore have a date 202x0y01 or 202x0y20. If you don’t use this option, the run will have today’s date, and new directories for output will be created, etc.

''' ...(snipped)
--date:        Rerun dump of a given date (probably unwise)
               If 'last' is given as the value, will rerun dump from last run date if any,
               or today if there has never been a previous run

Use the addnotice arg to add a file “notice.txt” in the dumps run directory for this wiki and date which will be inserted into the index.html file for that wiki and dump run. This is useful if for example there is a known problem with these specific dumps for this run and wiki.

Use the delnotice arg to remove any such file.

''' ...(snipped)
--addnotice:   Text message that will be inserted in the per-dump-run index.html
               file; use this when rerunning some job and you want to notify the
               potential downloaders of problems, for example.  This option
               remains in effective for the specified wiki and date until
               the delnotice option is given.
--delnotice:   Remove any notice that has been specified by addnotice, for
               the given wiki and date.

Which job or jobs to run, comma-separated. If you leave this blank, you will get a list of all known jobs. Note that if you need to rerun all the sql table dumps (yes, this has happened), you can just specify “tables” instead of each job one at a time. Default: run everything

''' ...(snipped)
--job:         Run just the specified step or set of steps; for the list,
               give the option --job help
               More than one job can be specified as a comma-separated list
               This option cannot be specified with --force.

You can use skipjobs to supply a comma-separated list of jobs NOT to run, in case you didn’t specify --job. Maybe you want to (re)run everything but the stubs, for example.

''' ...(snipped)
--skipjobs:    Comma separated list of jobs not to run on the wiki(s)
               give the option --job help

dryrun: the most important option in here. I use it every single time I want to run something manually. Every time, after ten years of doing these. And it still saves my butt from time to time.

Note that you want to get the MediaWiki maintenance script command for some part of the stubs, page logs or abstracts dump jobs, you will have to pass this argument to worker.py, look at the output, choose the page range that interests you, and copy paste the xml stub, page log or abstract script with its arguments, adding the --dryrun argument yourself, to get the final command that will run.

''' (...snipped)
--dryrun:      Don't really run the job, just print what would be done (must be used
               with a specified wikidbname on which to run
--force:       steal the lock for the specified wiki; dangerous, if there is
               another process doing a dump run for that wiki and that date.

exclusive is used for every production run. With this option, the wiki is locked for this run date so that nothing else runs a job. This means that different jobs for the same wiki cannot be run at the same time...yet. But it also means that you won’t have multiple processes trying to run the same job. Trust me, that's a good thing.

''' (...snipped)
--exclusive    Even if rerunning just one job of a wiki, get a lock to make sure no other
               runners try to work on that wiki. Default: for single jobs, don't lock
--noprefetch:  Do not use a previous file's contents for speeding up the dumps
               (helpful if the previous files may have corrupt contents)
--prefetchdate:  Read page content from the dump of the specified date (YYYYMMDD)
                 and reuse for the current page content dumps.  If not specified
                 and prefetch is enabled (the default), the most recent good
                 dump will be used.
--nospawn:     Do not spawn a separate process in order to retrieve revision texts
--restartfrom: Do all jobs after the one specified via --job, including that one

The skipdone flag gets used for all production runs, so that if we retry a run, we do not rerun any jobs that completed successfully. If this flag is not passed, all jobs that can be run will be. For manual reruns you almost always want this flag.

''' (...snipped)
--skipdone:    Do only jobs that are not already succefully completed

Write all progress and other messages to a logging facility as determined by the config file. This is done for all runs in production but when testing something or running it manually, it's not required.

''' (...snipped)
--log:         Log progress messages and other output to logfile in addition to
               the usual console output

Thhe cutoff option is a bit odd. Basically, you provide a date in YYYYMMDD format, and get the name of the next wiki with no dump run for the specified job(s) for that date, and the oldest previous run. Age in this case is figured out first by the name of the dump run directory and secondly by timestamp in case of a tie. If there are no such wikis, the script exists with no output.

We want this at the beginning of each run so that we can loop through all wikis one time to create the new run directory, instead of looping forever.

''' (...snipped)
--cutoff:      Given a cutoff date in yyyymmdd format, display the next wiki for which
               dumps should be run, if its last dump was older than the cutoff date,
               and exit, or if there are no such wikis, just exit
--cleanup:     Remove all files that may already exist for the spefici wiki and
               run, for the specified job or all jobs

Prereqs is used in production, for loose values of "use". If the prerequisite job for any specified job is missing, run it first. Example: the articlesdump job requires the xmlstubsdump job to run first. In practice I don't know that this ever happens in production, but, it could. I guess.

''' (...snipped)
--prereqs:     If a job fails because the prereq is not done, try to do the prereq,
               a chain of up to 5 such dependencies is permitted.
--batches:     Look for a batch file, claim and run batches for an executing dump
               run until there are none left.
               In this mode the script does not update the index.html file or various
               status files. This requires the --job argument and the --date argument.
--numbatches:  If we create a batch file (we are a primary worker), or we simply
               process an existing batch file (we are a secondary worker invoked with
               --batches), claim and run only this many batches; if numbatches is 0,
               do as many as we can with no limit, until done or failure.
               If we are not either creating batches or processing them but are a
               regular nonbatched worker, this setting has no effect.
               default: 0 (do as many batches as we can until done)

If you need to keep track of the progress of a manual job, this option is handy. We don't use it by default in production, but if something is badly broken we'll use it when trying to track down the issue.

''' ...(snipped)
--verbose:     Print lots of stuff (includes printing full backtraces for any exception)
               This is used primarily for debugging
"""
    sys.stderr.write(usage_text)
    sys.exit(1)

Sample commands

We talked about “stage files” a while back. These are all located in /etc/dumps/stages and they contain all of the worker.py commands that are run automatically, so let’s look at one of those. We have stage files for full (page content with all revisions) and partial dump runs, so let’s look at what’s required for a partial run. You cn look at a copy of the stages file from December 2020 ([2]) which was used for this document.

List of space-separated fields, with the command last since it contains spaces

# slots_used numcommands on_failure error_notify command

Notice the use of skipdone, exclusive, log, prereqs, the job name and the start date.

# stubs and then tables so inconsistencies between stubs and tables aren't too huge
1 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps --date {STARTDATE} --onepass --prereqs --job xmlstubsdump; /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps --date {STARTDATE} --onepass --prereqs --job tables

You see here that the config file is specified with the config section bigwikis which includes special settings for these wikis that run 6 parallel processes at once.

# stubs, recombines, tables for big wikis
6 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --date {STARTDATE} --onepass --prereqs --job xmlstubsdump
,xmlstubsdumprecombine; /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --date {STARTDATE} --onepass --prereqs --job tables

Nothing too exciting here.

# regular articles
1 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps --date {STARTDATE} --onepass --prereqs --job articlesdump
# regular articles, recombines for big wikis
6 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --date {STARTDATE} --onepass --prereqs --job articlesdump,articlesdumprecombine

More boring entries.

# regular articles multistream
1 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps --date {STARTDATE} --onepass --prereqs --job articlesmultistreamdump
# regular articles, recombines for big wikis multistream
1 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --date {STARTDATE} --onepass --prereqs --job articlesmultistreamdump,articlesmultistreamdumprecombine

# articles plus meta pages
1 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps --date {STARTDATE} --onepass --prereqs --job metacurrentdump
# articles, recombine plus meta pages for big wikis
6 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --date {STARTDATE} --onepass --prereqs --job metacurrentdump,metacurrentdumprecombine

Note how we specify certain jobs to be skipped. Thus, they are not marked to run, and when the rest of the jobs are complete, the entire dump is marked as complete.

# all remaining jobs except for the history revs
1 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps --date {STARTDATE} --onepass --prereqs --skipjobs metahistorybz2dump,metahistorybz2dumprecombine,metahistory7zdump,metahistory7zdumprecombine, xmlflowhistorydump
6 max continue none /bin/bash ./worker --skipdone --exclusive --log --configfile /etc/dumps/confs/wikidump.conf.dumps:bigwikis --date {STARTDATE} --onepass --prereqs --skipjobs metahistorybz2dump,metahistorybz2dumprecombine,metahistory7zdump,metahistory7zdumprecombine, xmlflowhistorydump

Configuration

I’ve thought about replacing these standard python config format files with yaml or json. And that’s as far as it’s gotten: thinking about it.

The production configuration settings are in /etc/dumps/confs/wikidump.conf.dumps so let’s look at some of these settings. Note that the file is generated from a puppet template, so I can't link you to a copy of the completed config in our repo. A copy of the generated config file from December 2020 is available though ([3]) for people to look at.

#############################################################
# This file is maintained by puppet!
# modules/snapshot/templates/dumps/wikidump.conf.erb
#############################################################

All of the lists of wiki databases of various sorts go here. Some are maintained by us (skipdblist) but most are not. skipdblist is used for dump runs that do “all the regular wikis” but not big or huge wikis (no wikis that need multiple processes running at once).

[wiki]
dblist=/srv/mediawiki/dblists/all.dblist
privatelist=/srv/mediawiki/dblists/private.dblist
closedlist=/srv/mediawiki/dblists/closed.dblist
skipdblist=/etc/dumps/dblists/skip.dblist
flowlist=/srv/mediawiki/dblists/flow.dblist

This is the path to the MediaWiki repo.

dir=/srv/mediawiki

I think we used to parse this directly. Not any more! I should just remove the adminsettings entry.

adminsettings=private/PrivateSettings.php

These are the tables we will dump via mysqldump. Need a new sql table dumped? Just add it to the list.

tablejobs=/etc/dumps/confs/table_jobs.yaml

If you have a wikfarm with our sort of setup, this is the path to the location of MWScript.php.

multiversion=/srv/mediawiki/multiversion

[output]
public=/mnt/dumpsdata/xmldatadumps/public
private=/mnt/dumpsdata/xmldatadumps/private
temp=/mnt/dumpsdata/xmldatadumps/temp

Those are python (NOT PUPPET!) templates for pieces of html files.

templatedir=/etc/dumps/templs
index=backup-index.html
webroot=http://download.wikimedia.org
fileperms=0o644

This is who gets emails on dump failure. Who wants to be on this alias with me?

[reporting]
adminmail=ops-dumps@wikimedia.org
mailfrom=root@wikimedia.org
smtpserver=localhost

There’s a job that cleans up stale locks every so often, in case a process died or was shot, leaving its lock files around.

# 15 minutes is long enough to decide a lock is expired, right?
staleage=900

Deprecated, we don’t dump private tables ever.

skipprivatetables=1

Mysql/mariadb setting. No easy way to keep it in sync with mariadb config, we have to do it manually :-(

[database]
max_allowed_packet=32M

Full paths to everything. For php, this lets us specify different php for different dump groups if we want. writeuptopageid and recompressxml are part of the collection of c utils for working with MediaWiki xml dump files.

[tools]
php=/usr/bin/php7.2
mysql=/usr/bin/mysql
mysqldump=/usr/bin/mysqldump
gzip=/bin/gzip
bzip2=/bin/bzip2
sevenzip=/usr/bin/7za
lbzip2=/usr/bin/lbzip2
checkforbz2footer=/usr/local/bin/checkforbz2footer
writeuptopageid=/usr/local/bin/writeuptopageid
recompressxml=/usr/local/bin/recompressxml
revsperpage=/usr/local/bin/revsperpage

Used to be useful, third parties might still use it (if any); now we clean up via a separate cron job

[cleanup]
keep=10

Are we writing page content files in page ranges? Honestly the chunk name is awful and will be changed $someday.

[chunks]
chunksEnabled=0
retryWait=30

[otherformats]
multistream=1

[misc]
sevenzipprefetch=1
maxRetries=3

We write stubs to a flat file and then read it and pass it to gzip, which is gross. We want around maxrevs revisions in each such temp file to be nice to the db servers.

[stubs]
minpages=1
maxrevs=100000

All wikis that require multiple processes to run (and not enwiki, wikidatawiki) use these settings. The ‘6’ you see around is how many output files, and therefore how many processes, for various jobs. The dblist file is different so that running through all of the bigwikis until there are none left to do, means running just through that list. lbzip2 uses multiple cores, but we don’t want to use 6 cores because there will be other (input) processes running too, so we use 3 as a good compromise. We used to produce giant bz2 files with all the history in them; now we don’t. You want full history, download a bunch of smaller files.

[bigwikis]
# generic settings for big wikis
checkpointTime=720
chunksEnabled=1
chunksForAbstract=6
chunksForPagelogs=6
dblist=/etc/dumps/dblists/bigwikis.dblist
fixeddumporder=1
keep=8
lbzip2forhistory=1
lbzip2threads=3
recombineHistory=0
revinfostash=1
revsMargin=100
revsPerJob=1500000
skipdblist=/etc/dumps/dblists/skipnone.dblist

We've skipped a bunch of stuff and moved on to the next interesting bits: per wiki settings, which are picked up automatically by the config file reader. For “big wikis” such as these, all of the special settings are in the bigwikis section except for the number of pages in each output file, which obviously varies per wiki.

The rest of the config file has more per-wiki specific settings so we'll not bother to copy them here.

########################
# wiki-specific settings


[arwiki]
# specific settings for wiki arwiki
pagesPerChunkHistory=340838,864900,1276577,1562792,2015625,1772989

[commonswiki]
# specific settings for wiki commonswiki
pagesPerChunkHistory=10087570,13102946,15429735,17544212,19379466,18705774

There may be other settings added from time to time; check the docs and the puppet manifests for details!