SQL/XML Dumps/Anatomy of a dumps job

Anatomy of a dump job (a simple implementation)

Overview

A job sets up its output filename, the human-readable description of the dump job, the dump job name for inclusion in various report files used by the code and by dumps endusers.

It then sets up the command(s) that must be run in order to produce the output.

And finally, it runs these commands, reporting back if there is an error.

Flow.py

This is a good case to look at because it doesn’t involve multiple output files, it doesn’t run commands in parallel, and it doesn’t need to generate a bunch of temporary files. It also doesn’t depend on input files from any other dump jobs.

The file may have been modified since these notes were written but you can always look at the version used for this document ([1]).

Let’s look at the code bit by bit. Don’t worry if you don’t know python; we don’t care so much about the specific syntax as the general idea of what’s being done. All you really need to know is that indentation levels have meaning, since they rather than brakcets are used to determine which statements are part of a code block.

Basic class setup

We dump all flow boards; during the "full" dumps we dump separately all revisions of flow boards. Both dumps use the same MediaWiki maintenance script, with a small difference in command-line arguments. Naturally the output filenames differ as well. Passing in True for the history argument in the constructor makes the difference.

#!/usr/bin/python3
'''
Dumps of Flow pages
'''

import os
from dumps.exceptions import BackupError
from dumps.utils import MultiVersion
from dumps.fileutils import DumpFilename
from dumps.jobs import Dump, ProgressCallback


class FlowDump(Dump):
    """Dump the flow pages."""

    def __init__(self, name, desc, history=False):
        self.history = history
        Dump.__init__(self, name, desc)

The detail line is displayed in the list of output files per dump run that users see when they view the web page for the dump run of a given wiki and date. It helps users decide which files they need to download.

    def detail(self):
        return "These files contain flow page content in xml format."

These are the parts of a dump output filename. Output names are of the form <wiki>-<YYYYMMDD>-<dumpname>.{xml,sql,json,txt,''}.{gz,bz2,7z}

In the case of flow output, example filenames are elwiki-20200801-flow.xml.bz2 for regular dumps, and elwiki-20200801-flowhistory.xml.bz2 for the full revision history dumps.

    def get_filetype(self):
        return "xml"

    def get_file_ext(self):
        return "bz2"

    def get_dumpname(self):
        if self.history:
            return 'flowhistory'
        return 'flow'

Building a command

The dfname used in building a command will be explained later. For now, know that it is an object representing a typical dump output filename with all of the parts that constitute it.

The path for php is configurable; this was useful when we were moving from PHP 5 to PHP 7, and from HHVM to PHP.

    def build_command(self, runner, output_dfname):
        if not os.path.exists(runner.wiki.config.php):
            raise BackupError("php command %s not found" % runner.wiki.config.php)

runner.dump_dir is a DumpDir object. Dump directories are of the form <path-to-dump-tree>/{public,private}/<wiki>/YYYYMMDD/ Only the public directories are rsynced to other hosts. The private data is not copied anywhere and is not backed up in any way.

This method returns the public directory for the given wiki and dump run date unless the wiki itself is private, in which case the private directory is returned.

        flow_output_fpath = runner.dump_dir.filename_public_path(output_dfname)

This method determines, based on the dumps configuration file, whether MWScript.php should be called with the maintenance script needed for this dump job, or whether php may be called directly.

        script_command = MultiVersion.mw_script_as_array(
            runner.wiki.config, "extensions/Flow/maintenance/dumpBackup.php")

The in progress name is just the output file name with ".inprog" tacked onto the end. We write files out with this suffix so that we can easily clean up all partially written files if a job dies, and so that rsyncs of files to the public-facing servers include only completed files ready for download, saving grief for the users and bandwidth for the servers in the rsync. Remember that the source server for the rsync is an NFS share so we jealously guard our bandwidth.

        command = [runner.wiki.config.php]
        command.extend(script_command)
        command.extend(["--wiki=%s" % runner.db_name,
                        "--current", "--report=1000",
                        "--output=bzip2:%s" % DumpFilename.get_inprogress_name(flow_output_fpath)])

Our command runner ("CommandManagement") expects a list of command series which are run in parallel. Each series in the list is a list of command pipelines to be run one after the other; such a list might result in something like (dump this | gzip > someplace; dump that | gzip > somewhere-else)

Each pipeline is a set of commands with pipe symbols to be run as a unit.

This means that if we have just one command to run, we must convert it to a pipeline (list of commands) with one entry; then that pipeline into a series with one entry, and then later a commands-in-parallel unit, with one entry.

        if self.history:
            command.append("--full")
        pipeline = [command]
        series = [pipeline]
        return series

Running the command

Old output files for this run are removed if this is enabled. In general we don't do this; old output files are complete files and will be left intact, with only missing files (in the case of jobs that write multiple files) filled in on a retry.

    def run(self, runner):
        self.cleanup_old_files(runner.dump_dir, runner)

Defined in jobs.py ([2]), this lister uses the file type, file ext, and dump name to construct the output file name. If there are multiple files produced with different dumpnames, that's handled here too. If files are produced by processes running in parallel, they are numbered and those names returned. DumpFilename objects are returned, to be more precise.

        dfnames = self.oflister.list_outfiles_for_build_command(
            self.oflister.makeargs(runner.dump_dir))

In this case we want just one output file.

        if len(dfnames) > 1:
            raise BackupError("flow content step wants to produce more than one output file")

Defined in jobs.py ([3]), this adds an associative array (dict) to a list we keep of information about commands to be run in this job; it allows us to connect a specific command series to the output files that it will produce, in case of error, so that the information can be relayed back to the caller and the files can potentially be cleaned up as well before retry.

        output_dfname = dfnames[0]
        command_series = self.build_command(runner, output_dfname)
        self.setup_command_info(runner, command_series, [output_dfname])
        prog = ProgressCallback()

Here we have converted the command series into a list of command series to be run in parallel, which has just the one series as an entry.

We jump through all these hoops with special callers not just to catch errors but also to provide a callback that either reports progress as the maintenance script emits progress lines, or checks on the size of the output file in progress and reports on that at regular intervals. This progress information is written to a file in the dump output directory so that it can be used in generating the web page seen by users; if you look at e.g. https://dumps.wikimedia.org/wikidatawiki/ for the most recent dump, assuming it is still in progress, you will see something like, for example,

2020-08-27 09:01:58 in-progress All pages, current versions only.
b'2020-08-27 09:39:24: wikidatawiki (ID 45809) 2999 pages (150.8|456.9/sec all|curr), 3000 revs (150.8|152.3/sec all

If your maintenance script does not produce such output as it runs, then you will want to check the file sizes at regular intervals and write that information out to various status files; this is done by html_update_callback which is defined in jobs.py ([4]). The tables dump jobs all do this; they save output to a gzip compressed file using runner.save_command which automatically handles the choice of the appropriate callback. You can see this in the apijobs.py file ([5]) as well.

The command completion callback moves the temporary (“.inprog”) files into their permanent location, and may optionally check if the files are empty or, for compressed files, if they are complete or have only been partially written (decompression fails).

        error, _broken = runner.run_command([command_series],
                                            callback_stderr=prog.progress_callback,
                                            callback_stderr_arg=runner,
                                            callback_on_completion=self.command_completion_callback)

We hope this never happens :-)

        if error:
            raise BackupError("error dumping flow page files")
        return True