SQL/XML Dumps/A dump job using an existing MediaWiki script

A dump job using an existing MediaWiki script

Things to consider

Does your dump job run via a MediaWiki maintenance script (in core or an extension) or via some other command like mysqldump or a custom script? Does your dump job script write compressed output directly? Does your dump job script produce progress messages that can be used to judge the % of entries processed or to derive an ETA for when the job will complete?

Your job may integrate slightly differently than this example based on the answers to the above questions.

Code of the module

First, we need the code of the new dumps job python script; see below for the code to sample_job.py ([1]).

While we're here, we might as well see how it works. At this point, nothing here should be surprising.

Comments are inline so that anyone who checks out the repo can study this example and see how it works.

#!/usr/bin/python3
'''
Sample job for illustrative purposes only

Dumps of site lists
In reality, this needs to run only on one wiki, but as an
example that can be run against an existing maintenance script
and for which one has to fiddle a bit with the command to
get the output file right, it's useful
'''

import os
import time
from dumps.exceptions import BackupError
from dumps.utils import MultiVersion
from dumps.fileutils import DumpFilename
from dumps.jobs import Dump

Class defining the job

Jobs should have a class name like BlahBlaDump.

#!/usr/bin/python3
class SitelistDump(Dump):
    """Dump the sites list in xml format"""

    def __init__(self, name, desc):
        Dump.__init__(self, name, desc)

    def detail(self):
        # this text shows up on the index.html page for the dump run for the wiki.
        return "These files contain a list of wikifarm sites in xml format."

    # the following settings ensure that the output filename will be of
    # the form <wiki>-<YYYYMMDD>-sitelist.xml.gz

    def get_filetype(self):
        return "xml"

    def get_file_ext(self):
        return "gz"

    def get_dumpname(self):
        return 'sitelist'

Building a command

This method is not mandatory but most jobs include it.

    @staticmethod
    def build_command(runner, output_dfname):
        '''
        construct a list of commands in a pipeline which will run
        the desired script and piping all output to gzip
        '''
        if not os.path.exists(runner.wiki.config.php):
            raise BackupError("php command %s not found" % runner.wiki.config.php)

        # the desired script is a maintenance script in MediaWiki core; no additional
        # path info is needed.
        script_command = MultiVersion.mw_script_as_array(
            runner.wiki.config, "exportSites.php")

        # the script does not write compressed output, we must arrange for that via
        # a pipeline, consisting of the script and the following gzip command.
        commands = [runner.wiki.config.php]
        commands.extend(script_command)
        commands.extend(["--wiki={wiki}".format(wiki=runner.db_name),
                         "php://stdout"])
        pipeline = [commands, [runner.wiki.config.gzip]]
        return pipeline

Running the job

All dump jobs must have this method; it is executed to actually run the job.

    def run(self, runner):
        self.cleanup_old_files(runner.dump_dir, runner)

        dfnames = self.oflister.list_outfiles_for_build_command(
            self.oflister.makeargs(runner.dump_dir))
        if len(dfnames) > 1:
            raise BackupError("Site list dump job wants to produce more than one output file")
        output_dfname = dfnames[0]

        command_pipeline = self.build_command(runner, output_dfname)

        # we write to the "in progress" name, so that cleanup of unfinished
        # files is easier in the case of error, and also so that rsyncers can
        # pick up only the completed files

        # the save command series is just a list of the single pipeline but
        # with a redirection to the output file tacked onto the end.
        # this is useful for adding a compression step on the end when
        # scripts don't write compressed data directly.
        command_series = runner.get_save_command_series(
            command_pipeline, DumpFilename.get_inprogress_name(
                runner.dump_dir.filename_public_path(output_dfname)))
        self.setup_command_info(runner, command_series, [output_dfname])

        retries = 0
        maxretries = runner.wiki.config.max_retries

        # this command will invoke the html_update_callback as a timed callback, which
        # allows updates to various status files to be written every so often
        # (typically every 5 seconds) so that these updates can be seen by the users
        error, _broken = runner.save_command(command_series, self.command_completion_callback)

        # retry immediately, don't wait for some scheduler to find an open slot days later.
        # this catches things like network hiccups or a db being pulled out of the pool.
        while error and retries < maxretries:
            retries = retries + 1
            time.sleep(5)
            error, _broken = runner.save_command(command_series)
        if error:
            raise BackupError("error dumping Sites list for wiki {wiki}".format(
                wiki=runner.db_name))
        return True

Wiring it in

Next we need to make the job known to the infrastructure. We do this by adding an entry for it in the dumpitemlist.py module ([2]):

diff --git a/xmldumps-backup/dumps/dumpitemlist.py b/xmldumps-backup/dumps/dumpitemlist.py
index fb9898ad4..a3c449d0b 100644
--- a/xmldumps-backup/dumps/dumpitemlist.py
+++ b/xmldumps-backup/dumps/dumpitemlist.py
@@ -20,6 +20,7 @@ from dumps.xmljobs import XmlLogging, XmlStub, AbstractDump
 from dumps.xmlcontentjobs import XmlDump, BigXmlDump
 from dumps.recompressjobs import XmlMultiStreamDump, XmlRecompressDump
 from dumps.flowjob import FlowDump
+from dumps.sample_job import SitelistDump
 
 
 def get_setting(settings, setting_name):
@@ -241,6 +242,8 @@ class DumpItemList():
         self.append_job_if_needed(
             FlowDump("xmlflowhistorydump", "history content of flow pages in xml format", True))
 
+        self.append_job_if_needed(SitelistDump("sitelistdump", "List all sites."))
+
         if self.wiki.config.revinfostash:
             recombine_prereq = self.find_item_by_name('xmlstubsdumprecombine')
         else:

That’s all we change: an import line at the end of all of the imports near the top of the module, so that the class name is recognized without an icky old prefix, and adding it to the list of dump jobs that may or may not be run, passing in just what the constructor for the class needs, which isn’t so much.

Because this job is just for purposes of illustration and should not be run in a production environment, we also added a config switch that lets you disable jobs on all runs on all wikis; see the commit ([3]) if you're interested in more details. Ordinarily you won't have to worry about that, since jobs you add will be jobs you want run :-)

Testing

Now we run it:

[ariel@bigtrouble dumptesting]$ python ./worker.py  --configfile ./confs/wikidump.conf.current:bigwikis --job sitelistdump
Running elwikivoyage, jobs sitelistdump...
2020-09-01 10:43:53: elwikivoyage Checkdir dir /home/ariel/wmf/dumps/testing/xmldumps/dumpruns/public/elwikivoyage/20200901 ...
2020-09-01 10:43:53: elwikivoyage Checkdir dir /home/ariel/wmf/dumps/testing/xmldumps/dumpruns/private/elwikivoyage/20200901 ...
2020-09-01 10:43:53: elwikivoyage Cleaning up old dumps for elwikivoyage
2020-09-01 10:43:53: elwikivoyage No old public dumps to purge.
2020-09-01 10:43:53: elwikivoyage No old private dumps to purge.
Preparing for job sitelistdump of elwikivoyage
command /usr/bin/php /var/www/html/elwv/maintenance/exportSites.php --wiki=elwikivoyage php://stdout (3305310) started...
command /usr/bin/gzip (3305311) started...
returned from 3305311 with 0
2020-09-01 10:43:55: elwikivoyage Checkdir dir /home/ariel/wmf/dumps/testing/xmldumps/dumpruns/public/elwikivoyage/latest ...
2020-09-01 10:43:55: elwikivoyage Checkdir dir /home/ariel/wmf/dumps/testing/xmldumps/dumpruns/public/elwikivoyage/latest ...
2020-09-01 10:43:55: elwikivoyage adding rss feed file /home/ariel/wmf/dumps/testing/xmldumps/dumpruns/public/elwikivoyage/latest/elwikivoyage-latest-sitelist.xml.gz-rss.xml
2020-09-01 10:43:55: elwikivoyage Checksumming elwikivoyage-20200901-sitelist.xml.gz via md5
2020-09-01 10:43:55: elwikivoyage Checksumming elwikivoyage-20200901-sitelist.xml.gz via sha1
2020-09-01 10:43:55: elwikivoyage Checkdir dir /home/ariel/wmf/dumps/testing/xmldumps/dumpruns/public/elwikivoyage/latest ...
2020-09-01 10:43:55: elwikivoyage Checkdir dir /home/ariel/wmf/dumps/testing/xmldumps/dumpruns/public/elwikivoyage/latest ...
2020-09-01 10:43:55: elwikivoyage Completed job sitelistdump for elwikivoyage

Output check

And finally we check the output:

[ariel@bigtrouble dumptesting]$ zcat /home/ariel/wmf/dumps/testing/xmldumps/dumpruns/public/elwikivoyage/20200901/elwikivoyage-20200901-sitelist.xml.gz
<sites version="1.0" xmlns="http://www.mediawiki.org/xml/sitelist-1.0/">
</sites>
[ariel@bigtrouble dumptesting]$

Looks empty! But that’s because I’m testing on my local instance which has no wikifarm and hence no list of site instances. Nonetheless, the empty list is formatted properly and written to the correct location. Success!

Go forth and do likewise!

The end. Obligatory cute puppies link for reading this through to the end: [4]