SQL/XML Dumps/Becoming a dumps co-maintainer/Deployment-prep

We do a lot of testing in the Deployment-prep project in the Wikimedia Cloud. You'll want to know how to set up a new snapshot server instance there, how to access it once it's ready, and how to run tests. If there's already a working instance, great! You're done! But if you need to upgrade to a new distro for example, this guide is for you.

Setting up a new snapshot instance

All of our dumps are produced by servers named snapshotNNN and written out to NFS fileshares on hosts named dumpsdataNNN, except in the deployment-prep project. There, we have a single instance that has a local filesystem mounted where the NFS filesystem would usually go, and reads and writes are done directly to it. This means that certain features such as NFS locking are not testable there, but everything else is.

So, to set up a new testbed, we need only one instance, a new "deployment-snapshot0x" where x is the next available number ([1]). Currently we are on 03.

You will want the following settings when setting up an instance ([2]):

image type: g2.cores4.ram8.disk80 (should have: VCPUs: 4, RAM: 8GB, Disk: 80GB)
hostname: deployment-snapshotXX
OS: currently buster, change as needed
Security group: default
Sever group: ignore this

Once you click "Launch Instance", the instance will be configured, create and booted. Don't expect it to come up with everything working. Instead:

Wait a while to check that the instance is up and running and has probably run at least part of the initial puppet run. You can tell what happened via the logs at https://horizon.wikimedia.org/project/instances/ Select your new instance and then click the "Logs" tab.

Now go to the "Puppet Configuration" tab. In the "Classes" section, make sure the contents are as follows:

role::beta::mediawiki
role::dumps::generation::worker::beta_testbed

and for "Hiera Config" you have

profile::dumps::generation::worker::common::dumps_misc_cronrunner: false
profile::dumps::generation::worker::common::nfs_extra_mountopts: actimeo=0
profile::dumps::generation::worker::common::php: /usr/bin/php7.2
profile::dumps::generation_worker_cron_php: /usr/bin/php7.2
profile::envoy::ensure: absent
profile::services_proxy::envoy::local_clusters:
- swift-https
- search-https
- search-omega-https
- search-psi-https
puppetmaster: deployment-puppetmaster04.deployment-prep.eqiad.wmflabs

You'll want to double-check the list of instances to be sure the current puppetmaster is indeed 04 and not some later number.

Once you submit this change you'll need to wait a little while for it to take effect, typically 15 to 20 minutes to make it both to the puppetmaster and then to your instance.

At this point you should be able to SSH in to your instance but it will likely tell you that puppet failed to run nicely. As root on the instance, do

rm -rf /var/lib/puppet/ssl
puppet agent --test

to generate a new certificate request from the deployment-prep puppetmaster, SSH to the deployment-prep puppetmaster and do

puppet ca list
puppet cert sign deployment-snapshotNN.deployment-prep.eqiad1.wikimedia.cloud   (or whatever snapshot name showed up in the list)

and then back on the instance, do

puppet agent --test

This time it should run and do a whole bunch of stuff. You might want to run it more than once to make sure it's got nothing left to process.

You now need to make sure this instance isn't in the WMCS backups since the disk is so large, and that it IS in the list of deployment targets for mediawiki and for dumps. Examples of how to do that in puppet: [3] for backups and mediawiki, [4] for dumps.

Once you've made the appropriate patchset and merged it, you'll want to wait for it to make it over to the deployment-prep deployment server, again 15 to 20 minutes.

On the deployment server, as you, be in the /srv/deployment/dumps/dumps/scap directory and do a git pull to make sure that the new target is added to the list.

At this point you should be ready to try running the test suite as the dumpsgen user; installation and setup are done!

Troubleshooting

The dreaded failed sync

This procedure can go wrong in a few places. The most common issue is that the sync from our gerrit puppet repo to deployment-prep's puppetmaster is broken for some reason. Typically you will want to do the following to sort it out:

Go to the deployment-prep puppetmaster, where the sync runs, and check the file /var/log/git-sync-upstream.log to see if the last run was successful.
If not, look through it and the previous (.1 through whatever) logs to see where the error first started happening.
On deployment-prep, some local commits are maintained outside of the puppet repo. The sync job updates the local repo from the gerrit copy, and rebases these local commits on top. One of these will likely have failed and still be failing.
The rebase attempt should say exactly which commit was unable to be rebased on top of the current branch in the repo. You can cd to /var/lib/git/operations/puppet and have a look at the local commit that is a problem, and you can look at a local copy of the puppet repo on your laptop at the conflicting commit in production.
At this point you have a couple choices: if it looks too messy for you to touch, find the author of one or the other and ask them to rebase it for you, or disable the sync cron job (in the root crontab, look for an entry with "git-sync-upstream" in it), git rebase -i origin/production and try to resolve it yourself.
Once you or your rescuer has finished up with the rebase, try the script directly (as root): /usr/local/bin/git-sync-upstream
If you don't see errors at the end, you are done! If you do, ask for help.

Certs gone bad

But wait, there's more!

Because you might have had to clean up ca certificates in an earlier step, you might have bad symlinks in /etc/ssl/certs. You'll know about this in buster ([5]) because innocuous commands like

curl -H 'Content-Type: application/json' 'https://deployment-etcd-01.deployment-prep.eqiad.wmflabs:2379/v2/keys/conftool/v1/mediawiki-config/?recursive=true'

won't work and most mediawiki maintenance scripts likely won't either.

To fix this, you should be able to run /usr/sbin/update-ca-certificates as root on your instance. You'll know it did the trick if the curl command is now successful.