Parsing/Visual Diff Testing

The code for generating visual diffs is in the integration/visualdiff repo on gerrit. The main two directories are:

diffserver/ has code for running a visual-diff server for generating diffs on demand.
testreduce/ has code for running mass visual diff testing via the testreduce setup, and for configuring the testreduce server.

This uses the test reduce code which is in the mediawiki/services/parsoid/testreduce repo on gerrit. The testreduce repository includes the sql code for getting up a new database and scripts to extract title lists.

On github, these two repositories are mirrored at:


visualdiff	testreduce
GitHub: project page git repository URL ^{[help ]} commit history	GitHub: project page git repository URL ^{[help ]} commit history

Overview

We have visual diff code set up on parsing-qa-02.wikitextexp.eqiad1.wikimedia.cloud. parsing-qa-02 is a labs server and you can run visual diff tests only against public APIs (whether Parsoid, mediawiki, something else altogether).

Currently, there is one visual-diff instance on this VM.

http://parsoid-vs-core.wmflabs.org is used with the parsoid_vs_core and other databases, and parsoid-vs-core-vd and parsoid-vs-core-vd-client testreduce services. This instance is set up to compare Parsoid rendering and core parser rendering for production wiki pages.

Other visual diff instances can be set up as long as the right visualdiff, testreduce, proxy domain and nginx configs are updated.

Testreduce code

The testreduce code is in /srv/testreduce which is used to run the parsoid-vs-core-vd and parsoid-vs-core-vd-client services. The systemd controller files for these services are in /lib/systemd/system/parsoid-vs-core-vd.service and /lib/systemd/system/parsoid-vs-core-vd-client.services — these files have derived from the puppetized code for similar services on scandum used for Parsoid's roundtrip testing.

The testreduce server config is in /etc/testreduce/parsoid-vs-core-vd.settings.js. The testreduce client config is in /etc/testreduce/parsoid-vs-core-vd-client.config.js which also includes a section that provides the config for the visual diff tests that are to be run.

Visualdiff code

The visualdiff code is in /srv/visualdiff that also provides config and hooks to use it with testreduce. The file /etc/testreduce/parsoid-vs-coe-vd-client.config.js also provides the visualdiff config. It specifies how to fetch the HTML for the two screenshots, specifics uprightdiff as the diffing engine to use, and a few other parameters that control these -- the comments should be fairly self-explanatory. The uprightdiff code is in /srv/uprightdiff.

There is a separate helper service for viewing results for a single title without having to go digging for them in the directory containing them. On parsing-qa-02, the code in /srv/visualdiff/diffserver/diffserver.js is run as the visualdiff-item service. The config for this is in /etc/visualdiff/parsoid-vs-core-diffserver.config.js. The systemd controller file is in /lib/systemd/system/parsoid-vs-core-diffserver.service.

Managing services: parsoid-vs-core-vd, parsoid-vs-core-vd-client, parsoid-vs-core-diffserver

To {stop,restart,start} all clients:

sudo service parsoid-vs-core-vd-client stop
sudo service parsoid-vs-core-vd-client restart
sudo service parsoid-vs-core-vd-client start

Client logs are in systemd journals and can be accessed as:

### Logs for the parsoid-vs-core-vd-client service
# equivalent to tail -f <log-file>
sudo journalctl -f -u parsoid-vs-core-vd-client
# equivalent to tail -n 1000
sudo journalctl -n 1000 -u parsoid-vs-core-vd-client

### Logs of the parsoid-vs-core-vd testreduce server
sudo journalctl -f -u parsoid-vs-core-vd

### Logs of the parsoid-vs-core-diffserver service
sudo journalctl -u parsoid-vs-core-diffserver

The public-facing web UIs for these services are managed by a nginx config in /etc/nginx/sites-available/parsoid-vs-core-vd and provides access to the web UI for the parsoid-vs-core-vd and parsoid-vs-core-diffserver services and also enables directory listing for the screenshots generated during the test runs. The config should be self-explanatory.

Updating the code to test (and being run by the clients)

Unlike Parsoid where the code to test is determined by the latest git commit, in the parsoid-vs-core setup, the code to run lives on a separate VM, and sometimes the change might be in the config files, and may not be available in a git repository (at least as of today). The testreduce codebase implicitly assumes that the test to run is a git commit. However, the testreduce client config file (/etc/testreduce/parsoid-vs-core-vd-client.config.js) can declare a getGitCommit function that is then used by the server as clients to identify the test run in the database. So, in our case, this function simply returns a unique string identifying the test run. So, to initiate a new test run, simply change the string being returned by this function, save the file, and restart the parsoid-vs-core-vd-client service and you will be ready to go.

Anyway, here are the steps:

Login to parsing-qa-02.wikitextexp.eqiad1.wikimedia.cloud. Edit /etc/testreduce/parsoid-vs-core-vd-client.config.js and update the string in the getGitCommit function at the bottom.
Restarting the parsoid-vs-core-vd service shouldn't be necessary, but occasonally that service might crash and might need restarting

Updating the testreduce, visualdiff, uprightdiff code

Of course, there will continue to be bug fixes and tweaks to these codebases. To update the relevant code, simply go to /srv/testreduce, /srv/visualdiff, or /srv/uprightdiff, and do a git pull, and restart the affected services. As simple as that!

Generating new title lists (method 1)

There is a script server/scripts/gen_titles.js in the testreduce repo to generate title lists. Read the README file in that repo for hints on its use. Briefly: first create/edit testdb.info.js in that repository to include the target wikis, using their database name in the prefix column. It will probably be much smaller than the file which is checked-in, which is for complete round trip testing. For example:

#!/usr/bin/env node
'use strict';

module.exports = {
	// How many titles do you want?
	size: 10000,

	// How many of those do you want from traffic popularity
	popular_pages_percentage: 50,

	// How many of those do you want from the dumps?
	// Rest will come from recent changes stream
	dump_percentage: 25,

	wikis: [
		// wikivoyage
		{ prefix: 'cswikivoyage', limit: 1 },
		{ prefix: 'hiwikivoyage', limit: 1 },
		{ prefix: 'shnwikivoyage', limit: 1 },
		{ prefix: 'pswikivoyage', limit: 1 },
		{ prefix: 'trwikivoyage', limit: 1 },
	],
};

Then run the scripts in the order described in the README, after first running npm install at the top level of the testreduce repo:

npm install
cd server/scripts
node fetch_rc.js 
node fetch_top_ranked.js 
node gen_titles.js

You will now have a bunch of *.sql files, one for each target wiki. Transfer these as well as the create_everything.mysql script to parsing-qa-02.wikitextexp.eqiad1.wikimedia.cloud with something like:

cat ../sql/create_everything.mysql *.sql > 20240801.sql
scp 20240801.sql parsing-qa-02:

Creating a new database with your generated titles

We're going to assume you're going to make a new database for these new test results. You could optionally delete the old database and use these instructions to create it with the same name. Start by logging in to parsing-qa-02 using ssh. Start by editing/etc/testreduce/parsoid-vs-core-vd.settings.js:

	// Database to use.
	database: "parsoid_rv_deploy_targets",

	// User for MySQL login.
	user: "testreduce",

	// Password.
    password: "$PASSWORD"

Change the database name (in our example, to parsoid_rv2_deploy_targets) and remember it. Also look at what the password is set to and remember it. Now start mysql:

mysql -u testreduce -p$PASSWORD

and run the following commands (substituting your own new database name for parsoid_rv2_deploy_targets):

create database parsoid_rv2_deploy_targets;
use parsoid_rv2_deploy_targets;
source 20240801.mysql;
quit

Edit /etc/testreduce/parsoid-vs-core-vd-client.config.js and update the string in the gitCommitFetch function at the bottom to match the latest running version of mediawiki from versions.toolforge.org/ . Restart all the services:

sudo service parsoid-vs-core-vd restart
sudo service parsoid-vs-core-vd-client restart

And check that everything is running:

sudo journalctl -f -u parsoid-vs-core-vd

Generating new title lists (method 2)

There is a script tools/gen_visualdiff_titles.js in the Parsoid repo, which has some hints for its use in a comment at the top of the file. It starts by using Quarry to get a list of titles from your target wiki(s). Go to quarry.wmcloud.org and login with your meta.wikimedia.org account. Click the "New Query" button. The first thing you will need to do is enter the dbname corresponding to the wiki you are targeting. Check this list to map from project domain to dbname, it can sometimes be unintuitive.

For testing parsoid read views in the main article space, use the following query:

select page_title,page_namespace from page where page_is_redirect=0 and page_namespace = 0;

For testing discussion tools (ie, selecting pages in the Talk namespaces), use:

select page_title,page_namespace from page where page_is_redirect=0 and mod
(page_namespace,2) = 1;

Use the "Download Data" button and save this as allpages.json.

This will be used to generate a random sample of all pages on a wiki. It is recommended to supplement this with a "most frequently viewed" list, to ensure that the most popular pages are included in the visualdiff. Go to https://pageviews.wmcloud.org/topviews/, enter your target project (in domain form this time, not dbname). If you are looking at main article space pages, leave the "Show only mainspace pages" box checked. If you want discussion tools pages, uncheck the "Show only mainspace pages" box and type "Talk:" in the Search box.

Use the "Download' button and save this as topviews.json.

Now use the gen_visualdiff_titles.js tool as follows:

node tools/gen_visualdiff_titles.js $DBNAME allpages.json 1000 > titles.sql
node tools/gen_visualdiff_titles.js $DBNAME topviews.json 1000 >> titles.sql
sort -u titles.sql > $DBNAME-titles.sql

You will now have a bunch of *-titles.sql files, one for each target wiki. Transfer these to parsing-qa-02.wikitextexp.eqiad1.wikimedia.cloud.

Now follow the instructions under "method 1" above to create a new database and load these .sql files into it.

Retesting a subset of titles

The only way to do this is to clear the result entries in the mysql db. The mysql credentials (username, db, password) are in /etc/testreduce/parsoid-vs-core-vd.settings.js

mysql> update pages set claim_hash="",claim_num_tries=0, claim_timestamp=null,latest_stat=null,latest_result=null,latest_score=0,num_fetch_errors=0 where latest_score > 5000;

That will clear all test results for titles that have a score > 5000 which is equivalent to pages that have rendering diff > 5%. Score = errors * 1M + truncate(diff%) * 1000 + fractional-part-of-diff%. This weird scoring formula is just a result of shoe-horning the visualdiff results into the testreduce setup that was built for parsoid-rt testing. So, to clear test results for all erroring pages, you use latest_score >= 1000000 (or a really high score value less that 1 million).

Look at the schema for the pages table to clear results for other subsets.

Deleting 404-ing titles

Occasonally, titles in the test database might be deleted on the wiki -- so visual diff tests for these titles will start failing after that point, and will show up as an erroring title. Here is how you would go about detecting and deleting those:

mysql> select page_id from results where result like '%code: 404%' and commit_hash='.. latest hash ..';
.. this gives you a set of page ids ...
mysql> delete from stats where page_id in (... above set ...);
mysql> delete from results where page_id in (... above set ...);
mysql> delete from pages where id in (... above set ...);

Some useful scripts

On parsing-qa-02, there are two scripts in ~ssastry home directory

stats.sh: This script can be used to generate a wikitable of stats (you can tweak it to generate stats for all wikis in the db or a subset of wkis) ordered by diff runs in reverse chronological order. This script was used to generate the tables here and here.
diffs.sh: This script can be used to generate a wikitable of diffs (with score > 1000 unless you tweak it for other thresholds) per wiki which is useful to prioritize work as well as to distribute the work of analyzing diffs by wiki. this script was used to generate table here.

Some useful sql commands

Here is a command to list diff titles to further inspect (edit suitably):

select latest_score, concat("http://parsoid-vs-core.wmflabs.org/diff/", prefix, "/", title) from pages where prefix='eowikivoyage' and latest_score > 1000 order by latest_score desc;

Resource usage and # of test clients

parsng-qa-01 is a large labs vm with 12 cpu cores, 32 gb memory, and a 400+gb disk. Even so, visual diff testing can use up all these resources. 20 testreduce clients seem to be about the upper-end of how many can be run at the same time. This is enough to sometimes bring cpu load to 13-15 and memory usage to 28+gb. Probably 16 clients is a more comfortable number. The # of test clients to run can be tweaked by editing /lib/system/systemd/parsoid-vs-core-vd-client.service

The screenshots from puppeteer and from uprightdiff are written to /data/visualdiffs/pngs organized by wiki prefix. These images are overwritten with each test run. It takes too much disk space to store these images per test run. 125GB is used per test run. But, in the future, we could consider storing results from the most recent 2-3 runs or get a larger disk and expand that range a bit more.

Web UI for browsing results

The screenshots from puppeteer and from uprightdiff are written to /data/visualdiffs/pngs organized by wiki prefix and are accessible via HTTP @ http://mw-expt-tests.wmflabs.org/visualdiff/pngs/.

However, a better way of browsing these results is via the parsoid-vs-core-vd web UI at http://mw-expt-tests.wmflabs.org. The /topfails link sorts results in descending order of score which makes it easy to look at pages that generate the most prominent diffs first. The @remote link on these results listing page is a easy way to look at the 2 HTML screenshots and the uprightdiff screenshot. That output is outsourced to the visualdiff-item service. It simply links to the existing screenshots (or if missing, generates them on demand).

Uprightdiff numeric scoring

Uprightdiff compares the two candidate images and returns 3 metrics:

* modifiedArea : This is a simple count of the number of pixels for which the source does not match the destination (after they have both been expanded to the same size).
* movedArea    : The number of pixels for which nonzero motion was detected.
* residualArea : The number of pixels which differed between the resulting image and the second input image.

In other words,

if modifiedArea == 0, then the images had pixel-perfect match. In this scenario, movedArea and residualArea will also be zero.
if modifiedArea > 0, then the images obviously differed. If residualArea == 0, then it tells us that all the differences could be accounted for by vertical motion and the rendering differences are mostly insignificant. In this scenario, movedArea tells us how many pixels were affected.

The goal of generating a numerical score is to be able to (a) compare test results for different pages and identify the most significant ones, and (b) compare test results for the same page across test runs and determine whether our fixes improved or worsened the situation. With these goals in mind, the visual diffing code takes the totalArea of the image and uses the above 3 metrics to generate 2 different numbers.

SignificantDiffMetric (when residualArea > 0): 75 * residualArea / totalArea + 0.25 * min(max(2^(residualArea / 100000) - 1, 0), 100)
InsignificantDiffMetric (when residualArea == 0): 50 * modifiedArea / totalArea + 50 * movedArea / totalArea
ErrorMetric: 1 if the test had a fatal error, 0 otherwise.

The total score is then computed as 1,000,000 * ErrorMetric + 1,000 * SignificantDiffMetric + InsignificantDiffMetric (In other words, this can be seen as a number in base-1000 notation).

This scoring technique gives us what we want. In addition, the signficant diff metric tries to flag pages that are really large (big totalArea value), that have a sizeable pixel diff (big residualArea), but which is fairly small relative to the size of the page (small residualArea / totalArea ratio). A simple residualArea / totalArea ratio would favor small pages with mostly insignificant residualArea values over large pages with mostly significant residualArea values. So, we pick a 1M area as our baseline and figure out how big the residual area is relative to that and use exponentiation to weight those heavily.

We believe that this numeric metric lets us quickly identify problematic rendering differences and use mass visual diff testing without having to manually sift through thousands of diff images to identify where to focus our efforts.

Updating the VMs

Just to be clear, the above talks about labs VMs which, in the following discussion, are the hosts to the VMs that mediawiki-vagrant spins up. This section is about keeping mediawiki-vagrant and the VMs it spins up up-to-date.

In the future, it might be easiest to just create new labs VMs and start from scratch. https://phabricator.wikimedia.org/T204566#4797907 has some notes from when we updated the VMs this way in 2018. In addition, the following notes might nevertheless be a useful guide in cases problems arise while upgrading.

Troubleshooting notes from May 2020 while upgrading vagrant and mediawiki checkout

Keeping mediawiki-vagrant up-to-date is supposed to be as simple as git pull && vagrant provision. In practice, that wasn't so. This is most likely because of nfs issues that were left unresolved when setting it up. At the time, vagrant reload was abused until no errors were reported when starting up. To consistently startup without error, the suggestion from T139859 was used to set vagrant config nfs_shares no.

Unfortunately, after booting, the permissions in /vagrant are in a problematic state. In order to work around it, on the hosts, do sudo chown -R mwvagrant:wikidev /srv/mediawiki-vagrant and, in the VM, do sudo chown -R vagrant:www-data /vagrant. That at least allows for basic vagrant commands to work.

Generally, to update mediawiki in the VMs, vagrant ssh in and then fix the permissions. Then, instead of using vagrant git-update on the hosts, just invoke run-git-update from inside the VM.

There were a few other one off problems when provisioning the VMs that required apt-get install php-redis php-igbinary php-luasandbox and fixing links to the available modules when going to php 7.2 that won't likely need repeating. These were from T213016 and T213993

All that said and done, the major hurdle was that we were using an import from before the actor migration began. I imagine that because users weren't imported, when the schema migration scripts in maintenance/update.php ran, we ended up in a broken state. In order to fix the revision_actor_temp table, I just assigned all the actions to the Admin user. Based on T249185#6028521, in the VMs, create a file t.sql,

insert into revision_actor_temp (revactor_rev, revactor_actor, revactor_timestamp, revactor_page) select rev_id, 1, rev_timestamp, rev_page from revision r where not exists ( select 1 from revision_actor_temp a where a.revactor_rev = r.rev_id );

and then run,

#!/usr/bin/env bash

for db in $(alldbs); do
	echo $db
	mysql $db < t.sql
done