Topic on Talk:Quarry

LabsDB replica databases can drift from production originals

11
Summary last edited by Clump 11:51, 19 June 2018 5 years ago

There is some discussion of causes of replica drift on wikitech. There is also a tracking task in phabricator where you can report specific problems you find.

The TL;DR of why this happens is that MediaWiki performs some database changes which are not deterministic (they are based on the current state of the database at the time the change is made) and the wiki replicas hosted by Cloud Services are not identical copies of production. This combination makes for drift problems that must be resolved manually or by a very slow re-import process. The new cluster that is being prepared is using a different replication engine that we hope will make drift less common. That work is tracked in phab:T140788. The only thing that will eliminate drift is changing how MediaWiki does the problematic database operations.

Le Deluge (talkcontribs)

Quarry is using a copy of the database that doesn't match the production database - there's clearly been some kind of corruption or replication problem. It first reared its head 13 months ago, and has now come back and is active at the moment. Please can Quarry be given a "clean" version of the db to feed on?

Email issues mean I can't get on Phabricator at the moment, but hopefully someone here can get something done. I do a lot of work with red-link and other problem categories on en.wiki, particularly with these three reports : | Uncategorized categories (Quarry), Categories categorized in red-linked categories (Quarry) and Red-linked categories with incoming links (Quarry).

Obviously you would expect every category in the first query to exist - but it became clear that there were four that were "stuck"in the query that had been deleted either on 22 April 2016 or 30/Apr/16. The second query has fourteen "zombies" - some cats that were also deleted on 22 April 2016 (and so shouldn't be in the report), and some cats that do exist but have a parent category that exists (which should disqualify them from the report) - in those cases the parent category was moved to its current name on 22 April 2016. I've even tried null edits and recreating some of these zombies and then deleting them, but it doesn't affect what happens in Quarry.

It gets worse on the third query, which now has 31 zombie cats all of which are empty and so shouldn't be in the query. A couple overlap with query 2, there's some birth and date ones whose only thing in common is the removal of a CfD tag on 1 May 2016, and there's maintenance categories - one deleted on 13 May 2016, one on 4 June 16 and one deleted on both 8 June 2016 and 14 August 2016. So far, a coherent story seems to be emerging - a big problem on 22 April 16, which trickled on for a few weeks afterwards but then was fixed.

Now it's back. If you look at the third query you'll see a bunch of maintenance categories from the last week, which were mostly deleted on 2 June 2017 but there's at least one which was deleted just 13 hours ago. So whatever this corruption/replication problem is, it has clearly come back.

From my point of view, all I want is Quarry to be working on a copy of the database that actually matches the production database. However, there's also a wider issue of what is the underlying cause of all this - and does it only affect replication to the Labs copy of the database, or is there a wider problem? Worms, meet can.... Le Deluge (talk) 13:43, 4 June 2017 (UTC)

BDavis (WMF) (talkcontribs)

The problem of deletions not replicating properly is known. The short answer is that MediaWiki does things in the database that the current replication strategy does not always deal with well. Work has been underway for several months to build a new database cluster that uses a different replication strategy that should miss fewer deletes and undeletes. I won't say that it will get rid of all of them because there are some database operations that MediaWiki does that are just very difficult to sync to the filtered replicas in the Cloud Services environment. As soon as the new database servers are fully populated for all shards we expect to change the configuration of Quarry to use them.

Le Deluge (talkcontribs)

Thanks for the reply (and I assume you had something to do with Quarry getting so much quicker recently, I now have queries taking 10-20% of the time - and more importantly not timing out!)

If what you're saying is that Quarry will be working off a "clean" copy of the database within a few months then that's good enough for me, I was more worried that I'd turned up something that had wider implications for the integrity of the database. And it seemed to have got a whole lot worse recently, although that may just be a particular thing to do with query 3 which has a lot of maintenance cats passing through it for a day or two, lately they've been coming in but not managing to get out again.

So just my curiosity remains - what's actually been happening with the secondary copies of eg en:category:Law_about_religion_by_country? Is it some kind of locking issue or something deeper?

BDavis (WMF) (talkcontribs)

There is some discussion of causes of replica drift on wikitech. There is also a tracking task in phabricator where you can report specific problems you find.

The TL;DR of why this happens is that MediaWiki performs some database changes which are not deterministic (they are based on the current state of the database at the time the change is made) and the wiki replicas hosted by Cloud Services are not identical copies of production. This combination makes for drift problems that must be resolved manually or by a very slow re-import process. The new cluster that is being prepared is using a different replication engine that we hope will make drift less common. That work is tracked in phab:T140788. The only thing that will eliminate drift is changing how MediaWiki does the problematic database operations.

Le Deluge (talkcontribs)

And the replication problem is ongoing - another 7 cats are still there that were deleted today such as here

Achim55 (talkcontribs)
Le Deluge (talkcontribs)

And now it's a bit over 3 hours. Thanks, it's a sign of database stress, but I'm not talking about something that's "just" a bit of replication lag. These queries seem to reflect what's happening in the production database pretty closely - a minute or two lag at most. What I'm talking about is individual entries getting "stuck" - for over a year in some cases.

It feels like these entries have ended up getting permanently locked on the Quarry db, apparently as a result of being moved or deleted on the production database (presumably at a time of database stress). The locks have been cleared on the production db but not on the Quarry db and as a result the entries on the Quarry db can't be updated.

At least, that's my guess.

Achim55 (talkcontribs)

The increasing lag mentioned above is now at 37 hours, subtracting 14 of yesterday, there is no replication at all.Yesterday at 12:32 (UTC) I saved a page on Commons the content of which isn't queriable via quarry by now.

Le Deluge (talkcontribs)

Ahem, it does sound less impressive if you accidentally divide by 10 like I did.... Again, we are at slight cross-purposes, I'm talking about the main en database which is not showing significant lag, even if the Commons database is struggling. But the fact that a major database is under such stress must mean that Wikimedia as a whole has a problem.

BDavis (WMF) (talkcontribs)

There is ongoing production maintenance of the "s4" shard that hosts the commons database. The latest at this writing is phab:T166206. This query of the server admin logs shows others happening recently.

You can check on the lag of particular database replicas using https://tools.wmflabs.org/replag/.

The DBA team has been working on building a new replica cluster for Labs/Cloud Services that we hope will be ready for everyone to use in the next month or two. This new cluster uses a different backend technology for replication (row based replication rather than binlogs) that should stay in better sync with the production servers. This process of bringing up a new cluster from scratch is pretty slow because we can't just copy the data directly from production. The wiki replicas are filtered to remove data that has been supressed and that filtering process needs to happen as the data is copied across rather than some operation after a full copy to try and remove it.

We don't like it when the replicas are out of sync by more than a second or two, but sometimes there isn't much that can be done to prevent it. You can take some amount of solace in knowing that back in the olden days of the toolserver replica lag of more than a week was not uncommon for all of the tables in all of the wikis. We are generally doing a lot better as time goes on.

BDavis (WMF) (talkcontribs)

The s4 shard is back in sync, but there is another planned maintenance starting sometime on 2017-06-12 that will cause replication to lag for a few days (phab:T166206#3331928).