Requests for comment/Multi datacenter strategy for MediaWiki/Progress
This page is obsolete. It is being retained for archival purposes. It may document extensions or features that are obsolete and/or no longer supported. Do not rely on the information here being up-to-date. Current status at: wikitech:Performance/Multi-DC MediaWiki |
References:
- Multi-DC master tracking task https://phabricator.wikimedia.org/T88445
Multi DC strategy RFC:
- https://phabricator.wikimedia.org/T88666
- Requests for comment/Multi datacenter strategy for MediaWiki
Multi-DC sync-up meeting regular attendees:
- Aaron
- Stas
- Gabriel
- Brandon
- Giuseppe
- Filippo
- Gilles
- Timo
- JaimeC
2016-08-17
MediaWiki:
- [assigned] Flow cache purges to use WAN cache ( https://phabricator.wikimedia.org/T120009 )
- [blocked] action=rollback uses GET ( https://phabricator.wikimedia.org/T88044 )
- patch reverted for now (user JS breakage); patch to be tweaked
- needs user input; ask comm laisons, ask Design/Reading?
- [in progress] wikidata master queries ( https://phabricator.wikimedia.org/T110399 )
- Subtask created: https://phabricator.wikimedia.org/T138376
- First patch: https://gerrit.wikimedia.org/r/#/c/302199/1
Configuration:
- [unstarted] Switch parts of config to something like etcd.
Databases:
- [done] pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 )
- [in progress] mariadb clients (MediaWiki) to use TLS/SSL( https://phabricator.wikimedia.org/T134809 )
- Make sure cross-DB TLS new connections are rare (10x worse latency for opening connections vs non-SSL) - We already use it for replication (1 continous connection) with no visible overhead
- Certificate management???
- I need to coordinate with Performance and Availability to standarize all MySQL services with the same HA solution. That may require mediawiki changes so that most of https://noc.wikimedia.org/conf/highlight.php?file=db-eqiad.php gets simplified to a single ip + port per "micro-service". Also probably those 2 files should disappear and only have db.php, given that we will have a single active-active setup (?) T141547
Media storage / Swift:
- [unstarted] HTTPS for swift: https://phabricator.wikimedia.org/T127455
- swiftrepl/MediaWiki cross-dc writes uses HTTP now. Lets clean this up before doing active/active though.
Session storage / redis:
- [in progress] Use a dedicated HyperSwitch/cassandra cluster? (https://phabricator.wikimedia.org/T134811 )
- Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 ) and SSL needed
- What is the advantage of using restbase vs. direct cassandra?
- RestBase allows us to narrow the public interface, no way to drop & list all data etc; independence from backend Do we have other backends besides cassandra and sqlite for restbase? also we're already choosing the datastore, not the restbase
- https://phabricator.wikimedia.org/T140813
- Last meeting affirmed cautious support for cassandra/hyperswitch
- Idea of services team focusing on session before auth storage was floated (would be useful for multi-DC work)
- [in progress] Use a dedicated HyperSwitch/cassandra cluster? (https://phabricator.wikimedia.org/T134811 )
CDN / traffic:
- [done] log all post-send DB updates to gauge frequency (we don't want too many threads tied up)
- Patch to distinguish callback updates deployed
- Graph for GETs: https://graphite.wikimedia.org/render/?width=973&height=470&_salt=1470217359.097&target=highestCurrent(MediaWiki.deferred_updates.GET.*.rate%2C8)
- Mostly logging, parsercache updates, spreadAnyEditBlock() is 20/minute
- [deferred] VCL routing logic: https://phabricator.wikimedia.org/T91820
Services:
- [in progress] look into mcrouter too see if it can work for WANCache
- Either email some people use a github question
- initial mcrouter debianization: https://gerrit.wikimedia.org/r/#/c/288196/
- Firming up design for session & auth service: https://phabricator.wikimedia.org/T140813
- Timeline: Prototype auth service in Q1, deploy with security in Q2. Can push for session storage deploy earlier, pending hardware.
- ACTION: Gabriel to set up meeting for session storage next week.
The Big Active / Active Goal™ When to call it out / how far away are we from starting active-active operation? What are the critical things we need to have solved / in place before we can call out a technology goal of going active-active? Workboard: https://phabricator.wikimedia.org/tag/wikimedia-multiple-active-datacenters/ Timeline: Tentatively looks like Q2 is still busy. Possibly Q3?
2016-08-03
MediaWiki:
- [assigned] Flow cache purges to use WAN cache ( https://phabricator.wikimedia.org/T120009 )
- [blocked] action=rollback uses GET ( https://phabricator.wikimedia.org/T88044 )
- patch reverted for now (user JS breakage); patch to be tweaked
- needs user input; ask comm laisons, ask Design/Reading?
- [in progress] wikidata master queries ( https://phabricator.wikimedia.org/T110399 )
- Subtask created: https://phabricator.wikimedia.org/T138376
- First patch: https://gerrit.wikimedia.org/r/#/c/302199/1
Configuration:
[unstarted] Switch parts of config to something like etcd.
Databases:
- [unblocked] pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 )
- Config patch at https://gerrit.wikimedia.org/r/#/c/243116/
- datacenter column now present \o/
- [unblocked] Deploy MASTER_GTID_WAIT() support (https://gerrit.wikimedia.org/r/#/c/289985/)
- Patch merged in core
- Config patch at https://gerrit.wikimedia.org/r/#/c/302635/ (might do testwiki first though)
- [unstarted] mariadb clients (MediaWiki) to use TLS/SSL( https://phabricator.wikimedia.org/T134809 )
- Make sure cross-DB TLS connections are rare (10x worse latency for opening connections vs non-SSL) - We already use it for replication (1 continous connection) with no visible overhead
- Certificate management???
- [status?] ES compression...blocker?
- https://phabricator.wikimedia.org/T106386
- Not a blocker
Media storage / Swift:
- [unstarted] HTTPS for swift: https://phabricator.wikimedia.org/T127455
- swiftrepl uses HTTP now. Do want to add MediaWiki to this?
- [HARD BLOCKER] lets do SSL first
- swiftrepl uses HTTP now. Do want to add MediaWiki to this?
Session storage / redis:
- [in progress] Use a dedicated HyperSwitch/cassandra cluster? (https://phabricator.wikimedia.org/T134811 )
- Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 ) and SSL needed
- What is the advantage of using restbase vs. direct cassandra?
- RestBase allows us to narrow the public interface, no way to drop & list all data etc; independence from backend Do we have other backends besides cassandra and sqlite for restbase? also we're already choosing the datastore, not the restbase
- https://phabricator.wikimedia.org/T140813
- Last meeting affirmed cautious support for cassandra/hyperswitch
- Idea of services team focusing on session before auth storage was floated (would be useful for multi-DC work)
- [in progress] Use a dedicated HyperSwitch/cassandra cluster? (https://phabricator.wikimedia.org/T134811 )
CDN / traffic:
- [done] log all post-send DB updates to gauge frequency (we don't want too many threads tied up)
- Patch to distinguish callback updates deployed
- Graph for GETs: https://graphite.wikimedia.org/render/?width=973&height=470&_salt=1470217359.097&target=highestCurrent(MediaWiki.deferred_updates.GET.*.rate%2C8)
- Mostly logging, parsercache updates, spreadAnyEditBlock() is 20/minute
- [deferred] VCL routing logic: https://phabricator.wikimedia.org/T91820
Services:
- [unstarted] look into mcrouter too see if it can work for WANCache
- Either email some people use a github question
- initial mcrouter debianization: https://gerrit.wikimedia.org/r/#/c/288196/
- Firming up design for session & auth service: https://phabricator.wikimedia.org/T140813
- Timeline: Prototype auth service in Q1, deploy with security in Q2. Can push for session storage deploy earlier, pending hardware.
- ACTION: Gabriel to set up meeting for session storage next week.
The Big Active / Active Goal™ When to call it out / how far away are we from starting active-active operation? What are the critical things we need to have solved / in place before we can call out a technology goal of going active-active? Use tracking ticket: ACTION: Aaron to create, discuss at next meeting. Timeline: Tentatively looks like Q2 is still busy. Possibly Q3?
2016-07-20
MediaWiki:
- [done] restbase BagOStuff subclass (https://phabricator.wikimedia.org/T137272 )
- [assigned] Flow cache purges to use WAN cache ( https://phabricator.wikimedia.org/T120009 )
- [blocked] action=rollback uses GET ( https://phabricator.wikimedia.org/T88044 )
- patch reverted for now (user JS breakage); patch to be tweaked
- needs user input; ask comm laisons, ask Design/Reading?
- [unstarted] wikidata master queries (T110399)
- Subtask created: T138376
- [done] notify users to use POST for rollback/markpatrolled/purge tools
Databases:
- [blocked] pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 )
- Config patch https://gerrit.wikimedia.org/r/#/c/243116/ waiting on 'datacenter' pt-heartbeat table column
- [done] MASTER_GTID_WAIT() support (https://gerrit.wikimedia.org/r/#/c/289985/)
- Initial version done, maybe test in betalabs with mariadb next?
- [done] talk to RE about mariadb version (https://phabricator.wikimedia.org/T138778)
- [unstarted] mariadb clients (MediaWiki) to use TLS/SSL( https://phabricator.wikimedia.org/T134809 )
Media storage / Swift:
- [unstarted] HTTPS for swift: https://phabricator.wikimedia.org/T127455
Session storage / redis:
- [in progress] Use a dedicated HyperSwitch/cassandra cluster? (https://phabricator.wikimedia.org/T134811 )
- Old patch for direct casandra use: https://gerrit.wikimedia.org/r/#/c/238370/1/includes/libs/objectcache/CassandraBagOStuff.php
- Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 ) and SSL needed
- What is the advantage of using restbase vs. direct cassandra?
- RestBase allows us to narrow the public interface, no way to drop & list all data etc; independence from backend Do we have other backends besides cassandra and sqlite for restbase? also we're already choosing the datastore, not the restbase
- https://phabricator.wikimedia.org/T140813
- [in progress] Use a dedicated HyperSwitch/cassandra cluster? (https://phabricator.wikimedia.org/T134811 )
CDN / traffic:
- [done] log all post-send DB updates to gauge frequency (we don't want too many threads tied up)
- T137326: done
- [deferred] VCL routing logic: https://phabricator.wikimedia.org/T91820
Services:
- [unstarted] change_propagation module for CDN cache purges
- [unstarted] look into mcrouter too see if it can work
- initial mcrouter debianization: https://gerrit.wikimedia.org/r/#/c/288196/
- [unstarted] develop xkey purge strategy: Brandon to set up initial brainstorm meeting
- looking into Kafka fail-over / upgrade; likely 0.9 first, then 0.10 (will bring timestamp indexes): https://phabricator.wikimedia.org/T127718
- new librdkafka based node client looking good, starting beta testing; adds Kafka 0.9/0.10 support
- Firming up design for session & auth service: https://phabricator.wikimedia.org/T140813
- Timeline: Prototype auth service in Q1, deploy with security in Q2. Can push for session storage deploy earlier, pending hardware.
- ACTION: Gabriel to set up meeting for session storage next week.
The Big Active / Active Goal™ When to call it out / how far away are we from starting active-active operation? What are the critical things we need to have solved / in place before we can call out a technology goal of going active-active? Use tracking ticket: ACTION: Aaron to create, discuss at next meeting. Aaron: I'd rather use a tag and board, TODO Blocking tasks are now all in etherpad now Timeline: Tentatively looks like Q2 is still busy. Possibly Q3?
2016-06-22
MediaWiki:
- [under review] restbase BagOStuff subclass (https://phabricator.wikimedia.org/T137272 )
- [unassigned] Flow cache purges ( https://phabricator.wikimedia.org/T120009 )
- [assigned] action=rollback uses GET ( https://phabricator.wikimedia.org/T88044 )
- reverted for now (user JS breakage); patch to be tweaked
- [unstarted] wikidata master queries (T110399)
- Subtask created: T138376
- [in progress] notify users to use POST for rollback/markpatrolled/purge tools
Databases:
- [blocked] pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 )
- Config patch https://gerrit.wikimedia.org/r/#/c/243116/ waiting on 'datacenter' pt-heartbeat table column
- [in progress] MASTER_GTID_WAIT() support (https://gerrit.wikimedia.org/r/#/c/289985/)
- Initial version done, maybe test in betalabs with mariadb next?
- [ACTION] talk to RE about mariadb version
- [unstarted] mariadb clients (MediaWiki) to use TLS/SSL( https://phabricator.wikimedia.org/T134809 )
Media storage / Swift:
- [done] Experiment with sync/async and watch statsd for api entry point for multiwrite backend
- done and left on; no noticeable effect on api entry points
- [unstarted] HTTPS for swift: https://phabricator.wikimedia.org/T127455
Session storage / redis:
- [in progress] Use restbase/cassandra cluster? (https://phabricator.wikimedia.org/T134811 )
- Old patch for direct casandra use: https://gerrit.wikimedia.org/r/#/c/238370/1/includes/libs/objectcache/CassandraBagOStuff.php
- Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 ) and SSL needed
- [in progress] Use restbase/cassandra cluster? (https://phabricator.wikimedia.org/T134811 )
CDN / traffic:
- [done] log all post-send DB updates to gauge frequency (we don't want too many threads tied up)
- T137326: done
- [deferred] VCL routing logic: https://phabricator.wikimedia.org/T91820
Services:
- change_propagation module for CDN cache purges
- [unstarted] look into mcrouter too see if it can work
- initial mcrouter debianization: https://gerrit.wikimedia.org/r/#/c/288196/
- looking into Kafka fail-over / upgrade; likely 0.9 first, then 0.10 (will bring timestamp indexes): https://phabricator.wikimedia.org/T127718
2016-06-08
MediaWiki:
- [unassigned] restbase BagOStuff subclass (https://phabricator.wikimedia.org/T137272 )
- [unassigned] Flow cache purges ( https://phabricator.wikimedia.org/T120009 )
- [assigned] action=rollback uses GET ( https://phabricator.wikimedia.org/T88044 )
- reverted for now (user JS breakage); patch to be tweaked
- [unstarted] wikidata master queries (T110399)
Databases:
- [blocked] pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 )
- Config patch https://gerrit.wikimedia.org/r/#/c/243116/ waiting on 'datacenter' pt-heartbeat table column
- [wip] MASTER_GTID_WAIT() support (https://gerrit.wikimedia.org/r/#/c/289985/)
- Initial version done, maybe test in betalabs with mariadb next?
- [unstarted] mariadb clients (MediaWiki) to use TLS/SSL( https://phabricator.wikimedia.org/T134809 )
Media storage / Swift:
- [unstarted] Experiment with sync/async and watch statsd for api entry point for multiwrite backend
- statsd graphs finally fixed (at https://grafana.wikimedia.org/dashboard/db/api-requests)
- use 'sync' if not too slow (little upload API speed change per statsd) (https://gerrit.wikimedia.org/r/293272)
- [unstarted] HTTPS for swift: https://phabricator.wikimedia.org/T127455
Session storage / redis:
- [unassigned] Use restbase/cassandra cluster? (https://phabricator.wikimedia.org/T134811 )
- Old patch for direct casandra use: https://gerrit.wikimedia.org/r/#/c/238370/1/includes/libs/objectcache/CassandraBagOStuff.php
- Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 ) and SSL needed
- [unassigned] Use restbase/cassandra cluster? (https://phabricator.wikimedia.org/T134811 )
CDN / traffic:
- [deferred] VCL routing logic: https://phabricator.wikimedia.org/T91820
- VCL or Apache proxying?
- Should probably block on verification that idempotent/safe methods (GET/HEAD/OPTIONS) do not cause writes, is there a task for this?
- https://phabricator.wikimedia.org/T92357 tracks master queries on GET/HEAD
- [ACTION] log all post-send DB updates to gauge frequency (we don't want too many threads tied up)
- Related: perhaps there should be some basic protection too if there isn't (if GET causes a nasty write on non-primary, throw a 500 and don't do it at the applayer?)
- Too many deferred updates and a few sync exceptions (writes will be cross-DC then)
- [status?] General Active/Active support (incl non-MW, not sticky-cookie specific): https://phabricator.wikimedia.org/T134404
Services:
- change_propagation module for CDN cache purges
- [unstarted] look into mcrouter too see if it can work
- looking into Kafka fail-over / upgrade; likely 0.9 first, then 0.10 (will bring timestamp indexes): https://phabricator.wikimedia.org/T127718
2016-05-25
MediaWiki:
- EventBus purge relayer for WAN cache https://phabricator.wikimedia.org/T134535
- Flow cache purges ( https://phabricator.wikimedia.org/T120009 )
- action=rollback uses GET ( https://phabricator.wikimedia.org/T88044 )
- Reduce cross DC wiki DB queries
- action=purge and wikidata (T110399)
Databases:
- pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 )
- We need both decent HA and correct lag estimates in all DCs
- MASTER_GTID_WAIT() support (https://gerrit.wikimedia.org/r/#/c/289985/)
- Cross DC writes and TLS/SSL (e.g. writes via DeferredUpdates or CentralAuth autocreate) ( https://phabricator.wikimedia.org/T134809 )
Media storage / Swift:
- FileBackendMultiWrite 'async' upload /thumbnail race conditions
- Option 1: use 'sync', Option 2: plugin into ChronologyProtector to force 'master' backend
- Experiment with sync/async and watch statsd for api entry point
- HTTPS for swift: https://phabricator.wikimedia.org/T127455
Session storage / redis:
- Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 ) and SSL needed
- Use another system (like a cassandra cluster?) (https://phabricator.wikimedia.org/T134811 )
- Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 ) and SSL needed
CDN / traffic:
- VCL routing logic: https://phabricator.wikimedia.org/T91820
- Should probably block on verification that idempotent/safe methods (GET/HEAD/OPTIONS) do not cause writes, is there a task for this?
- Related: perhaps there should be some basic protection too if there isn't (if GET causes a nasty write on non-primary, throw a 500 and don't do it at the applayer?)
- General Active/Active support (incl non-MW, not sticky-cookie specific): https://phabricator.wikimedia.org/T134404
- Experiment with % of traffic to codfw (avoid loops?)
Services:
- change_propagation module for WAN cache purges
2016-05-11
ACTION ITEMS:
MediaWiki:
- EventBus purge relayer for WAN cache https://phabricator.wikimedia.org/T134535
- Flow cache purges ( https://phabricator.wikimedia.org/T120009 )
- ?action=rollback uses GET ( https://phabricator.wikimedia.org/T88044 )
Databases:
- pt-heartbeat usage for lag detection ( https://phabricator.wikimedia.org/T111266 )
- Related: cross-datacenter state visibility (in general, chronology checks) Use GTID? Use pt-heartbeat? Needs discussion. Joe mentiones that needs to work for "regular/simple" non-WMF mediawiki setups.
- MASTER_POS_WAIT() does not work cross-DC with current file/coords [Jaime will file a task]
- Cross DC writes and TLS/SSL (e.g. writes via DeferredUpdates or CentralAuth autocreate)
- Parsercache (not really DBs): General consensus on replacing the datastore from MySQL to somethings else with mult (which should eventually be done). Jaime proposes to do a couple of fixes to have something quicky.
Media storage / Swift:
- FileBackendMultiWrite 'async' upload /thumbnail race conditions
- Option 1: use 'sync', Option 2: plugin into ChronologyProtector to force 'master' backend
- Experiment with sync/async and watch statsd for api entry point
- HTTPS for swift: https://phabricator.wikimedia.org/T127455
Session storage / redis:
- Sync writes for ChronologyProtector ( https://phabricator.wikimedia.org/T111575 )
- Blocked on TLS/SSL for apaches <=> redis (http://redis.io/topics/encryption not supported)
- Maybe use another system (like a cassandra cluster?) (https://phabricator.wikimedia.org/T134811 )
ElasticSearch:
- Basically ready
CDN / traffic:
- VCL routing logic: https://phabricator.wikimedia.org/T91820
- Should probably block on verification that idempotent/safe methods (GET/HEAD/OPTIONS) do not cause writes, is there a task for this?
- Related: perhaps there should be some basic protection too if there isn't (if GET causes a nasty write on non-primary, throw a 500 and don't do it at the applayer?)
- General Active/Active support (incl non-MW, not sticky-cookie specific): https://phabricator.wikimedia.org/T134404
- Experiment with % of traffic to codfw (avoid loops?)
Services:
- change_propagation module for WAN cache purges