Core Platform Team/Initiatives/Enable Multi-DC Session Storage

Initiative Description

< Initiatives

Summary

Develop a multi-master replicated key-value storage service, the semantics of which permit session access from MediaWiki in an active-active, multi-datacenter configuration.  Secondarily, the service decouples MediaWiki from storage, creating additional isolation of sensitive data.

Significance and Motivation

This is a blocker to enable active-active data center. Enables multi-data center session access. Makes the system more fault tolerant and resistant. Secondarily, it isolates session data.

Outcomes

Increase the scalability of the platform for future applications and new types of content, as well as a growing user base and amount of content

Baseline Metrics
  • Sessions are accessed from 1 Data Center
Target Metrics
  • Sessions can be accessed from 2 Data Centers
Stakeholders
  • SRE
  • Performance
Known Dependencies/Blockers
  • Setup Kubernetes security zone (SRE)
  • Security review (Security - 30 day lead time)

Epics, User Stories, and Requirements

< Initiatives

  • Hardware request and setup
  • RFC for the session storage API
  • Investigate use of Redis session storage to see if there is extra work
  • Design implementation (storage, replication semantics, performance)
  • Test and prototype in multiple languages to understand performance/latency/throughput
  • Implementation
  • Figure out deployment method
  • CI for build testing docker image creation
  • Cassandra cluster configuration
  • Beta deployment
  • Develop migration plan
  • Integrate with MediaWiki
  • Determine if “Set if not exist” functionality is needed (implement if needed)
  • Determine if Per operation defined TTLs are needed (implement if needed)
  • Enable functional testing (set up and tear down of Cassandra)
  • Security review
  • Implementing service-checker functionality (endpoint monitoring)
  • Figure out the Kubernetes deployment (Helm charts)
  • Deploy according to migration plan (test wikis, etc…)

Time and Resource Estimates

< Initiatives

Estimated Start Date

October 2018

Actual Start Date

Started in October 2018

Estimated Completion Date

None given

Actual Completion Date

None given

Resource Estimates
  • 2 FTE for 6 months (FY1819 Q2-Q3)
  • 2 part time engineers for 3 months during deployment (FY1819 Q4)
Collaborators
  • Core Platform
  • SRE
  • Security

Open Questions

< Initiatives

  • Should central auth metadata be stored in the same or different kask instance?
  • Is “Set if not exist” functionality is needed?
  • Are Per operation defined TTLs are needed?

Documentation Links

< Initiatives

Phabricator

https://phabricator.wikimedia.org/T206016 (master ticket)

Plans/RFCs

Requests for comment/SessionStorageAPI

Other Documents

wikitech:Performance/Multi-DC MediaWiki

Subpages