Core Platform Team/Initiatives/Multi-DC Echo Notification Storage

Initiative Description

< Initiatives

Summary

We will provide a key-value store backend for the Echo notification extension that will work in an active-active data centre environment.

Significance and Motivation

With the future enabling of multiple data centres with active MediaWiki servers running, our application and storage infrastructure has to adapt. In particular, the assumption that databases will be read only in one data centre is no longer valid; we need replicated databases in our two North American data centres, and our architecture should support further decentralization of data.

Echo is an extension for in-wiki notifications. Because it lists notifications in the Web interface, it keeps a "last-read" timestamp, so that only unread notifications are shown in the user interface. Each registered user of Wikimedia sites has two timestamps; one for alerts, and one for notifications.

The notification timestamps are currently stored using the MainStash abstraction in MediaWiki. MainStash is a global key-value store that doesn't require a schema or database setup like most database tables do. In the Wikimedia environment, MainStash is implemented as a RedisBagOStuff, writing to Redis servers in the Wikimedia main data centre.

There is not an expectation that Redis will be able to extend to a multi-DC environment. Much of the data currently stored in Redis has been or will be moved to other storage. The goal of this project is to move Echo notification timestamps to storage that is more multi-DC-friendly.

Our experience with moving Wikimedia session storage to the new Kask key-value server should be helpful in determining storage requirements for Echo.

Outcomes
  • Echo notifications are not a blocker for active-active DC deployment
  • No noticeable degradation of performance for Echo notification UI
  • No noticable errors in Echo notification UI
Baseline Metrics
  • Echo notification last-read timestamps are stored in MainStash
  • MainStash is implemented as a Redis server
  • WMF Redis configuration is not multi-DC-ready and we don't see a future configuration that will be
Target Metrics
  • Echo notification last-read timestamps are stored in Cassandra
  • Echo extension uses configurable key-value store, such as RESTBagOStuff, to store notification data
  • Kask REST service brokers data storage
Stakeholders
  • Growth (Echo owners)
  • SRE (for new services)
Known Dependencies/Blockers

None given

Epics, User Stories, and Requirements

< Initiatives

Personas

  • User - a registered Wikimedia user
  • Infrequent User - a registered Wikimedia user who logs in "infrequently" (boundary of frequent/infrequent TBD)
  • Systems Adminstrator - a systems administrator

Epic 1

User Stories
ID Description Priority Notes
1 As a User, I want to see unread notifications, so I get timely notifications without a lot of confusing older messages. Must have This is "read" functionality for the timestamps. The timestamps are read on the server side by the Echo notification code, and change the appearance of the "notifications" and "alerts" indicators on each wiki page. There are potential race conditions if the timestamp was recently written in another data centre and the row has not yet propagated.
2 As a User, I want the system to remember that I read notifications when I click the Notices popup on the main Web UI, so that I don't have to get notified of those messages again. Must have This is one "write" function for the timestamps. When the alerts or notification UI is popped open, a POST request is sent from the browser to the MediaWiki server, handled by the Echo extension, to reset the last-read timestamp to now.
3 As a User, I want the system to remember that I read notifications when I go to the All Notifications page, so that I don't have to get notified of those messages again. Must have This is the other "write" function for the timestamps. When the user navigates to the Special:Notifications page, the server will reset the last-read timestamp, even though it is an HTTP GET request. In our multi-DC environment, we usually don't write to data storage on an HTTP GET, since POST, PUT, DELETE verbs will all be routed to the primary data centre. In this case, it can mean writing to data storage in a secondary data centre.
4 As a Systems Administrator, I want to configure Echo to write its notification timestamps to the storage engine of my choice, so I can architect my storage systems without changing the extension's code. Must have Echo currently writes to MainStash without any override capability. We need a way to configure Echo to use a different storage server.
5 As an Infrequent User, I want to see unread notifications even if I haven't logged in a long time, so that I am not suprised when I look at my notifications list and see notifications marked as "read" that I have not read. Must have This is the user story for migrating data from the Redis server to the Kask server.
6 As a Systems Adminsitrator, I want to decommission the Redis server, because it is not well-suited to our multi-DC configuration so it is no longer used by any software components. Must

have

This is for shifting out of the migration period into the final configuration.

Engineering tasks

This is informational, based on Evan's understanding of what needs to be done.

  • Stand up a new Kask server (user stories 1, 2, 3)
  • Change Echo so that it can use a configured object store, with MainStash as a fallback (user story 4)
  • Configure WMF MediaWiki servers to use MultiWriteBagOStuff with Kask and Redis as a fallback so that it gradually migrates from Redis to Kask (user story 5)
  • Write and run a maintenance script to copy all or some Echo notification timestamps from Redis to Kask (user story 5)
  • Configure WMF MediaWiki servers to use RESTBagOStuff only, without the Redis fallback (user story 5)

The maintenance script is tricky. There are tens of millions of timestamps in Redis, so it will take a long time to run. It will only be run once. However, any timestamps older than when we start doing the multi-write configuration will otherwise be lost when we go to the Kask-only configuration. It's important for us to figure out how far we need to go back, and what happens if there's no data for a user who hasn't been back to Wikipedia in a year or two.


Subpages