Wikimedia Technical Conference/2018/Session notes/Identifying the requirements and goals for dependency tracking and events

Session Setup - https://phabricator.wikimedia.org/T206068Edit

Theme: Architecting our code for change and sustainability

Type: Technical Challenges

Facilitation Exercise(s):

Leader(s): Alexandros

Facilitator: Jon Katz

Scribe: Nick

Description: Dependency tracking and propagating events for changes is a critical function of our infrastructure that enables us to invalidate the many caches we have for our content, and regenerate artifacts based on it. New use cases, especially around WIkidata, have greatly increased the number of events we need to propagate and process to ensure we always return the latest content to users. We are currently designing a new Modern Event Platform to fill this need as well as the needs of our Analytics stack. This session looks to aggregate needs and requirements for the dependency tracking in order to provide parameters for designing this new system.

Facilitator Instructions: /Session_Guide#Session_Guidance_for_facilitators

Questions to answer during this sessionEdit

Question Significance:

Why is this question important? What is blocked by it remaining unanswered?

What event propagation issues do we have now? What use cases do we see event propagation as a solution for? Are the current efforts for the Modern event platform poised to solve these? What other gaps do we have and do we have solutions for these? If not what do we need to do to find them? We are building a new event propagation system built on Kafka. This is being used in several contexts including dependency tracking. We should make sure that we understand what the new system is solving and if there are gaps identify them. We should also make sure we have a scalable way to store dependencies, in order to route events.
What is the SLO for dependency tracking / delay for different types of updates? How do we solve the issue of the number of events caused by invalidation of Wikidata items due the inherent structure of Wikidata? Is there a way to reduce the number of events? If not, is there a way to scale to handle this number of events. If we integrate Wikidata and Wikibase more into other content Wikis, what is the impact of this? When editing content, we need to invalidate our caches to ensure clients get the most recent version. Most of this is currently done using the JobQueue. Wikidata and the way that it stores data has increased the number of updates and has made invalidation more difficult when it impacts many pages. With the increase of including Wikidata content on other Wikis, this may get worse. Do we know how we are going to solve this?
Would deterministic parsing and closed templates impact the needs of dependency tracking? Dependency tracking is key to invalidating content. Syntactically closed templates would allow us to ensure parts of HTML are independent of each other. How does this impact the need for dependency tracking and re-parsing of content?
How does the product need to increase  the use of content from other wikis impact the needs of event propagation? Product is moving towards including content from other projects… specifically Wikipedia. This is already happening With Wikidata. This would seem to increase the need to propagate events to invalidate content across projects

Daniel’s personal brain dumpEdit

  • Purpose: update things when stuff they depend on changes, recursively. Replace “links tables” and “RefreshLinksJobs” with something more scalable, more flexible and extensible, more granular, and cross-wiki.
  • Two components: event bus (or http polling) and dependency graph. My focus is on the dependency graph.
  • Idea: each node in the graph can be “touched”, all nodes that depends on the touched node become “dirty”, and need to be touched.
  • Graph properties: directed, not connected, acyclic (guaranteeing this is going to be tricky), shallow. Low median degree (in and out), but some nodes with very high degree. Roughly two orders of magnitude more edges than nodes.
  • Initial graph size: edges ~ size of all link tables combined < 10 billion, nodes ~ number of all pages on all wikis < 1 billion. Increasing granularity -> less superfluous updates, larger graph.
  • Splitting by wiki is tempting, but won’t work cleanly (commons, wikidata, global user pages, etc)
  • Writes: 100 nodes / 1000 edges per second.
  • Unfinished idea of how change propagation can work: https://www.mediawiki.org/wiki/User:Daniel_Kinzler_(WMDE)/DependencyEngine
  • Increasing granularity: isolated template rendering, tracking dependency on wikidata statements, distinguishing dependency on page content and page title, etc.
  • Decreasing overhead: deduplication and coalescing of events (e.g. subsequent edits), (always means delay), ignoring redundant updates.
  • Technology questions: interface and model (HTTP and/or events), graph storage technology, horizontal scalability, availability and persistence requirements.

Attendees listEdit

  • About 16 people.

Structured notesEdit

There are five sections to the notes:

  1. Questions and answers: Answers the questions of the session
  2. Features and goals: What we should do based on the answers to the questions of this session
  3. Important decisions to make: Decisions which block progress in this area
  4. Action items: Next actions to take from this session
  5. New questions: New questions revealed during this session

Questions and answersEdit

Please write in your original questions. If you came up with additional important questions that you answered, please also write them in. (Do not include “new” questions that you did not answer, instead add them to the new questions section)

Q: What event propagation issues do we have now? What use cases do we see event propagation as a solution for? Are the current efforts for the Modern event platform poised to solve these? What other gaps do we have and do we have solutions for these? If not what do we need to do to find them?
A Many. But 5 stood out:
  • Edge cache purging is opportunistic
  • Wikidata dispatch lag
  • rerendering of millions of pages for template changes is laggy
  • notification of 3rd parties for events is difficult.

Other gaps are: wikidata to commons/wikimedia purging lag; lag in watchlist notifications; memcached purge events delivery is not guaranteed; rapid fires of new values in memcached when backends are down; locks acquisition is impossible when memcached backens are down; high granularity edits flood the watchlist; Capturing of dependencies chains is difficult; Identification of scenarios where the impact of a change is noop can be tricky; Bucketing of  purges for multiple variants of a page (e.g. WD language variants) is impossible; Batching of change events affecting the same target is tricky; WDQS updates from WD edits is laggy; Category pages don’t always reflect article edits (lag); complex template logic may not be purged at times

The modern event platform could help solve the edge caching issues.

Q: What is the allowed delay for dependency tracking / delay for different types of updates? How do we solve the issue of the number of events caused by invalidation of Wikidata items due the inherent structure of Wikidata? Is there a way to reduce the number of events? If not, is there a way to scale to handle this number of events. If we integrate Wikidata and Wikibase more into other content Wikis, what is the impact of this?
A Aside from external clients where we have to come up with an actual number, the allowed delay is more or less answered internally. It’s almost instantly for watchlists/recentchanges and minutes/hours for indirectly touched pages.

We came up with a question about looking into a solution implementing a dependency graph in order to deduplicate and track/possibly reduce the number of events.

Q: Would deterministic parsing and closed templates impact the needs of dependency tracking?
A No time.

Daniel's comment, added later: deterministic parsing and closed templates would allow rendered templates to be cached and re-rendered separately from the rest of the page. This would reduce the amount of wikitext that would have to be re-parsed for a given change, at the cost of tracking mroe artifacts.

This is one of many instances of the granularity trade-off: higher granularity means more targeted purging and thus less redundant re-generation of artifacts. This is payed for with more tracking meta data (that is, a larger dependency graph) and thus more rads and writes on that graph, and the need for more storage space. Wikidata has experimented with this tradeoff as applied to the entity usage tracking. The need for purges has been greatly reduced, but the size of the tracking table is starting to get problematic.

Q: How does the product need to increase  the use of content from other wikis impact the needs of event propagation?
A This ties up to the question of the dependency graph solution.

Daniel's comment, added later: This particularly means that we need one dependency graph that spans all wikis in the cluster, or a federation mechanism for dependency graphs. Such a federation mechanism is complex and perhaps not worth while for on-cluster use, but may be needed to enable dependency tracking and change propagation for 3rd party re-users of our content.

Features and goalsEdit

For Use Case and Product Sessions:

Given your discussion of the topic and answering questions for this session, list one or more user stories or user facing features that we should strive to deliver

1. We need to find a way to have bucketing of purges of edge caches (X-Key)
Why should we do this?

In order to allow efficient purges of pages that can have many variants (e.g. wikidata language variants).

Bucket purges enable a separation of the edge cache (CDN) from the lower-level dependency tracking and change propagation mechanism: change propagation would known which "bucket" to purge from the edge cache, without the need to track all cached variants. Implementing this "bucketing" in the dependency tracking service would be possible, but it seems desirable to re-use existing mechanisms for this in the specialized edge cache. This is especially true since dependency tracking is geared towards a semi-persistent cache (like the parser cache) as opposed to a highly transient cache.

What is blocking it?

Support for an equivalent on X-Key on the edge caches.

Who is responsible?

SRE Traffic team. Some support in MW core needed.

2. Edge caches need to have a hard 24 hour limit
Why is this important?

We kind of already do this for most cases, but there are edge cases where this is more.

What is it blocking?

Reliable representation of pages

Who is responsible?

SRE Traffic team

3. Modern Event platform needs to be persisent and reliable
Why is this important?

Losing events would mean the edge caches would be in the same or worse situation they are today

What is it blocking?

Improvent of the delivery of edge caching purge events

Who is responsible?

Analytics

4. Modern Event platform must support ordered subscrition groups
Why is this important?

With a 2 layered edge caching it is important that the layers are purged in the correct order so that race conditions are avoided

What is it blocking?

Elimination of race conditions on the edge caching layer

Who is responsible?

Analytics

Important decisions to makeEdit

What are the most important decisions that need to be made regarding this topic?
1. Should we build a dependency tracking service, and what technology should it be based on?
Why is this important?

If we want a unified dependency tracking and change propagation mechanism, we should start to build it soon, otherwise we will grow more code re-inventing that wheel, see blow.

In order to decide if the is even feasible, we need to investigate storage technologies for large sparse graphs with many writes.

What is it blocking?

Implementation of the service, use of the service for new features. AS long as we do not have it, every feature that needs some kind of caching has to implement its own tracking and purging mechanism.

Who is responsible?

Core Platform

Action itemsEdit

What action items should be taken next for this topic? For any unanswered questions, be sure to include an action item to move the process forward.
1. Investigate current bottlenecks in events
Why is this important?

We currently throttle wikidata event dispatching in order to not overload the infrastructure.

What is it blocking?

Faster wikidata event processing

Who is responsible?

Daniel Kinzler

2. Investigate tech for representing a dependency graph
Why is this important?

Tracking all dependencies between the various pages is cumbersome

What is it blocking?

-

Who is responsible?

Core Platform Team

3. Draft a spec for the dependency graph service
Why is this important?

We should have a draft to check our use cases against, and to drive technology choice (see 2).

What is it blocking?

Investigation of the technology choice, resourcing of the implementation effort.

-

Who is responsible?

Core Platform Team

4. Investigate support for bucketing (X-Key or similar) in MediaWiki core.
Why is this important?

Bucketing CDN purges in MediaWiki core can act as a proving ground for two things we want:

  • bucketing support in the edge cache
  • data flows in MediaWiki core that preserve information about which artifact depends on which resource.
What is it blocking?

Support for caching of multi-variant output on the edge (non-English output for anonymous readers of Wikidata and Commons). Compare T152425 and T114662.

-

Who is responsible?

Daniel Kinzler (to ask Performance Team, SRE traffic team)

New QuestionsEdit

What new questions did you uncover while discussing this topic?
  1. Will the modern event platform that is currently under design survive the millions of events we are poised to send its way ?
Why is this important?

If the new platform crumbles under the load, it would cause greater issues than the current platform.

What is it blocking?

-

Who is responsible?

Alex will contact analytics team

2. What is the acceptable delay for event delivering to external/3rd parties
Why is this important?

External parties having installations mirroring our content and not receiving timely events end up being out of date

What is it blocking?

-

Who is responsible?

-

Detailed notesEdit

Place detailed ongoing notes here. The secondary note-taker should focus on filling any [?] gaps the primary scribe misses, and writing the highlights into the structured sections above. This allows the topic-leader/facilitator to check on missing items/answers, and thus steer the discussion.

  • Goals: Purge the edge-cache, varnishes. Process/protocol uses for this is opportunistic. There are cases where purging cache is lot of race conditions, there are people who [make?] past versions of a page.
  • D: is it just about edge cache, or anything? Anything.
  • 10 minutes of writing down issues, including the producer and consumer
  • Examples of eventbus in production? Just the problems,
  • D: Are we asking,  if we wanted to rebuild it, what are the problems we'd need to address?  - No, just looking at current problems that exist. -- caching and purging. Don’t think about EventBus. Think lower-case “e” events.

Clustering happened of the issues.

Group dot votes on the issues they want to talk about

5 top issues from voting

  1. How do we notify external users of changes
  2. Wikidata change dispatching
  3. Some edits require millions of pages to be re-rendered
  4. Template change don’t show up in article until purge of long delay
  5. Content purges (send: all services, rec: all caches, problems: 1: unreliable transport, 2: race conditions, 3: scaling f/ multi-view)

Group 1 Discussion: (Greg, ...) - Purging issues

What is the nature of the issue

What obvious solutions exist

The importance of the issue (low/medium/high)

  • B: short summary of how they fit together together
  • Edge caches cache on URI, we have to get purges from cacheable content services, using transport between the services to the caches
  • Uses UDP, efficient but lossy
  • Services, emit purge events, of a URI
  • Problems:
    • unreliable UDP: pubsubhub, deals with downtime, easy to solve conceptually
    • Race condition on purging: there’s two layers of caches, if a purge goes to them in random order it can cause stale content to be re-cached again. Requirements it imposes is on the transport mechanism, eg: layer 1 must consume it before layer 2
      • Should the caches themselves transmit the purge events
      • There are many machines in different DCs
    • Scaling: for one article there might be a lot of views (mobile, desktop, page preview), when we invalidate a page there are actually like 15 things that need to be invalidated, which means the scale could be on the order of magnitute of a million of destinations due to template purges
      • A key associated with the various views/etc
      • “X-Key” was a potential solution, maybe doesn’t scale to a million articles from a template
      • We probably shouldn’t have a million events in any of the transports
      • Some sort of alternate indexing system like X-Key
  • The other issue: “Some edits require millions of pages to be re-rendered”
  • Cascading updates
  • T: If we were to have X-key or similar, do we actually want everything that uses the infobox to be purged at the same time?
  • B: probably no
  • T: when you edit infobox, we don’t send millions of jobs at once, jobqueue does the purges in chunks of 100 or so
  • B: we wouldn’t want to do this with an X-key
  • We’re trying to reduce our TTL, 4 weeks is way too long, even less than 24 hours is enough, the vast majority is within 6 hours
  • The 24 hours is more operationally important with turning off some cache centers
  • LRU type policy? Yes, and when moved to ATS it’ll be LF(requently)U
  • If the purge will take more than 24 hours to complete, why not just let the cache policy do it for you?
  • Not just HTML, but link-tables in the DB, we don’t want that to be tied to the caching purge system
  • We’ll need to jobqueue the millions of re-parses
  • S What’s an acceptable stale content window?
  • B: that’s a deep dive….
  • T: If I edit a template, how long is reasonable for a page that uses it to not update?
  • T: from a porduct perspective if a user edits an article it’s pretty important, the user who did it is most important
  • J: prioritize reverts
  • B: might not have to
  • T: template edits requiring the million
  • A: wikidata is the content use case
  • B: another wikidata snowflake issue
  • T: infobox edit will take 3 days to re-parse, but 24 hours for the cache layer to get rid of it
  • Rough consensus :)

Group 2 discussion: (Nick, lydia, daniel, markb, erikB, ramsey, corey, Alexia, Karsten) - Wikidata issues

  • Wikidata change dispatching
  • Some edits require millions of pages to be re-rendered

Notes:

  • When someone makes a change, we want to notify the rest of the world about it. Specifically WP, but also all WM< and in the future the rest of the world who uses our items and properties to describe things in their own software. Internal dispatching is kinda working, but not perfect,. Problems: When someone changes an item that is used in a lot of articles like "label for imdb" property, as sued in authority control, that means purging a huge amount of pages. We currently limit this to a [?1000 articles?].
  • Er: the limit is not the event system?
  • D: what's capped is recent changes/watchlist system.
  • L: it takes time to make these changes show up in RC/Watchlist, and if it takes too long, they drop off. If it takes 5 minutes to show up, then it's below the fold.
  • We try to reduce redundant purges, by having highly granular tracking, not only which items are which pages, but also …  Decreases by 2 orders of magnitude. You can optimize for purging but that means more churn of tables.
  • Also we don't want to re-render all the time, or flood RC with irrelevant stuff. (which was a problem in the past and was turned  off)
  • What was irrelevant?
  • E.g. mayor of berlin is used in Dewiki, but only population of Berlin is used in Enwiki, previously if someone changes the mayor then it would show up everywhere.
  • We have one mechanism for purging and for RC. they have different latency requirements and scaling issues.
  • L: 2nd lissue:  we want more and more people to use items and props in their own software, and we need to tell them if values change.
  • Ramsey: federation?
  • Tim: complete copies?
  • L: commons
  • T: So they only want to know about changes they're subscribed to
  • D: question is transport  -if you have to notify a thousand external site, and then 3 seconds later about the next edit ….  
  • T: digests?
  • L: instant notification probably not as important externally as it is internally
  • D: Requirements are: RC - seconds, re-rending articles - minutes, external users - hours
  • Latency not the same as importance
  • Erik: what is source of latency today?
  • D: application surface doing rendering of stuff
  • E: not enough jobruners?
  • D: yes-ish
  • T: can't have a burst of 100,000 edits at once,
  • E: bursty-happy?
  • D: going to have bursts of something used 5million times, which fills queue
  • M: re: filling RC..
  • D: we want to report the event with correct timestamp, so kinda need 2 timestamps… how do we make that clear to the user.
  • T: Why do you need to know the original timestamp?
  • C: is it more important to see it, or to show it chronologically
  • E: Stas has had problems with this. Wikidata query is delayed by 10 mins.
  • D: tangent… that makes it more important…  edit a page, pushed to RC< RC reacts, API asks for external links on page, API gets old results.
  • M: need to make it faster.
  • D: yes, reduces the problem. But want a guarantee of causality
  • C: order events by internal to outside. Should never give the old view.  Make sure things are updated before re-rendering
  • D; do we not publish event, ie delay publication of RC event until internal events have all happened. This would be asynchronous, which is a change.
  • C: Islands of synchronicity - finish this before doing that
  • D: still have user waiting. If we say don't push until everything is ready, then
  • D: that reduces the problems but doesn't eliminate it.
  • D: still get into api [race conditions?]
  • M: What are the bottlenecks exactly
  • D: no way for external user to get a consistent view. Poll for latest, and you might be getting old things.  Could have a placeholder in RC saying "something will be added here" - You could poll RC for "give me quickest" or "give me done"
  • M: get efficient data structure of all dependencies - and set a bit for that, so that you know something is pending.
  • C: is this like graphDB
  • D: Do you know a graphDB that meets our scaling and performance for this? We're talking 100 billion edges, and [?] thousand changes a second.
  • C: is that graph problem validated?
  • E: google must have bigger graphs than that, but unsure if it exists in the FOSS world
  • ACTION: Investigate graphDB options.
  • ACTION: Investigate current bottlenecks.
  • We need to build the actual graph and mechanism to feed it.

Bucketing of changes is important to wikidata people

Goal:

  • Be able to purge multiple urls in varnish

Action:

  • investigate why we have bottlenecks
  • Investigate ways to represent the dependency graph (in Daniel's head)
  • Needs help answering questions about running DBs at scale.

Questions:

  • Will modern event platform survive number of events?
  • What's acceptable latency for external change propagation (aka: when we notify external subscribers of changes)
    • BB: Are we talking browser caches?
    • D: we're talking federated wikis. - Places using wikidata items/props.
      • BB: Once its async…
      • D: async doesn't mean no bounds, what are those bounds.
      • Product question! What do users require

Decisions:

  • Event Transport: Must be persistent and reliable.
  • Event Transport: Must support ordered subscription groups
    • BB: Stream of purge events with pubs and subs, two different groups of subs (A+B) so that we can say, deliver to group A before group B
    • D: It’s one possible way to solve it but is is a requirement, have A send along to B?
    • Except A and B are different clusters of machines
    • Possible if we go looking for a transport that gives these things and find there is none.
  • Caches (edge) - hard 24h limit
    • Hit rate? It varies depends on your perspective, from 90-98%