Wikimedia Developer Summit/2017/Asynchronous processing

Session recording

Chronology

[Capture the gist of who said what, in what order. A transcript isn't necessary, but it's useful to capture the important points made by speakers as they happen]

Giuseppe: Examples for async processing are the job queue, cron scripts (wikidata), changeprop service (https://www.mediawiki.org/wiki/Change_propagation), HTCP purges (via job queue, changeprop). I don't think we need all of these mechanisms. Can we consolidate & avoid accumulating technical debt with duplicate implementations?

Issues with the current jobqueue:

Scalability, relies on redis. Redis has less than perfect scalability, durability, multi-DC replication.
Hard to debug.
Messages are serialized PHP; does not play well with services.

Need to address issues & consolidate mechanisms. Requirements for a replacement:

- Allow to have pure PHP mediawiki installations (shared hosting)

Job queue model

Job based, not publish / subscribe event based
Retry logic
Throttling / backoff
Deduplication (based on root jobs)
Progressive dependency expansion (ex: backlinks)
Delayed execution
At least once execution

ChangeProp model

Event / publish-subscribed based (EventBus / Kafka)
- Well-defined, schema-verified JSON events.

Which features is it missing to make it a viable job queue replacement?

Producer-controlled delayed execution & deduplication
JSON schemas enforce a lot more discipline (different from job queue)

Option 1: Full transition to EventBus / ChangeProp

Issue: Need to write & maintain pure-PHP fall-back solution

Deprecate PHP job queue
Convert current extensions to use new system
CP would call the PHP API to execute MW jobs

Is it worth it? Not sure it's viable. Not in short term, anyway. Need solution in the mid to long term.

Option 2: Use EventBus as a backend of the jobqueue

Use EventBus as a job queue transport (replacing Redis)
Use ChangeProp to just relay such jobs to a specific endpoint
Gradually convert some specific (high volume) jobs to native CP

Advantages

Only one system to maintain
Relatively low effort

Option 3: Write a Kafka backend for the job queue

Relatively simple
Doesn't require a lot of changes in mediawiki

Option 4: Use third party solution

Gearman, Chronos, Nomad, .. Attraction: No need to maintain the software ourselves Issues:

Not tailored to our scale / needs
No shared hosting support

Discussion

Giuseppe: Personal view: Best option is 2 (start by encapsulating job queue)
Chad: Please let's not use Gearman. Let's make sure whatever we use works for third party users.
Adam: Start with improving PHP implementation / micro refactors?
Andrew B: Why are transports and message types coupled?
Marko: Currently, the two are coupled, and messages are private.
Andrew B: Seems that first step would be to decouple the two.
Jaime: Might not be as much a problem of the job queue implementation, but with the jobs itself (e.g. recursive jobs)
Giuseppe: A more instrumentable, commonly understood tool will allow better fixing of broken jobs
Gabriel: Seems some fixes are short term, some are more long term. I agree that Option 2 is the best to start off, and it would still allow the use of Option 1 later on. We might e.g. ditch the requirement of shared hosting support by 2020.
Giuseppe: Agreed. We can start with Option 2, and then do Option 1 later.
Mark: Option 2 is kind of the "acceptable debt" option.
Jaime: Are the cron jobs a bit like periodic / scheduled actions?
Giuseppe: Yes.
Marius (Hoo): Wikidata has several cron jobs like this.
Adam: The ORES changeprop config looks very verbose; can we clean this up?
Marko: Different rules are applied per wiki, not easy to clean up ATM
Giuseppe: The queue solutions needs to be independent of the behaviour of the consumers (from an ops perspective).
Adam: Could have some dynamic registration mechanism (like extensions.json).
Petr: Considered dynamic registration early on in ChangeProp, but realized that this would be dangerous. Current config files are easier to reason about & have a code review process.
Giuseppe: Thought experiment: Bug in mediawiki causes 1000s of events per edit, if a Kafka backend were in use it could simply be dropped there
Aaron: Prefers option 2. Leaves lots of options open longer term.
Giuseppe: Is there anything we missed in requirements / issues?
Aaron: Prioritization / separate queues could be an issue. Within MW jobs as well. What is the multi-DC story for ChangeProp?
Giuseppe: Mostly. Kafka is set up with mirrormaker, and there are separate prefixed topics per producer DC. ChangeProp in each DC can be configured to consume from one or both DCs. Target end points can be switched as well, so CP in DC1 can send jobs to DC2.
Gabriel: The DC switch-over will become more precise with Kafka 0.10's timestamp based indexing. Currently, we'll have some duplicate processing after switch-over.
Aaron: That's definitely a lot more reliable. In terms of resourcing, who is going to do the work?
Giuseppe: Mark?
Mark: Could be a good cross-team goal in technology.