Parsoid/Minimal performance strategy for July release

See also: Parsoid/Ops needs for July 2013

Peak edit rates are around 50reqs/second across all Wikipedias. Re-parses triggered by template edits can reach single-digit millions per day, for an average rate over a day of another 50 requests per second. In July, we need to sustain rates close to that as the Visual Editor is scheduled to become the default editor on all Wikipedias. Parsoid itself can be scaled up with more machines. We do however use the MediaWiki API to expand templates and extensions and to retrieve information about images. This can mean hundreds of API requests on large pages, which would overload the API cluster.

We have a long-term performance strategy as outlined in our roadmap that will also address the API overload problem. We might however not be able to implement enough of this before July. A minimal backup strategy to avoid overloading the API cluster is needed.

Leverage cached parse results to avoid API overload on edit

edit

We have a Varnish cache in front of the Parsoid cluster, which caches the parse result for a given revision (see the Parsoid page on wikitech. We can use this cached parse result to speed up subsequent parses. The main things we are interested in to avoid API requests (template / extension expansions and image dimensions / paths) are available in the previous version and are marked up in a way that makes it relatively easy to extract and reuse.

On edit

edit
  • Retrieve previous version's HTML DOM from cache (using oldid in get parameter)
  • Extract template, extension and image data from it and pre-populate internal caches with it
  • Parse new page, which will trigger API requests only on changed template transclusions / extensions / images.
  • Purge old version from cache

On HTMLCacheUpdate job after template / image edit

edit

Templates and images in particular can be modified, so we'll have to make sure our cached expansions are not getting too stale. A simple and promising option is to piggyback onto the HTMLCacheUpdate job with a hook. The hook action can then either purge + re-request or implicitly refresh the Varnish copy.

  • Request new version with 'Cache-control: no-cache' header set. With the proper configuration, this will cause Varnish to go to origin. The Cache-control header will be forwarded (TODO: verify!), which Parsoid can use as an indication to fully expand all templates from scratch. Varnish will update its cached copy implicitly when configured to do so.

For all of this to work, the current cache-busting page_touched parameter needs to be removed from the GET URL.

Relevant links:

Possibly relevant for other invalidation approaches:

Cache invalidation hooks

edit

We'll need a chunked and deferred job similar to HTMLCacheUpdate (TODO: can we subclass and override invalidateTables?). Given a title and table name, get all titles to purge and do so. The revision ID (for the oldid GET parameter) can be accessed via Title::getLatestRevID() and WikiPage::getLatest().

Main edit hook. Used to schedule updates to the page itself and templatelinks updates.

public static function onArticleEditUpdates( &$article, &$editInfo, $changed ) { ... }

Similar as above. TODO: Are template links etc gone at this point?

public static function onArticleDeleteComplete( &$article, User &$user, $reason, $id ) { ... }

SpecialUndelete. Links are updated before this hook is run.

public static function onArticleUndelete( $title, $create ) { ... }
public static function onArticleRevisionVisibilitySet( &$title ) { ... }

Basically purge both old and new titles.

public static function onTitleMoveComplete( Title &$title, Title &$newtitle, User &$user, $oldid, $newid ) { ... }

Purge pages in imagelinks. Won't work for commons IIRC.

public static function onFileUpload( $file ) { ... }