Wikimedia Services/Revision storage for HTML and structured data: Use cases

We are considering the costs and benefits of

offering predictable performance for access to HTML and metadata of old revisions, and
archiving citable HTML renders of articles as they looked at the time.

To this end, we are collecting relevant use cases in this page. Please be bold, and add / tweak as needed.

There are two categories of use cases. One where unavailability of stored (previously rendered) HTML is merely a performance / latency issue and doesn't affect the functionality / feature. The other is where the re-rendering of the page in the future affects functionality because the re-rendering will not be identical to the HTML version that was generated at the time the revision was created. It is useful to identify which category the use case belongs to.

Use cases

Viewing old revisions with predictable performance

Needs: Fast access to old revisions

Both external sites & wiki content occasionally link to specific revisions of an article. Performance for those accesses should be reasonable and predictable.

Fast access to old revisions would allow for functionality which searches the page history. Such functionality may or may not be Elasticsearch-based. Speed is an important factor here, since this is likely to be an extremely expensive operation.

FIXME: Whether this is a performance issue or a functionality issue depends more specifically on why the old revision was hot-linked to. So, I think this use case should be merged with more specific use cases where the actual reason for hotlinking is articulated.

To clarify, is the "functional difference" use case you have in mind essentially the "Citing Wikipedia content" case below? -- Gabriel

(Also: revision tagging)

Visual diffing

Needs: Fast access to Parsoid HTML of old revisions

The Editing department is working towards a visual diffing service, with a view towards eventually becoming the default change review experience (in recent changes, histories, …). This would provide a more convenient review format for users, and allow users without wikitext knowledge to review edits.

Most diffs are expected to be against relatively recent revisions, which will mostly already be available on demand, but it would at least be a bonus if performance would not degrade significantly when flipping through older diffs.

Beyond the performance issue, not having stored HTML for older revisions could also affect a diff in some scenarios where wikitext semantics / syntax has evolved, or where HTML versions have been updated.

Reading focused analysis of diffs

Needs: Long-term storage of specific renders, "as they looked at the time".

Needs: Fast access to "old" revisions ("old" meaning not the most recent, but the last X number of revisions)

This is a sub-case of diffing that would be focused on the use case of showing readers how a page has changed. The interface would likely not be a traditional diff UI as used by editors, but instead would process revisions in some way to present relevant changes to readers. To this end, other technologies like Machine Learning may be used to analyze the revisions and present the results as a subset or a summary to the user. This type of analysis would likely need full page content for several recent revisions as inputs.

Citing Wikipedia content / permalinks

Needs: Long-term storage of specific renders, "as they looked at the time".

Template and software changes make it difficult to reliably cite a specific revision of a Wikipedia article. MediaWiki always uses the latest version of any transcluded content. Facts in infoboxes can disappear when the template is edited, and news items or featured content on the main page are replaced every day.

Stored HTML versions are required for functionality.

Highlighting and sharing of text within Wikipedia articles

Needs: Long-term storage of specific renders, "as they looked at the time".

Needs: Fast access to old revisions

Highlighting specific content in an article is a often requested feature by readers, but is difficult to implement due to inherent complexity of tracking a highlight through revisions as the article changes. In order to support this use case two features need to be supported:

Show the original highlight to the user (on the originally highlighted revision)
Transfer the highlight to newer revisions as the article changes

The first is satisfied by having past article renders available. The second would be solved by analyzing diffs and migrating the highlight to the new revision - so fast access to old revisions would be needed.

Research

Needs: Reasonably fast and high-volume access to old revisions

Research / analytics use cases frequently have a need to extract information from a large number of revisions of an article. Examples include machine learning like ORES, as well as projects aimed at establishing the trustworthiness of specific parts of an article.

Currently, many of these projects are using custom wikitext parsers. This presents a high bar to entry. By lowering the bar to entry with more accessible HTML and structured data formats, such research would become more accessible, resulting in more contributions especially from outside researchers and tool writers.

HTML dumps

Needs: Reasonably fast and high-volume access to old revisions. Incremental updates need fast random revision access.

Archiving all versions of Wikipedia in perpetuity guarantees that the knowledge is available in a widely supported format if something should happen to the Wikimedia Foundation.