User:EProdromou (WMF)/The hard way to check for reverted edits

Suppose you want to differentiate between revisions of a page that contribute to the final content of the page, and revisions that don't. How could you do this?

Consider a page with a simplified history like the following. Revisions are in forward chronological order (oldest first).


ID	Text
1	charity
2	premiere
3	charity
4	throwback
5	snuggle
6	perish
7	throwback
8	motocross
9	snuggle
10	motocross

(Most of our wiki articles are much larger than a single word, but keeping it short in the example helps with my diagrams.)

We can think of each text value of a revision as a state, and the ID of the revision as an edge that connects the previous state to the current state. We could then draw a graph that shows how the page evolves over time.

In this graph, you can see a direct path along the left hand side from the empty state (before the page existed) to the most recent state ("motocross"). Off to the right are some "loops" that the page went through; sometimes just one state, and sometimes multiple states before it gets back on the "main path".

We can then use a simple graph-walking algorithm to determine the unreverted states:

Start with the last state ("motocross") and add it to our list of unreverted states
Find the earliest incoming edge (8)
Follow that edge to the previous state ('throwback") and add it to our list of unreverted states
Find the earliest incoming edge (4)
Follow that edge to the previous state ("charity") and add it to our list of unreverted states
Find the earliest incoming edge (1)
Follow that edge to the previous state (null) and add it to our list of unreverted states
Stop, because we're at the beginning null state.

So, our unreverted states are "charity", "throwback", and "motocross". Our reverted states are all the other states: "premiere", "snuggle", "perish". Finally, our reverted edges are any edges that are incoming to reverted states: 2, 5, 6, and 9.

One thing about this algorithm is that there's not a direct 1:1 relationship between "reverting revision" and "reverted revision". Revisions 5 and 6 are both reverted by revision 7, for example.

Making it efficient

The difficult part about this algorithm is that it requires loading the whole history of a page into memory, organizing it as a graph, walking the graph, and then determining the reverted states and edges.

A couple of things can be done to make this a little bit efficient.

Use a hash of the text for the state rather than the full text.
Cache the graph.
Cache the reverted edges lists.
When a new edge is added (on edit), update the graph, and flush the reverted edges lists.
When an edge is deleted or undeleted, flush the graph and reverted edges list.

A new edit can radically change the graph; for example, a new edit (11, "perish") would make the reverted edges 2 and 8.