User:SSastry (WMF)/Notes

Very basic dump of notes ... work in progress

Wikitext processing model


Wikitext is primarily a string-based macro-preprocessing language and has evolved to accumulate a wide variety of features. However, the core aspect of wikitext being a language based on processing macros remains. For the most part, this doesn't matter and it is possible to treat normal wikitext constructs like lists, headings, bold/italics, tables as constructs that return a well-formed HTML string, except when it is not possible to do so. For example, errors in wikitext (unclosed tags, overlapping instead of properly-nested tags, mismatched quotes, unclosed quotes, etc) are one area where this model breaks down, since the output is then dependent on the details and order in which wikitext is parsed.

For example, <ref>foo{{echo|</ref>}} yields "meaningful output" only if transclusions are processed before extension tags. Similarly, what should the output of <div>foo\n==foo==\nbar</div> be? Is it a section that is wrapped in a div-tag? Should it be?

In general, errors in wikitext can have effects elsewhere on the page and are not necessarily restricted to the paragraph, list, table, or section where they occur. Ex: leave a <ref> tag unclosed and see what happens.

The other area where the above DOM-based processing model breaks down is where transclusions enter the picture. For example, the following is perfectly valid wikitext to generate a red-bordered div. <{{echo|div}} style='border:1px solid{{echo| red}}'>foo</div> However, this also means that till all transclusions are fully expanded to plain top-level wikitext, it is not clear what the final wikitext is going to generate. It is hard to reason about any individual transclusion and how it is going to affect the rest of the page. While in practice, this is not always a problem, the limitations of this model show up acutely whenever there is an wikitext error or a template coding error.

This string-based model also means that HTML-based visual editing and wikitext do not always fit well together. Templates in particular, because of their string-based preprocessing model, often get in the way of HTML-based WYSIWYG editing.

This also has performance implications in terms of parsing wikitext. If we take any well-developed wiki page, in reality, the size of a large majority of edits will be "minor" relative to the full size of the page. However, in the general case, it is non-trivial to do an incremental update of the HTML based on the edit -- a full reparse is often required to guarantee correct rendering. For example, if I replaced {{some-transclusion}} with {{some-other-transclusion}} on a page, this edit can potentially change the rendering of the rest of the page.

Penalties imposed by current model


So, the current string-based preprocessing model of wikitext imposes the following "penalties" (if you will):

  • Increased complexity in reasoning about how changes to wikitext can affect the output of the page.
  • Forces serialized processing of transclusions on a page.
  • Breakdown in the HTML-based WYSIWYG editing semantics.
  • Difficulty implementing incremental updates of old HTML based on edits.
  • Differences in parsed output (Parsoid vs. PHP parser) on pages where there are wikitext errors.

Incremental parsing


I am now going to discuss what is involved in implementing incremental updates of old HTML (incremental parsing) based on wikitext edits. One question to ask is, given a wikitext edit, what is the smallest substring of wikitext (that contains the edit) that we need to reparse and insert back the parsed HTML into the surrounding DOM context of the old page.

There are also two different issues to deal with here:

  1. well-formedness of the HTML string
  2. even if HTML is well-formed and yields a DOM, you may not be able to insert the new DOM without impacting the surrounding context.

Ex: [[Foo]] will always parse to a well-formed HTML string. However, if added [[Foo]] inside another link, it affects the output of the outer link because in HTML, you cannot nest one A-tag inside another. In general, given the dom D for a page, if F is the DOM subtree that we are going to replace with F' (the reparsed DOM subtree) to get a new dom D', we can be guaranteed to use D' (i.e. a full reparse of the page will also yield D') only if the following condition is met: D' == HTML5.parse((D' = D.replace(F, F')).outerHTML)

So, for every edited wikitext string, for the smallest incremental update, we need to identify a DOM subtree F within the old DOM D that will satisfy this constraint above. So, if we are considering incremental parsing for performance reasons, algorithmically, this is not going to work very well at all, since this test is expensive (but may still be lower than doing a full page wikitext parse). So, we need heuristics or other constraints on wikitext constructs. Let us examine these next.

One obvious candidate for F is the immediate child of the <body> tag that contains the edited string, i.e. given an edit, we find the top-level node within which it is embedded and reparse that entity as a whole and insert the new DOM subtree back. So, if we think about section-based wikitext editing, a section edit will force a reparse of just the section (including forcing a fixup of unbalanced tags and limiting wikitext errors to just that section). However, this will not work unless the full page parsing applies the same technique to every individual section, i.e. each section is processed to a full DOM independently. This means a construct like <div>foo\n==Section==\nbar</div> will effectively be parsed to <div>foo</div>\n==Section==\nbar

Editing of sections and at finer granularities


An obvious first step that falls out of this is to treat sections of a page as independent documents and process them independently (except for page-level metadata like citations, categories, etc.). In effect, we are treating sections as a typed/scoped wikitext construct that will always return a DOM tree.

More generally, we can apply this DOM scoping to other wikitext constructs like lists, tables, and paragraphs as well, i.e. every list, table, and paragraph in a document would be processed to a DOM tree independently.

Doing so can enable finer grained editing (and thus reduce the possibility of edit conflicts) at the granularity of individual lists, tables, or paragraphs within a section. This granularity can also be used for implementing incremental HTML updates after wikitext edits.

DOM-scoping of template output


Let us now move to templates and what this kind of DOM-scoping can enable.

Templates are often tweaked and edited. For popular templates, these edits can force full reparsing of a lot of pages (sometimes in the millions for common templates). This makes template edits extremely expensive. However, if there is a way we can make HTML updates way cheaper, i.e. drop-in replacement of new HTML in place of old HTML, it not only improves load on the cluster but also makes template edits much cheaper. For this kind of drop-in replacement to be possible, we have to once again be able to treat the output of templates to be well-formed DOM trees. However, not all templates do so. There are large classes of templates that only work as sets (ex: sets of templates that collectively generate tables, multi-column layouts, infoboxes, etc.). So, we also need a mechanism where we can demarcate an entire section of wikitext -- this can either be done with special wikitext syntax or by relying on extension tags. (Ex: <domscope> {{table-start}} ... {{table-end}} </domscope>)