User:SSastry (WMF)/Notes/Wikitext

Disclaimer: These are not necessarily all my ideas. These are a result of conversations we have had inside and outside the Parsoid team over the last couple years and also based on experience developing Parsoid. I am just pulling things together and organizing them and maybe extending them in some cases.

Goals

Improve ability to reason about wikitext
Reduce edge cases by bounding range of wikitext errors
Improve editability in VE
Improve performance

DOM scopes

We can introduce the notion of a DOM scope for wikitext where markup within a DOM scope is processed to yield a forest of well-balanced DOM trees. Here are some benefits from DOM scopes:

Editability: Individual DOM scopes can be edited independently and in isolation. This can help VE as well as other wikitext editing tools.
Performance: DOM scopes can be processed somewhat in isolation -- a step towards supporting incremental parsing.
Simpler semantics: You don't have to look at rest of the page to make sense of what this piece of code does (I am deliberately exaggerating this to highlight that when enforced, this property is not dependent on ability to not have wikitext markup errors).

DOM scopes within a page are properly nested. This means that the page can be represented as an abstract tree with DOM scopes for its nodes.

Possible candidates for DOM scoping

Tables, lists, sections, extensions, and specially-marked-transclusions (either via templatedata or whatever other mechanism we come up with) could all be considered DOM scopes. Image captions could potentially be considered as well.

Implementation considerations

Given this, any compliant wikitext runtime would have to parse the string produced by these wikitext constructs to a DOM tree and reserialize them.

Based on the DOM scope tree, and a DOM for every node in the tree, we can now get a unified DOM for the entire page by replacing every scope with its DOM. However, given that HTML5 has constraints on tag nesting (<a>-in-<a>, <p>-in-<p> are disallowed, for example), if we serialize this unified DOM to a HTML string and reparse it with a HTML5-compliant parser, the new DOM might not be identical to the DOM that produced the HTML string. So, this needs some more thought.

DOM scopes filed as task T114444 for RFC discussion.

Core ideas

Parsing the wikitext for any page returns a DOM tree with metadata annotations on different nodes.
All pages are composed using 3 different forms of syntax:
1. Basic markup: (lists, bold, italic, headings, tables, links, ...):
  - Can be wikitext-1.0 basic markup OR markdown OR whatever else as long as there is a pluggable implementation for it.
  - For wikipedias, it will continue to be wikitext-1.0
2. Metadata markup
  - Markup for references, annotations, edit notices, category links, language links, etc. The following may not be best syntax for it, but something to riff off of.
    - <m:notice>..</m:notice> (instead of )
    - <m:cat>foo</m:cat> (instead of [[Category:foo]])
    - <m:ref>..</m:ref>
    - <m:comment>..</m:comment>
    - <m:author>..</m:author>
  - pluggable processors for different metadata types.
    - All metadata is attached to a specific place in the DOM. Metadata could be page-specific or DOM-node-specific. The former could be treated a special case of the latter where it is attached to <BODY>.
    - metadata markup generates visible output in some cases (refs) or generate JSON data / non-visible HTML markup (cat links, annotations, etc) in others.
3. Content-generator markup (transclusions, extensions, data widgets, wikidata-driven infoboxes, whatever)
  - Depending on context in which it is used, the output is treated as a string, a list of k=v HTML attributes, or a DOM forest. There is nothing in between.
No concept of preprocessing at the top level. Top-level parser is just that, a parser that returns a DOM tree.
- However, the transclusion generator implementation can support preprocessing. The output of the preprocessor can be wikitext which is then processed according to what the use context demands (string, html attributes, DOM). We can treat this as just another extension -- just that it is shipped and enabled by default on most wikis.
No page-global state for anything, even for <ref>s. The effect of global state is reproduced by doing a DOM walk on the final DOM. For example, as a fall-back scenario for when CSS counters are not supported by some browsers, a post-pass might update the DOM by inserting ref counter values.

Other notes

A page can be a special case of a content generator.
This enables caching at a content generator level, so, output of {{convert|2|km}} can be cached no matter what page it is used on (unlike now where it is processed once for every page it is found on)

Open questions

Where do parser functions fit in this scheme of things?
Does this anticipate and meet all templating uses currently?