Parsoid/OutputTransform/HtmlHolder

When Parsoid (and Parsoid-aware transforms) hold a DOM object model, there are two important features/extensions which an HtmlHolder interface in core needs to be aware of. The first is structured-value and private attributes within the Document, and the other is the representation of standalone document fragments. This document describes these features and presents consequent design decisions relating to the HtmlHolder abstraction.

Structured-value attributes and the DataBag

The MediaWiki DOM Spec contains a large number of "JSON-valued" attributes, to express structured values in HTML attributes in a compact and bandwidth-friendly way. Parsoid implements support for this primarily in the DOMDataUtils class, based on the DOMDataUtils::getJSONAttribute() method which returns a structured value: in the PHP implementation an associative array; formerly in the JavaScript implementation a JS Object. This value is nominally present in "plain old HTML" or via Element::getAttribute() as the JSON-encoded value of the array/object, but is stored "live": that is, it is not parsed and re-serialized to a string attribute value every time the attribute is read or modified, but instead is kept as a live array or object value attached to the DOM Node. References can be kept to the live value, and it can be mutated and that change is immediately visible to anyone else which has a reference to the value.

The actual implementation is a bit baroque -- in addition to a proposal ("Rich attributes") to extend this basic mechanism to include DOM DocumentFragment values, there are multiple different serialization formats for these structured values. The nominal "as a JSON-encoded string" version we call "inline attributes". It suffers from a perceived "ugliness" problem, since inside a quoted HTML attribute value all quotes must be escaped, and JSON-encoded values contain a large number of quotation marks. This is mitigated by the use of single-quotes around the attribute value in a minor departure from standard HTML serialization, but if the structured value contains HTML markup escaping becomes inevitable as (a) both available quotation marks have been used, and (b) < and & are additionally required to be escaped. There's a separate but orthogonal issue with the exposure of "private" attributes in this naive serialization, discussed below. For these two reasons, Parsoid has historically supported two additional alternative encodings of structured attributes. By adding a unique id attribute to every node, the values of structured attributes can be hoisted out of the HTML and stored as a mapping from ID to attribute value. In one encoding this map is kept as a separate JSON-encoded blob alongside the HTML; the combination of JSON blob and HTML is called a PageBundle (page bundles have further uses described below). In another representation the combination is kept as a single HTML document, but the JSON-encoded map is stored in a ‎<script> element in the ‎<head> of the Document. This reduces the bloat caused by encoding all the quotation marks in the structured attributes, but adds additional bandwidth to record ID attributes on every node and additionally to include those ID values in the key portion of the map in the ‎<head>.

This id-to-attribute map is also used internally to the implementation: instead of hanging the rich attribute values directly off of the DOM Node, in the PHP implementation the ID-to-value map is stored in a DataBag which is attached to the root Document object. This is because the existing PHP implementation of the DOM uses ephemeral PHP objects to wrap the "actual" representation of the Node implemented by the libxml2 library. Those ephemeral PHP wrapper objects are created and destroyed every time a reference to the Node goes into or out of scope in PHP. When the ephemeral PHP wrapper goes out of scope, any data attached to the Node is destroyed, even if a reference to the Node is still present in the native document model. By keeping a persistent reference to the (wrapper of the) main Document object in Parsoid's Env class, which is kept alive for the duration of the parse, we can prevent the DataBag from being destroyed. (We could also just keep an explicit reference to the DataBag in the Env which would avoid the use of dynamic properties in PHP.)

Parsoid contains a "load" mechanism that runs after DOM parsing which loads structured-valued attributes into the DataBag, implemented in DOMDataUtils::visitAndLoadDataAttribs(), and a corresponding "store" mechanism in DOMDataUtils::visitAndStoreDataAttribs(). In our implementation changes made to the live object stored in the DataBag are not reflected in the raw attribute value visible via Element::getAttribute() until a "store" is done. Similarly, several Parsoid helper methods based on structured attributes, like ::getDataParsoid(), will not work correctly until a "load" is done. The "eager" loading mechanism could be replaced by a "lazy" loader which didn't locate and load structured values (whether from inline attributes or a map) until requested. Lazy loading could eliminate the need for an explicit "load" step, but since the values are live and can be mutated without notification to the DOM layer, an explicit "save" step will always be necessary to ensure the serialized DOM reflects the latest values for structured-value attributes.

Private attributes

The implementation and encoding of structured-value attributes in Parsoid was also influenced by an API decision that the contents of data-parsoid attributes were to be considered implementation-private. This was enforced at an API level by stripping the data-parsoid attributes in HTML provided to most clients, and then re-inserting the attributes from separate storage when necessary, keyed by a render ID assigned to the parse. In addition to strictly enforcing the abstraction boundary, this also saved bandwidth on API responses.

Special treatment was also extended to data-mw attributes, which were used by convention to store information "needed by editing clients but not for readers". The idea was that data-mw attributes would also be stripped in content served for read views or for reader clients to save additional bandwidth.

In this context, an additional benefit to storing the structured attributes outside the Document (or in a separate element in the ‎<head>) was that it allowed API code to efficiently implement these attribute-stripping strategies without requiring node-by-node traversal. In practice this benefit was undercut by the fact that Parsoid's principal client, VisualEditor, used the contents of data-mw attributes, requiring the data-mw attributes to be explicitly reloaded from the separate storage before the HTML was usable.

Since the implementation of separate storage was tied to abstraction boundary design goals for data-parsoid and data-mw specifically, the DataBag and load/store mechanism was initially implemented only for structured values of these two attributes. Other structured value attributes used the uncached storage mechanism of DOMDataUtils::getJSONAttribute() and the values returned were not live but had to be explicitly saved with DOMDataUtils::setJSONAttribute(). The Rich Attribute proposal would extend live storage to all structured-value attributes and separate out the policy decision regarding the precise set of structured attributes to be encoded in separate storage (as opposed to inline).

Design decisions

For an HtmlHolder interface in core, two views of the document are provided: an HTML string and a DOM object model.

We have decided that the DOM representation will contain structured data that has been appropriately "loaded" -- that is, operations provided to core that operate on structured values will work immediately on the DOM returned without requiring an explicit load step. An equivalent to DOMDataUtils::getJSONAttribute() will be provided in core (or more likely, in an HTML library which may also contain parts of Parsoid's DOMCompat library) which will work on the DOM as returned by HtmlHolder. This is consistent with either an eager "load" step occurring after string-form HTML is parsed, or with a lazy load step integrated with the implementation of the structured value API provided to core.

The HTML string provided by HtmlHolder will be the "naive" inline-attribute serialization of the document, not one of the alternate encodings. When converting from DOM to a string, an appropriate "store" step will be performed to serialize the current live values of structured attributes. Private attributes like data-parsoid will not be stripped from the HTML string. HtmlHolder will therefore need to know about "structured value" HTML (again, as an abstraction provided by the HTML library used), but will not need to specially handle data-parsoid or any other Parsoid-internal attributes.

Serialization to ParserCache

Note that the actual representation stored in the ParserCache (ie, the serialized version of the HtmlHolder) does not need to be the same as the string form of the HTML returned by HtmlHolder. Optimized encodings could be utilized to reduce the "lots of escaped double-quotes and angle brackets" issue with the naive inline-attribute representation. The primary performance requirement is that, if policy decides that read view HTML is to be cached for a specific page, that read view HTML be able to be rendered directly from the serialized value stored in the ParserCache with minimal additional processing. But read view HTML is not expected to have many (if any) structured-value attributes in it. So long as optimized encodings do not touch the set of attributes present in read views, then read views ought to still be able to be served directly from the ParserCache representation. (Perhaps the optimized serialization can include a flag explicitly indicating when the optimized serialization is suitable for directly serving to clients, based on the absence of structured value attributes found.)

For edit views, the "inline-attribute" representation matches what the VisualEditor client expects, although currently data-parsoid is stripped by the API. The visual editor API which provides access to edit-mode HTML can choose to reimplement data-parsoid stripping for performance/bandwidth reasons, but it is not required. We are already serving content with inline data-parsoid to some VE clients, so the presence or absence of data-parsoid should not cause issues.

The precise details of the ParserCache serialization of HtmlHolder should as far as possible be hidden from clients, and changes made to the serialization format for performance or efficiency reasons should not affect the DOM model or HTML strings provided to callers.

It is worth noting that the JSON serializer used by ParserCache is currently implemented in MediaWiki core. Although probably not strictly required, a json codec implemented in an external library would be helpful in ensuring that HtmlHelper is deserialized as an object of the correct type: T346829.

Enumeration of fragments and metadata

In addition to an HTML Document, wikitext parsing results in a collection of metadata. Historically that metadata was stored in the PageBundle and returned to API clients as JSON, although some portions of the metadata were also returned as HTTP headers in the REST response. The integration of Parsoid with core has eliminated the need for a REST API-focused PageBundle structure, and made available the much richer ParserOutput object to hold metadata generated by parsing. For compatibility with existing calling conventions and the REST API, methods in core exist to convert metadata stored in PageBundle objects to "extension data" stored in ParserOutput, and the ContentMetadataCollector interface in Parsoid exists to allow Parsoid to directly write metadata to the ParserOutput object held by core. We currently accommodate the encoding of structured attributes as a standalone map by embedding that map in the PageBundle, which is then reflected into the parsoid-page-bundle extension data key when the PageBundle is stored in a ParserOutput.

The richer variety of metadata represented by ParserOutput and newly-implemented by Parsoid introduced another issue: instead of one Document representing the entire result of the parse, certain piece of metadata were "HTML strings" and thus logically separate DocumentFragments generated by the parse. Many of these fragments were stripped HTML of one sort or another (page title, TOC entries) but, for example, the "page indicator" mechanism in core represented an entire wikitext fragment that certainly requires post-processing (localization) and likely requires appropriate representation of structured attributes within the fragment as well. Extension implementations seem to want to store Parsoid-generated document fragments in ParserOutput's extension data mechanism as well, for later use in a final composition step.

This raises two related questions:

Should short HTML fragments of this sort be represented by individual HtmlHolder objects? If the HtmlHolder objects are separate, is the "owner document" for each fragment unique as well, or are all fragments conceptually part of a single Document?
For post-processing passes which want to operate on all Parsoid-generated HTML (for example, user-specific localization), how can all such fragments be located within the ParserOutput (and its extension data) and enumerated so they can be appropriately transformed?

It's worth noting that similar questions arose in the Parsoid implementation regarding the "owner document" of fragments created internally during parse and that after much work most fragments in Parsoid now share the same owner document (although an awkward Remex API means many of these fragments are created as separate documents that then have to be adopted by the main owner). Unifying the owner documents is not a complete solution to the enumeration question, however, since there exists no DOM API for enumerating all child fragments of a given owner document (and to do so would seem to require weak references at least).

Design decisions

The PageBundle data structure will be removed from Parsoid and moved from core's MediaWiki\Parser namespace into the REST API implementation, as a feature of the REST interface design but not a core Parsoid abstraction. The metadata written by Parsoid which is not already reflected by appropriate ParserOutput properties, such as specific content headers needed by the REST API, will be written by Parsoid directly to ParserOutput extension data using Parsoid's ContentMetadataCollection interface, either using the existing parsoid-page-bundle key or new keys specific to the particular metadata. The main Parsoid entrypoints will use HtmlHolder+ContentMetadataCollector as result types rather than page bundle; this will also avoid a serialization step and allow Parsoid to return its DOM result (with live structured attributes) directly to core. The XmlSerializer code can also be removed from Parsoid, since Parsoid's APIs will now be DOM-based. HtmlHolder plus support code for structured attributes will likely be moved to a library, which is probably also a good home for serialization code like XmlSerializer.

(Tentatively:) An API will be provided to store and fetch DocumentFragments by ID from ‎<template> elements in the Document ‎<head>. "Child" HtmlHolder instances will serialize themselves as simply the appropriate ID key, and they will fetch the appropriate DocumentFragment from the parent based on ID when necessary. This will allow storage of DocumentFragments (held by the HtmlHolder) in extension data or in ParserOutput fields. Live manipulation of structured data contained within these fragments will then be appropriately loaded and stored by the parent Document (held by its own HtmlHolder). Since these child fragments are part of the main document tree, they can be enumerated and mutated in-place by post-processing passes without explicit knowledge, and structured data attributes within the fragment will be transparently held by the DataBag or other mechanism used by the parent. Inside the HTML library, an API will allow easy creation of a new empty DocumentFragment/HtmlHolder tied to the owner document, as well as (for legacy compatibility) creating a new DocumentFragment /HtmlHolder tied to the owner document from an HTML string. Enumerating all fragments for post-processing can be done with Document::querySelectorAll('body, head > template'); this can also be exposed as an API helper method.

The JSON codec for child holders will need to use the codec context to ensure that child HtmlHolder objects are properly relinked to the parent on deserialization. The Slack discussion on T346829 seemed to get hung up on whether these sort of stateful deserializers should be "discouraged but possible" or whether the JSON codec wanted to explicitly prohibit anything but value objects. If serialization is restricted to simple value objects, then the HtmlHolder::getHtml() and HtmlHolder::getDom() methods need to include a parent object (ParserOutput, parent HtmlHolder, etc) as an explicit parameter (or as an explicit parameter of a similar-but-not-identical ChildHtmlHolder class) so that the child holder can be relinked to the parent Document after deserialization. (Perhaps even ParserOutput::getRawText() return a "child" HtmlHolder, with the contents of the ‎<body> tag a special case, and the full parent document is stored elsewhere. This makes all HtmlHolders "children" with references to a special ParentHtmlHolder; ie the parent is the exception, not the child.)

Note that the initial steps in the Rich Attributes proposal (before proposal 3) requires the caller to provide the context type for the deserializer, which is a different design that the JsonCodec used in core, which uses the more typical design where the serialized object contains its own type marker.

I believe that the parsing model for ‎<template> inside ‎<head> allows serialization of fragments in an appropriate way. ‎<meta> and ‎<link> tags are processed using the "in head" insertion mode, but this seems to match how they are processed "in body". If there is some issue it may be necessary to add another wrapper element inside the ‎<template> (like a ‎<body> tag) to reset the parsing mode so we're not "in template" and ‎<meta> etc tags are parsed properly.