Parsoid/MediaWiki DOM spec/Rich Attributes

This is an experimental proposal for a future revision of the MediaWiki DOM spec. Patch at gerrit 821281; see also T214994.

The problem

The DOM model of HTML is not orthogonal. Elements can contain elements which can contain elements, in a pleasant tree structure, but attributes of elements are limited to plain strings. You cannot nest further structure inside an attribute, and you cannot store multiple values within an attribute (although there are hacks involving string-separated tokens). This is a fairly well-known issue with XML, with common advice given such as "If you use attributes as containers for data, you end up with documents that are difficult to read and maintain. Try to use elements to describe data. Use attributes only to provide information that is not relevant to the data." and similar advice elsewhere.

But HTML uses attributes all over the place. And in some places it is essential, for example the href attribute of an <a> tag. The most natural rendering of the wikitext:

[http://example.com/{{1x|foo/bar}} {{1x|caption}}]

is something like:

<a href="http://example.com/WHAT GOES HERE"><span...>caption</span></a>

where WHAT GOES HERE is not trivially clear; ideally, the same <span> wrapper we use for the caption to embed metadata about the transclusion (of Template:1x in this case) could be used inside the href attribute as well.

Examples of content often embedded within HTML attributes in the MediaWiki DOM spec:

Transclusions (templates, etc)
Language Converter markup
i18n/l10n markup (system messages/ux)
Annotations (translation boundaries, etc)
Style and title attributes, which can have (eg) boldface or other formatting applied in wikitext
Generated attributes of HTML tags (with special ad-hoc markup in the DOM spec)
- This overlaps with many of the categories above, but the following template-generated attributes are specifically called out in the MediaWiki DOM spec: attributes of literal HTML tags in wikitext, href attributes of links, style/width/height/caption/alt of media

Another related issue is "invisible HTML content", for example the invisible caption of an media file which is currently being displayed inline, the output of a suppressed language converter rule, the output for language variants which are not the current one, etc. These can not be embedded directly in the output HTML because they may break the HTML content model -- for example, block type content in a paragraph context. That shouldn't break the paragraph because the content is currently invisible, but if you just dropped it into the document with a display: none CSS style it would break its container. We typically "hide" this content in an attribute (currently a JSON-valued attribute) but then it complicates HTML traversal: various html2html transformations need to know enough about these special hiding places in order to recurse inside and mutate the embedded HTML.

Note that we are focusing on structured data in attribute values here; although one can certainly imagine structured values for attribute names (or element tag names!), we are explicitly keeping that out-of-scope. Attribute names are like element tag names and are identifiers, not user-generated content. (A future spec may add "key value pairs" to the argument/output types allowed for transclusion, which would be the way to support dynamic attribute names in our framework.)

Current solutions

The generated attributes of HTML tags portion of the MediaWiki DOM Spec works out a system for recording the template-affected portions of a attribute value, as an array of "parts", stored in the data-mw.attribs value. This mechanism works for <span class="-{foo}-"> and for [[{{1x|Foo}}]] but isn't a fully-general mechanism; eg it doesn't work for <poem class="-{foo}-">.

The key used in the attribs value does not always bear a direct relationship to the HTML attribute it is describing, for example this sample markup from the MediaWiki DOM spec for the wikitext [[File:foo.jpg|{{1x|thumb}}|{{1x|160px}}]]:

<figure
  typeof="mw:File/Thumb mw:ExpandedAttrs"
  about="#mwt3"
  data-mw='{
    "attribs": [
      ["thumbnail",
        {"html":"&lt;span about=\"#mwt1\" typeof=\"mw:Transclusion\"
          data-mw=&apos;{\"parts\":[{\"template\":{\"target\":{\"wt\":\"1x\",\"href\":\"./Template:1x\"},\"params\":{\"1\":{\"wt\":\"thumb\"}},\"i\":0}}]}&apos;>thumb&lt;/span>"}
      ],
      ["width",
        {"html":"&lt;span about=\"#mwt2\" typeof=\"mw:Transclusion\" data-mw=&apos;{\"parts\":[{\"template\":{\"target\":{\"wt\":\"1x\",\"href\":\"./Template:1x\"},\"params\":{\"1\":{\"wt\":\"160px\"}},\"i\":0}}]}&apos;>160px&lt;/span>"}
      ]
    ]
  }'>
    ... Rest of image HTML here ...
</figure>

Note that template-affected attributes are typeof and the width tag of the internal <img> tag, but the attribute information is on the <figure> tag and the names in data-mw.attribs are thumbnail and width. This may in fact be the best/only way to handle complicated situations like media, where the attribute values do not bear a one-to-one resemblance to wikitext, but for the simpler <a href> and <div style="{{1x|....}}"> a more direct correspondence would make the Parsoid HTML easier to interpret and traverse. Even for media, you could argue that (eg) the alt attribute should be markup applied to the <img> whereas the mw:ExpandedAttrs markup applies it to the wrapper.

Note that data-mw.attribs in the extended attributes mechanism is an array of elements with the nominal structure:

{"txt":"<flattened value>","html":"<document fragment>"}

where a plain string can be used instead of the pair object in cases where the flattened value and the document fragment are identical. This allows for HTML-valued attribute names as well as values.

We also have a "shadow attribute" mechanism (WTSUtils::getAttributeShadowInfo) which is similar, in that it stores a "richer" value for a given attribute in a hidden data-parsoid property.

Relatedly, structured values for data-parsoid and data-mw are supported via a core interface to fetch a "JSON attribute": DOMDataUtils::getJSONAttribute() which returns an associative array. The implementation of this mechanism is discussed further in Parsoid/OutputTransform/HtmlHolder#Structured-value attributes and the DataBag. Many of the features of structured-value attributes in Parsoid (such as live object representation of values) are restricted to the data-parsoid and data-mw attributes. Other attributes with structured values accessed via ::getJSONAttribute() get a copy of the value which must be explicitly re-written using ::setJSONAttribute() after it is mutated.

There is no "built-in" support for storing document fragments in structured-value attributes; in a number of places where this is done the values are manually parsed from/serialized to strings. This does not interoperate well with the DataBag mechanism used for data-parsoid and data-mw attributes (discussed in link above).

We currently have DOM traversal code which is aware of mw:ExpandedAttrs and some other places where embedded markup can be stored. Because of the limited support for embedded HTML in structured-value attributes, the traversal code has to explicitly parse embedded HTML and then restore potentially-modified HTML after the traversal has completed, regardless of whether the traversal actually mutated the embedded DOM. The issues compound if the embedded DOM itself has structured-value attributes, potentially including additional embedded HTML.

Proposals

This proposal is called "rich attributes" to distinguish it from the existing "structured-value attribute" support in Parsoid. There are three main pieces to this proposal, which can be discussed and implemented separately. The first phase makes our existing structured-value support more general, to support attributes other than data-parsoid and data-mw, and to extend support to include DocumentFragment values (both at top level and embedded). The second phase adopts a uniform representation for structured-value attributes in "standard" HTML, including plaintext fallback values and alternative names. This aims to make our MediaWiki DOM spec more internally consistent. The third phase introduces a standardized marker for structured-value attributes to make possible generic traversal through the extended DOM, including DocumentFragments embedded within structured values.

The first phase is primarily targeted at internal users: it provides a cleaner mental model and API for manipulation of DOM trees containing structured data, and better supports traversal and manipulation of document fragments embedded within attributes in Parsoid and core code. It need not require any externally-visible changes to generated HTML.

The second and third phases are aimed at external users: they allow manipulation of a DOM with rich attributes independent of detailed knowledge about the specific attributes containing structured data. This allows the specification and creation of a general purpose "structured-value attribute" or "rich attribute" DOM library without hard-coded details of specific Parsoid attributes and uses. By cleaning up the specification and form of rich attributes the proposals help third-party consumers of HTML conforming to the MediaWiki DOM Spec to understand how to parse and properly manipulate structured-valued attributes.

At this time there is general consensus among the Content Transform Team on proceeding with phase 1 of this proposal, including exploring the "template bank" representation for embedded DocumentFragments. At this time, there is not consensus on proceeding with phases 2 and 3 until the Parsoid Read Views project is further along, as these may included changes to the generated HTML which are not backward-compatible with third-party clients.

Phase 1: New API for structured-value attributes

First we propose a general API to allows Parsoid code to treat attributes as structured values with complex types. We already are doing this for some particular JSON-valued attributes; this is an extension first to arbitrary attributes and second to extend the set of rich values to include not just JSON-encodable arrays but also DocumentFragments, and JSON-encodable arrays which contain DocumentFragments. Fundamental is a separation of the idea of "structured-valued attribute" from the Parsoid "page bundle" representation: not all attributes with structured values are in the page bundle. See HtmlHolder#Private attributes for more discussion of out-of-band representations for private attributes.

This phase of the proposal does not include a standard serialization of these values, nor does it allow generic traversal of a DOM including all traversing structured-values. At this stage the proposed API can be implemented with ad-hoc serialization strategies to remain consistent with the current MediaWiki DOM Spec.

In the API attributes can have three basic types of value: "string", "object" (json-encodable array), and "DOM" (DocumentFragment). The two main methods provided are:

getAttributeObject(Element $el, string $name): ?object^[1]
getAttributeDom(Element $el, string $name): ?DocumentFragment^[2]

Support for these three methods can be split into pieces corresponding to the individual methods. The getAttributeObject() API is phase 1a, and adding getAttributeDom() is phase 1b. Two additional methods will be discussed later, in the context of phase 3:

getAttributeString(Element $el, string $name): ?string
getAttributeMixed(Element $el, string $name): object|DocumentFragment|string|null

Corresponding setAttribute* setters and getAttribute*Default methods (which set a default value if the attribute is missing) are provided for each of these four primary methods. We will call these the "setters" of the primary method in subsequent sections.

Phase 1a: Uniform live representation of structured values in DataBag

In the first phase of the work, we implement getAttributeObject() and its setters, storing the live value in the DataBag. This is a simple generalization/refactor of the existing getDataParsoid(), getDataMw(), getDataI18n(), getDataParsoidDiff(), etc methods, but allowing an arbitrary attribute name and with corresponding generic support in loadDataAttribs() and storeDataAttribs(). Like the existing Data18n, the structured values are stored inline as JSON in the attributes when the document is serialized, not hoisted into the page bundle.

WLOG, a rich attribute named foo is stored in the NodeData associated with the Element as a prefixed dynamic property named rich-data-foo. This avoids conflict with the existing parsoid and mw properties of NodeData (used for data-parsoid and data-mw), and the rich-data- prefix ensures that other dynamic properties added to NodeData (eg tmp) don't inadvertently get serialized as rich attributes. Rich attributes are loaded on-demand and serialized using a hook in DOMDataUtils::storeDataAttribs.

We use JsonCodec to serialize/deserialize the JSON-encoded values, so that getAttributeObject() can return a fully-classed object type (like DataMw or DataI18n) instead of just a stdClass. Note that this is trivial if we could embed type information into the JSON object, as many json encodings (including JsonCodec) can do by default; for example:

{
    "_type_": "Wikimedia\\Parsoid\\NodeData\\DataI18n",
    "/": "...",
    "href": "..."
}

However, the HTML bloat caused by the _type_ properties required in such a scheme would be prohibitive. It is also unfortunate that the name embedded in the serialization is for a specific implementation in a specific Wikimedia namespace, although there is precedent in RDF and XML schemas for using explicit namespaces of this sort.

To avoid bloat, JsonCodec allows the type information to be omitted when the caller provides "hints". When a hint is provided for serialization, the same hint must always be provided for deserialization, but the hint does not need to actually match the type of the object provided at run time: JsonCodec will fall back to explicitly encoding the type of the object using a _type_ field when necessary; that is, when the class of the provided object does not match the hint. Object can also provide type hints for the properties present in their serialization. The DataParsoid and DataMw objects (data-parsoid and data-mw attributes) have been hinted so that explicit _type_ information is never generated in the serialization, and it is probably a good idea to ensure that all object types embedded in rich attributes are similarly static-typed and fully hinted to avoid bloat.

Thus, when getting or setting an rich attribute a type hint should be provided. The full method signature for get/setAttributeObject is:

getAttributeObject(Element $el, string $attributeName, class-string|Hint $hint): ?object
setAttributeObject(Element $el, string $attributeName, ?object $value, class-string|Hint $hint): void^[3]

This effort to avoid bloat works against our later proposal to make our rich media representation self-describing; we would prefer that the information about the proper class type to use for a given attribute is not external to the document itself. We'll discuss this further under phase 3. At this point we'll just briefly note that self-description is an issue only for deserialization: for serialization we can assume that the objects themselves know how to properly serialize their contents to a JSON string, and it is only when trying to deserialize a new unknown attribute that knowledge of the proper type hint is required. In phase 1 we solve this problem by lazily deferring the deserialization until ::getAttributeObject() is first called on the attribute, at which point the caller can/must supply the necessary type hint.

Backward compatibility

For attribute names beginning with data- the JsonCodec serialization of the rich object is used directly for the attribute value string. This is consistent with how existing attributes like data-parsoid, data-mw, and data-mw-i18n have been handled and allows transparent transition to using the new getter/setter methods in Parsoid code without affecting the external HTML representation.

Strictly speaking, attribute names which don't begin with data- are not included in the basic phase 1a proposal; see phase 2 below for a discussion of how rich attributes can be extended to cover these attributes as well.

Phase 1b: Supporting live DocumentFragments in structured values

In this phase we add the implementation of getAttributeDOM(Element $el, string $name) and its setters. This is a straightforward extension of the previous work; we primarily need to teach the serializer/deserializer in ::storeDataAttribs() and ::loadDataAttribs() how to respectively serialize/deserialize DocumentFragment values. We need to recurse into DocumentFragment values in order to ::storeDataAttribs() on an embedded fragment before that fragment is serialized into an array value, and similarly recurse into a DocumentFragment generated by ::loadDataAttribs() in order to similarly load the embedded fragment.

Just as object-valued attributes are stored live in a DataBag attached to the DOM element, HTML-valued attributes are also stored live as DocumentFragments. This allows efficient traversal (we don't have to repeatedly parse and serialize HTML trees) as well as allowing code to (eg) keep persistent handles to certain elements within the DocumentFragment for marking or other purposes.

At this point we have a number of serialization/representation options:

Naive DocumentFragment: At runtime, the object values are DocumentFragments belonging to the main document, but are not otherwise linked into the DOM. They are serialized as HTML strings and loaded as DocumentFragments with the JsonCodec mechanism; the only wrinkle is ensuring that all deserialization is done in the same Document context so that the DocumentFragments don't belong to different Documents.

The method signature is:

getAttributeDom(Element $el, string $name): ?DocumentFragment

The serialization is unchanged from the MediaWiki DOM spec and looks like:^[4]

<span title="Bold Title" data-mw='{"attribs":["title",{"html":"&lt;b&gt;Bold&lt;/b&gt; Title"}]}'>

The main disadvantage of this representation is that the component DocumentFragments can't be enumerated and traversed without loading every rich attribute from the DOM. That is, we still need a DOMTraverser which knows exactly which attributes can possible have "rich data" associated with them, and to do a very specific traversal which loads those attributes and recurses into the specific fields which can contain HTML. However, we do avoid repeated serialization/deserialization of HTML contents: once loaded the attribute value is a live DOM.

Template Bank representation: Another alternative (briefly discussed in Parsoid/OutputTransform/HtmlHolder#Design decisions 2) is to store DocumentFragments at runtime as <template> nodes in the <head> of the main Document. At runtime the live DOM is accessed via $template->content (or DOMCompat::getTemplateContent() until PHP catches up). The method signature looks like:

getAttributeDom(Element $el, string $name): ?HTMLTemplateElement

The actual DOM looks like:

<head>
    <template><b>Bold</b>Title</template>
    <template><div>some other html</div></template>
</head>

Note that we store references to the HTMLTemplateElement in the rich objects, rather than DocumentFragment, so that we can add/remove items from the template bank in the <head> as attribute values are added/removed.^[5] There are two possible serialization strategies for the template bank representation. With the compatible serialization we represent values internally as a template bank, but then use the same serialization as before, eg getAttributeDom($span, 'title') may return an HTMLTemplateElement but when we serialize to HTML we just take the innerHTML of the HTMLTemplateElement as before:

<span title="Bold Title" data-mw='{"attribs":["title",{"html":"&lt;b&gt;Bold&lt;/b&gt; Title"}]}'>

With the compatible serialization, once all rich attributes are loaded, we can enumerate all of the embedded HTML simply by traversing the parent document (including its <head>). So we've simplified the traversal-of-embedded-HTML problem slightly: we don't need to know where in the attribute value objects the DOM values live, we just need to ensure all of the rich attribute values are loaded, which will convert their values to <template> nodes that can then be generically traversed.

An improvement which both reduces serialization bloat and improves traversal of embedded HTML, is to serialize DOM values as <template> nodes in the <head> and reference them in the serialized attribute as a ID (a content hash or sequential assignment). For example:

<head>
    <template id="beefbeef"><b>Bold</b> Title</template>
    <template id="cafecafe"><span title="Bold Title" data-mw='{"attribs":[["title",{"id":"beefbeef"}]]}'>Foo</span></template>
</head>

<!-- Note that this works even with arbitrary nesting with no additional quoting! -->
<span title="Foo" data-mw='{"attribs":[["title",{"id":"cafecafe"}]]}'>!</span>

This simplifies the ::storeDataAttribs() phase since the HTML fragments are already/always present in the main document and don't require additional serialization into attribute values; we just need to ensure the template node has an ID and put that ID into data-mw.attribs.

When we deserialize an HTML document in this representation we no longer need to load all attributes before traversing all embedded HTML: it is always present in the <head>.

These are different serializations for the same data structure; the underlying internal representation has not changed. We can chose to serialize in either format from the same internal representation. Thus we can (for example) use the template-bank representation internally for parser cache, in order to obtain the compactness and traversal benefits for internal code, and then serialize that using the compatible serialization for third-party clients.

Embedded DOM in rich object types

Note that the associative array returned by ::getAttributeObject() may now also contain embedded DOM, stored as properties with DocumentFragment/ HTMLTemplateElement values. The JsonCodec used should be able to identify and serialize/deserialize these, either to/from the innerHTML or a template ID. In proposal 3 we would introduce marker values to be able to recognize and properly deserialize embedded DOM values without the need for attribute-specific codec hints, but at this phase we put the burden on the deserializer.

At the end of phase 1, we have live objects and live DocumentFragments representing attribute values, and our own code can access these uniformly with a simple rich-attribute API, but the serialization and deserialization of these rich values to HTML is ad-hoc and potentially inconsistent. A third party user must carefully implement bespoke serialization and deserialization logic for every rich attribute it uses. Unless the template-bank serialization is used, traversal requires the implementation of attribute deserializers which will properly construct live objects from the varied contents of HTML attributes, but once the document is fully parsed traversal can enumerate the contents of the NodeData object at a node to locate additional DocumentFragment or structured object values to recurse into.

Backward compatibility

Assuming the "compatible serialization" option for 1b, nothing in proposal 1 requires a backward-compatibility break with generated HTML, although the burden is on the object serialization code to maintain compatible formatting. The "template bank" representation is implemented internally without necessarily affecting the serialization; that is, when during deserialization a HTML string is parsed into a DocumentFragment, that fragment is hung off a new <template> element in the <head>, and during serialization in ::storeDataAttrs() the generated <template>s are removed and converted back to HTML embedded in JSON or attribute values.

If a gradual shift to a template bank representation for external users is desired, It ought to be possible to use a template bank serialization only for selected attributes by adding special cases to the serializer. Similarly, we can use a template bank serialization internally while converting to inline HTML in attributes for external clients, although because structured values are not self-describing (cf proposal 3) the conversion will still require specific knowledge of every place where HTML fragments could be embedded in attribute or object values.

Phase 2: Uniform HTML representation for structured-value attributes

The first step toward a uniform serialized HTML representation for structured-value attributes is to introduce a naming and location convention for them, so that, for example, "the structured value of the href attribute of this a element" can be located.

Let's first define "an attribute with special HTML semantics". This is an attribute whose stripped "non-rich" value is semantically meaningful for HTML. For example, the href attribute of an a tag is the URL which the browser will load when you click on the link. For now we'll say that "any attribute whose name doesn't start with data-" is an attribute with special HTML semantics.

For attributes without special HTML semantics we store the structured value directly in the named attribute as a JSON-encoded string (if an object) or an HTML string (if a DocumentFragment).

For attribute with special HTML semantics we will store a "flattened" version of the value directly under the named attribute. For a DOM value this is the textContent of the DocumentFragment. For a object value this is defined by the object class type. For a string value the flattened value is the string itself. The structured value is stored elsewhere.

Setting the title of an <a> tag to the DOM <b>bold<b> will then result in the "compatible serialization":

<p title="bold" typeof="mw:ExpandedAttrs" data-mw='{"attribs":[["title",{"html":"&lt;b>bold&lt;/b>"}]]}'>

or, if template-bank serialization is used::

<p title="bold" typeof="mw:ExpandedAttrs" data-mw='{"attribs":[["title",{"id":"some-id"}]]}'>

The rich value for attributes with special HTML semantics is stored as the attribs property under the structured-value data-mw attribute, as an array of <attribute name, structured attribute value> pairs, where the structured attribute value is a single-property object with a simple type key. This proposal only uses the type keys "html" (the compatible serialization) and "id" (template bank), but existing code also used "txt" for a plain-text value. This serialization is compatible with existing markup, and should be supported at least for deserialization even if we transition to a different representation. The data-mw.attribs serialization has the advantage of requiring only two attribute names to be reserved (typeof and data-mw) and all elements with rich attributes can be easily located with Document::querySelectorAll('[typeof="mw:ExpandedAttrs"]'). On the other hand, the typeof attribute is not indexed by the DOM so that query must touch every node in the Document anyway, and for practical use the attribs content of data-mw must coexist and be merged with the "other" structured values stored in data-mw by the MediaWiki DOM spec, making data-mw an unusual corner case.^[6] A more compact future representation might use the attribute name prefixed with data-mw-attr- to store an unwrapped value for attributes with special HTML semantics. (Obviously, this would require attribute names beginning with data-mw-attr- to be reserved.) For the same semantic value as above, the serialization would be simply:

<p title="bold" data-mw-attr-title='&lt;b>bold&lt;/b>'>

or with template-bank serialization:

<template id="some-id"><b>bold</b></template>
...
<p title="bold" data-mw-attr-title='some-id'>

The presence of data-mw-attr-title is then the only signal required that title has rich contents, and the caller would be expected to know how to interpret the contents without an explicit type wrapper.

One possible drawback of both these representations is that a naive client might get lulled into inattention by the fact that href/title/etc are "usually" plain text, and get caught unaware by the need to parse a data-mw-attr-* or data-mw.attribs value when it (unusually) appears.

Regardless, the primary work in phase 2 is migrating corner cases to this uniform representation. Above we identified several attributes in the media representation where the mw:ExpandedAttrs information was misplaced; there are also bespoke structured-value fields which don't use the standard representation, like the data-* attributes used for language conversion or hidden inline captions for media. We can use various hacks in ::loadDataAttrs/::storeDataAttrs to accommodate exceptions as we migrate them to use the standard. Of course, any new Parsoid features should use structured value attributes in the standard form.

Attributes without special HTML semantics

The discussion in phase 1a focused on data- attributes, and the discussion in phase 1b focused on "normal" non-data attributes, because those are the primary places where attribute values are objects or html, respectively. In phase 2 we'll ensure that object and html values work for all attribute types. In phase 1b we used a {"html":....} wrapper for DOM values in non-data attributes. At this stage of the proposal we'll propose a simple unwrapped serialization for object values.

This example shows how the serialization would change when (a) a DOM value is assigned to first the attribute title and then the attribute data-title; and (b) a DataParsoid object value is assigned to first the attribute data-parsoid and then the non-data attribute parsoid:^[7]

<p title="bold" typeof="mw:ExpandedAttrs" data-mw='{"attribs":[["title",{"html":"&lt;b>bold&lt;/b>"}]]}'>
<p data-title="&lt;b>bold&lt;/b>"> <!-- compatible serialization -->
<p data-title="some-id-here"> <!-- template bank serialization -->

<p data-parsoid='{"tsr":[1,2]}'>
<p parsoid="..." data-mw-attr-parsoid='{"tsr":[1,2]}'>

It could be argued that a DOM value applied to a data-* argument could still use a {"html":...} wrapper, but we'll put off that discussion for phase 3.

Backward compatibility

The changes in phase 2 can be rolled out piece by piece. If consensus is reached on this proposal as the long term direction for the MediaWiki DOM Spec, the first step is probably just to locate and mark as deprecated/subject to change any usage of "misplaced" structured-value attributes in the existing MediaWiki DOM Spec, aka attributes where the mw:ExpandedAttrs/data-mw.attribs is on a different element than the attribute itself, or where the name in data-mw.attribs doesn't correspond to an attribute on the element. Then those misplaced attributes can either be rolled out in small steps one-by-one after clients are located, notified, and updated; or else the changes can be bundled together into a single "big" breaking change, with a post-processing downgrade pass available to "move them back" for backward compatibility. The mw:ExpandedAttrs and data-mw-attr-* representations ought to be semantically equivalent, so a postprocessing step could be written to convert back-and-forth between them as well. Transitional code could also read from both but write only the preferred or compatible version.

Phase 3: Uniform marking of structured-value attributes

Up to this point, processing a document containing rich attributes requires an external schema giving the data type of each attribute. In the mw:ExpandedAttrs serialization, we have top-level type information; for example:

<span ... data-mw='{"attribs":[["title":{"html":...}]]' data-mw-i8n="...">

Here we know that the value of the title attribute is HTML (a DOM type). But without Parsoid-specific knowledge we can't tell whether the value of data-mw-i18n is supposed to be a literal string <b>x</b>, the DocumentFragment resulting from parsing that value as HTML, or a JsonCodec-encoded object value, and given <span data-foo='{}'> the client can't distinguish between the literal value {} and the empty object value. If we are using the template bank serialization of HTML, we can't tell the different between the literal string abcd and the use of "abcd" as a fragment reference in the template bank for data-* attributes.

This is a particular problem if we want to traverse the document to process embedded HTML, since without a schema for the document we don't know which attribute values we should check. There are two mitigating factors:

If we use the typeof=mw:ExpandedAttrs representation for structured value attributes than every attribute containing embedded HTML should be marked with that typeof. This helps us find a subset of attributes that require further attention, but we still can't distinguish an object-valued attribute from an HTML-valued attribute, nor can we identify HTML within a properly of an object-valued attribute.
If we use the <template> bank serialization for embedded HTML, then we can traverse all embedded <template> elements to be certain of discovering all embedded HTML, although we can't determine for certain where those fragments are actually embedded. It is certainly some help to be able to traverse the entire document, but because we don't know exactly how the template bank is embedded we are still unable to traverse a portion or subtree of it.

Traversal/mutation during html2html passes is an important use case for the MediaWiki DOM Spec. It is important that embedded HTML be "visible" to post processing passes, so that (for example):

i18n fragments inside href are expanded
redlinks inside language converter markup still work
redlinks/language converter markup/i18n fragments inside "invisible media captions" are properly processed -- so that if the VE user toggles the media style from inline to thumb the proper i18n/language converter markup/etc should already be present in the caption.

In order to do this with the current system, we need to write a specialized traverser with special knowledge of mw:ExpandedAttrs and data-mw.attribs and a number of other internal Parsoid features. As new embedded HTML is added inside internal data structures, the traversal must continue to be extended to handle each place embedded HTML may be found, and before proposal 1a is implemented the traverser must additionally parse from JSON, parse to DOM, mutate, serialize HTML, and then serialize JSON at each place.

In phase 3 the serialization of rich values is self-describing to the extent that embedded HTML can be located within rich attributes, even in subtree traversal and without the use of information external to the document. We have three alternative proposals for accomplishing this:

Alternative 1 treats attribute values with leading { or [ specially, allowing attributes to self-describe with very little extra overhead in the common case.
Alternatives 2 and 3 embed a schema in the document, rather than in the attribute values. This is potentially more rigid: a given attribute may only have a single type, not a union of types, but it potentially provides a more powerful type system than the simple three-type system of proposal 3a. Specific knowledge of Parsoid-specific attribute types and values is decoupled via the schema, which can be reused by third-parties without requiring porting a Parsoid-specific traversal class.

We could also decide that the "enumerate the entire document" capability provided by the <template> bank serialization is sufficient, and no additional action be taken under proposal 3.

In the following we use DocumentFragment to represent embedded DOM. As discussed in phase 1b above, the actual object used might be an HTMLTemplateElement with a DocumentFragment child to allow a template bank representation. In that case please mentally substitute HTMLTemplateElement for DocumentFragment in the discussion below.

Phase 3 Alternative 1: Value marking

In proposal 2 we already attempted to preserve the "normal HTML semantics" of attributes by storing a flattened string representation of well-known attributes like class, href, and alt even when the full structured value is stored elsewhere; this is similar to Parsoid's "shadow attributes".

In this proposal we tweak the encoding of object/array, string, and DocumentFragment values to make them uniquely identifiable:

A plain string value is encoded as { "_s": <value> }
A DocumentFragment is encoded as { "_h": "<DocumentFragment innerHTML>" } or { "_h": "<template bank id>" }
A associative array or object value is encoded by:
- Any key/property with a name starting with an underscore has an additional underscore prepended to its name, otherwise the JSON encoding is used, but
- Any array or object value is recursively encoded with this algorithm.
(Optionally) object values can also use JsonCodec and/or a ::flatten() method to customize their encoding.
- But note that the serialization of DocumentFragment values should not be customized, so that the parent codec can use { "_h": ...} to represent them, and any custom properties named "_h" will be renamed to "__h" by the codec, as described above.
(Optionally, for compatibility with current MediaWiki DOM spec) an property named "html" has the value parsed as a DocumentFragment; and non-DOM property named "html" is renamed "_html". (A "real" property named "_html" would already be renamed "__html" by the above.)

So nominally, $p->setAttributeString("data-mw-foo", "hello, world") would result in:

<p data-mw-foo='{"_s":"hello, world"}'>...</p>

(but see the 'optimization' section below), and a complex JSON object type may embed like this:

<p data-mw-foo='{"name":"bar","html":{"_h":"<span>xyz</span>"}}'></p>

To reduce bloat and increase compatibility, if the value to be stored is a simple string (including a DocumentFragment with a single Text node) and the first character of that string value is not { or [ then the string value is stored directly in the attribute. Thus, the example above where we set data-mw-foo to the string value hello, world would actually be represented as

<p data-mw-foo="hello, world">

but if we set the attribute to the string value {hello} it would be serialized as

<p data-mw-foo='{"_s":"{hello}"}'>

so that the first character of the attribute value can be used to indicate type. If the attribute has "special HTML semantics" (see proposal 2) and the value is a simple string, then the value can be stored directly under the attribute name, without the need for an additional data-mw-attr-* attribute (which would have the identical value). This optimization makes the output look "less weird" in the common cases and preserve HTML semantics for important attributes like title and href. As with proposal 2, one drawback may be that those parsing string-valued data-mw-foo attributes may be caught unaware when the value starts with a { or [ and the representation changes.

Now by looking at any attribute value to see if the first character is { or [ we can determine whether a structured value is stored in a given attribute, and for structured values we can identify any embedded DocumentFragments and property restore or traverse them. We can implement ::getAttributeMixed() which uses the value marking to return the proper value for a union-typed attribute. On the other hand, we also need to introduce ::getAttributeString() and all code must use it instead of the standard DOM ::getAttribute() if the attribute value could possibly start with a { or [ in order to ensure those are appropriately escaped when necessary. During the transition period we can use an allow or block list to mark attributes which have been ported to be self-describing in this way (ie, to protect attributes whose values may happen to start with { or [ but which shouldn't be parsed as a rich attribute), and/or require the affirmative presence of a marker attribute before values are parsed as rich.^[8]

Phase 3 Alternative 2: Type marking with a type dictionary

Instead of using the first character of an attribute value as a type tag, we can also embed an explicit type tag. One encoding option reduces bloat to 6-7 characters per structured attribute value by compressing the inline type information down to a single-character property name and a short numeric identifier ({"@":5,...}) and then embedding a map from the numeric type ids to complete type names in the <head>. This might look like the following:

<head>
    <script type="rich-schema">
        {
            "0": "DocumentFragment",
            "1": "Wikimedia\Parsoid\NodeData\DataMw",
            "2": "Wikimedia\Parsoid\NodeData\DataI18n",
            ...
        }
    </script>
    <template id="richurl">http://example.org/<span>foo</span></template>
    <template id="cafebad">Some <b>HTML</b>!</template>
</head>

<a href="http://example.org/foo" data-mw-attr-href='{"@":0,"_":"richurl"}'>Foo</a>

<span data-mw='{"@":1,"caption":{"@":0,"_":"cafebad"}}'>...</span>

<a data-mw-i18n='{"@":2,"title":{"lang":"x-page","key":"red-link-title","params":["Non existing page"]}}'>...</a>

To the extent that object types are correlated with attribute names, a gzip encoding of the HTML would be expected to combine the tag and the type prefix into a single dictionary entry (eg data-mw='{"@":1,") making the type information "free" from a bandwidth perspective. Client browsers would still need to store the extra characters in their DOM, however. This proposal is backward compatible with existing markup, as it simply adds a new type property to existing attributes without changing the values of existing properties.

The schema presented in the above example hard-codes classes from the current PHP implementation. It may be preferable to use an abstract type system corresponding more closely to the string|fragment|object type system from alternative 1, where a numbered type could be of the form {foo:DOM} to indicate "an object type that contains as a member named foo a value of DocumentFragment type". Any fields not recursively containing DOM fragments could be elided from the type specification, since the primary purpose is not to give a full semantic type to the value but only to guide traversal of embedded DOM fragments. Other example types could include DOM, {foo:{bar:DOM}}, {foo:[DOM]} etc.

Phase 3 Alternative 3: Embedded schema

Another option might be to create a schema document, either embedded in the <head> or as a standalone document which would be provided to the Rich HTML library when a document was to be parsed which would give the rich type of any attribute. The schema could use CSS-like or xpath specifications, like:

/*/@data-mw-i18n  Wikimedia\Parsoid\NodeData\DataI18n
/*/@data-mw Wikimedia\Parsoid\NodeData\DataMw
/*/@href DocumentFragment

or else it could be provided in executable form as a callable mapping an Element and attribute name to the proper type. As above, instead of naming PHP class names from our current implementation, the types could be given as abstract specifications sufficient to locate DocumentFragments

Backward compatibility

Alternative 1 was designed to minimize the changes required to existing serialization. Strings with the most common values are stored unmodified, and object-valued attributes are stored with a leading { as is the current practice. The two main differences are (a) alternate encoding of string values which start with { or [, although that can be mitigated using a separate marker attribute, and (b) marking of DocumentFragment values inside object-valued attributes. Currently stored directly as the value of an arbitrary property in the JSON, often but not always named t or html, like so:

<span typeof="mw:LanguageVariant" data-mw-variant='{"disabled":{"t":"bar<b>baz</b>"}}'></span>

Marking the DocumentFragment moves the HTML string one level down, like so:

<span typeof="mw:LanguageVariant" data-mw-variant='{"disabled":{"t":{"_h":"bar<b>baz</b>"}}}'></span>

Since these two alternatives can be distinguished by shape (a string in the first case becomes an object with an _h property in the second) it is likely that a custom deserializer could recognize both alternatives during a transitional period. On the other hand, any third party clients would be aware of the change in representation. Alternative 2 may actually be easiest from a backward-compatibility standpoint, as it simply adds a new type marker property to existing JSON output. External clients can be expected to simply ignore both the extra property and the type map in the <head>. However, marking the type of DocumentFragments still requires moving from a string to an object in the representation, just like alternative 1. Continuing with the example above, the output in proposal 3b would be:

<span typeof="mw:LanguageVariant" data-mw-variant='{"disabled":{"t":{"@":9,"_":"bar<b>baz</b>"}}}'></span>

Alternative 3 would require no compatibility break, but it would require us to come up with a declarative specification of some sort giving the desired type mapping for the MediaWiki DOM Spec, and every third party user of HTML complying with the MediaWiki DOM Spec would be required to provide a local copy of that specification to their rich attribute library in order to properly parse our output. There is precedent however in terms of the IDL used for the DOM representation of HTML. It may not be to difficult to create a simple interface specification of our DataParsoid, DataMw, DataMwI8n, etc types and write a traverser guided by that; one complication is, however, that the serialized form is often structurally different than our internal representation objects, and conceptually the schema should describe traversal of the serialized values however at runtime it would be running against our internal representation objects.

Work in Progress

Investigate parsing context of <template> elements to ensure that we can round trip any DocumentFragment.
- STATUS: complete, seems okay.
For transition purposes, we need to handle existing attributes which may happen to start with { or [ but shouldn't be treated as a rich attribute. The easiest solution is to require the affirmative presence of a matching data-mw-attr-* attribute before a value is treated as rich, and some similar affirmative marker for attributes which don't use "special HTML semantics".
As mentioned above, the existing mw:ExpandedAttrs markup serves many of the same purposes as rich attributes; the "shadow attributes" inside data-parsoid are also very similar. Neither of these is explicitly used in client code (as far as I am aware) however we should support existing markup using mw:ExpandedAttrs or shadow attributes which may come from the cache for html2wt, even if we want to shift the "canonical" representation of rich attributes. One approach to this would be to augment the rich attribute "loader" to transparently treat mw:ExpandedAttrs or a shadow attribute as an equivalent rich attribute. Another would be to write a simple html2html preprocessing pass which does this remapping in the html2wt direction.
- STATUS: using mw:ExpandedAttrs natively as the serialization format in present patches.
mw:ExpandedAttrs allows for HTML-valued attribute names as well as values. It seems that the same flattening mechanism we use for "attributes with special HTML semantics" can be used to serialize these using a flattened attribute name. Some outstanding questions:
- How to deal with conflicting flattened names (for example: href, hr<b>ef</b>, and <b>href</b>) which presumably have the same flattened serialized attribute.
- What the API should look like for getting/setting values on a DocumentFragment-valued attribute name. Perhaps object identity is sufficient for the name lookup? (But that implies we might also have multiple attributes with identical names, but differing DocumentFragment instances.)
- STATUS: limiting scope of rich attribute specification to values only for now. It probably wouldn't be too hard to allow string $attributeName to be a rich type using the data-mw.attribs serialization, but this should be considered late phase 2 work where we're trying to migrate the existing AttributeExpander code over to the new rich-attribute methods.

Notes

↑ There is an additional "classHint" argument needed to avoid bloat; see below.
↑ See phase 1b discussion below for a slight alternative to this signature.
↑ Note that the type hint is actually optional for setAttributeObject when the $value object implements the RichCodecable interface; in which case the type can be obtained at runtime as get_class($value)::hint(). This may be magic too advanced, however, and it may be best just to use the rule "always provide a type hint for get and set of rich attributes" and not bother with the RichCodecable interface.
↑ This uses the data-mw.attrs serialization discussed below under phase 2.
↑ You could probably fix this with a magical dynamic property added to the DocumentFragment which points back at its containing <template> but the current PHP DOM implementation doesn't like dynamic properties on DOM objects.
↑ For example, note that ::storeDataAttrs() is going to rewrite the contents of the data-mw attributes as it serializes structured-value attributes, and so it must do so before serializing the data-mw attribute itself or hoisting data-mw into a page bundle or one of our alternate representations.
↑ The "plain text" value of the parsoid attribute would depend on how the DataParsoid object chose to stringify itself; I don't think we have a string form current defined.
↑
Another variation here is to combine this value marking with a shift from JSON encoding to a more efficient representation which is more easily or compactly embedded in HTML, for example, base85-encoded CBOR, with an even-less-common marker prefix (\t or \f or a deprecated unicode character like U+0149 ŉ or U+0673 ٳ in the two-byte region of UTF-8) to distinguish encoded from "plain string" values. Using an example from Proposal 1b above:
```
<span data-outer="&lt;span data-inner=&39;&amp;lt;span>hello,&amp;lt;/span>&39;> world&lt;/span>">!</span>
```
becomes the following using type marking:
```
<span data-outer='{"_h":"&lt;span data-inner=&39;{"_h":"&amp;lt;span>hello,&amp;lt;/span>"}&39;> world&lt;/span>">!</span>
```
and using a CBOR encoding followed by Ascii85 encoding, and using a leading ŉ to indicate the presence of a structured value:
```
<span data-outer='ŉTjhABGX4H5E+*W,A79Rg/ST*?ATBp]`JIQ/BL+sS,V)Sr5W`416:"O[0dJD.7oLs-.k4Um-U&YsDfTZ)4>1bp@;\\7'>!</span>
```
It is likely that a more complex example may save more bytes, but it isn't obvious there are significant wins to be had here. The template-bank serialization solves whatever issues multiple levels of HTML-escaping would present without the readability penalty of this scheme.

[1] There is an additional "classHint" argument needed to avoid bloat; see below.

[2] See phase 1b discussion below for a slight alternative to this signature.

[3] Note that the type hint is actually optional for setAttributeObject when the $value object implements the RichCodecable interface; in which case the type can be obtained at runtime as get_class($value)::hint(). This may be magic too advanced, however, and it may be best just to use the rule "always provide a type hint for get and set of rich attributes" and not bother with the RichCodecable interface.

[4] This uses the data-mw.attrs serialization discussed below under phase 2.

[5] You could probably fix this with a magical dynamic property added to the DocumentFragment which points back at its containing <template> but the current PHP DOM implementation doesn't like dynamic properties on DOM objects.

[6] For example, note that ::storeDataAttrs() is going to rewrite the contents of the data-mw attributes as it serializes structured-value attributes, and so it must do so before serializing the data-mw attribute itself or hoisting data-mw into a page bundle or one of our alternate representations.

[7] The "plain text" value of the parsoid attribute would depend on how the DataParsoid object chose to stringify itself; I don't think we have a string form current defined.

[8] Another variation here is to combine this value marking with a shift from JSON encoding to a more efficient representation which is more easily or compactly embedded in HTML, for example, base85-encoded CBOR, with an even-less-common marker prefix (\t or \f or a deprecated unicode character like U+0149 ŉ or U+0673 ٳ in the two-byte region of UTF-8) to distinguish encoded from "plain string" values. Using an example from Proposal 1b above:
<span data-outer="<span data-inner=&39;<span>hello,</span>&39;> world</span>">!</span>
becomes the following using type marking:
<span data-outer='{"_h":"<span data-inner=&39;{"_h":"<span>hello,</span>"}&39;> world</span>">!</span>
and using a CBOR encoding followed by Ascii85 encoding, and using a leading ŉ to indicate the presence of a structured value:
<span data-outer='ŉTjhABGX4H5E+*W,A79Rg/ST*?ATBp]`JIQ/BL+sS,V)Sr5W`416:"O[0dJD.7oLs-.k4Um-U&YsDfTZ)4>1bp@;\\7'>!</span>
It is likely that a more complex example may save more bytes, but it isn't obvious there are significant wins to be had here. The template-bank serialization solves whatever issues multiple levels of HTML-escaping would present without the readability penalty of this scheme.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]