Parsing/Notes/HTML5

This page records some notes and observations about the HTML5 spec and parsing algorithm as a quick / easy reference and will be filled out progressively.

Non-obvious terminology / notes

edit

Content categories

edit

The spec defines a bunch of content categories. Elements can belong to zero or more categories. The list below should give you a sense of what the categories represent.

  • Flow content - pretty much everything except a few elements
  • Metadata content - link, meta, ..
  • Heading content - h1 - h6, ..
  • Sectioning content - h1 - h6, section, ...
  • Embedding content - audio, video, embed, object, etc.
  • Interactive content - forms, buttons, and the like
  • Phrasing content - all phrasing content is flow content; heading & sectioning content cannot be phrasing content
  • Palpable content:
    • elements in this category should provide at least one non-empty text node or audio/video.
    • with this category, the spec effectively discourages empty elements
    • we may not enforce this in MediaWiki but rely on linting tools to flag scenarios where this might be happening
  • Script-supporting elements: script, template
  • Media element: audio, video
  • Sectioning roots: blockquote, body, details, dialog, fieldset, figure, td

Observations

edit
  1. Elements that are Flow but not Phrasing: table, lists, headings, p, div, blockquote, section, figure, header, footer and other uncommon ones. Loosely speaking, this is the block node notion from HTML4.
  2. Phrasing content is, loosely speaking, the inline node notion from HTML4.

Content model

edit
  • Transparent content model: they inherit content models from their nearest non-transparent ancestor.
  • Nothing content model: no content can be present / nested in these elements

Paragraphs

edit
  • Paragraphs in HTML5 is a structural concept, not a semantic / logical content.
  • Runs of phrasing content form paragraphs. In other words, p-tags can only contain phrasing content.
  • </p> can be omitted if followed by a set of tags.  I imagine this is just grandfathering in the html seen in the wild.
  • Not required to add p tags around runs of phrasing content that form paragraphs. But, better to add them for clarity and to avoid edge cases in rendering. We'll probably always add them in MediaWiki.

Composition Spec notes

edit
  • For each element, build a map of context in which a node can show up and content model it expects. There is clearly a hierarchical relation here. The content model for a node determines the context in which children can show up. So, these constraints should line up properly.
  • This map can probably be used to come up with a set of composition rules / spec when document fragments need to be composed into a final document.
  • The HTML5 parsing algorithm specifies a fragment parsing mode that can handle this scenario, but we are then left to the whims of what the parsing algorithm does instead of specifying what we would like the behavior to be. For example, we might handle a-in-a differently than what the fragment parsing algorithm would do.

Composition constraints

edit

One of the things to work out with the balanced templates RFC and the Wikitext 2.0 proposal is to figure out how to properly compose fragments to yield a well-formed spec-conformant document. Note that since we have well-formed DOM fragments, we don't need to worry about the parts of the HTML parsing algorithm that deal with unclosed or misnested tags. We only need to worry about the content model constraints.

Looking at the table below, the following is a summary of composition constraints (partial since it only covers a largish subset of elements):

  • Elements that only accept phrasing content: h1 - h6, p, pretty much all the text-content elements (span, i, b, em, strong, small, sup, sub, etc. -- see section 4.5 in the table below). We have two options here:
    1. strip non-phrasing tags from the content: This seems the right approach for h1 - h6 tags
    2. split the parent node to ensure constraints are satisfied: This seems the right approach for p and text-content elements
  • Custom exclusions / constraints: No a-tags inside a; no table-tags inside caption; No main inside nav, aside, ... ; etc.
    • The best solution here is to strip the offending tags from the fragment. So, if you have an a-tag being used inside another a-tag, the a-tag is stripped out. An alternative is to convert the a-tag to text. But, in either case, the a-tag itself is removed. This has an impact on real use cases on wikipedias. [http://website.com Company with [[Website WikiPage]] here] seems to be found on wikis which leads to broken rendering for reads and headaches for Parsoid for editing and round-tripping. The solution proposed here is a better uniform solution.
  • Constraint on insertion context: li inside ol/ul, td/th inside tr, ...; etc. Some possibilities below. Option 3. seems like the best approach.
    1. Suppress the fragment content entirely: Might work for some cases, but probably not a good idea.
    2. Insert necessary required tags, i.e., insert a ul-tag or a table tag as necessary: Unclear that this is a good solution.
    3. Strip just the offending tags, i.e. <td>x <i>y</i> z</td> is converted to x <i>y</i> z
  • Deviations from content-model and context constraint:
    • It looks like the HTML5 parser does not enforce content model constraints in some cases. Try parsing <pre>a <ol><li>x</li></ol>y</pre>. The parser allows Flow content inside the pre tag which violates the (what the spec says) normative content model of a pre tag. Since wikitext overrides the <pre> tag as a native extension with wikitext semantics, we don't have to deal with this in MediaWiki since a HTML pre tag can never show up in wikitext.
    • It lets the li tag be used outside a list. Try parsing <li>x</li>. The list item is allowed to exist outside a list. To be clear, the spec does say that context requirements are non-normative, so there is that.

So, overall it looks like we can come up with a fairly reasonable set of fragment composition rules based on common sense notions (derived from the HTML5 spec). Within the wikitext markup spec, we might even specify exceptions / minor variations from the spec if it aids reasoning and/or eliminates edge cases.

Quick reference table of HTML5 elements and their content model

edit
Element Content categories Context

(Where can this element be used?)

Content Model

(What elements can be used in its DOM tree)

4.2 Document
html None document's doc element / wherever a fragment is allowed head followed by body
head None First elt of html 1+ elts of metadata with 1 title and <1 base elt
title Metadata In head without other titles Text that is not IEW.
base Metadata In head without other bases Nothing
link Metadata; If allowed in body flow & phrasing metadata OR noscript OR phrasing Nothing
meta Metadata; flow & phrasing if itemprop is present .. complicated .. Nothing
style Metadata Metadata content .. complicated ..
4.3 Sections
body Sectioning root second elt of html Flow
article Flow, Sectioning, Palpable Flow Flow
section Flow, Sectioning, Palpable Flow Flow
nav Flow, Sectioning, Palpable Flow Flow - {main}
aside Flow, Sectioning, Palpable Flow Flow - {main}
h1 - h6 Flow, Sectioning, Palpable Flow, child of hgroup Phrasing
hgroup Flow, Sectioning, Palpable Flow, child of hgroup zero or more h1..h6, template
header Flow, Palpable Flow Flow - {header,footer,main}
footer Flow, Palpable Flow Flow - {header,footer,main}
address Flow, Palpable Flow Flow - {header,footer,main} - Heading - Sectioning
4.4 Grouping content
p Flow, Palpable Flow Phrasing
hr Flow Flow Nothing
pre Flow, Palpable Flow Phrasing
blockquote Flow, Sectioning root, Palpable Flow Flow
ol Flow, Palpable if li present Flow >= 0 li and script-supporting
ul Flow, Palpable if li present Flow >= 0 li and script-supporting
li None In ol, ul and <menu type='toolbar'> Flow
dl Flow Flow >= 0 groups of [dt+, dd+]
dt None Before dd or dt inside dl Flow - {header,footer} - Sectioning - Heading
dd None After dd or dt inside dl Flow
figure Flow, Sectioning root, Palpable Flow Flow with optional figcaption before/after the flow content
figcaption None First/Last child of figure Flow
main Flow, Palpable Flow Flow
div Flow, Palpable Flow Flow
4.5 Text-level
a Flow, Phrasing, Palpable Phrasing Transparent, No interactive or a
em Flow, Phrasing, Palpable Phrasing Phrasing
strong Flow, Phrasing, Palpable Phrasing Phrasing
small Flow, Phrasing, Palpable Phrasing Phrasing
s Flow, Phrasing, Palpable Phrasing Phrasing
cite Flow, Phrasing, Palpable Phrasing Phrasing
q Flow, Phrasing, Palpable Phrasing Phrasing
dfn Flow, Phrasing, Palpable Phrasing Phrasing
abbr Flow, Phrasing, Palpable Phrasing Phrasing
ruby Flow, Phrasing, Palpable Phrasing .. complicated ..
rt None child of ruby Phrasing
rp None child of ruby immediate before/after rt Text
data Flow, Phrasing, Palpable Phrasing Phrasing
time Flow, Phrasing, Palpable Phrasing Phrasing if datetime attr present, constrained text (see spec for details)
code Flow, Phrasing, Palpable Phrasing Phrasing
var Flow, Phrasing, Palpable Phrasing Phrasing
samp Flow, Phrasing, Palpable Phrasing Phrasing
kbd Flow, Phrasing, Palpable Phrasing Phrasing
sub Flow, Phrasing, Palpable Phrasing Phrasing
sup Flow, Phrasing, Palpable Phrasing Phrasing
i Flow, Phrasing, Palpable Phrasing Phrasing
b Flow, Phrasing, Palpable Phrasing Phrasing
u Flow, Phrasing, Palpable Phrasing Phrasing
mark Flow, Phrasing, Palpable Phrasing Phrasing
bdi Flow, Phrasing, Palpable Phrasing Phrasing
bdo Flow, Phrasing, Palpable Phrasing Phrasing
span Flow, Phrasing, Palpable Phrasing Phrasing
br Flow, Phrasing Phrasing Nothing
wbr Flow, Phrasing Phrasing Nothing
4.7 Edits
ins Flow, Phrasing, Palpable Phrasing Transparent
del Flow, Phrasing Phrasing Transparent
4.8 Embedded
picture Flow, Phrasing, Embedded Embedded 0+ source tags followed by img optionally intermixed with script-supporting elements
source None child of picture, before img; child of a media elt before Flow or track elements Nothing
img Flow, Phrasing, Embedded, Form-associated, Interactive?, Palpable Embedded Nothing
iframe Flow, Phrasing, Embedded, Interactive, Palpable Embedded Text with constraints (see spec for details)
embed Flow, Phrasing, Embedded, Interactive, Palpable Embedded Nothing
object Flow, Phrasing, Embedded, Interactive?, Palpable, Listed & submittable form-associated elt Embedded 0+ param followed by transparent
param None child of object before Flow Nothing
video Flow, Phrasing, Embedded, Interactive? Palpable Embedded .. complicated ..
audio Flow, Phrasing, Embedded, Interactive? Palpable? Embedded .. complicated ..
track None child of media element before Flow Nothing
map Flow, Phrasing, Palpable Phrasing Transparent
area Flow, Phrasing Phrasing, but within a map ancestor Nothing
4.9 Tabular data
table Flow, Palpable Flow caption?, colgroup*, thead?, (tbody* OR tr+), tfoot?, intermixed with optional script-supporting elements
caption None first element of table Flow - {table}
colgroup None child of table, after caption, before thead, tbody, tr, tfoot Nothing if span attr is present; 0+ col and template if not
col None child of colgroup without a span attr Nothing
tbody 0+ ‎<tr> and script-supporing elements
thead None child of table, after caption, colgroup, before tbody, tfoot, tr; No other thead allowed 0+ tr and script-supporing elements
tfoot None child of table, after caption, colgroup, thead, tbody, tr; No other tfoot allowed 0+ tr and script-supporing elements
tr None child of thead, tbody, tfoot; child of tr after caption, colgroup and thead, but only if there are no tbody
td Sectioning root child of tr Flow
th None child of tr Flow - {header, footer} - Sectioning - Heading
4.10 Form
.. skipped ..
4.11 Interaction
.. skipped ..
4.12 Scripting
script
template .. .. content have no conformance requirements.