The second bit is mostly a modeling assumption for simplicity. It basically means there is a hierarchy of what can contain what:
- branch nodes can have children that are either branch nodes or content branches, but can't contain content directly. Examples of branch nodes are tables and lists
- content branches can have children, but those children must be content. Examples of content branches are paragraphs, headings and pre's
- content nodes "are" content, and can't have children. Examples are text nodes (plain or annotated text), images and br's
This means that some things that are legal in HTML are not legal in our model. For instance, in the HTML that we get, it's common for <li>
's to contain text directly. In our model, that's represented as a list item containing a paragraph containing a text node.