Parsoid/Normalizations

While serializing (html2wt), Parsoid performs a number of normalizations.

Most can be found in DOMNormalizer.php

Normalizations

edit

These are the normalizations that Parsoid performs,

  • Tag minimization (<i>/<b> tags)
  • Serialize invalid <a> tags to text
  • Enforce single-line context (in headings and lists)
  • Strip empty headings and style tags (only performed on new nodes)
  • Tag minimization (<a> tags, when at least one is new)
  • Whitespace at the start of paragraphs
  • New links that end in spaces
  • New table cells starting with escapable prefixes

Other normalizations that work around issues in Parsoid / VE+clients as a simpler solution for generating clean wikitext (at least for now)

  • Force category links and behaviour switches to serialize before/after headings (only performed on new nodes)
  • Strip <br> tags in headers (introduced by Parsoid in some paragraphs which when converted to headings in VE stick around)
  • Strip trailing <nowiki/> from wikitext lines (this one will be unnecessary once Parsoid stops introducing these)

Examples

edit

Tag minimization (<i>/<b> tags)

edit
<b>X</b><b>Y</b>

// becomes

<b>XY</b>

and

<i>A</i><b><i>X</i></b><b><i>Y</i></b><i>Z</i>

// becomes 

<i>A<b>XY</b>Z</i>
edit
<h2>hello there<link href="Category:A1" rel="mw:PageProp/Category" /></h2>

// becomes

<h2>hello there</h2>
<link href="Category:A1" rel="mw:PageProp/Category" />

and

<h2><meta property="mw:PageProp/toc" /> ok</h2>

// becomes

<meta property="mw:PageProp/toc" />
<h2> ok</h2>

Serialize invalid <a> tags to text

edit
<a rel="mw:WikiLink" href="[[foo]]">text</a>

// serializes to

text

and<v lang="html5"><a rel="mw:WikiLink" href="foo">*a foo</a>

// serializes to

*a [[foo]]</syntaxhighlight>

Enforce single-line context

edit
<h2>testing
123</h2>

// becomes

<h2>testing 123</h2>

and

<ul><li>asd
sdf</li></ul>

// becomes

<ul><li>asd sdf</li></ul>

However, newlines in transclusion parameters are preserved.

<h2> hi <span about="#mwt1" typeof="mw:Transclusion" data-mw='{"parts":[{"template":{"target":{"wt":"bogus","href":"./Template:Bogus"},"params":{"1":{"wt":"there\nyou"}},"i":0}}]}'>there</span><span about="#mwt1">
</span><span about="#mwt1">you</span> </h2>

// serializes to

== hi {{bogus|there
you}} ==

Strip empty headings and style tags

edit

Normally,

<h2></h2>
<i></i><b></b>

// serializes to

==<nowiki/>==
''<nowiki/>'''''<nowiki/>'''

but with scrubbing it's all dropped.

Tag minimization (<a> tags)

edit
<a href="Football">Foot</a><a href="Football">ball</a>

// becomes

<a href="Football">Football</a>

and

<a href="Football"><i>Foot</i></a><a href="Football"><b><i>ball</i></b></a>

// becomes

<a href="Football"><i>Foot<b>ball</b></i></a>
edit
<a rel="mw:WikiLink" href="./Football"><u><i><b>Football</b></i></u></a>

// becomes

<u><i><b><a rel="mw:WikiLink" href="./Football">Football</a></b></i></u>

This enables a simplified wikilink format if the href and link text formatting match. Without the reordering [[Football|<u>'''''Football'''''</u>]] would be emitted. With the reordering <u>'''''[[Football]]'''''</u> will be emitted.

Exceptions:

  • If the formatting tags have attributes like color, style, class since the reordering can change rendering in some cases. The A-tag's color style will override the outer style, i.e. <i color='brown'>[[Foo]]</i> doesn't render the same as [[Foo|<i color='brown'>Foo</i>]]
  • If the link text is not identical to the href, the reordering is not done since the simplified link form is not enabled in this case.

Whitespace at the start of paragraphs

edit

These nowikis are to prevent roundtripping as preformatted text.

<p> hi
 ho</p>

// normally serializes to

<nowiki> </nowiki>hi
<nowiki> </nowiki>ho

// but with scrubbing becomes

hi
ho
edit

The nowiki here is to prevent link trails.

<p><a rel="mw:WikiLink" href="./Berlin" title="Berlin">Berlin </a>is the capital of Germany.</p>

// normally serializes to

[[Berlin ]]<nowiki/>is the capital of Germany.

// but with scrubbing becomes

[[Berlin]] is the capital of Germany.

New table cells starting with escapable prefixes

edit
<table>
<tr><td>a</td></tr>
<tr><td>-</td></tr>
<tr><td>+</td></tr>
</table>

// normally serializes to

{|
|a
|-
|<nowiki>-</nowiki>
|-
|<nowiki>+</nowiki>
|}

// but with scrubbing becomes

{|
|a
|-
| -
|-
| +
|}
edit