User:Kephir/XML parse tree

The following is an unofficial documentation of the XML parse tree format, as returned by Special:ExpandTemplates and the API, like API:Expandtemplates and API:Revisions, when a generatexml argument is passed to the API call.

<!DOCTYPE root [
	<!ENTITY % mixed-markup "(#PCDATA|template|comment|h|possible-h|tplarg|ext|ignore)*">

	<!ELEMENT root       %mixed-markup;                     >
	<!ELEMENT template   (title, part*)                     >
	<!ELEMENT tplarg     (title, part*)                     >
	<!ELEMENT part       (#PCDATA|name|value)               >
	<!ELEMENT title      %mixed-markup;                     >
	<!ELEMENT name       %mixed-markup;                     >
	<!ELEMENT value      %mixed-markup;                     >
	<!ELEMENT h          %mixed-markup;                     >
	<!ELEMENT possible-h %mixed-markup;                     >
	<!ELEMENT comment    (#PCDATA)                          >
	<!ELEMENT ext        (name, attr, inner?, close?)       >
	<!ELEMENT attr       (#PCDATA)                          >
	<!ELEMENT inner      (#PCDATA)                          >
	<!ELEMENT ignore     (#PCDATA)                          >
	<!ELEMENT close      (#PCDATA)                          >

	<!ATTLIST root
		xml:space    CDATA    #FIXED     "preserve"     >
	<!ATTLIST template
		lineStart    CDATA    #IMPLIED                  >
	<!ATTLIST tplarg
		lineStart    CDATA    #IMPLIED                  >
	<!ATTLIST name
		index        CDATA    #IMPLIED                  >
	<!ATTLIST h
		i            CDATA    #REQUIRED
		level        CDATA    #REQUIRED                 >
	<!ATTLIST possible-h
		i            CDATA    #REQUIRED
		level        CDATA    #REQUIRED                 >
]>

Elements

edit
<dt id="root"> root
The root element. Has no interesting attributes by itself.
Since whitespace is significant in reconstructing wiki markup, it is a good idea to parse the XML document as if root had an xml:space="preserve" attribute. MediaWiki does not specify it explicitly, however.
<dt id="template"> template
Indicates a template, variable, or parser function invocation ({{ ... }}). Must contain at least a title element, followed by optional part elements.
The lineStart attribute is present and set to 1 if the template immediately follows a newline.
It is impossible in general to determine whether the node represents a transclusion or a parser function/variable until the contents of <title> are expanded: {{ {{{foo|x2}}}|aye|nay}} expands to "nay" if foo is assigned "#if:", for one.
API:Siteinfo provides several methods to gather the list of variables and parser functions (siprop=magicwords, siprop=variables and siprop=functionhooks), but none of them can be reliably used to recognise their precise syntax as of MediaWiki 1.24.
<dt id="tplarg"> tplarg
Indicates a template argument reference ({{{ ... }}}). Contents are just like template, a title element followed by optional parts. The lineStart attribute has the same meaning as above.
<dt id="part"> part
Indicates a template argument (or default value for a template argument reference). Always contains a name and a value element, in that order, with an equal sign between them if the name is given explicitly. If the template argument is an implicitly numbered one, the name element will be empty and contain an index attribute specifying the index.
For tplarg elements, only the first part child should be looked at to provide default arguments, the rest are ignored. The split into name and value is disregarded.
<dt id="h"> h and possible-h
Indicates a header (=== ... ===). The level attribute contains the header level, while i contains the section number, regardless of level (the same that the &section= query string parameter uses).
<possible-h> tags appear only in the output of the hashtable-based parser (Preprocessor_Hash.php). They are created in place of <h> tags everywhere except at the highest level of the tree (below <root>). Otherwise they are mostly equivalent to <h>; note that template logic might make them not end up as actual headers in the fully-parsed page.
<dt id="ext"> ext
Indicates a parser extension tag, such as ‎<ref>...‎</ref>, ‎<source>...‎</source> or ‎<nowiki>...‎</nowiki>. Not all tags are parser extension tags; ‎<b>...‎</b> or ‎<table>...‎</table>, for example, are not. Which tags are considered parser tags depends on MediaWiki installation. To obtain a list of extension tags, use API:Siteinfo with the siprop=extensiontags query parameter.
This element always contains (possibly empty) name (tag name) and attr (attributes) child elements, optionally an inner element, and optionally close following it. The contents of attr need not conform to HTML or XML attribute syntax.
If the parser tag is specified in a self-closing form (e.g. <nowiki/>), the ext element will lack inner and close child elements.
<dt id="ignore"> ignore
Indicates text to be ignored, usually a ‎<noinclude>...‎</noinclude>, ‎<onlyinclude>...‎</onlyinclude> or ‎<includeonly>...‎</includeonly> tag and/or its contents.
There is no option in the publicly available API to preprocess wikitext in transclusion mode, i.e. ignoring contents of ‎<noinclude>...‎</noinclude> while parsing ‎<includeonly>...‎</includeonly> or restricting parsing to ‎<onlyinclude>...‎</onlyinclude> (T51353, gerrit:168669).
<dt id="comment"> comment
Indicates an HTML-style comment, i.e. <!-- ... -->. The contents of this element include the comment start mark (&lt;!--) and end mark (-->).

Serialisation

edit
Note: the following method guarantees only that valid parser output will serialise back into original markup. Modifying parse trees without regard for escaping may produce unexpected results. See below for information on escaping template arguments.

Turning the XML parse tree back into wiki markup is rather simple. It amounts to four substitutions, three of them being:

<template>...</template> → {{...}}
<tplarg>...</tplarg> → {{{...}}}
<part>...</part> → |...

Care has to be taken when handling ext elements. For elements that contain inner element, the following substitution is appropriate:

<ext><name>...</name><attr>...</attr>...</ext> → <......>...

Otherwise, use:

<ext><name>...</name><attr>...</attr></ext> → <....../>

Other elements can have their contents passed through as is.

The whole process is equivalent to applying the following XSLT stylesheet:

<?xml version="1.0" standalone="yes" ?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
	<xsl:output method="text" media-type="text/x-wiki" />
	<xsl:preserve-space elements="*" />
	
	<xsl:template match="template">
		<xsl:text>{{</xsl:text>
		<xsl:apply-templates />
		<xsl:text>}}</xsl:text>
	</xsl:template>
	
	<xsl:template match="tplarg">
		<xsl:text>{{{</xsl:text>
		<xsl:apply-templates />
		<xsl:text>}}}</xsl:text>
	</xsl:template>
	
	<xsl:template match="part">
		<xsl:text>|</xsl:text>
		<xsl:apply-templates />
	</xsl:template>

	<xsl:template match="ext[inner]">
		<xsl:text>&lt;</xsl:text>
		<xsl:apply-templates />
	</xsl:template>

	<xsl:template match="ext[not(inner)]">
		<xsl:text>&lt;</xsl:text>
		<xsl:apply-templates />
		<xsl:text>/&gt;</xsl:text>
	</xsl:template>
	
	<xsl:template match="inner">
		<xsl:text>&gt;</xsl:text>
		<xsl:apply-templates />	
	</xsl:template>

	<xsl:template match="*">
		<xsl:apply-templates />
	</xsl:template>
</xsl:stylesheet>

Escaping and transformations

edit

The pipe character, the equal sign and consecutive curly braces are interpreted specially in template invocations. If you wish to employ either as literal characters, you have to escape them. Unfortunately, MediaWiki markup does not lend itself to escaping very well. There are many methods of escaping markup, and they come with many caveats. Proper escaping is significant when modifying parse trees, hence we discuss it here.

The simplest method is to wrap special characters, or the whole string, inside a <nowiki> tag, or escape them with numerical HTML escapes: &#124;, &#61;, &#123; and &#125; (and possibly escape other characters as well). This has two disadvantages: first, wikilinks and transclusions stop working (obviously). Second, the escaped text might not be recognised by template or module logic that processes it. In this section, more universal alternatives will be discussed.

If you want to allow wikilinks in an argument, but not templates (or template arguments), the simplest universal method is to perform the following substitutions:

  • {{{{<noinclude/>{<noinclude/>{
  • }}}}<noinclude/>}<noinclude/>}
  • {{{<noinclude/>{
  • }}}<noinclude/>}
  • ={{lc:=}}
  • |{{!}} (built-in magic word since MediaWiki 1.24; for older versions, you have to ensure that Template:! expands to the pipe character.)

It has the disadvantage that piped wikilinks come out of it as [[link target{{!}}label]], which may be aesthetically unpleasing, although it still renders as expected. It also prevents the pipe trick from working. If you wish to avoid that, you will have to count pairs of brackets preceding | to see if they match, and therefore it is not a part of a wikilink and needs escaping.

If you want to allow both links and templates, but prevent misinterpretations of | and premature template closures, you need to follow the following steps:

  1. Parse the markup you wish to escape. (The following will assume that you get an XML tree as described above.)
  2. For each direct child text node of the <root> element, escape =, |, }}} and }}, as discussed.
  3. Serialise the parse tree back into wiki markup.

The resultant text will be interpreted as if it were a stand-alone piece of markup, even inside a template argument. Following these steps is the only universal method of escaping wikitext.

Implementation

edit