Parsoid/Extension API

Known issues with the current draft that will be fixed edit

  • Functionality:
    • ParserOutput instances aren't currently exposed but need to be exposed. See Talk:Parsoid/Extension API for discussion.
    • setFunctionHook isn't yet supported but will be supported.
  • Documentation: This page doesn't discuss ContentModelHandler support yet. The code supports that functionality.
  • Documentation (and maybe functionality): Hook listener implementations are often beset with having to deal with the problem of how to order and invoke listeners. This is not unique to Parsoid but it needs to be explicitly addressed and if we intend to provide any code / api support for handling it, it needs to be implemented and documented.

Introduction edit

Terminology:

  • In the rest of this document, we use the term wikitext engine interchangeably with wikitext parser. Parsoid not only parses wikitext and generates HTML but also serializes HTML to wikitext. As such, wikitext engine is a better term.
  • wt2html is a shortcut for wikitext to HTML transformation.
  • html2wt is a shortcut for HTML to wikitext transformation.

This page only concerns extensions that interact with the wikitext engine either to process wikitext or because they register for one of the many parser hooks currently supported by the MediaWiki core wikitext engine. In this first pass drafting the extension API here, we are going to only deal with extensions that implement tag handlers. The Parsoid codebase has support for extensions that implement content handlers as well and in the next round of updates, we will update this page to document that support.

Types of extension tags edit

It is useful to broadly distinguish between a few types of extension tags for the purpose of figuring out how to interact with Parsoid and the extension API.

  • Type 1: Extension tags like ‎<pre> and ‎<nowiki> that don't treat their contents as wikitext at all, i.e. $out = genDOM( $input )
  • Type 2: Extension tags like ‎<ref> that wrap regular wikitext and use the parser's output more or less as is, i.e. $out = parseWT( $input )
  • Type 3: Extension tags like ‎<poem> that wrap regular wikitext but preprocess their source to generate wikitext that they need the parser to generate output for, but then use the parser's output more or less as is, i.e. $out = postProcessDOM( parseWT( mangle( $input ) ) )
  • Type 4: Extension tags like ‎<gallery> that have wikitext-like snippets in their source that they preprocess to feed to the parser, and then stuff the parser's output in a DOM tree they construct separately, i.e. $out = buildDOM( map( $snippet => parseWT( mangle( $snippet ) ), $wtSnippets ) )

For lack of imagination, we are just using names like type 1, 2, 3, 4 so we can refer back to them in the rest of the document. We believe this broad categorization can be mildly helpful in making sense of how to use Parsoid's extension API. This categorization may not have much value outside this page and should not be given more merit than it deserves.

Core Parsoid concepts to be familiar with edit

Annotations added to extension wrappers: Parsoid decorates extension wrappers with a few attributes (typeof, about, data-mw) that lets Parsoid's clients (reading or editing) demarcate content that comes from an extensions and convey additional information about it.

Selective Serialization: Parsoid has a html2wt mode for edited documents where it diffs the original and edited HTML and runs the html2wt transformation only for edited DOM subtrees. For the unedited subtrees, it uses source offsets for those subtrees to emit the original wikitext for those subtrees. This technique is critical to prevent "dirty diffs" when edited documents are converted to wikitext. Not all wikis care about this issue but it is a significant concern for Wikimedia run wikis.

DSR (Dom Source Range) offsets: As part of transforming wikitext to HTML, Parsoid maps DOM elements to the source wikitext substring that generated that DOM element. Based on this mapping, Parsoid assigns a 4-element offset array ( [ start-offset, end-offset, start-tag-width, end-tag-width ] ) to all DOM elements (with caveats that we'll skip here for now). While this is considered Parsoid-internals information as far as Parsoid's clients are concerned, extensions might need to be aware of this concept in case they want to generate (or ensure the accuracy of) these offsets for DOM elements in their output.

DSR considerations will likely only apply to extensions of types 2-4, and only if they decide to implement their own html2wt transformations and only if they decide to support selective serialization. So, for a majority of extensions, for initial implementations, this may not be relevant. For type 2 extensions, Parsoid's computed DSR offsets are likely going to be accurate and so, only extensions of type 3 and 4 would need to worry about this.

All that said, as a first approximation, it is always safe to set a parse-time option asking the API to null out all computed DSR offsets (see more below).

API & Hooks edit

In the Parsoid regime, extensions will NOT get direct access to the wikitext engine. All interaction happens through an extension API and hooks; extensions can register "hook listeners" by implementing interfaces that declare those hooks. Unlike the current set of parser hooks, Parsoid hooks are primarily transformation hooks. While some of them might refer to a timeline in the processing of the input document (whether wikitext or DOM) like initialization, post-processing, or finalization, any such exposed events do not reference implementation-specific pipeline events (before/after some pipeline stage).

For now, we only support the following transformation hooks: sourceToDom, domToWikitext, lintHandler, and a DOM post processor. As we analyze more extensions and get feedback, we will consider what other hooks might become necessary and how to support them.

Clarification / disambiguation of an overloaded term: We use events here to refer to implicit parser pipeline timeline events (ex: completion of tokenization, link parsing, DOM building, etc), not explicit events emitted for the purpose of logging, metrics, instrumentation, etc. and anything that is handled by the event infrastructure like Kafka, etc. Those events are outside the purview of the wikitext engines.

No support for sequential, in-order processing of extension tags edit

Extensions should not expect to maintain global document state within the extension where ordering matters. Parsoid does not guarantee that repeated occurrences of the same extension tag will be processed in the same order in which they are seen on the page (for ex: because of batched processing or cooperative multitasking). Nor should extensions assume that they will be invoked for every instance that is seen in wikitext (for ex: because we reuse parsed content from a cache). All Parsoid guarantees is that in the final output of the page, the output for extension tags will be found in the same order as they showed up in source wikitext.

Given this implementation flexibility that Parsoid reserves for itself, global state like counters cannot be reliably maintained by the extension. Extensions will get access to the fully processed DOM of the page which they could inspect to reconstruct source ordering. Parsoid's Cite implementation is one example of this scenario.

Support for html2wt transformations edit

Parsoid provides a default html2wt transformation based on information encoded in the data-mw attribute during the wikitext to HTML transformation. Given this basic support, Visual Editor can only provide extremely basic editing support (direct editing of the data-mw attribute likely). However, if extensions intend to provide custom editing support for editing clients like Visual Editor, they should implement the domToWikitext transformation to convert edited HTML back to appropriate wikitext. Currently, Parsoid does not have selective serialization support for all extensions (support for Cite may have been implicitly baked in). But, at that time, Parsoid will expose more interface methods in the core extension tag interface if extensions choose to more carefully control how their extension HTML is serialized back to wikitext.

Extension registration and configuration edit

Parsoid-compatible extensions can be registered via the ParsoidModules property in the extension.json file. This property can either specify the Parsoid configuration inline OR provide an ObjectFactory declaration that provides an implementation of the Wikimedia\Parsoid\Ext\ExtensionModule interface (preferred!).

See the schema at docs/extension.schema.v2.json for the latest spec (search for "ParsoidModules"). If you use the preferred approach, this can just be the string name of an class implementing the Wikimedia\Parsoid\Ext\ExtensionModule interface, or else a fuller ObjectFactory specification of that class (which allows you to pass in service objects to the constructor). However, you can also use a configuration array object, which is an associative array with the following fields currently:

  • tags: If an extension implements extension tags (ex: Cite implements <ref> and <references>), this property is an array of configuration objects for each such extension tag (see configuration spec below).
  • domProcessors: This is an array of ObjectFactory declarations each of which return a class that extend the Wikimedia\Parsoid\Ext\DOMProcessor abstract class.
  • styles: Additional ResourceLoader styles to include.

Configuring extension tags edit

The extension tag config object is an associative array with the following fields currently:

  • name: The name of the tag
  • handler: ObjectFactory declaration that provides a class implementing this tag. The class should extend the Wikimedia\Parsoid\Ext\ExtensionTagHandler abstract class.
  • options: This has 2 properties, one for wt2html and another for html2wt
    • wt2html: This options block dictates how the DOM fragment returned by the sourceToDom method should be handled. Currently, only one option exists. The vast majority of extensions will not need this.
      • unpackOutput: By default, Parsoid takes the DOM fragment returned by the sourceToDom method, unpacks and splices it into the parent document in the appropriate place. However, if unpackOutput is false, Parsoid will leave a marker instead and store the fragment in a map. It is expected that the extension's wt2htmlPostProcessor DOM processor will appropriately deal with these DOM fragments and manipulate them. For example, the Cite extension relies on this to migrate the ref's fragments to the references section and leave behind a citation that is appropriately globally numbered.
    • html2wt: This options block influences Parsoid's HTML to wikitext transformation. Given that extensions might implement their own domToWikitext implementation, these options primarily influence how the generated wikitext interacts with its context. Currently, only one option exists.
      • format: By default, the wikitext from converting the HTML is rendered inline. However, if extensions specify a block value for this property, the wikitext output is rendered on its own separate line.

ExtensionTagHandler abstract class edit

This class provides four methods: sourceToDom, domToWikitext, lintHandler, modifyArgDict.

Extensions are expected to implement the sourceToDom method at the very least. Parsoid annotates the output DOM fragment returned by the sourceToDom method so that clients that process Parsoid HTML can demarcate extension output and extract other information from it besides enabling Parsoid's default html2wt transformation for this tag. Please look at the docs for this class for more specific details about these methods.

Instances of classes extending ExtensionTagHandler and configured as extension tags are cached and should not maintain a state that would leak between invocations of the tag extension.

In all cases, Parsoid passes in a ParsoidExtensionAPI object into the methods. Every invocation will get a fresh instance of this API object.

DOMProcessor abstract class edit

Currently, the only real supported DOM processor is the wtPostprocess method. This method will be provided an instance of the ParsoidExtensionAPI object as well as the DOM for the wikitext being processed.

We do anticipate supporting a htmlPreProcess method in the future which will be invoked when converting HTML to wikitext. This processor will be invoked at the beginning to give extensions a chance to preprocess the DOM or extract any information necessary for later use. While the DOMProcessor class provides a method for this, this is not hooked up anywhere in Parsoid currently.

In future, based on need, other DOM processors might be supported.

Extension Registration Example 1: ObjectFactory declaration edit

{
    "name": "JsonExtension",
    "manifest_version": 2,
    ...
    "ParsoidModules": [ "Wikimedia\\Parsoid\\Ext\\JSON" ]
}

Extension Registration Example 2: Inline configuration edit

{
   "name": "Cite",
   "manifest_version": 2,
   ...
   "ParsoidModules": [
      {
         "name": "Cite",
         "domProcessors": [ "Wikimedia\\Parsoid\\Ext\\Cite\\RefProcessor", ],
         "tags": [
            {
               "name": "ref",
               "handler": "Wikimedia\\Parsoid\\Ext\\Cite\\Ref",
               "options": {
                  "wt2html": { "unpackOutput": false }
               },
            },
            {
               "name": "references",
               "handler": "Wikimedia\\Parsoid\\Ext\\Cite\\References",
               "options": {
                   "html2wt": { "format": "block" }
               },
            }
         ],
      }
   ]
}

Parsoid API for extensions edit

As part of implementing the various methods (sourceToDom, domToWikitext, etc) for extension tags and the DOM processors, extensions might need access to certain kinds of information or functionality. For example, extensions that intend to handle wikitext as part of their implementation will rely on Parsoid to convert that wikitext to DOM. Or, they might need access to configuration information for the wiki, or the page. Or, they might need to log error messages or metrics. The Wikimedia\Parsoid\Ext\ParsoidExtensionAPI class provides this API. Please look at the linked docs for specific details about the interface. The following sections document the API methods broadly with some discussion of how / where to use them.

Converting wikitext to DOM edit

Extensions that used recursiveTagParse when interacting with the MediaWiki core parser have two different methods to choose from:

  • extTagToDOM transforms an extension tag to a DOM tree rooted in a requested wrapper tag (ex: div, span, sup). Extensions of type 2 and 3 are most likely going to use this API method.
  • wikitextToDOM transforms wikitext to a DOM tree. Extensions of type 4 are most likely going to use this API method.

The wikitext passed in to these methods are processed fully - there is no notion of partially processed wikitext in Parsoid. The following options are provided which are of declarative / semantic nature. Beyond this, extensions will not be able to turn on / off specific pieces of the parsing pipeline. In the long run, this makes for simpler semantics and more robust code since (a) the underlying implementation can be changed without breaking extensions (b) wikitext doesn't behave differently when used outside extensions and inside extensions which makes for a better user experience.

  • context: With this option, you can specific the embedding context for the DOM. Currently, the only available value is inline to specify that the output of this wikitext will be embedded in an inline / phrasing HTML context. This effectively turns off paragraphs and pre behavior. In the future, other context values (like table cell, list item, link etc.) might be supported. Most extensions wouldn't need to specify this option except if your extension is only meant to be used in such contexts. For example, Cite uses this option as a backward-compatibility hack to support its paragraph wrapping and space-indented-pre behavior (which don't make sense when content is meant to be used in an inline / phrasing HTML context).

If the generated DOM requires scripts or styles to be added to the page, addModules and addModuleStyles can be called respectively.

Converting HTML to DOM and DOM to HTML edit

During both wt2html and html2wt transformations, Parsoid maintains the DOM in an optimized form where data attributes are not directly stored on the DOM. This is an implementation detail that extensions should not be concerned about and the specifics of this representation might change in the future. However, this has implications for when extensions need to convert HTML to a DOM and vice versa.

  • htmlToDOM: Where extensions need to parse HTML and construct a DOM document (for example, creating an empty base document, or for processing HTML snippets), they should use the htmlToDOM method since it returns a DOM that is in Parsoid's canonical form. All of Parsoid core code assumes this canonical representation and without that, extensions might experiences subtle (or not so subtle) failures in certain scenarios.
  • domToHTML: For the same reason as above, where extensions need to serialize a DOM node to string, they should use the domToHTML method which knows about the internal data representation while serializing. This provides additional options. One is to get innerHTML vs. outerHTML. The other option turns on a performance optimization (where the user doesn't expect to continue to using the DOM) that puts the DOM in a non-canonical form.

Sanitization helpers edit

While we originally planned to proxy a subset of sanitization helpers through the ParsoidExtensionAPI object, after analyzing how current extensions use the Sanitizer code, and in light of T247804 and in the interest of reducing disruption, once T247804 is resolved, extensions will be able to use the Sanitizer class directly.

Converting DOM to wikitext edit

This part of the API is only relevant to extensions that intend to provide custom editing support for their extensions in editing clients like VisualEditor. For example, the Cite and Gallery extensions make use of this API. The methods in this section mirror those in the wikitext to DOM section.

  • extStartTagToWikitext: All extension tags will need to use this. This method takes care of converting the HTML attributes to the extension's arguments while handling Parsoid-specific annotations.
  • domToWikitext: Use this method to convert input DOM to wikitext. Extensions of type 2 or type 3 will primarily benefit from this API method. There are no options provided to specify the context since the result wikitext is meant to be used as is between ‎<ext> and ‎</ext>.
    • htmlToWikitext: Extensions that need to convert a HTML string (instead of a DOM) would use this method. As you might imagine, this is just a convenience function that chains htmlToDOM and domToWikitext internally.
  • domChildrenToWikitext: Extensions of type 4 that used the wikitextToDOM method will most likely need this method to convert DOM fragments to wikitext. The extension will have to extract relevant DOM fragments from its input DOM and convert those fragments to wikitext. This method provides additional arguments to control this conversion to wikitext. (FIXME: Should this API method get a better name?)
    • $context: This is a bit-wise OR of one or more flags that specifies context for this wikitext (ex: caption, option, link, start-of-line, etc.).
    • $singleLine: If true, this indicates that the wikitext should be a single-line output (so, for example, lists, tables, and other multi-line constructs cannot be present)
  • escapeWikitext: Type 4 extensions may need to escape wikitext-like constructs in a string so that the string can be used as part of a larger wikitext fragment without breaking those semantics.. For example, wikitext used in template arguments cannot use the |, =, {{, }} as is, wikitext used in table cells cannot use |, -, +, } as is, and so on. This method lets extensions delegate this logic to Parsoid and provides a few pre-defined context options currently and this will be expanded in the future based on usage and further analysis. Note that type 2 & 3 extensions that use htmlToWikitext or domToWikitext will not have to deal with this - Parsoid handles this automatically on their behalf.

Sundry API methods edit

  • Extension argument methods extArgsToArray, findAndUpdateArg, addNewArg, sanitizeArgs: Please refer to the documentation for the specific details of how to use this method, but if you need to loop over them, modify or add args, or sanitize them, these API methods are your friends.
  • renderMedia : Use this method if your extension needs to render an image from an image name, and an array of options (each of which need to be preceded by a "|" prefix currently). without have to construct a wikitext string from it. If your extension also intends to generate DSR information, you will need to provide source offsets for the option strings you pass into the method.
  • processHiddenHTMLInDataAttributes: If your extension content can be used in wikitext that doesn't actually render (ex: image captions for inline images, language variant markup, etc.), and you need to run a DOM processor (ex: the wtPostprocess) on all uses of the extension independent of whether it rendered or not, you will need to use this method when you walk the DOM tree by passing in a handler that can be invoked on HTML strings.
  • Many other methods to get information: Site or Page config, extension information like tag offsets, full extension source as seen on the page, whether it was self-closed, whether this extension tag was used in a template, methods to get the URI for the page, get the URI for a title, make a title, etc. Please reference the generated documentation for this class for the full listing of methods and how to use them.

Helpers and Utility classes edit

Besides the ParsoidExtensionAPI class, extensions also have access to the following additional classes from the Parsoid codebase and these classes are subject to the standard MediaWiki code deprecation and removal policies.

Mapping existing parser hooks edit

With the current MediaWiki core wikitext parser, extensions have access to a number of parser hooks at different points in the parsing pipeline. A vast majority of use cases are subsumed by the sourceToDom transformation as well as the wt2htmlPostProcessor DOM pass.

In this document, we'll attempt to provide a mapping from existing parser hooks to equivalent code patterns. There is unlikely to be an exact 1:1 mapping since the processing model is quite different but for the most part, we'll provide guidelines about how to implement your use case that uses an existing parser hook.

Hook(s) Status Used on Wikimedia Wikis? Notes
ParserFirstCallInit Use Config Yes This should be subsumed by the extension config for the most part. You can also use the ExtensionModule::getConfig() method as a direct replacement for this hook to register extension tags and hook handlers.
ParserAfterParse, ParserAfterTidy Use wtPostprocess Yes Parsoid doesn't distinguish between these states and for the most part, the wtPostprocess DOM Processor should cover uses cases not handled implicitly by the sourceToDom transformation hook.
ParserCloned, ParserClearState Likely Not Needed Yes As far as we can tell, these hooks might not be necessary with Parsoid at all since (a) extensions can never access Parsoid's internal state directly and can only go through the API and Parsoid ensures clean state for every use. (b) you cannot reliably maintain global ordered state within extensions. Nevertheless, if we find legitimate reasons to add support for this, we'll consider it.
ParserGetVariable* (3 hooks) To be investigated Yes Usage in extensions needs to be investigated, but we may add support for this.
ParserLimit* To be added Yes The parser limits functionality will be refactored out of Parser.php into an abstract meta parser functionality that both the core parser and Parsoid will support. As such, this set of parser hooks will be supported. The only difference is that instead of a parser object in callbacks, the ParsoidExtensionAPI object will be passed in.
ParserOptionsRegister To be investigated Yes Looks like it should be possible to support. Needs closer look at current use cases (Wikibase & SemanticMediaWiki)
InternalParseBeforeLinks Won't Support No? We are extremely unlikely to support this hook. Link syntax is heavily overloaded in wikitext and is a source of a fair amount of complexity in Parsoid. If you are using this hook to modify link output, please use the wtPostprocess DOM Processor instead. Alternatively, use different syntax like an explicit parser function - we believe the SemanticMediaWiki extension has this option available in lieu of this parser hook.
ParserBeforeInternalParse To be investigated No Parsoid has no notion of strip state. We need to investigate how extensions use this and what the equivalent in Parsoid might be.
BeforeParserFetchFileAndTitle, BeforeParserFetchTemplateAndTitle To be added No While Parsoid currently does not have support for this, we realize these extensions might be used for access control, revision control, and remapping when media and template sources are fetched. As we learn more, we are likely going to add support for these hooks in Parsoid.
InternalParseBeforeSanitize Deprecated Hook No Won't support deprecated hooks
BeforeParserrenderImageGallery To be investigated No Looks like this is an extension for the Gallery extension. Usage needs to be investigated.
ParserBeforePreprocess To be investigated No Usage in extensions needs to be investigated.

... to be completed ...

Mapping parser methods to ParsoidExtensionAPI methods edit

Currently extensions make use of one or more of the following methods to deal with wikitext: parse, internalParse, startExternalParse, recursiveTagParse, recursiveTagParseFully. The equivalents of these in ParsoidExtensionAPI would be one of wikitextToDOM, extArgToDOM, extTagToDOM. However, one of the signiicant differences in functionality is that there is no notion of "half-parsed" or "fuly-parsed" wikitext in Parsoid. You always get a DOM that is processed to the same stage in the parsing pipeline.

There is also no strip-tag notion in Parsoid currently. Extensions seem to primarily make use of it to tunnel content through the parser without further processing. In Parsoid, all extension output (the DOM produced by one of the above mehods) is always tunneled through the parser and expanded into the DOM before handing it off to additional processing that operates on the final DOM (including the DOM post processors that extensions might register for). So, extensions should not have to deal with this detail. As such, you will find all such methods absent in Parsoid's extension API.

Examples edit

Let us look at a few simple examples that will hopefully help make some sense of how this works.

RawHTML edit

This extension is used by parser tests and the code below is the entirety of the extension. The code should be self-explanatory.

class RawHTML extends ExtensionTagHandler implements ExtensionModule {
    public function getConfig(): array {
        return [
            'name' => 'RawHTML', 'tags' => [ [ 'name' => 'rawhtml', 'handler' => self::class ] ],
        ];
    }

    public function sourceToDom(ParsoidExtensionAPI $api, string $src, array $args) {
        return $api->htmlToDOM( $src );
    }
}

Cite edit

Let us look at snapshots of a slightly more complex extension. The configuration for this extension is available earlier in this document (Example 2 in the extension registration section).

Ref.php edit

Let us take a look at the implementation of the ref tag. We won't present the entire implementation, but just a snippet of it to demonstrate the use of the API.

class Ref extends ExtensionTagHandler {
    public function sourceToDom( ParsoidExtensionAPI $extApi, string $src, array $extArgs ) {
        // ... some logic here to drop forbidden nested refs ...
        $allowNestedRef = // ... some logic here ...
        return $extApi->extTagToDOM( $extArgs, '', $src, [
            'wrapperTag' => 'sup',
            'parseOpts' => [
                'extTag' => 'ref',
                'extTagOpts' => [ 'allowNestedRef' => $allowNestedRef ],
                'context' => 'inline', // Ref content doesn't need p-wrapping or indent-pres
            ],
        ] );
    }

    public function lintHandler( ... ) { ... }

    public function domToWikitext( ... ) { ... }
}

This snippet demonstrates the use of the API to convert wikitext to DOM. That code is the entire implementation of the ref tag's processing. It simply parses the wrapped wikitext to DOM and wraps it in a <sup> tag. It does not migrate the content of the ref to the references section, nor does it leave behind a numbered link to that section. This handler cannot do either of those tasks because (a) it does not have access to the entire document, and (b) as we noted earlier, you cannot maintain global counters reliably. Both of these tasks are accomplished by the wt2htmlPostProcessor defined in the config section earlier.

RefProcessor.php edit

... to be completed ...

Related information edit

  • Retargeting your extensions to work with Parsoid: August 2020 Tech Talk: [ Video, Slides ]
  • Look at the Ext/ namespace in Parsoid docs.