Parsoid/Known differences with Core Parser output
This page will track known HTML output differences between Parsoid and PHP Parser and what the proposed solution is to resolve that difference.
Differences because of implementation differences or functionality gapsEdit
|Parsoid generates <figure> tags for block images whereas PHP parser uses <div>||This is once again a HTML4 / HTML5 fallout. Parsoid uses semantic markup available in HTML5 that wasn't available in HTML4 at the time PHP parser was written.
Once this code is ready to be merged and deployed (and before we deploy this), we'll work with bot and gadget authors to use the new markup that will be generated.
|T118517 is the RFC for updating PHP parser output (and gerrit:196532 is a WIP patch).||In progress|
|Parsoid doesn't handle language variants yet||Parsoid doesn't yet parse language variant markup and doesn't provide a variant-specific rendering for reading clients.||Language variant support in Parsoid has landed and has been deployed. The TODO at this point is finishing up support for all languages||In progress|
|Edge case differences between Parsoid's native implementation of some extensions compared to PHP implementations of the same||For any extensions that process wikitext (ex: Cite, Gallery), Parsoid needs a native implementation of the same in Parsoid. However, because of implementation differences, there are edge cases where the output differs (ex: T51538, T96555, and a few others related to gallery).||Some of these (T104662, T96555) will be fixed in Parsoid. Others might be tweaked in the PHP implementation, or we might just treat the edge case differences as undefined behavior which shouldn't be relied on by editors. Since these are edge cases, they will be fairly uncommon usage in wikis (otherwise, we would have fixed them).||In progress|
|Unavailability of some parser hooks in Parsoid compared to PHP parser||Parsoid and PHP parser have different internals and hence not all the PHP parser's tag hooks are available in Parsoid. This page with parser hook stats lists extensions and the parser hooks they use. Some hooks like ParserBeforeStrip, ParserAfterStrip have no equivalent in Parsoid. So, in a Parsoid-only world, this could affect output and functioning of extensions like
||We are going to develop a parser hooks API that is implementation independent (without exposing the internal details of how parsing happens) and port all the Wikimedia extensions to use this new API.
Parsoid is developing an extension API to support existing Parsoid-native extensions cleanly (Cite, Gallery, Poem, etc). We plan to extend the API gradually based on experience with adapting more extensions to work with Parsoid. In parallel, we will continue to deprecate unnecessary hooks and possibly rename some to reflect desired semantics.
This task is likely going to be completed after Parsoid moves to core.
|Parsoid doesn't handle pages in some namespaces properly (ex: File, Category)||Parsoid doesn't have special handling for pages in namespaces that has generated content. For example, the content for a page in a Category namespace is generated dynamically. Content for a page in a File namespace similarly has some generated content. There is a good argument to be made that Parsoid shouldn't be duplicating this support and that clients should fetch this from the MediaWiki API directly. However, this does leave Parsoid clients in a bit of a bind because they don't know which of these namespaces are special in that content for those pages is better fetched from the MediaWiki API directly. So, some good resolution of this problem would be helpful. Maybe Parsoid should handle requests for content in all namespaces, and where that content is better served from the MediaWiki API, redirect the client to the right url?||To be investigated and resolved.
This task is likely going to be picked up after Parsoid moves to core.
|Parsoid doesn't generate metadata needed for updating the links and page_props tables.||We'll have to add that at some point before Parsoid can replace the existing Parser class.||To do|
Differences identified via visual diff testingEdit
We run mass visual diff tests comparing rendering of Parsoid output and PHP parser output. This table will be filled out as we inspect the visual diffs and identify the underlying cause for those diffs. In addition to the above source of diffs, here are a few more specific ones that we discovered.
|Difference||Explanation||Bug / Proposed Resolution||Status|
|Long tail of bugs related to read views||Fix bug filed under the Read Views column on the Parsoid Phabricator workboard||Fix all the bugs!||In progress|
|Missing resource modules in Parsoid output||http://sv.wikipedia.org/wiki/Mir has a bunch of modules (ext.gadget.*) which the Parsoid output is missing||T161278||In progress|
|CSS differences in Cite||Cite output needs styling (T156351 and T156350). This should also cover the styling requirements for cite ref links - some wikis like eswiki and frwiki skip the brackets. In addition, knwiki (Kannada) uses Kannada numerals for the ref text.||The necessary styles for these various wikis are being added to visual diffing code. Most of these styles for wikis are good to be added to commons.css on these specific wikis.
However, as part of this, we've also identified some limitations in the Cite CSS output. We'll have to figure out how to resolve that.
| In progress
Stalled on trying to figure out how general i18n support in Parsoid should work vis-a-vis visual editing.
|Broken / missing support for some extensions||Pages extension output for wikisource pages is missing some wrapping divs (with associated styles). (Example)
Pages on viwiki are missing mapframe / osm maps (Example)
|To be investigated||To do|
- https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/513692/2/src/Config/Env.php#721 has some commentary from Brad about handling page metadata that Parsoid would need to emit when it is slated to replace the current parser.
- The known differences column on the Parsoid phab board