The goal of this project is to arrive at a single parser that supports all clients and use cases.
This page will serve as a high-level page for tracking this unification project with links to other pages with additional details. Updates will continue to be published on this page. This project is primarily driven by the Parsing team with participation from the MediaWiki Platform team, all the internal teams that develop Parsoid clients, Community Liaisons team, Wikimedia wiki editor communities, and third party MediaWiki projects since this parser unification will touch them all.
As of July 1, 2018, this work will be undertaken as part of the Platform Evolution CDP.
In this quarter, we will be preparing the Parsoid codebase for prototyping a port. Specifically, here are a few things we'll be working towards.
- Implement unit testing and performance testing features: These features let us port individual token and DOM transformers and verify correctness and test performance without needing a full functional port.
- Migrate more promises in Parsoid to use newer async/yield code patterns: the benefit of this code pattern is that the code reads as if it is synchronous code and is readily migratable to PHP.
- Ensure PHP parser and Parsoid general similar media output
- Explore migrating media processing to a post-processing step: This frees the core parsing step from blocking on database access.
January 2018 - June 2018Edit
In this timeframe, we did a bunch of early experiments to get a sense of feasibility of a PHP port of Parsoid.
Separately, we added unit testing features to Parsoid to let us port, test, and benchmark individual token transformers without requiring all of Parsoid to be ported. In Q1, we will be porting this feature to PHP and then port a couple of token transformers to get a handle on the complexity and performance of token transformers in PHP.
The two parsers use different internal processing models to convert wikitext to HTML.
The PHP parser is largely based on string manipulation via regular expressions with a goal of low latency conversion from wikitext to HTML.
Parsoid was born out of the VisualEditor project to support visual editing which required bidirectional conversation between wikitext and HTML with additional constraints on the wikitext generated from edited HTML. In 2012, as this project was in its infancy, it wasn't fully clear how viable this entire project was and where it would go. Since then, Parsoid has proved to be a succession project on its own and has supported a number of additional projects beyond VisualEditor.
Since around 2015, it has been clear that long-term, this two parser situation is untenable and we had to consolidate behind a single parser.
The long and short of it is that there are two aspects to arriving at a single parser.
- Bridging the differing processing models and consequent output and feature differences between the two parsers
- Addressing the language and architectural differences between the two parsers - the Parsing/Notes/Two Systems Problem page documents the differences between the two parsers and various possible scenarios for what the unified parser is going to look like. If you are interested in more details, please check out that page.
We are tackling these two aspects / work categories concurrently.
Replacing the HTML4 based Tidy with HTML5 based RemexHtml was one of the biggest projects under the first work category that has an independent utility and purpose above and beyond the parser unification project. Besides that, we have been continuously addressing the long tail of incompatibility between the two parsers besides continuing to address editing client features and requests.
As for the second work category, after a lot of internal debate and discussion, we have started evaluating and prototyping a port of Parsoid into PHP. Please check the Parsing/Notes/Moving Parsoid Into Core page for more details and background about this aspect of the parser unification project.