Parsing/Notes/Moving Parsoid Into Core

Background

During the 2016 Parsing Team Offsite, the parsing team decided to evaluate a port of Parsoid into PHP so that it can be integrated into MediaWiki core. That was not a commitment to evaluate that immediately. But, for a bunch of reasons, that evaluation continued to be on the backburner. However, since then, a number of things have transpired which have introduced a lot of momentum behind attempting to move Parsoid into MediaWiki core. In the rest of this document, we are going to explore this direction including the why, how, risks and concerns.

Why move Parsoid into Core?

There are two somewhat independent reasons for this.

Architectural concerns

Parsoid was originally developed as an independent stateless service. The idea was to be able to make parsing independent of the rest of MediaWiki. In an ideal interface, MediaWiki would send the parsing service the page source and any additional information and get back HTML that could be further post-processed.

Unfortunately, that ideal has been hard to meet because of the nature of wikitext. As it is used today, wikitext doesn't have a processing model that lends itself to a clean separation of the parsing interface with the rest of MediaWiki. The input wikitext -> output HTML transformation depends on state that Parsoid does not have direct access to. To deal with this, Parsoid makes API calls into MediaWiki to fetch all this state - processed templates, extensions, media information, link information, bad images, etc.

Without additional cleanup to the semantics and processing model of wikitext, a cleaner separation of concerns is hard to accomplish. That will eventually happen and is on our radar. But, for now, without going into too many additional details, this service boundary around Parsoid as it exists today is architecturally inelegant. This service boundary is a source of a large number of API calls and introduces performance overheads in two ways: (a) network traffic (b) Mediawiki API startup overheads for every API call that Parsoid makes. In the ideal service boundary, most of these API calls would be serviced internally and would be a function call.

There are two ways to address the architectural boundary issue and transform most of the API calls into function calls. Either, (a) migrate more pieces of MediaWiki core into the parsing service, or (b) integrate the parsing service back into core. Solution (a) is a hard sell for a bunch of reasons. It is also a slippery slope since it is unclear how much code will have to get pulled out and what other code dependencies that might expose. Given that, if we want to address this architectural issue, for now the best solution is to move Parsoid back into core. But, ideally, this would be done in a way that abstracts the parser behind an interface that can be gradually refined to let us pull it out as a service again in the future, if desirable.

Third party MediaWiki installation concerns

Parsoid has been written in node.js (a) to minimize I/O blocking from API calls by leveraging async functionality (b) because PHP5 had much poorer performance compared to JavaScript (c) because there were HTML5 parsing libraries readily available for JavaScript but none for PHP.

But, this non-PHP service meant that third party MediaWiki users would have to install and configure a new component. In some cases (like shared hosting services), it might not even be possible to install node.js and hence Parsoid. There has been a lot of debate and discussion about this shared hosting limitation and how far we should go to support this case. As part of the MediaWiki Platform's team work, a clearer direction is likely to emerge with respect to this.

2018 is not 2012

There are some things different in 2012 compared to 2018.

Back in 2011/2012, when Parsoid was initiated, a PHP implementation was not viable. The original plan in 2012 was to eventually port the node.js prototype into C++ to get the best performance.
- Unlike the PHP parser, Parsoid does a lot more work to provide the functionality and support for features that depend on it. In 2012, MediaWiki ran PHP5 which would not have been performant enough for Parsoid's design. In 2018, PHP7 has much better performance compared to PHP5.
- In 2012, There was no HTML5 parsing library available in PHP. In 2018, there is RemexHtml, which has been developed with performance in mind.
In 2012, Parsoid was an experimental prototype designed to support the VisualEditor project with a tight deployment timeline. In that context, node.js was the best choice for that time since the feasibility of this project in terms of its ability to support all the wikitext out there was far from clear. In 2018, Parsoid is an established project which has proved its utility multiple times over and is on the way to supplanting the legacy PHP parser in MediaWiki.
In 2018, we have a much better sense of the core pieces of the parsing pipeline: PEG tokenizer, token transformers, HTML5 tree builder, DOM transformations. RemexHtml has a high-performance tree building implementation. PHP's DOM implementation is C-backed and has competitive, if not better, performance compared to domino's implementation. Performance of PHP-based token transformers and PEG tokenizer is unknown at this time and is something to be prototyped and evaluated.

Overall, given this change of context both from a PHP-implementation feasibility and Parsoid's utility point of view, we feel it is time to attempt a closer integration of Parsoid into MediaWiki via a PHP port that addresses both the architectural concerns as well as the third party installation concerns. In 2012, this would have been a questionable decision. Even in 2016, we didn't have sufficient clarity that this was the right decision. But, after a lot of discussions and some early experiments, we have a bit more understanding that gives us confidence that this might be feasible.

This direction is not without its risks. Let us now examine those and think about how we can address them.

Concerns & Risks

Performance

Parsoid generates HTML with auxilliary markup and information that lets it convert edited HTML to wikitext without introducing dirty diffs. As a result, it does more work to generate its output HTML than does the PHP parser. While unifying and consolidating behind a single parser, the goal is to minimize the performance gap, given that Parsoid does additional work. However, if the equivalent PHP code performs worse than node.js, this can widen the difference between the old parser and Parsoid.

That said, as noted above in the previous section, we are cautiously optimistic that this performance gap might not come to pass.

Higher request startup costs

A related performance concern is that the MediaWiki request startup overhead is higher than with node.js because every request loads the config from scratch unlike the node.js express setup where parsoid and wiki and other config is pre-loaded and shared across requests. But, that is how it has been with MediaWiki and so, if required, we'll have to redo some of the startup code to not assume cached access of config across requests.

Loss of async features

Parsoid heavily relies on node.js async event loop handling to overlap I/O requests and CPU processing. This will no longer be available in PHP. But, with the move to core, Parsoid-PHP will not incur cross-service network and MediaWiki API startup latencies, but only cross-service database (mariadb / mysql) access latencies which are expected to be lower. So, the I/O waits are expected to be smaller and similar to what the existing PHP parser incurs. Nevertheless, this might still require some changes to Parsoid like migrating media processing to a post-processing step and other ways of batching database accesses.

Longer-term however, as we evolve wikitext semantics to enable features like independent parsing of sections of a page, we will still need to figure out ways to do concurrent / parallel parsing of page fragments. This is not something that will be required immediately, but highlighting it here so we are aware of this requirement going forward.

Independent deploys

Parsoid has so far been deployed independent of MediaWiki and on several occasions since its first deployment in 2012, it has been updated multiple times a week, independent of the status of the MediaWiki train. With the move to core, this flexibility could be lost.

That said, Parsoid has been changing less frequently and could adapt to an once-a-week deployment schedule, if required. The other options to explore would be to have Parsoid-PHP be a PHP library that is independent pulled in and deployed via Composer.

Testing

Pre-deployments, Parsoid code is pushed through round trip testing on live pages from multiple Wikimedia wikis. The node.js incarnation of Parsoid is able to query the MediaWiki API for any public wiki for this testing. But, when integrated with core, this feature will no longer be available unless we build in support for MediaWiki to point it at an external wiki. It is possible but potentially cumbersome.

That said, if we run round trip testing on a production cluster server (in read-only mode), Parsoid-PHP should be able to query the production databases directly and hence, we can continue doing these round trip testing on live pages from Wikimedia wikis.

But, this will nevertheless be a minor loss of functionality since we are right now able to point Parsoid to any public wiki and process the wikitext of that page. This feature is especially valuable while debugging and helping troubleshoot 3rd party installations.

How do we do this?

Here is a rough outline of a plan to do this. We'll create a subproject in Phabricator and file tasks for these.

Understand and evaluate performance of the four components of the wt2html pipeline
- PEG (Done, T204617)
- Token transformers (Done, T204598)
- Tree builder (Partially done, T204595)
- DOM transformers (Done, T204614)
Identify a suitable replacement for PEG.js (Done, T204617. Decided on option 3)
- Option 1: add PHP code generation to Tim's fork along the lines of https://github.com/nylen/phpegjs
- Option 2: upstream some of Tim's fork changes, abandon fork and use phppegjs
  - Have contacted the new upstream maintainer almost 6 weeks back -- yet to hear back. Will ping again.
- Option 3: https://github.com/smuuf/php-peg ... see if our PEG grammar can be made functional with that parser generator
- Option 4: https://github.com/gpakosz/peg ... see if our PEG grammar can be made functional with this C parser generator ... option if we have serious perf issues (but this doesn’t solve the “pure PHP” requirement for our port)
Experiment with (potentially throwaway) ports of some pieces of the codebase to understand the stumbling blocks / pitfalls to the porting process (in progress)
Investigate automatic JS -> PHP transpilers
- https://github.com/endel/js2php
- Here’s a mostly-automatic translation of QuoteTransformer.js with that tool: https://gist.github.com/subbuss/ffcfed641784f64945ebd7e49fc448cb
Code changes in Parsoid based on the above experiments
- Eliminate circular dependencies, if possible
- Use a JS class pattern in more places
- One export per file, and it should be a “real” ES6 class
  - No method initialization after class creation (Foo.prototype.method = ….)
  - Be careful with constants; perhaps turn them into static methods
  - Be careful with static properties; perhaps turn them into static methods
  - No lonely top-level functions/variables; everything should be inside a class scope.
- Remove dead code from Parsoid
- Try and remove console.log statements from code wherever possible
  - Probably should refactor our logger implementation to better match MediaWiki’s logging framework so that porting is unsurprising/mechanical
- Replace .bind with appropriate lambda functions
- Use async/yield instead of Promises everywhere that is going to be ported to PHP (except TokenTransformManager maybe)
- For token transformers, maybe extract an API and use class inheritance that implements the API
  - TokenHandler right now implements a stub (init, constructor, reset)
  - Most transformers have a handler for:
    - All tokens (onAny)
    - Newline token (onNewline)
    - EOF token (onEOF)
    - Specific token types ← this would need some generic handler name / signature instead of custom ones (onQuote, onListItem, etc.)
  - Flesh out TokenHandler abstract class based on the above
- Similarly, we have a domHandler interface spec’d out at the top of DOMHandlers.js that should be solidified in code
Other big changes in Parsoid that we need to do prior to the port
- T153080 (batching imageinfo updates at the end just like redlinks instead of doing inline where they occur in the source)
Things we don’t need to port
- Testreduce codebase
- Visualdiff codebase
- Roundtrip-test.js script
- Regression test script
- LanguageConverter (will have PHP version)
- …
Things that can be dropped
- Async TTM
- Test syncing scripts
- …
Potential things to consider during the port
- Drop the Parsoid API and hook RESTBase directly with Parsoid via function calls / VirtualRESTAPI
  - Reason: equivalent express functionality might not be available in MediaWiki
- Research what the complexity of doing Parsoid’s API / URL routing in Mediawiki core would be
  - Ensure a clean separation between http / parsoid internals
    - Idea being that all the same info from restbase will probably be present (content-type, etc) but maybe in different forms (ie, query parameters instead of http header). So draw a line between the “extract information from HTTP request” part and the “do something useful with the extracted info”.
  - Figure out what a reasonable “action API” implementation of Parsoid’s interface would be
  - Determine what amount of the parsoid API is actively used by RESTBase, with the idea of simplifying the port by only supporting the minimal necessary API.