Wikimedia Platform Engineering/MediaWiki Platform team/Simplifying the Wikimedia Foundation technology stack

Note: This is more of a position statement and a starting point for discussion than a roadmap.

As a small organisation serving a large user base, Wikimedia Foundation (WMF) needs to be lean and efficient in both software development and server hosting.

MediaWiki is a successful, community-developed open source product. Examining MediaWiki core over the last two years, 72% of commits were made by Wikimedia Foundation staff, 22% by third-party developers, and 6% by Wikimedia Deutschland staff. In gerrit-hosted extension repositories over the same period, 53% of commits were made by Wikimedia Foundation staff, 40% by third-party developers, and 7% by Wikimedia Deutschland staff.

MediaWiki is large: ~520 kloc in the core, and ~2700 kloc in extensions, not including libraries. In our opinion, it is not feasible or advisable to rewrite it from scratch at this point.

There has been a trend of rewriting small parts of MediaWiki or developing new features as Node.js services. The MediaWiki core remains closely integrated with these services. The result is that instead of eliminating a language barrier (as Netflix did with their microservice architecture), we are shifting a pre-existing barrier from the edge of the server stack to the middle. It is difficult for services to achieve feature parity with MediaWiki, or to achieve widespread adoption by users, with the result that we are committing to code duplication rather than replacement.

Duplicating parts of the infrastructure also has negative consequences for server spending. For example, we have a RESTBase/Parsoid stack, duplicating the MediaWiki/ParserCache stack in order to achieve a desired variation in the output format.

Additional services make MediaWiki more difficult to install and use, for both developers and non-WMF users. The RESTBase documentation page advises users to not bother trying to install RESTBase in front of Parsoid unless they have a pressing need. Whereas MediaWiki’s equivalent ParserCache module is available for all users by default, no configuration is required.

The MediaWiki Platform Team believes that current user needs could be met with a simpler architecture, in which server-side code is preferentially written in PHP and integrated with MediaWiki.

A REST endpoint in PHP

The request routing component of RESTBase could be profitably reimplemented in PHP as part of MediaWiki. We don’t find the performance difference between Node.js and PHP to be a compelling reason to split an important API into a separate service.

Recall that one of the reasons for using a REST API over the parameterized Action API is the fact that a REST API is easier to cache in the CDN. For the same reason, PHP should have adequate performance to serve as a REST backend. Several REST modules can be served internally by MediaWiki, and so using MediaWiki as the REST endpoint avoids additional network overhead.

A REST API in the MediaWiki core would take advantage of MediaWiki’s extension registration system, routing requests to the relevant extension. No configuration would be required by the installer or packager. Extensions would have fixed relative paths. Compare this to, for example, the configuration required to integrate Mathoid with RESTBase.

If necessary, high-traffic paths such as Mathoid rendered images could be routed to non-PHP endpoints by Varnish, as a WMF-specific optimisation.

Parsoid integration with MediaWiki

Parsoid provides rich annotation and manipulation of wikitext, and a DOM-centric structured view of the document, which is necessary for VisualEditor and similar use cases. Its formal approach to parsing is a great improvement over MediaWiki’s parser, providing improved predictability for end users, with less chance of security vulnerabilities.

However, the choice of language means that certain components of MediaWiki need to be reimplemented in Node.js. MediaWiki’s preprocessor and certain details of media handling were not implemented in Parsoid, and it is not feasible to do so, with the result that Parsoid may require hundreds of requests to MediaWiki in order to generate the HTML for a page. The boundary between MediaWiki and Parsoid seems inelegant.

We see Parsoid as the future of wikitext processing, but we would like to investigate strategies for better integrating Parsoid with MediaWiki. Ideally, this would mean rewriting most of Parsoid in PHP.

Parsoid’s first stage is a formal grammar (a PEG), from which JavaScript code is generated. That generated JavaScript code recognises wikitext and produces a token stream. It may be possible to instead use a library such as ANTLR to generate either C or PHP code. For optimal performance, the C code would be dynamically linked into HHVM/the PHP runtime. For ease of use, the PHP code could alternatively be used, which would be guaranteed to match the C code.

This approach should allow Parsoid to be integrated into MediaWiki with a minimal performance penalty. However, there is technical risk, the concept needs to be prototyped. One strategy for evaluation would be to port Parsoid’s DOM postprocessor passes to PHP, and to benchmark them using intermediate DOM output from Parsoid.

Note that the MediaWiki parser was designed with performance as its primary concern. Parsoid in Node.js is not as fast as the MediaWiki parser (since it does a whole lot more work), and it is unlikely that Parsoid ported to PHP would be that fast either. Also, some MediaWiki extensions depend on details of parsing within the MediaWiki parser, and would need to be rewritten to support a new parser. As such, we recommend retaining the MediaWiki parser as a core component which will be primarily maintained by the community, for the benefit of third-party installations which need optimal performance or backwards compatibility.

Replacing Cassandra with the Parser Cache

We can make Parsoid be a pluggable wikitext parser for MediaWiki. For prototyping purposes, this could be done even without porting it to PHP. We just need to provide support for contacting Parsoid within MediaWiki’s ContentHandler. Parsoid results would be used to fill a ParserOutput object.

Parsoid output would then be cached in the parser cache. The proposed REST API could access Parsoid output via the parser cache. Eliminating our Cassandra installation would significantly reduce server costs.

RESTBase uses a change propagation client to prefill its cache of Parsoid output in advance of user requests, in order to hide the latency of Parsoid from users. We think that such a system would be useful for many MediaWiki users, not just those who use Parsoid. A parser cache prefill system would ideally be placed inside MediaWiki, in front of the pluggable parser.

VisualEditor requires a stable identifier for cache entries, so that data-parsoid attributes can be re-added to the DOM after an edit. These attributes are stripped before they are delivered to VisualEditor via the REST API. ParserCache could provide this feature, or the cache entry could be temporarily duplicated into session storage while an edit is in progress.

Page composition

We can see good use cases for page composition at the network edge, for example, to inject fundraising banners. However, we don’t want to be in a situation where ordinary page design changes require source code edits across a service/language boundary. So we would prefer the page composition to be controlled by a template which is delivered by MediaWiki, with page components also delivered by MediaWiki.

A good technology for this would appear to be ESI: it is mature and vendor-neutral, being supported by various caching proxies and CDNs, including Varnish. Research done for T106099 suggests an increase in CPU of 50% for pages composed of five ESI components, which seems tolerable. Page views comprise only 11% of the frontend request rate, and typical daily peak CPU usage for Varnish servers is around 10%, so heavy use of ESI might increase this CPU usage to 11% or 12%. The latency effect of such increased CPU usage would be less than 1ms.

Services we like

Stateful services, while complicated and requiring more maintenance, are undeniably useful. PHP is a poor platform for stateful services. Wikimedia’s installation of MediaWiki is supported by services such as MySQL, Memcached, ElasticSearch, PoolCounter and Varnish, and we have no problem adding to that list, as long as there is a strong rationale and a minimum of redundancy between services. In particular, we are looking forward to the development of a job queue service.

Stateless services such as Mathoid can be justified in terms of availability of a library in some other language, or in terms of performance. The cost of maintaining a stateless service has to be considered against the prospective cost of providing equivalent functionality in PHP.

Additional services should be optional for non-WMF installations. MediaWiki should have a pluggable client boundary for each service concept, allowing multiple implementations of a given service concept.

Conclusion

This document is a position statement of Tim Starling on behalf of the MediaWiki Platform Team. It was intended to outline a viewpoint on recent architectural decisions which is commonly held but rarely expressed. The goal was to provoke discussion on alternatives to the architectural vision promulgated by Reading and Services. I think it was successful as such, and my thanks go to everyone who has commented on it.

The next step for me will be to take specific project ideas from this document, expanding on them and incorporating the input I received here. These project proposals will be communicated as Phabricator tasks.

October 2017