MediaWiki architecture document/How and why

CC BY Important note: When you edit this page, you agree to release your contribution under the Creative Commons Attribution 3.0 Unported license (CC BY 3.0). If you don't want this, please don't edit. This page is used for the MediaWiki architecture document project, whose content will be included in a book released under CC-BY.
If you had to pick, what are 5 key decisions MediaWiki's developers made that were very insightful?
  1. Making MediaWiki reusable by other people. At a time it was hard to install, you had to run a command line installer to set it up and there were plenty of references to Wikipedia and hardcoded paths everwhere. Releasing tarball is great for that.
  2. Rewriting the schema early to better support web scaling. I think it was in MediaWiki 1.5.
  3. Using a tokenizer to parse wikitext (JeLuF wrote it). Unfortunately lack of performances with PHP array memory allocations led to a revert after 3 days of having it running on live site. We are back to the huge pile of regexp since them.
  4. Writing our own database abstraction layer and load balancer. Well at that time, there were not much around so we HAD to write one :-b
  5. Opening source code to a lot of volunteer developers. We have tons of people able to commit :-)
  6. Peer reviewing of every single patch since day 1.
  7. Migrating to svn (next step git?)
  8. Finally using jQuery for javascript.
And if you had to pick, what are 5 key decisions MediaWiki's developers made that were, in retrospect, the wrong choice?
  1. Reusing MediaWiki to build commons. It was, and is still not, adapted to handling millions of media files. We should have started a dedicated project with the goal of handling media files, something less like a wiki and more like Flickr or Picassa.
    • I think we could do a lot better while still working within the MediaWiki framework, but we never really did that either. --brion
  2. relying on PHP/MySQL which are probably not the best choices for performances. On the other hand, they are both very popular and most software developer can thus submit patches :-)
  3. The templating architecture, Tim Starling can talk about it much more than me.
  4. Using per language encoding. That led to a lot of issues. Eventually everything migrated to UTF-8 which makes things easier when you deal with hundreds of different languages.
    • We eliminated this in 1.5, along with the "big schema change". Definitely wish we'd done it first though! --brion
  5. We still have metadata (categories, interwiki) in the body of text. This should really have been coded in a different table / interface. Users have to edit the whole page just to add/remove categories and interwikis :-)
How would you describe our attitude to backwards compatibility?

AFAIK, we are really conservative. Most old methods / functions are kept in the code, nowaday they are marked as deprecated and removed after 2 or 3 releases. We still support the 10 years old skins.

Extensions part of our svn repository are more or less maintained by core developers. At least when it comes to API changes and sometime coding style. The wikitext parser and render still supports hacks we really want to remove for performances reasons. Still, since people use them, we keep the features :D

Did MediaWiki get more customizable over time (with extensions, user scripts, gadgets, and skins) or less? Why?

It get more customizable. Given a skin, someone can now apply his own stylesheet just by editing an article (something like User:username/skinname.css )

Gadgets are great. Want to talk Brion about them.

At one time, you could not write any extension. I am not sure who added the first hooks, I know I proposed it to some developers, eventually the hook system was added by someone else.

What decisions and improvements improved or hurt MediaWiki's performance?
  • Rewriting the database schema improved MW.
  • Adding support for memcached (in memory cache) and APC (PHP opcode cache) had a HUGE impact in improving perfs.
  • The template system degraded them. People came with creative uses of the template system and eventually led to templates that takes seconds to render. The worse of them is probably the citation templates. We should really have coded that template as a PHP extension which would get us better perfs.
  • Our broken parser has and still has bad performances.
  • ResourceLoader !!! :)
Anything else you think should be included in a document about MediaWiki's architecture and history?

Brion

edit
If you had to pick, what are 5 key decisions MediaWiki's developers made that were very insightful? Why?
  1. FOSS from the beginning. While we're not always the best at accepting patches & sharing, we're far from the worst. :) MediaWiki's early couple years involved zero engineering budget and a lot of volunteer turnover -- the original authors of the phase 2 and phase 3 PHP codebases were Wikipedians with a technical bent as were most of our other early devs; folks like me started on fixes, internationalization, and features support based on our ability to see the source in CVS & the bug tracker on SourceForge, discuss with other devs on the wikis & mailing lists, and get software updates actually pushed to production.
  2. Regular releases and installer. Making sure the software was easy to set up was a BIG factor in getting more volunteer developers involved, both directly (people already possibly interested wanting a small impedence to start working) and indirectly (easier for 3rd-party usage, leading to people sending fixes and customizations upstream). Release frequency has gotten less regular but we're still pushing them out, and the installer's gotten a big boost in 1.17.
  3. Extension architecture. While I'm not 100% happy with every detail of how we do it, we have a fairly flexible infrastructure which has helped us to make specialized code more modular, keeping the core software from expanding (too) much and making it easier for 3rd-party reusers to build custom stuff on top.
  4. Site/user JS/CSS and gadgets: hugely impactful, this has greatly increased the democratization of MediaWiki's software development. Individual users are empowered to add features for themselves; power users can share these with others both informally and through globally-configurable admin-controlled systems.
  5. Templates. While we have plenty of things to whinge about in the syntax, management etc, the ability to create partial page layouts and reuse them in thousands of articles with central maintenance has been a big boon.
And if you had to pick, what are 5 key decisions MediaWiki's developers made that were, in retrospect, the wrong choice? Why?
  1. Cleaner markup syntax near the beginning would simplify our lives a lot with template & editing stuff
  2. The flat namespace for articles is too simple: for Wikipedia it encourages overly long pages (leads to performance problems as we have to parse and copy around huge chunks of text that will not usually get read all at once, and makes it harder to navigate to relevant, more digestable chunks of data). For other sites like wikibooks, wikisource, wikiversity, heck even mediawiki.org we could benefit a lot from more structured entities that consist of multiple subpages. It also means it's harder to separate draft or provisional pages from the published article space.
  3. Not implementing structured messaging / discussion / chat systems from the beginning has left us with a legacy of "talk pages" that are horrid to work with and unaccountable IRC backchannels.
  4. DB & filesystem storage layout for media files is very awkward, with a number of problems that hinder our ability to mirror, cache, and do major online maintenance.
  5. A cleaner accounts system that spanned multiple sites from the beginning would have saved lots of trouble; CentralAuth is still a bit hacky to work with.
What are MediaWiki's idiosyncrasies? What makes it special compared to other PHP software?
What would you say were the main milestones in MediaWiki's history?
  • creation, testing, and initial deployment of Magnus' "phase 2"
  • creation, testing, and initial deployment of Lee's "phase 3"
  • Refactorings & performance improvements in the early Brion & Tim years
    • internationalization & unicode
    • addressable logs
    • 1.5 schema refactor
    • compression & external storage
    • web-installable package
    • regular releases
  • Early empowerment of end-users
    • user/site JS/CSS
    • extensions
    • templates
  • CentralAuth
  • Gadgets
  • API
  • 1.12 or so - preprocessor
  • 1.17 - resourceloader
What decisions and improvements improved or hurt MediaWiki's performance?
  • improvements above? :)
  • hurt: awful awful syntax making it harder to plug in better-performing parser etc bits
  • hurt: PHP has not benefited from performance improvements that some other dynamic languages have seen in recent years (eg JavaScript VMs now have aggressive JITs etc, but Zend's PHP still doesn't ship an opcode cache, much less try to actually compile anything)
  • hurt: MySQL has had a few specific areas it's lagged in that have been problematic:
    • lack of full native UTF-8 (this is finally in in the latest versions, but you have to jump through some hoops and we have years of legacy databases)
    • no or limited online alter table makes even simple schema changes painful to deploy, slowing some development
  • data dump format is very hard to parallelize well; even with few changes to the database it takes forever to build one due to the compression.
How would you describe our attitude to backwards compatibility?
  • .... varies. :)
Anything else you think should be included in a document about MediaWiki's architecture and history?
  • will add more at some point

Chad

edit
If you had to pick, what are 5 key decisions MediaWiki's developers made that were very insightful? Why?
  1. Making it FLOSS since the very beginning was very important for several reasons. Not only for those Brion listed above, but also for its popularity. MediaWiki has become the 800-lb gorilla of wiki software, and it wouldn't have happened with a closed development model. Also, I think Wikipedia would've taken quite a bit of flak from the FLOSS community (where a lot of our roots come from) for being "free content" but not running on "free software."
  2. Transitioning from CamelCase to [[Free links]] was before my time, but an excellent decision. It allowed for greater flexibility in link text and page names and made it less confusing. Free links have since become the de-facto standard for internal links in most wiki software now.
  3. The schema overhaul in 1.5 was very well thought out. We've added on to it over time, but by and large it has remained the same to this day and continued to serve us reasonably well.
And if you had to pick, what are 5 key decisions MediaWiki's developers made that were, in retrospect, the wrong choice? Why?
  1. The skin system has been terrible since the beginning. It's damn near impossible for 3rd parties to write their own custom skins without reinventing the wheel.
  2. ParserFunctions never should've seen the light of day. Granted we were responding to the needs of the time, but if we did it over, we probably should've taken the time to solve the problem properly rather than putting PFuncs as a stopgap measure. This page has some interesting history on the subject (Also this and also older versions of this page)
  3. File repository code could've been done slightly differently. Ideally wikis should be able to upload *to* foreign repos, rather than just read from. Also, most of the code assumes a local filesystem or NFS which isn't very flexible--other backends like the database, Swift, etc. shouldn't be so hard to add.
  4. The parser wasn't formally spec'd from the beginning--it just morphed and evolved as needs have demanded. This makes it difficult for alternative parsers to exist and has made changing the parser hard. The parser's spec is whatever the parser spits out, plus a few hundred test cases.
  5. Globals for configuration -- partially this is because we started out in PHP4 world, but it has really hurt 3rd parties over time and made the software seem rather difficult to configure/maintain.
What are MediaWiki's idiosyncrasies? What makes it special compared to other PHP software?
  • It has to be webscale ;-) Unlike most PHP applications MediaWiki has been built for years now with performance as a major design goal since it absolutely must scale to WMF sites (primarily: enwiki)
  • The parser has to remain very very stable. Hundreds of millions of wikipages worldwide depend on the parser to continue outputting HTML the way it always has. It makes changing the parser difficult.
What would you say were the main milestones in MediaWiki's history?
What decisions and improvements improved or hurt MediaWiki's performance?
  • Adding a generic caching layer did amazing things for performance. We can throw practically anything expensive into the cache and expect it to come out :)
How would you describe our attitude to backwards compatibility?
  • Some aspects such as hooks or configuration variables, remain very stable for a long time. When they change, they typically go through a slow deprecation process to allow users and extension authors to catch up.
  • However our internal apis change all the time which can be frustrating to extension authors (and even core devs!)
    • I think this will improve in the coming releases though. A lot of our "omg rewrite" situations in the past ~2 years have been to bring ancient code into the 21st century.
Anything else you think should be included in a document about MediaWiki's architecture and history?
  • TBA

Tim Starling

edit
If you had to pick, what are 5 key decisions MediaWiki's developers made that were very insightful? Why?
  • I think it would be rather vain to refer to any of my own decisions, or decisions I contributed to, as "very insightful".
And if you had to pick, what are 5 key decisions MediaWiki's developers made that were, in retrospect, the wrong choice? Why?
  1. Configuration variables are placed in the global namespace.
    • This had serious security implications with register_globals.
    • It limits potential abstractions for configuration, and makes optimisation of the startup process more difficult.
    • The configuration namespace is shared with variables used for registration and object context, leading to potential conflicts.
  2. Extension registration based on code execution at startup rather than cacheable data.
    • Limits abstraction and optimisation
  3. The use of triple braces to designate template arguments.
    • Makes templates look ugly.
  4. Unprefixed class names.
    • PHP core and PECL developers have the attitude that all class names that are English words inherently belong to them, and that if any user code is broken when new classes are introduced, it is the fault of the user code.
    • Prefixing e.g. with "MW" would have made it easier to embed MediaWiki inside another application or library.
  5. Lack of a unified, pervasive permissions concept.
    • This stymied the development of new user rights and permissions features, and led to various security issues.
What are MediaWiki's idiosyncrasies? What makes it special compared to other PHP software?
  • Namespaces
  • Internationalisation system
  • Replication support
What would you say were the main milestones in MediaWiki's history?
  • The phase3 rewrite
  • The web-based installer (1.2)
  • 1.3: The MonoBook skin, categories, templates and extensions.
  • The new schema (1.5)
  • The resource loader (1.17)
What decisions and improvements improved or hurt MediaWiki's performance?
  • As discussed above: the configuration and extension registration systems hurt MediaWiki's performance.
  • Obviously using Java would have been much better for performance, and scaling up the execution of backend maintenance tasks would have been simpler.
  • The template argument default parameter feature {{{arg|default}}} was ultimately the most costly feature in terms of performance. It enabled the construction of a functional programming language implemented on top of PHP, starting with {{qif}}.
How would you describe our attitude to backwards compatibility?
  • Inconsistent. Such is the result of having many volunteer developers with many different opinions on this fraught issue.
Anything else you think should be included in a document about MediaWiki's architecture and history?

MediaWiki grew organically and is still evolving. It's hard to criticise the founders for not implementing some abstraction which we now find to be critical, when the initial codebase was so small, and the time taken to develop it so short.

We've seen major new architectural elements introduced to MediaWiki throughout its history, for example:

  • The Parser class
  • The SpecialPage class
  • The Database class
  • The Image class, then later the media class hierarchy and the filerepo class hierarchy
  • ResourceLoader
  • The upload class hierarchy
  • The Maintenance class
  • The action hierarchy

MediaWiki started without any of these things, despite the fact that all of them support features that have been around since the beginning. Many developers are driven primarily by feature development -- architecture is often left behind, only to catch up later as the cost of working within an inadequate architecture becomes apparent.

Roan Kattouw

edit
If you had to pick, what are 5 key decisions MediaWiki's developers made that were very insightful? Why?
  1. Strongly focusing on security by providing wrappers around HTML output and DB queries that handle escaping for you, and making their use pretty much mandatory. This means that everyone is expected to write secure code, while at the same time writing secure code is made easy so everyone can do it. Thanks mostly to Tim Starling, we have institutionalized a security-minded development culture, and I think that contributes to the low number of security flaws found in MediaWiki.
  2. When the limitations imposed on us by WMF's caching infrastructure caused problems in MediaWiki or friction with things we wanted to do in MediaWiki, we found ways around that. We didn't try to shape WMF's caching-heavy optimized-to-death architecture around MW, but we did almost the reverse: make MW more flexible so it can handle our crazy caching setup, without compromising on our performance and caching needs.
  3. We are fully committed to internationalizing our software in any imaginable language. This i18n support is quite pervasive and impacts many parts of MediaWiki, but despite that we stuck with it anyway, and we now have a very feature-rich i18n system.
  4. We try to do things cleanly if there are benefits to it (e.g. separation of logic and output in the architecture) but at the same time we're not afraid to ignore standards or rules if that's better for us (e.g. not fully complying with stupid provisions of HTML4, denormalizing the DB schema where that brings performance benefits)
And if you had to pick, what are 5 key decisions MediaWiki's developers made that were, in retrospect, the wrong choice? Why?
  1. MediaWiki seems to have been started mostly by people who weren't really very expert in their field, and as a result a lot of ugly old code is lying around that lacks proper logic/view separation and has other nasty issues
  2. Making wikitext such a complex and idiosyncratic language that parsing it with 'normal' parsers is very hard was definitely a bad move, and we're feeling the pain now
  3. The visual editor project is way overdue. We're fixing it now, and that's good, but it's kind of ridiculous that, in 2011, the main interface of one of the largest sites on the web is still a <textarea> from the 90s
What are MediaWiki's idiosyncrasies? What makes it special compared to other PHP software?
  • We support all sorts of things that 'normal people' never even think about, such as
    • DB replication/lag handling
    • reverse caching proxies
    • SSL termination proxies and protocol-relative URLs
    • internationalization in 350+ languages
    • the ability to have the interface and content in different languages
    • right-to-left languages
    • mixed directionality (i.e. interface language and content language have opposite directionality)
What would you say were the main milestones in MediaWiki's history?
  • The schema migration in 1.5
  • The creation of api.php, and the addition of write actions (including edit) to it
  • The preprocessor rewrite
  • CentralAuth (SUL)
  • ResourceLoader
  • The new installer
What decisions and improvements improved or hurt MediaWiki's performance?
  • Factors that contribute to a 'performance culture':
    • We have a few very expert people on board who know a lot about performance optimization (DB performance is a big chunk of it, but the rest is important too)
    • MediaWiki must run on a top-ten web site. Things that would not scale to that size are fixed, reverted, or put behind a config var
    • MediaWiki must run on a top-ten web site operated on a shoestring budget, which has additional implications for performance and caching
  • Specific things that have improved performance:
    • Generic caching layer support that looks the same to the developer whether the backend is memcached (preferred), APC/ECache/whatever, a database, or even nothing (null cache)
    • Using disk-backed object cache for the parser cache greatly improved the pcache hit rate and produced some awesome hit rate graphs
    • PoolCounter prevents a Michael Jackson-esque cache stampede. It's sort of difficult to verify it actually works, but we've seen things that lead us to believe it does indeed work
  • Specific things that have harmed performance
    • The fact that wikitext slowly evolved into this almost Turing-complete programming language. Wikitext that exploits these features takes forever to parse


How would you describe our attitude to backwards compatibility?

This doesn't seem to be particularly consistent across MediaWiki's subsystems. When I maintained the bot API, my policy was to avoid breaking backwards compatibility unless there was a very good reason for it, and to announce changes that broke backwards compatibility on the mediawiki-api-announce list (with the subject line screaming "BREAKING CHANGE") as soon as they were committed to trunk, so they would be well-known to clients by the time they went live on WMF wikis or became part of a tarball release.

Anything else you think should be included in a document about MediaWiki's architecture and history?