Parser 2011/Parser plan

Important note: Parts of the technical content in this page is a bit older, but kept as a reference. See Parsoid for the current implementation effort.

Draft starting at etherpad:mwhack11Sat-ParserDraft-- more will be copied to the wiki after more polishing.

Vision edit

2011	Today, every editor and commenter must deal with source markup for every edit. The tight integration between the classic Parser class and the rest of MediaWiki has made it difficult to create and maintain compatible parser implementations for other uses -- such as the reversibility needs of rich text editors or the alternate output conversions of many republishers. Over the coming months we will build a specification describing the current -- or very close to the current -- MediaWiki parser behavior in terms of a combination of formal and informal grammars to create, transform, and convert via an intermediate abstract syntax tree. A reference implementation in JavaScript will be built and tested alongside a rich-text editor tool, which will be testable on current production Wikimedia sites using gadget options.
2012	By some time next year, we will have rolled out a rich text editor which will let most editors and commenters contribute text without encountering source markup, based on integration with a better-defined version of MediaWiki's parser. We expect raw wiki markup editing to become the domain of advanced users working on templates; most edits by most editors will not need to dive into markup source directly, though it will still be available. Somewhere in this timeframe, swapping out the old MediaWiki PHP Parser class for the really-nearly-compatible specified one should be ready; it may actually lag behind the editing, depending on whether the JavaScript parser is sufficient to handle the editor's needs.
2017	The far mystic future! World peace, flying cars, and free software on every desktop? Perhaps. By now, the classic MediaWiki templates will have been supplemented with more flexible programming extensions, further reducing the need of advanced editors to work with markup directly. Many new articles, and newer versions of existing articles, have by now been switched to a future derivative format that eliminates much of the low-level syntax oddities of 2011's classic wikitext. Creation of that future reformed format is not the aim of this current project; we leave that to our future selves. What we aim to do now is to lay the groundwork for those future transitions: by integrating well with a rich text editor, we reduce the dependence of editors and commenters on dealing with low-level wikitext. It will become easier to have different data types in use side by side, such as to start using a new format on some pages, or new versions of existing pages, while the specified classic parser continues to be used when needed on old pages -- and perhaps new articles imported from other wikis. Those who choose -- or for complex template creation need -- to work with low-level markup directly may need to deal with both classic and future markup syntaxes for some purposes, but this will largely be along the lines of "oh there's that funky apostrophe thing in this old-style page". Most editors will never need to encounter it.

Terms edit

Parsing terminology tends to get used... inexactly within MediaWiki. We'll want to make sure we use some terms consistently. :)

wikitext: the markup format used by MediaWiki
page: an object in MediaWiki containing wikitext data. Pages are referred to by a site-unique title, and may have metadata and versioning.
title: MediaWiki page titles are site-unique within each namespace in the wiki's page database
parsing: the entire process of manipulating wikitext source into an output-ready form -- output may be HTML, renormalized save-ready wikitext, or a syntax tree.
Parser: the Parser class and its related parts performs parsing
Preprocessor: the Preprocessor class converts wikitext to an XML or hashmap tree representing parts of the document -- template invocations, parser functions, tag hooks, section headings, and a few other structures.

...

Spec format edit

Description of parsing context edit

Wikitext parser/Context

page title, text contents, notions of other pages accessible

Stage 1 formal grammar edit

Wikitext parser/Stage 1: Formal grammar

Rule sets to interpret wikitext source string into an abstract syntax tree. It should be possible to use fairly standard parser generators to produce stub code in many languages with a minimum of manual construction. Some rule combinations will depend on context information such as being inside or outside of a template, but all the rules themselves remain consistent; see Preprocessor ABNF for a similar description of the current MediaWiki Preprocessor, which covers a smaller subset of the syntax.

Stage 2 annotated steps edit

Wikitext parser/Stage 2: Informal grammar

Informal description of processing stages upon the AST to perform steps that can't be expressed in the formal grammar. We hope to cut down the amount of explicit steps significantly from the current Parser class

Expansion stage annotated steps edit

Wikitext parser/Stage 3: Expansion

Informal description of how to handle expansions of templates (combining the parent page's tree with the template page's tree, and resolution of parameters etc), template parameters, etc.

Parser function addenda edit

Wikitext parser/Core parser functions

Parser functions are roughly like template invocations, but they call code instead of fetching contents directly from another page. This is the primary document format feature extension mechanism in MediaWiki, as it creates no new syntax requirements. Most parser functions will expand into a subtree like normal templates do, but some could expand into custom node types (eg extensions). Core section describes abstract API between a core parser and callbacks for parser functions; addenda describe standard parser functions (those shipped as part of MediaWiki core today). eg, the {{#time}} parser function implements formatting of times from parameter or from current local time as provided by parsing context. Description should be sufficient to write a compatible implementation or reasonable fallback behavior if the exact function won't be suitable for some implementation.

Tag hook addenda edit

Wikitext parser/Core tag hooks

Tag hooks have an XML-like source syntax rather than the curly-brace MediaWiki template/parser function syntax. Unlike parser functions, the parameters and text content passed to a tag hook are not automatically expanded as parse trees, though a specific tag hook may choose to run its data back through the parser for this purpose. Core section describes abstract API between a core parser and callbacks for tag hooks; addenda describe the standard tag hooks (those shipped as part of MediaWiki core today: nowiki, pre, gallery, html)

Notes from Berlin Hackathlon 2011 edit

Berlin_Hackathon_2011/Notes/Saturday/Parser
Future/AST (Abstract Syntax Trees)