Manual:Architectural modules/Parser
Module | |
Parser | |
Responsibilities | Different: parsing of wikitext and several other tasks |
Implementation | The main Parser.php file as well as 14 other separated files which implement supporting functionalities for parser |
Responsibilities
editParser module has different responsibilities in MediaWiki. Besides the actual parsing of wikitext it is used for several other tasks. The role of Parser module is comprised of the following functional areas, that will be further presented in more details:
- Parse wikitext into HTML
- Apply options for parsing
- Deliver output object to be used for rendering
- Cache the output of parsing
- Provide parser functions
- Provide tag hooks
- Extract and replace sections when they are edited
Parse wikitext into HTML
editThe content of the articles is stored in the database in wikitext. When an article has to be shown in the browser, this wikitext needs to be converted into proper HTML. Performing this task is the core functionality of Parser module. The entry point to do the parsing is Parser::parse()
, that sequentially executes a row of operations transforming at each step a piece of wikitext code (tables, lists, headers etc.) into HTML. The execution steps of parse()
are shown in the table Parsing Steps.
Parsing Steps | |
---|---|
startParse() | Sets the title, options and outputType |
internalParse() | |
preprocessToDom() | Preprocesses wikitext and returns the document tree |
$frame->expand() | Expands templates Expands variables and parser functions |
Sanitizer:: removeHTMLtags() | Cleans up HTML, removes dangerous tags and attributes, removes HTML comments |
doTableStuff() | Renders wikitext for tables |
preg_replace() | Inserts <hr /> tag for thematic break (start of sections) |
doDoubleUnderscore() | Removes valid double-underscore items, like __NOTOC__, and puts them into array $Parser->mDoubleUnderscores .
|
doHeadings() | Renders section headers, i.e. "==" are replaced with <h2> tags |
replaceInternalLinks() | Puts placeholders for internal links in [[ ]] and stores them in $Parser->mLinkHolders Renders section links |
doAllQuotes() | Replaces single quotes with HTML markup (<i>, <b>, etc) |
replaceExternalLinks() | Renders external links |
doMagicLinks() | Replaces special strings like "ISBN xxx" and "RFC xxx" with magic external links |
formatHeadings() | Auto numbers headings if that options is enabled Adds an [edit] link to sections for users who have enabled the option and can edit the page |
exit internalParse() | |
$this->mStripState->unstripGeneral() | Inserts back general stripped items from StripState |
preg_replace() | Cleans up special characters |
doBlockLevels() | Renders lists from lines starting with ':', '*', '#', etc.
Renders new lines and paragraphs |
replaceLinkHolders() | Replaces link placeholders with actual links from $Parser->mLinkHolders |
$this->getConverterLanguage()->convert() | The text is language converted (when applicable) |
mStripState->unstripNoWiki() | Nowikitext is inserted back from StripState array |
replaceTransparentTags() | Replaces transparent tags with values which are provided by the callback functions in $Parser->mTransparentTagHooks. Transparent tag hooks are like regular XML-style tag hooks, but they operate on HTML instead of wikitext. |
$this->mStripState->unstripGeneral() | Inserts back general stripped items from StripState |
Sanitizer::normalizeCharReferences() | Ensures that any entities and character references are legal for XML and XHTML specifically |
MWTidy::tidy($text) | If HTML tidy is enabled, MWTidy::tidy is called to do the tidying |
Limit report | If limit report is enabled, produces limit report |
mOutput->setText($text) | Sets the parsed text to ParserOutput |
return $this->mOutput | Returns the ParserOutput object with HTML text of the wiki page |
ParserOptions |
The parsing of wikitext is based on the applied ParserOptions
. ParserOptions
are initialized from the RequestContext
where the Language and User are the two important parameters. User preferences, such as thumbs size, numbering of headings or language are set to the ParserOptions
. Furthermore, other options get their values based on the settings for global variables of MediaWiki. These are, for example, mAllowExternalImages
or mMaxTemplateDepth
(maximum recursion depth for templates within templates). An example of ParserOptions
object with values can be found on the figure ParserOptions. As a result, the HTML output for the same article (and the same revision) can vary depending on the used parser options.
ParserOutput |
An important point to mention about the parsing process is the incorporation of extension hooks. Hooks allow custom code to be executed when some defined event occurs. Code in the parse()
already includes some defined events and runs hooks when they occur. For example, there is an event ParserBeforeInternalParse
, on which hooks will be ran before proceeding with internal parse. Additional functions can be registered on these events or new events can be created as everywhere in MediaWiki.
The result of parsing is stored as the ParserOutput
object. The main attribute here is mText
, that holds the whole HTML representation of the article content. Besides that ParserOutput
holds separately categories, links, images, sections, templates and other "parts" of the article as variables. Moreover, it holds information relevant for caching – cache time, expiry and revision. An example of ParserOutput
object can be found on the figure ParserOutput. After the ParserOutput
is produced, the values of its attributes are set to the OutputPage
object.
Cache the output of parsing
editIn order to optimize the performance of MediaWiki and not to parse every time the wikitext into HTML, the ParserOutput
can be cached. There are 2 PHP files in the parser directory which relate to mechanism of caching: ParserCache.php and CacheTime.php. When ParserOutput
is being created it is done so using some ParserOptions
. There are more than 30 options and they can be found in the ParserOptions
class. Example of such options would be:
$mDateFormat
– specifies date format$mInterfaceMessage
– specifies which language to call for plural and grammar$mEnableLimitReport
– specifies whether to enable limit report in an HTML comment on output$mNumberHeadings
– specifies whether headings should be automatically numbered$mThumbSize
– specifies thumb size preferred by user
All of these options are taken into account when producing the ParserOutput
and some of them are critical for creation of the cache.
After the wikitext is parsed and ParserOutput
is created (in PoolWorkArticleView::doWork()
), the output will be cached if cache expiry > 0 and we are dealing with the latest revision of the article. For that ParserCache::singleton()->save()
function is called. The key to store the ParserOutput
in cache is generated from the pageid
and used ParserOptions
. The options are hashed using ParserOptions::optionsHash()
in form of '!value' or '!*'. '!' means the beginning of a new option. '*' is placed when no value for this option is found. In the end the key for the cached ParserOutput
will look this way:
where
- buildings_en – name of the database
- pcache – constant for parser cache
- 31 – page id
- 0 – render key
- !*!0!!en!2 – different options, 'en' for example stands for user language
After the key is created, the output is saved with $this->mMemc->set()
. mMemc
is the given back-end storage mechanism (memcached client or a BagOStuff derivative). It is set during the construction of ParserCache
by passing $parserMemc
, which gets its value from global variable $wgParserCacheType
.
When a current revision of the page is requested and the request is not a redirect, client and file cache will be checked. If nor client, nor file cache are available and the variable $useParserCache
is set to true, parser cache will be tried. The cache retrieval consists of 2 steps: getting optionsKey
from cache and getting the actual ParserOutput
cache using parserOutputKey
. optionsKey
is a CacheTime
object that holds in particular $mUsedOptions
, $mCacheTime
and $mCacheExpiry
. The key for getting the optionsKey
is generated from pageid
:
where pageid
=31.
If optionsKey
can be found in cache and it is not expired, then we can get the parserOutputKey
. For that ParserCache::getParserOutputKey()
will be called. The $usedOptions
coming from $optionsKey->mUsedOptions
will be hashed to become the part of the parserOutputKey
as described before.
Now when the parserOutputKey
is available, it will be tried to retrieve the value for it from cache. If the cache is found, it is not expired and no different revision was requested, then the cached ParserOutput
object will be returned and served further to the user. Otherwise false will be returned and the page will be parsed again.
As described before, ParserOptions
are important for saving the cache. If a requested page was saved with different options than those required for retrieval, it will have to be parsed again. Following the example above, a user who has Spanish as his preferred language would get the following parserOutputKey
for exactly the same page:
As it has es standing for user language, the page has to be parsed again in order to provide correct representation. The Spanish user, for example, would then see 'Editar' instead of 'Edit' for editing sections of the page.
Provide parser functions
editParser function is any magic word, that takes parameters and returns calculated value based on these parameters. The Parser module implements the required functions and provides interface for using them. An example of a built in parser function would be PLURAL.
would become
The parser functions can be accessed not only during the parsing of the articles written in wikitext, but at any point when the outcome of this function is needed. MediaWiki developers can implement additional parser functions by creating an extension. This way users will have more options for applying magic words with parameters suited to their specific needs.
Provide tag hooks
editMarkup of MediaWiki includes tags, that allow to delimit some text and process it in a special way depending on the meaning of the tag. There are 4 core tags in MediaWiki (nowiki, pre, gallery and html), that the parser can process automatically. That is the implementation of processing is built in the parser. The users, however, might want to extend the wiki markup and they can do so by introducing new tags. This can be done by implementing a tag extension and integrating it with parser. A tag extension would consist of a function, that will be called during parsing and that will render the tagged text into HTML.
Extract and replace sections when they are edited
editMediaWiki provides the possibility to edit specific sections instead of editing the whole article. The extraction of the section and its replacement after it has been changed is done by the Parser module as well. The entry point for these operations would be Parser::getSection()
and Parser::replaceSection()
.
Implementation Information
editFiles related to the Parser module are placed in the parser directory, that can be found in includes --> parser. The directory consists of the main Parser.php file as well as 14 other separated files which implement supporting functionalities for parser.
- Contains the core functionality of the Parser module.
- The main entry point is
parse()
that does the parsing of wikitext. - Other entry points:
preSaveTransform()
preprocess()
cleanSig()
getSection()
replaceSection()
getPreloadText()
- Sets the time of caching.
- Sets the time (in number of seconds) when the cache should expire and checks if it is expired.
- Contains variable
$mUsedOptions
withParserOptions
that were used to produce theParserOutput
.
- Parser functions provided by MediaWiki core. These functions are registered for the Parser in
Parser::firstCallInit()
. - Parser function is any magic word that takes one or more parameters. They are sometimes prefixed with hash to distinguish them from templates.
- Example: {{PAGEID: page name}} – Returns the page identifier of the specified page.
- Tag hooks provided by MediaWiki core. These functions are registered for the Parser in
Parser::firstCallInit()
. - Core tags are:
- <pre> – works as normal html <pre> tag and the text inside is not considered as wikitext (ignored by parser).
- <nowiki> – text inside is not considered as wikitext (ignored by parser).
- <gallery> – images are displayed in rows and columns.
- <html> – only if
$wgRawHtml
is enabled, then the text inside is treated as raw HTML.
- Date formatter recognizes dates in plain text and formats them according to user preferences.
- Temporarily holds links of the wiki page. There are 2 types of links:
$internals
and$interwikis
. At some point during parsing all internal and interwiki links from the wikitext are cut out and placed into this array with key-value pairs. Instead of cut out links the keys for the links are inserted in the wikitext. At the right time real links correctly formated in HTML are placed back into the wiki page.
Parser_DiffTest.php (removed in 1.35)
edit- Fake parser that outputs the difference of two different parsers.
- Handles caching of the
ParserOutput
:- generates cache key
- saves in cache
- retrieves from cache
- Holds options used to create
ParserOutput
. - Generates hashed options that will be used as a part of the key to store
ParserOutput
in cache.
- Represents the output of the parsing.
- The main variable is
$mText
which contains HTML text that will be rendered in browser. - Also has such variables as
$mLanguageLinks
– List of the full text of language links, in the order they appear$mCategories
– Map of category names to sort keys$mTitleText
– Title text of the chosen language variant- ...
- Interfaces for preprocessors (Preprocessor, PPFrame, PPNode).
- Implementation classes can be set in configuration for Parser.
Preprocessor_DOM.php (removed in 1.35)
edit- Preprocessor using PHP's dom extension.
- Contains several classes: Preprocessor_DOM (implements Preprocessor), PPDStack, PPDStackElement, PPDPart, PPFrame_DOM, PPTemplateFrame_DOM, PPCustomFrame_DOM, PPNode.
- Main functionalities:
- Preprocesses wikitext and returns a document tree.
- Creates a PPFrame DOM object and calls its expand() method to do the structure of the wiki page (creates new lines for sections and lists) and expands templates and parser functions.
- Preprocessor using PHP arrays
- Holder for stripped items when parsing wikitext. 2 types of striped items: nowiki and general.
- Holds these items as key-value pairs while parsing is done. These items are removed from wikitext and the keys are placed instead. When all the parsing is done the items are inserted back into the text.
- This is done in order to leave nowikitext as it is while doing normal parsing of wikitext.
Tidy.php
edit- HTML validation and correction.
- Parsing/Replacing Tidy