Markup spec/BNF/Article

EBNF grammar project

ANTLR
BNF
- Article title
- Article
- Noparse-block
- Links
- Magic links
- Special block
- Inline text
- Fundamental elements

Wiki-page

The top-level element (start symbol) is wiki-page which describes the contents of a page. A page can either be a redirect or a normal article.

<wiki-page>               ::= <redirect> [<article>] | [<article>]
<redirect>                ::= <redirect-tag> <characters> <internal-link-start> <article-link> (<internal-link-end> | <pipe> | EOL)
<redirect-tag>            ::= FROM_LANGUAGE_FILE

<internal-link-start>, <article-link>, <internal-link-end> and <pipe> are defined in Links Notes:

The <redirect-tag> is language-specific, and may have more than one possible value. By default the value for the right-hand-side of the expression (replacing FROM_LANGUAGE_FILE) is "#redirect", but in Estonian it is "#redirect" | "#suuna". This match is case-insensitive (though this again may be overridden in the language file).

<characters> should be non-greedy, matching the largest subset of characters that does not contain <internal-link-start>.

For example, <redirect> will match the following, and treat it as a redirect to foo:

#REDireCTnon%^sense[[foo|and this is parsed as article content

Interwiki prefixes may not be supported in redirect links. (Is this configurable?)
The <article> following the redirect link is not rendered. However, it is parsed. So, interwiki links, category links and even normal links are still treated and behave "normally".
Anchors (Article#Section) are supported, but not yet described in the grammar.

Article

This describes the contents of an article. An article consists of blocks, which come in two flavours: paragraphs and special blocks. Both of them end with a newline. Paragraphs are separated by empty lines.

 <article>                 ::= <special-block-and-more> | <paragraph-and-more> 
 <special-block-and-more>  ::= <special-block> ( EOF | [<newline>] <special-block-and-more> 
                                                     | (<newline> | "") <paragraph-and-more> )
 <paragraph-and-more>      ::= <paragraph> ( EOF | [<newline>] <special-block-and-more> 
                                                 | <newline> <paragraph-and-more> )

The nonterminals special-block-and-more and paragraph-and-more are not disjoint; the parser should first try to match against special-block-and-more.

The expression (<newline> | "") is a greedy version of [<newline>]. If both the empty string and a newline can be matched, then the former expression matches the newline, while the latter expression would match the empty string according to the conventions on Markup spec/BNF.

For the definition of special block, see Markup spec/BNF/Special block.

Note

Any line that does not start with one of the following is not a special block: " " | "{|" | "#" | ";" | ":" | "*" | "="

This should assist in parsing.

Hey, that's almost what it says in the current parser. I must be onto something. Wonder why it doesn't cover space or = though.

if (!$piece['lineStart'] && preg_match('/^(?:{\\||:|;|#|\*)/', $text)) /*}*/{

Paragraph

Every paragraph ends with a newline character. A paragraph translated in a <p> element.

 <paragraph>               ::= <newline> [<lines-of-text>] | <lines-of-text>
 <lines-of-text>           ::= <line-of-text> [<lines-of-text>]
 <line-of-text>            ::= <inline-text> <newline>

For the definition of inline text, see Markup spec/BNF/Inline text.

The recursion in the second rule should be non-greedy, i.e., it should match as few lines as possible. For instance,

abc

----

should be parsed as one line-of-text and one horizontal-rule, but

abc

---

should be parsed as two line-of-text nonterminals.

If a paragraph starts with a newline, the newline is as a <br> element.

Block HTML

(not referred to yet) BlockHTML = Pre | Blockquote | TableHTML | Div | HeaderHTML ;

String Types

This text came from Meta-Wiki. It's not immediately compatible with the surrounding text (it's EBNF, rather than BNF, for a start). However it is much more precise about the nature of lines and captures rules about whitespace normalisation.

Fundamental strings

WikiMarkupCharacters = "|" | "[" | "]" | "*" | "#" | ":" | ";" | "<" | ">" | "=" | "'" | "{" | "}" ;

UnicodeCharacter = ? all supported Unicode characters ? - Whitespaces ;
UnicodeWiki = UnicodeCharacter - WikiMarkupCharacters ;
PlainText = UnicodeWiki
          | "<nowiki><nowiki></nowiki>" { "|" | "[" | "]" | "<" | ">" | "{" | "}" } "<nowiki></nowiki></nowiki>"
          | UnicodeWiki { " " } ( "*" | "#" | ":" | ";" )
          | UnicodeWiki [ " " ] "=" [ " " ] UnicodeWiki
          | UnicodeWiki "'"
          | " '" UnicodeWiki ;
WhiteSpaces =  " " | NewLine | ? carriage return ? | ? line feed ? | ? tab ? | ? variants of spaces ? ;
NewLine = ? carriage return and line feed ? ;

Article strings

Line = PlainText { PlainText } { " " { " " } PlainText { PlainText } } ;
Text = Line { Line } { NewLine { NewLine } Line { Line } } ;

Titles

PageName = TitleCharacter , { [ " " ] TitleCharacter } ;
PageNameLink = TitleCharacter , { [ " " | "_" ] TitleCharacter } ;
SectionTitle = ( SectionLinkCharacter - "=" ) { [ " " ] ( SectionLinkCharacter - "=" ) } ;
SectionLink = SectionLinkCharacter { [ "_" ] SectionLinkCharacter } ;
LinkTitle = { UnicodeCharacter { " " } } ( UnicodeCharacter - "]" ) ;

TitleCharacter = UnicodeCharacter - BadTitleCharacters ;
BadTitleCharacters = "[" | "]" | "{" | "}" | "<" | ">" | "_" | "|" | "#" ;
SectionLinkCharacter = UnicodeCharacter - BadSectionLinkCharacters ;
BadSectionLinkCharacters = "[" | "]" | "|" ;