Markup spec/BNF/Inline text

EBNF grammar project

ANTLR
BNF
- Article title
- Article
- Noparse-block
- Links
- Magic links
- Special block
- Inline text
- Fundamental elements

"Inline text" is the guts of Wikitext formatting. It covers every situation where "normal" text is allowed, such as image captions, table data, and to an extent (yet to work out how to enforce this restriction...), headings and link text.

Code

 <inline-text>             ::= <inline-element> [<inline-text>]
 <inline-element>          ::=
                             | <category-link>           
                             | <internal-link> 
                             | <external-link>
                             | <magic-link> 
                             | <image-inline> | <gallery-block> | <media-inline> 
                             | <text-with-formatting>

 <text-with-formatting>    ::=
                             | <formatting> 
                             | <inline-html> 
                             | <noparseblock>
                             | <behaviour-switch> 
                             | <open-guillemet> | <close-guillemet>
                             | <html-entity>
                             | <html-unsafe-symbol>
                             | <text>
                             | <random-character> 
                             | (more missing?)...
 <text>                    ::= { <harmless-character> }+

Detail:

noparse-block: Markup spec/BNF/Noparse-block
link: Links
magic-link: Magic links

The parser should try the options in order, with text matching if all else fails. In particular, <category-link> should be matched before <link> because a category is a special type of link, and we don't want the normal parsing to occur.

HTML entity

The parser recognises validly constructed HTML entities and leaves them alone.

 <html-entity>           ::= "&" <html-entity-name> ";"
                          | "&#" <decimal-number> ";"
                          | "&#x" <hex-number> ";"                                                   
 <html-entity-name>      ::= Sanitizer::$wgHtmlEntities (case sensitive)
 (* "Aacute" | "aacute" | ... *)

Rendering

The whole sequence is output literally.
The current parser does very complicated things with escaping and de-escaping. So maybe there are places where something more complicated needs to happen.

HTML unsafe symbol

These "unsafe" symbols are turned into HTML entities if they haven't matched part of a valid HTML entity above. It's probably not too efficient having single-character level matching rules...perhaps should be combined with "text".

 <html-unsafe-symbol>     ::= <unescaped-ampersand> | <unespaced-less-than> | <unescaped-greater-than>
 <unescaped-ampersand>    ::= "&" 
 <unescaped-less-than>    ::= "<" 
 <unescaped-greater-than> ::= ">"

Rendering

<unescaped-ampersand> → &
<unescaped-less-than> → <
<unescaped-greater-than> → >

Text

Harmless-characters mean characters that couldn't be anything else. I'm not sure how useful this is as a distinction, but perhaps it will help speed things up?

A "random character" is any character which hasn't matched anything else.

 <harmless-characters>    ::= /[A-Za-z0-9] etc
 <random-character>       ::= ? any character ... ?

Rendering

Both types are written literally.

This section from the "fundamental elements" section...time to mangle!

<character>             ::= <whitespace-char> | <non-whitespace-char> | <html-entity> 


<whitespace>            ::= <whitespace-char> [<whitespace>] | EOF
<newlines>              ::= <newline> [<newlines>]
<space-tabs>            ::= <space-tab> [<space-tabs>]

<whitespace-char>       ::= <space-tab> | <newline>

<space-tab>             ::= <space> | TAB
<spaces>                ::= <space> [<spaces>]
<space>                 ::= " "

<newline>               ::= CR LF | LF CR | CR | LF
<BOL>                   ::= <newline> | BOF
<EOL>                   ::= <newline> | EOF

<non-whitespace-char>   ::= <letter> | <decimal-digit> | <symbol>
<letter>                ::= <ucase-letter> | <lcase-letter>
<ucase-letter>          ::= "A" | "B" | ... | "Y" | "Z"
<lcase-letter>          ::= "a" | "b" | ... | "y" | "z"
<symbol>                ::= <html-unsafe-symbol> | <underscore> | "." | "," | ...

<underscore>            ::= "_"

<decimal-number>        ::= <decimal-digit> [<decimal-number>]
<decimal-digit>         ::= "0" | "1" | ... | "8" | "9"

<hex-number>            ::= <hex-digit> [<hex-number>]
<hex-digit>             ::= <decimal-digit> 
                            | "A" | "B" | "C" | "D" | "E" | "F" 
                            | "a" | "b" | "c" | "d" | "e" | "f"

Formatting

Bold/italics is the biggest problem with switching to a consume-parse-render parser. It will not be possible to describe the current, extremely esoteric rules in simple (E)BNF. The best we can hope for is to store tokens representing the apostrophe clumps and do a second pass to make more sense of them. It would be very useful to define a second, unambiguous set of formatting syntax (most likely // and **), and encourage people to use those wherever apostrophes and bold/italics meet.

Some rules for parsing bold/italics as recognised by the current parser. These must be implemented (Brion said so). In increasing order of complexity:

''italics'', '''bold''', '''''bold-italics'''''
italics, bold, bold-italics
'''''bold-italics''' just italics'' normal
bold-italics just italics normal
Some text about l''''Arc de triomphe'''.
Some text about l'Arc de triomphe.
However: '''bold is l''''Arc de triomphe'''.
However: bold is l'Arc de triomphe.

Optimistic view:

 <formatting>              ::= <bold-italic-toggle> | <bold-toggle> | <italic-toggle>
 <bold-italic-toggle>      ::= "'''''"
 <bold-toggle>             ::= "'''"
 <italic-toggle>           ::= "''"

Reality:

 <formatting>              ::= <apostrophe-jungle>
 <apostrophe-jungle>       ::= "''" { "'" }

Rendering

Once the parser has decided which way the toggles go:
- bold-toggle-on -> 
- bold-toggle-off -> 
- italic-toggle-on -> 
- italic-toggle-off -> 
- bold-italic-toggle-on -> 
- bold-italic-toggle-off->

Determining the behaviour of apostrophes

The following describes the behaviour of repeated apostrophes. "Bold" means "toggle bold", rather than "turn bold on". "Bold, italics" means "Toggle bold and italics independently", rather than "turn bold and italics on" or "toggle bold and italics the same way".

One ( ' ): Always a single apostrophe.
e.g. ( hello ' blah ) → hello ' blah
Two ( '' ): Always italics on or off
e.g. ( hello '' blah ) → hello blah
Three ( ''' ):
1. Bold (default)
 e.g. ( hello ''' blah ) → hello blah
2. Apostrophe, italics
 - If there is otherwise an odd number of both bold and italics
 1. If the preceding characters are <space><non-space> (and there are no earlier such sequences)
 e.g. ( hello l'''amour'' l'''ouest''' blah ) → hello l'amour louest blah
 2. Else if the preceding characters are <non-space><non-space> (and there are no earlier such sequences)
 e.g. ( hello mon'''amour'' blah ) → hello mon'amour blah
 3. Else (the preceding character is <space>) (and there are no earlier such sequences)
 e.g. ( hello '''amour'' '''blah '''blah ) → hello 'amour blah blah
3. Italics, apostrophe (never)
Four ( '''' ):
1. Bold, apostrophe (never)
2. Apostrophe, bold (default, if either bold or italics ends up balanced)
  e.g. ( hello ''''amour''' now ''italics unbalanced, but that's ok ) → hello 'amour now italics unbalanced, but that's ok
  
  e.g. ( hello ''''amour''' now, '''bold unbalanced, but that's ok ) → hello 'amour now, bold unbalanced, but that's ok
3. Apostrophe, apostrophe, italics
  - If the default treatment leads to an odd number of bold and italics then this can meet condition 1 under the second case of three italics, above.
    e.g. ( hello ''''amour''' now '''''bold and italics unbalanced, so invoke this special case ) → hello ''amour now bold and italics unbalanced, so invoke this special case
Five ( ''''' ):
1. Bold, italics; or italics, bold (default, the two cases are equivalent)
  e.g. ( hello ''''' blah ) → hello blah
2. Italics, apostrophe, italics (never)
More than five:
1. Apostrophes, bold+italics (default)
  e.g. ( hello '''''''''' blah ) → hello ''''' blah
  
  e.g. ( hello '''bold '''''''''' blah ) → hello bold ''''' blah
2. Bold+italics, apostrophes (never)

Inline HTML

The parser recognises and cleans a large number of HTML tags, as defined in Sanitizer.php.

A decision has to be made here on whether to attempt to parse these things as a matched set, or whether to leave that to a later pass.

A loose definition assuming they are treated individually:

 <InlineHTML>              ::=  <InlineHTML-Open> | <InlineHTML-Close> | <InlineHTML-OpenClose> | <HTMLComment>
 <InlineHTML-Open>         ::=  "<" <InlineHTMLtagname> [<extra-characters>] ">"
 <InlineHTML-Close>        ::=  "</" <InlineHTMLtagname> [<extra-characters>] ">"
 <InlineHTML-OpenClose>    ::=  "<" <InlineHTMLtagname> [<extra-characters>] "/>"
 <extra-characters>      ::= <word-boundary-char> {characters - ">"}
 <word-boundary-char>      ::= " " | "-" | ":" | " " | "\"" | "/" | "*" | "#" | "!" | "$" | "%" | ...

Remarks

The range of "word-boundary-char" seems to be an artefact of the regular expression: if( preg_match( '!^(/?)(\\w+)([^>]*?)(/{0,1}>)([^<]*)$!', $x, $regs ) ) {

The list of "tags that must be closed":

block elements: p, span, table, div,
lists: ol, ul, dl,
paragraph formatting: h1, h2, h3, h4, h5, h6, cite, center, blockquote, caption, pre,
character formatting: b, del, i, ins, u, font, big, small, sub, sup, code, em, s,; strike, strong, tt, var, u
Ruby: rt, rb , rp, ruby,

Tags that can appear singly, and possibly paired

br, hr, li, dt, dd

Tags that must not be paired

br, hr

Tags that can be nested (source code is dubious on this)

table, tr, td, th, div, blockquote, ol, ul, dl, font, big, small, sub, sup, span

Tags that can only appear inside a table

td, th, tr,

Tags that make lists

ul,ol,

And tags that can appear inside lists

li

The significance of these groupings is shown as follows:

 A <blockquote> B <span>C </blockquote> D </span> E

Here, blockquote and span are both "nesting" tags. When the close-blockquote tag is found inside the span block, it is escaped.

This doesn't work:

 <span>Some text [[Image:foo.jpg|close </span>it.]]

But this does:

 <b>Some text [[Image:foo.jpg|close </b>it.]]

Rendering

Tags that have to be paired are forced closed according to some sort of logic.
<extra-characters> are "sanitized", strip all but pre-approved attributes and styles on a whitelist.
Tags are then written out literally: <InlineHTMLTagname> " " <sanitized-attributes> > etc.
HTML comments are completely discarded, with some whitespace massaging: (sanitizer.php)
To avoid leaving blank lines, when a comment is both preceded and followed by a newline (ignoring spaces), trim leading and trailing spaces and one of the newlines.

Non-breaking spaces

This is pretty trivial and used basically to improve the appearance of punctuation in French, which always places a space before certain punctuation, and places spaces inside guillemets. Other languages use these characters, but without the spaces. Currently performed directly in the parse() method.

<nbsp-before>             ::= [any character] <space> ("&raquo;" | "?" | ":" | ";" | "!" | "%")
<nbsp-after>              ::= "&laquo;" <space>

Rendering

In both cases, the space is converted to a   string.

Behaviour switches

Not to be confused with magic links. These seem to be able to be used virtually anywhere: a table of contents in an image caption even works. See Help:Magic words#Behaviour switches.

 <behaviour-switch>               ::= <behaviourswitch-toc> | <behaviourswitch-forcetoc> | <behaviourswitch-notoc> | <behaviourswitch-noeditsection> | <behaviourswitch-nogallery> 
 <behaviourswitch-toc>            ::= mw("toc")
 <behaviourswitch-forcetoc>       ::= mw("forcetoc")
 <behaviourswitch-notoc>          ::= mw("notoc")
 <behaviourswitch-noeditsection>  ::= mw("noeditsection")
 <behaviourswitch-nogallery>      ::= mw("nogallery")

 /* defaults, i->case insensitive, s->case sensitive */
 mw("notoc")                ::= "__TOC__"i
 mw("forcetoc")             ::= "__FORCETOC__"i
 mw("notoc")                ::= "__NOTOC__"i
 mw("noeditsection")        ::= "__NOEDITSECTION__"i
 mw("nogallery")            ::= "__NOGALLERY__"i

Notes:

These are the "default" strings to be matched. They can be modified in languages/messages/MessagesXx.php where Xx is the language.
Each magicword can have more than one string associated with it.
The magic words are by default case insensitive but this can be changed in the file.
Plenty of other "magic words" exist, including "magic variables" (eg {{CURRENTMONTH}}) which will be handled by the preprocessor. However it looks like all sorts of other "magic words" exist and are processed in different places.

Semantics

behaviourswitch-toc: a miniature contents page will be rendered and inserted at the first instance of this token.
behaviourswitch-forcetoc: a contents box will be rendered even if the normal criteria (typically, 4 sections) have not been met. Irrelevant if magicword-toc is present.
behaviourswitch-notoc: no miniature contents pages will be rendered. Only takes effect if neither magicword-toc nor magicword-forcetoc are present.
behaviourswitch-noeditsection: no edit links are to be displayed for any sections.
behaviourswitch-nogallery: unclear. According to the code (parser::stripNoGallery): if the string (not case-sensitive) occurs in the HTML, do not add TOC. Perhaps it only has an effect in certain namespaces.

Images, media, gallery

Links to images and media should be handled as normal links. It's inline images and media that are being dealt with here.

Originally from MetaWiki.

Images

 ImageInline                ::= "[[" , "Image:" , PageName, ".", ImageExtension, ( { <Pipe>, ImageOption, } ) "]]" ;
 ImageName                  ::= PageName, ".", ImageExtension
 ImageExtension             ::= "jpg" | "jpeg" | "png" | "svg" | "gif" | "bmp" ;
 ImageOption                ::= ImageModeParameter | ImageSizeParameter | ImageAlignParameter  
                             | ImageVAlignParameter | Caption

 ImageModeParameter         ::= ImageModeManualThumb | ImageModeThumb | ImageModeFrame | ImageModeFrameless

 ImageModeManualThumb       ::= mw("img_manualthumb");
 ImageModeAutoThumb         ::= mw("img_thumbnail");  
 ImageModeFrame             ::= mw("img_frame");      
 ImageModeFrameless         ::= mw("img_frameless");  

/* Default settings: */
 mw("img_manualthumb")      ::= "thumbnail=", ImageName | "thumb=", ImageName
 mw("img_thumbnail")        ::= "thumbnail" | "thumb";
 mw("img_frame")            ::= "framed" | "enframed" | "frame";
 mw("img_frameless")        ::= "frameless";

 ImageOtherParameter        ::= ImageParamPage | ImageParamUpright | ImageParamBorder
 ImageParamPage             ::= mw("img_page") 
 ImageParamUpgright         ::= mw("img_upright") 
 ImageParamBorder           ::= mw("img_border")

/* Default settings: */
 mw("img_page")             ::= "page=$1" | "page $1"  ??? (where is this used?)
 mw("img_upright")          ::= "upright" [, ["=",] PositiveInteger]
 mw("img_border")           ::= "border"


 ImageSizeParameter         ::= mw("img_width"); 
/* Default setting: */
 mw("img_width")            ::= PositiveNumber "px" ;


 ImageAlignParameter        ::= ImageAlignLeft | ImageAlign|Center | ImageAlignRight | ImageAlignNone
 ImageAlignLeft             ::= mw("img_left") 
 ImageAlignCenter           ::= mw("img_center") 
 ImageAlignRight            ::= mw("img_right") 
 ImageAlignNone             ::= mw("img_none")

/* Default settings: */
 mw("img_left")             ::= "left"
 mw("img_center")           ::= "center" | "centre"
 mw("img_right")            ::= "right"
 mw("img_none")             ::= "none"


 ImageValignParameter       ::= ImageValignBaseline | ImageValignSub | ImageValignSuper | ImageValignTop 
                             | ImageValignTextTop | ImageValignMiddle | ImageValignBottom | ImageValignTextBottom

 ImageValignBaseline        ::= mw("img_baseline") 
 ImageValignSub             ::= mw("img_sub")
 ImageValignSuper           ::= mw("img_super")
 ImageValignTop             ::= mw("img_top")
 ImageValignTextTop         ::= mw("img_text_top")
 ImageValignMiddle          ::= mw("img_middle")
 ImageValignBottom          ::= mw("img_bottom")
 ImageValignTextBottom      ::= mw("img_text_bottom")

/* By default: */
 mw("img_baseline")         ::= "baseline"
 mw("img_sub")              ::= "sub"
 mw("img_super")            ::= "super" | "sup"
 mw("img_top")              ::= "top"
 mw("img_text_top")         ::= "text-top"
 mw("img_middle")           ::= "middle"
 mw("img_bottom")           ::= "bottom"
 mw("img_text_bottom")      ::= "text-bottom"
 
 Caption                    ::= <inline-text>

Semantics

Renders an image inline using the <img> tag.
It is not an error to specify multiple alignment parameters; the first specified is the one used.
It is not an error to specify multiple captions; the last specified is the one used.
The caption has no effect if ThumbImageParameter is not given.

Media

 MediaInline               ::=  "[[" , "Media:" , PageName "." MediaExtension "]]" ;
 MediaExtension = "ogg" | "wav" ;

Gallery

 GalleryBlock               ::=   "<gallery>" [ NewLine ] GalleryImage { [ NewLine ] GalleryImage } [ NewLine ] "</gallery>" ;
 GalleryImage               ::=   (to be defined: essentially   foo.jpg[|caption] )

Remarks:

The gallery block can technically be used in the middle of a sentence so is not a "special block". It doesn't render particularly nicely when you do that though.