User:OrenBochman/ParserNG

The pages are an attempt at documenting my w:Antler based Parser Speck to create a new efficient analysis chain for indexing wikisource.

1. The ANTLR spec needs to be able to tokenize wikisource. 2. the tokens should then be tagged. 3. next the tokens can be filtered. 4. the token that are not removed by the filter may be augmented by payloads.

Specs

I no longer see a need for a monolithic parser in ANTLR.
Lucene loves TokenStream based analysers chains. Each step in the chain consumes the predecessor's input.
these ANTLR grammars because they do little more than document the wiki sytax while creating a parse tree.

However ANTLR can generate PHP and JavaScript code.

Analysis Chain

They specs are planned as parts of a parser chain.

preprocessor
1. comments
2. templates
3. magic words
4. parser functions
5. core
6. extensions
  1. maths
  2. date
7. <include> etc
tables
links, images
other simple syntax
1. formatting
extensions tags
1. cite.
2. others

ideally the parser should be able to transform input to output format.
for search however the overridin concerns is to capture the search points in thier context together with any boosting factors.

To build fully functional output would necessitate:

a mechanism to resolve
1. parser functions,
2. magic words,
3. extensions action (not realistic) unless they can be invoked fast from a mock PHP Doc interface.
4. globalization of information
a mechanism to resolve transclusion of
1. templates,
2. non template NS content

transliteration tables.

parser specs
integration with the parser specs
transform the pharse tree to the output tree using a tree grammars.
use StringTemplate file to construct the output.

an analysis of the input On the other hands produce a basic parse tree.

[[Manual:Extending_wiki_markup|]

Current Specs

Parsing Options

Goal: specify the parser in Antlr

would provide documentation
would be more efficient and robust.
would simplify other parsing effort ^[1]
can produce different language targets php,JavaScript,java,c++,python for use by many tools
can be used to migrate, translate to a better format.
can be extended

Challenges of Parsing MediaWiki Syntax

The set of all Input is not fixed.
External references:
- templates
- transclusion
- extensions
Command order mis marthc
- output is a single file. input can a recursive set of files.
- templates require out-of-order processing and extensions too.
the lexer is context sensitive lexer?
Need to look forward, and backwards too some times.
1. backwards to determine curly construct meaning. (till end of file)
2. the same goes for include-only, no-include, comments and no wiki.
The languages is big, the statement (magic words can be changed externally)
some language statements are very similar
- [ in [[.|...]] can mean several things. (internal link, external link, audio, picture, video etc)
- { in {{{{{|}}}}} can mean several things.
- ' in ''''x'''' can mean several things: ' + ''' or ''' + '
White space adds some complexity.
- TOC placement
- indentations does matter
- single vs multiple new lines matter too.
Optional case sensitivity in literals first letter but not in commands.
Error recovery is important
Good reporting is not
Poor documentation.
- The language is not well-defined and is sparsely documented;
- It was hacked for ages like by non-language designers?
- The only definition is in the working code of the above hacks.
The Translator should be fast and modular.
- However the current parser is very slow.
- it would be hard to be slower
- extensive caching compensates for slowness in many situations
- modularity and simplicity are more important.
content has comments and markup that can occur anywhere in the input and need to go out into the output at proper locations.
multiple syntax for features:
- tables
- headers, bold italic can be wiki or html based
- output need not be human editable
input size - can be massive, e.g. wikibooks.
- imposes limits on # of passes.
- imposes limits on viability of memorization.

Based on:^[2] and ^[3]

Open Questions

what are and what should be the parser's
- error handling.
- error recovery capability.
Is a major move to simplify the language being considered?
- reducing construct ambiguity.
- reducing context dependency.
  - Links, images etc in [[]]
- simple is not necessarily weaker.
how does/should the extension mechanism interact with the parser.
- protect the parser from extension's bugs.
- give extension's services.
- separate implementation.
is the Antlr backend for PHP or JavaScript good enough to generate the parser with?
what is the importance of semantics on parsing media wiki content, as opposed to parsing just the syntax?
templates seem important
can the parser's complexity be reduced if had access to semantic metadata.
scoping rules (templates, variables, references)
- are the required variable defined already
- when does a definition expire

enhancements

dynamic scoping of template args

let the template called see named variables defined in their parent's call
as above but with name munging like super.argname

parser functions which evaluate

(mathematical) expressions within variables.

Existing Documentation

Preprocessor ^[4]
Markup Speck ^[3]
Alternative_parsers ^[1]
Parser Testing script + Test Cases
Extending_wiki_markup Parser hooks for the extension mechanism

hooks:

Category:ParserBeforeStrip extensions rely on the ParserBeforeStrip hook.
Category:ParserAfterStrip extensions rely on the ParserAfterStrip hook.
Category:ParserBeforeInternalParse extensions rely on the ParserBeforeInternalParse hook.
Category:OutputPageBeforeHTML extensions rely on the OutputPageBeforeHTML hook.
Category:ParserBeforeTidy extensions rely on the ParserBeforeTidy hook.
Category:ParserAfterTidy extensions rely on the ParserAfterTidy hook.

Missing Specks

Language conversion {- -} syntax
sanitation
Operator precedence
Error recovery

Tools

Mediawiki\maintenance\tests
Parser Playground gadget

Antlr

How to remove global backtracking from your grammar ^[5]
look ahead analysis ^[6]

(...)? optional sub-rule
(...)=> syntactic predicate
{...}? hoisting disambiguating semantic predicate
{...}?=> gated semantic predicate

Java Based Parsers

the last is the most promising!

Todo

finish the dumpHtmlHarness class.
add more options.
1. bench marking.
2. log4j output.
3. implement extension tag loading mechanism.
4. implement magic word (localised) loading mechanism.
5. input filter support.
6. different parser implementation via dependency injection
write a JUunit test which runs the tests in Mediawiki\maintenance\tests\parser\parserTests.txt
write a JUunit test which runs real page content.
get the lot into Jenkins CI.
fix one of the above parser

test the ANTLR version.

References

Subpages

[altp-1] 1.0 ^1.1 Alternative_parsers

[2] what makes a language problem hard

[spec-3] 3.0 ^3.1 Markup_spec

[4] Preprocessor_ABNF

[5] How to remove global backtracking from your grammar

[6] ysis

[1]

[2]

[3]

[4]

[5]

[6]