User:OrenBochman/ParserNG

The pages are an attempt at documenting my w:Antler based Parser Speck to create a new efficient analysis chain for indexing wikisource.

1. The ANTLR spec needs to be able to tokenize wikisource. 2. the tokens should then be tagged. 3. next the tokens can be filtered. 4. the token that are not removed by the filter may be augmented by payloads.

Specs edit

  • I no longer see a need for a monolithic parser in ANTLR.
  • Lucene loves TokenStream based analysers chains. Each step in the chain consumes the predecessor's input.
  • these ANTLR grammars because they do little more than document the wiki sytax while creating a parse tree.

However ANTLR can generate PHP and JavaScript code.

Analysis Chain edit

They specs are planned as parts of a parser chain.

  1. preprocessor
    1. comments
    2. templates
    3. magic words
    4. parser functions
    5. core
    6. extensions
      1. maths
      2. date
    7. <include> etc
  2. tables
  3. links, images
  4. other simple syntax
    1. formatting
  5. extensions tags
    1. cite.
    2. others
  • ideally the parser should be able to transform input to output format.
  • for search however the overridin concerns is to capture the search points in thier context together with any boosting factors.

To build fully functional output would necessitate:

  1. a mechanism to resolve
    1. parser functions,
    2. magic words,
    3. extensions action (not realistic) unless they can be invoked fast from a mock PHP Doc interface.
    4. globalization of information
  2. a mechanism to resolve transclusion of
    1. templates,
    2. non template NS content

transliteration tables.

  1. parser specs
  2. integration with the parser specs
  3. transform the pharse tree to the output tree using a tree grammars.
  4. use StringTemplate file to construct the output.

an analysis of the input On the other hands produce a basic parse tree.

  • [[Manual:Extending_wiki_markup|]

Current Specs edit

Parsing Options edit

Goal: specify the parser in Antlr

  • would provide documentation
  • would be more efficient and robust.
  • would simplify other parsing effort [1]
  • can produce different language targets php,JavaScript,java,c++,python for use by many tools
  • can be used to migrate, translate to a better format.
  • can be extended


Challenges of Parsing MediaWiki Syntax edit

  1. The set of all Input is not fixed.
  2. External references:
    • templates
    • transclusion
    • extensions
  3. Command order mis marthc
    • output is a single file. input can a recursive set of files.
    • templates require out-of-order processing and extensions too.
  4. the lexer is context sensitive lexer?
  5. Need to look forward, and backwards too some times.
    1. backwards to determine curly construct meaning. (till end of file)
    2. the same goes for include-only, no-include, comments and no wiki.
  6. The languages is big, the statement (magic words can be changed externally)
  7. some language statements are very similar
    • [ in [[.|...]] can mean several things. (internal link, external link, audio, picture, video etc)
    • { in {{{{{|}}}}} can mean several things.
    • ' in ''''x'''' can mean several things: ' + ''' or ''' + '
  8. White space adds some complexity.
    • TOC placement
    • indentations does matter
    • single vs multiple new lines matter too.
  9. Optional case sensitivity in literals first letter but not in commands.
  10. Error recovery is important
  11. Good reporting is not
  12. Poor documentation.
    • The language is not well-defined and is sparsely documented;
    • It was hacked for ages like by non-language designers?
    • The only definition is in the working code of the above hacks.
  13. The Translator should be fast and modular.
    • However the current parser is very slow.
    • it would be hard to be slower
    • extensive caching compensates for slowness in many situations
    • modularity and simplicity are more important.
  14. content has comments and markup that can occur anywhere in the input and need to go out into the output at proper locations.
  15. multiple syntax for features:
    • tables
    • headers, bold italic can be wiki or html based
    • output need not be human editable
  16. input size - can be massive, e.g. wikibooks.
    • imposes limits on # of passes.
    • imposes limits on viability of memorization.

Based on:[2] and [3]

Open Questions edit

  1. what are and what should be the parser's
    • error handling.
    • error recovery capability.
  2. Is a major move to simplify the language being considered?
    • reducing construct ambiguity.
    • reducing context dependency.
      • Links, images etc in [[]]
    • simple is not necessarily weaker.
  3. how does/should the extension mechanism interact with the parser.
    • protect the parser from extension's bugs.
    • give extension's services.
    • separate implementation.
  4. is the Antlr backend for PHP or JavaScript good enough to generate the parser with?
  5. what is the importance of semantics on parsing media wiki content, as opposed to parsing just the syntax?
  6. templates seem important
  7. can the parser's complexity be reduced if had access to semantic metadata.
  8. scoping rules (templates, variables, references)
    • are the required variable defined already
    • when does a definition expire

enhancements edit

  1. dynamic scoping of template args
  1. let the template called see named variables defined in their parent's call
  2. as above but with name munging like super.argname
  1. parser functions which evaluate
  1. (mathematical) expressions within variables.

Existing Documentation edit

  • Preprocessor [4]
  • Markup Speck [3]
  • Alternative_parsers [1]
  • Parser Testing script + Test Cases
  • Extending_wiki_markup Parser hooks for the extension mechanism

Missing Specks edit

  • Language conversion {- -} syntax
  • sanitation
  • Operator precedence
  • Error recovery

Tools edit

  • Mediawiki\maintenance\tests
  • Parser Playground gadget

Antlr edit

  • How to remove global backtracking from your grammar [5]
  • look ahead analysis [6]
  • (...)? optional sub-rule
  • (...)=> syntactic predicate
  • {...}? hoisting disambiguating semantic predicate
  • {...}?=> gated semantic predicate

Java Based Parsers edit

the last is the most promising!

Todo edit

  1. finish the dumpHtmlHarness class.
  2. add more options.
    1. bench marking.
    2. log4j output.
    3. implement extension tag loading mechanism.
    4. implement magic word (localised) loading mechanism.
    5. input filter support.
    6. different parser implementation via dependency injection
  3. write a JUunit test which runs the tests in Mediawiki\maintenance\tests\parser\parserTests.txt
  4. write a JUunit test which runs real page content.
  5. get the lot into Jenkins CI.
  6. fix one of the above parser
  • test the ANTLR version.

References edit

Subpages edit