User:OrenBochman/ParserNG
The pages are an attempt at documenting my w:Antler based Parser Speck to create a new efficient analysis chain for indexing wikisource.
1. The ANTLR spec needs to be able to tokenize wikisource. 2. the tokens should then be tagged. 3. next the tokens can be filtered. 4. the token that are not removed by the filter may be augmented by payloads.
Specs
edit- I no longer see a need for a monolithic parser in ANTLR.
- Lucene loves
TokenStream
based analysers chains. Each step in the chain consumes the predecessor's input. - these ANTLR grammars because they do little more than document the wiki sytax while creating a parse tree.
However ANTLR can generate PHP and JavaScript code.
Analysis Chain
editThey specs are planned as parts of a parser chain.
- preprocessor
- comments
- templates
- magic words
- parser functions
- core
- extensions
- maths
- date
- <include> etc
- tables
- links, images
- other simple syntax
- formatting
- extensions tags
- cite.
- others
- ideally the parser should be able to transform input to output format.
- for search however the overridin concerns is to capture the search points in thier context together with any boosting factors.
To build fully functional output would necessitate:
- a mechanism to resolve
- parser functions,
- magic words,
- extensions action (not realistic) unless they can be invoked fast from a mock PHP Doc interface.
- globalization of information
- a mechanism to resolve transclusion of
- templates,
- non template NS content
transliteration tables.
- parser specs
- integration with the parser specs
- transform the pharse tree to the output tree using a tree grammars.
- use StringTemplate file to construct the output.
an analysis of the input On the other hands produce a basic parse tree.
- [[Manual:Extending_wiki_markup|]
Current Specs
edit- WikiTable
- Preprocessor
- Translator
- Awk ANTLR
- Sanitizer Antlr Scrubber
Parsing Options
editGoal: specify the parser in Antlr
- would provide documentation
- would be more efficient and robust.
- would simplify other parsing effort [1]
- can produce different language targets php,JavaScript,java,c++,python for use by many tools
- can be used to migrate, translate to a better format.
- can be extended
Challenges of Parsing MediaWiki Syntax
edit- The set of all Input is not fixed.
- External references:
- templates
- transclusion
- extensions
- Command order mis marthc
- output is a single file. input can a recursive set of files.
- templates require out-of-order processing and extensions too.
- the lexer is context sensitive lexer?
- Need to look forward, and backwards too some times.
- backwards to determine curly construct meaning. (till end of file)
- the same goes for include-only, no-include, comments and no wiki.
- The languages is big, the statement (magic words can be changed externally)
- some language statements are very similar
- [ in [[.|...]] can mean several things. (internal link, external link, audio, picture, video etc)
- { in {{{{{|}}}}} can mean several things.
- ' in ''''x'''' can mean several things: ' + ''' or ''' + '
- White space adds some complexity.
- TOC placement
- indentations does matter
- single vs multiple new lines matter too.
- Optional case sensitivity in literals first letter but not in commands.
- Error recovery is important
- Good reporting is not
- Poor documentation.
- The language is not well-defined and is sparsely documented;
- It was hacked for ages like by non-language designers?
- The only definition is in the working code of the above hacks.
- The Translator should be fast and modular.
- However the current parser is very slow.
- it would be hard to be slower
- extensive caching compensates for slowness in many situations
- modularity and simplicity are more important.
- content has comments and markup that can occur anywhere in the input and need to go out into the output at proper locations.
- multiple syntax for features:
- tables
- headers, bold italic can be wiki or html based
- output need not be human editable
- input size - can be massive, e.g. wikibooks.
- imposes limits on # of passes.
- imposes limits on viability of memorization.
Open Questions
edit- what are and what should be the parser's
- error handling.
- error recovery capability.
- Is a major move to simplify the language being considered?
- reducing construct ambiguity.
- reducing context dependency.
- Links, images etc in [[]]
- simple is not necessarily weaker.
- how does/should the extension mechanism interact with the parser.
- protect the parser from extension's bugs.
- give extension's services.
- separate implementation.
- is the Antlr backend for PHP or JavaScript good enough to generate the parser with?
- what is the importance of semantics on parsing media wiki content, as opposed to parsing just the syntax?
- templates seem important
- can the parser's complexity be reduced if had access to semantic metadata.
- scoping rules (templates, variables, references)
- are the required variable defined already
- when does a definition expire
enhancements
edit- dynamic scoping of template args
- let the template called see named variables defined in their parent's call
- as above but with name munging like super.argname
- parser functions which evaluate
- (mathematical) expressions within variables.
Existing Documentation
edit- Preprocessor [4]
- Markup Speck [3]
- Alternative_parsers [1]
- Parser Testing script + Test Cases
- Extending_wiki_markup Parser hooks for the extension mechanism
- Category:ParserBeforeStrip extensions rely on the ParserBeforeStrip hook.
- Category:ParserAfterStrip extensions rely on the ParserAfterStrip hook.
- Category:ParserBeforeInternalParse extensions rely on the ParserBeforeInternalParse hook.
- Category:OutputPageBeforeHTML extensions rely on the OutputPageBeforeHTML hook.
- Category:ParserBeforeTidy extensions rely on the ParserBeforeTidy hook.
- Category:ParserAfterTidy extensions rely on the ParserAfterTidy hook.
Missing Specks
edit- Language conversion {- -} syntax
- sanitation
- Operator precedence
- Error recovery
Tools
edit- Mediawiki\maintenance\tests
- Parser Playground gadget
Antlr
edit- (...)? optional sub-rule
- (...)=> syntactic predicate
- {...}? hoisting disambiguating semantic predicate
- {...}?=> gated semantic predicate
Java Based Parsers
edit- http://code.google.com/p/gwtwiki/
- http://rendering.xwiki.org/xwiki/bin/view/Main/WebHome
- http://sweble.org/wiki/Sweble_Wikitext_Parser
the last is the most promising!
Todo
edit- finish the dumpHtmlHarness class.
- add more options.
- bench marking.
- log4j output.
- implement extension tag loading mechanism.
- implement magic word (localised) loading mechanism.
- input filter support.
- different parser implementation via dependency injection
- write a JUunit test which runs the tests in Mediawiki\maintenance\tests\parser\parserTests.txt
- write a JUunit test which runs real page content.
- get the lot into Jenkins CI.
- fix one of the above parser
- test the ANTLR version.
References
editSubpages
edit- OrenBochman/ParserNG/Preprocessor
- OrenBochman/ParserNG/Preprocessor Antlr
- OrenBochman/ParserNG/Sanitizer Antlr
- OrenBochman/ParserNG/Tests
- OrenBochman/ParserNG/Tests/Test1
- OrenBochman/ParserNG/Tests/Test2
- OrenBochman/ParserNG/Tests/Test3
- OrenBochman/ParserNG/Tests/Test4
- OrenBochman/ParserNG/Transliterator Antlr
- OrenBochman/ParserNG/WikiTable
- OrenBochman/ParserNG/antlr