Parsoid/Roundtrip testpages
Kinds of RT problems
edit- Wikitext escaping diffs: Lots of pages have wikitext escaping of text in brackets ( [..]). This piece of wikitext escaping code needs fixing.
- Whitespace/quote diffs for ref and other extension tags: Lots of pages have whitespace/quote diffs around extension tags (<ref name='foo'/> vs <ref name='foo' />, <source lang='javascript' /> vs <source lang="javascript"/>)
- Template diffs: Missing template output / duplicate template output. Most of these are probably because of incorrect DSR (DOM wikitext source range) information.
- Unbalanced quote diffs: Some wikipages dont close quotes and Parsoid adds the missing balancing quotes which shows up as RT diffs.
- Table wikitext diffs: Missing "-" or Missing "|" chars or some such -- maybe a serializer or html/missing-rt-data issue -- to be investigaged.
- Lists/lists in tables diffs: List rt diffs or lists in table rt diffs (mostly involving dl-dt-dd lists)
- Unbalanced tag diffs: On several pages, there are unbalanced opening and closing quotes that obviously dont RT correctly -- need workarounds/fixes/hacks/or wont-fix.
- Wikitext "syntax errors": Pages where the wikitext syntax is erroneous in the surrounding context and leads to differences in parsing and roundtripping -- unbalanced tags above are one special case of this broader category.
Zero diffs
edit- Medha_Patkar
- John_McCain
- Political_science
- Hindu_reform_movements
- PHP
- Middle_Way
- Substitution Cipher
Extension (ref, source, code, etc.) whitespace and quote diffs
editCrashes
editWhen I rescue errors and continue, the real problem emerges -- the parsed output for this page is mangled up. The output from templates are interlaced rather than being serialized correctly. This messes up the template encapsulation code -- something is up with the async pipeline that needs fixing.
Help:Templates parser crashes in template encapsulation code.
ApocalyPS3 - Cannot call "removeChild" of null at deleteNode (/data/project/parsoid/js/lib/mediawiki.DOMPostProcessor.js:101:15)
P-S-3 - Same as above
- Book:Bromine - RangeError: Maximum call stack size exceeded - Doesn't seem to happen on the web service, maybe an artifact of the round-trip test runner having a lot of callbacks?
- Gordan Kožulj - TypeError: Cannot read property 'nextSibling' of undefined
Several template diffs
editNothing major right now.
Fixed
editAnexo:Monumentos_Históricos_de_Panamá -- lots of template diffs.Hayasdan -- couple significant template diffs.BuddhaAnna_Hazare -- " {{cite ... }}" is the reduced test case that causes most of the RT problems on this page because of nested pre-tags showing up in the token stream. Yet to fix.Adi_Shankara-- some newline/whitespace issues in one rt-ed cite template use. Yet to investigate.Theravada-- now mostly whitespace, nowiki, and mismatched quote diffs.Dependent_originationYogaVijayanagar Empire-- there is one notable diff which is a stray </s> is rt-ed as </sK-1 Rising 2002 -- template / table interaction
Lots of Whitespace/quote diffs
editUnbalanced quote diffs
edit- Theravada
- Emma Cairns
Gondi bank -- here and many other pages, quotes are nested improperly across ref-tags. Ex: This is an example of a ''quotes <ref> and ref tags '' overlapping </ref> with each other . This then parses and round trips improperly.Fixed by moving cite processing earlier in the stage 3 pipeline.
Table wikitext diffs
editDont_Ask, Dont_Tell Repeal Act of 2010 -- templates used to set td attributes are not RT-ing.Nagarjuna scroll down to find the table char diffs.Enigma MachineSimpsons dvd sets replaces empty lines with a td-wikitext char (|) -- either a parse or serializer error probably.
Test cases
edit1. ! chars in links in table cells
This example does not parse and RT as expected.
{| | [[Foo!! bar]] |}
2. Table lines starting with a leading space
<-- The 3rd line has a leading space --> {| |- |a||b |}
Another one where the ! marks dont parse properly. !h1!!h2 gets parsed as a string in a td
<!-- all lines below have a leading space --> {| |- !h1!!h2 |foo||bar |}
3. Unclosed attributes (missing " char) and | char in table cells
This example does not parse and RT as expected.
{| | style="text-align:center; {{Party shading/Republican}}|'''[[United States presidential election, 2008|2008]]''' | style="text-align:center; {{Party shading/Republican}}|'''53.1%''' ''4,523'' | style="text-align:center; {{Party shading/Democratic}}|43.9% ''3,743'' | style="text-align:center; background:honeyDew;"|2.9% ''243'' |}
Fixed in https://gerrit.wikimedia.org/r/#/c/45919/ and https://gerrit.wikimedia.org/r/#/c/45699/
4. Table cells separated by "|" instead of "||" . Unsure if this should be fixed at all since there is no clean fix for this. This may just have to be considered as bad wikitext that cannot be RT-ed as coded.
{| | foo bar foo | baz |}
The 'foo bar foo' in this example gets parsed as 3 attributes of the td and the second 'foo' gets dropped as a duplicate.
Lists / lists-in-tables RT diffs
editLists in tables and tables in lists combination has now been dealt with adequately and these diffs are no longer present.
Fixed
editRomance languages Lot of dl-dt list rt diffsnoticeboard/IncidentArchive541 WP Admin IncidentArchives541 Lot of list rt diffsUser page Lists in indented tables rt diffs (::{|..)
Unbalanced tag diffs
edit- Complete games -- uses a <li> tag at the start of a #-wikilist but gets rt-ed with an extra </li> tag as expected. This seems related to the bug Roan reported.
Fixed
editIdea: If detectable, we could add a flag on automatically-closed tags so those tags can be skipped on RT. But, unsure if we can detect it since the treebuilder closes the tags and we cannot pass along attribute information on closing tags. Implemented in git SHA 051bf97b
Karate Coyote unbalanced/incorrect closing-br-tag diffs.2002 Australian Formula 3 season -- uses <hiddentext> but an incorrect closing tag <\hiddentext> instead of </hiddentext> -- either because of this or even otherwise, this extension text gets fostered out of the table header and introduces a rt diff.Mallor -- several unclosed <li> tags.SC Canada Opinions lots of unbalanced <small> tags in table cells (opened but not closed). This diff is present in other pages as well. Nova Scotia Lt. Govs. list -- both canadian govt. page lists (clearly same editor is involved in both :-)). These two pages have the most semantic diffs in our RT testing.Pitt Panthers football -- lot of unclosed <font> tags generates RT diffs since Parsoid closes unclosed tags.Indonesia's Got Talent -- lot of unclosed <center> tags generates RT diffs since Parsoid closes unclosed tags.
Wikitext "syntax errors"
edit- List of scooter manufacturers -- in the table listing the scooter manufacturers, table cell content is added after table row wikitext (|-) instead of table data wikitext (|). The wikitext in question is this: |-[[Gorilla Motor Works]] || China ||. PHP parser ignores this content (see http://en.wikipedia.org/wiki/List_of_scooter_manufacturers). Parsoid moves the content to the beginning of the table rather than ignoring it. When roundtripped, this content shows up in the wrong place. So, a possible Parsoid fix would be to not swallow content that shows up in a table-row (having it fostered lets the editor recognize the error in the output and fix it), but introduce a placeholder tag to rt the fostered content in place, but add a marker on the fostered content so it is suppressed during rt-ing. Or, maybe just not bother about these kind of errors.
- Food Science Australia -- similar bug as in the page above.
- Chicago's Northwest Side -- bare lists in tables get fostered out of the table (same in php) which doesn't RT back to original wikitext.
- List of Philippine radio stations by province (AM) -- bad <th> code
- Northwest Side Chicago's Northwest Side -- lists inside tables that get fostered out of the table and wont RT as expected.
Wikitext escaping diffs
editNothing here at this time.
Fixed
editMath-related pages use a lot of braces and because brace pairs {{ or }} are unconditionally nowiki-escaped right now, several math-related pages have a lot of semantic/syntactic diffs because of this.
The wikitext escaping code is due for an overhaul -- there are several cases of escaping. For example, right now, the escaper always escapes content in "[" and "]". The specifics of how that content is escape varies (sometimes everything is escaped, other times, only the right bracket is escaped).
[1] -- Parserfunction-generated external link target in bracketed external link, closing bracket nowiki-escaped.
Other RT diffs
editJRubyStomach cutting notable template rt diff- WP page -- lot of missing newlines among other white-space and nowiki diffs.
- Difference in error recovery between Parsoid and PHP parser when extension and template tags nest improperly. This affects parse output and RTing on this page Milky way galaxy. Simplified test case:
{{echo|blah <ref name=foo>blah {{echo|blah}} blah}}</ref> blah}}
- Use of {{!}} in non-table contexts (Example snippet from en:Death of the soviet union)
[[Image:Edgar Savisaar 2005.jpg{{!}}100px|left|thumb|this is a caption]]
Performance issues
edit- Sanskrit Takes a long time (~80 sec) to parse this page.
- Anexo:Monumentos_Históricos_de_Panamá Takes ~80-85 secs to parse this page.