Parsoid/MediaWiki DOM spec/Language conversion blocks

Status: implemented in Parsoid MediaWiki DOM Spec 1.5.0.

See bug 41716. Also see Writing_systems/Syntax and en:User:cscott/LanguageConversion. Implementation matches #Alternative 2 below.

Alternative 1 edit

Basically as described in bug 41716#c37. Render the default variant according to the fallback chain for output-producing rules.

foo-{bar baz}- quux

<p>
  foo<span typeof="mw:LanguageVariant" data-mw='{"disabled":true}'>bar baz</span> quux
</p>

foo-{zh-cn:blog; zh-hk:WEBJOURNAL; zh-tw:WEBLOG;}- quux

<p>
  foo<span typeof="mw:LanguageVariant" data-mw='{"text":{"zh-cn": "blog", "zh-hk": "WEBJOURNAL", "zh-tw": "WEBLOG"}}'>WEBJOURNAL</span> quux
</p>

foo-{zh;zh-hans;zh-hant|blog, WEBJOURNAL, WEBLOG}- quux

<p>
  foo<span typeof="mw:LanguageVariant" data-mw='{"text":{"": "blog, WEBJOURNAL, WEBLOG"},"target":["zh","zh-hans","zh-hant"]}'>blog, WEBJOURNAL, WEBLOG</span> quux
</p>

Alternative 1b edit

Same basic idea as Alternative 1, but using more-specific typeof attributes, and we don't store information in data-mw which is redundant with the content of the ‎<span> which helps to make WTS more predictable.

foo-{bar baz}- quux

<p>
  foo<span typeof="mw:LanguageVariant/inline" data-mw='{"text":{"*": "bar baz"}}'>bar baz</span> quux
</p>

foo-{zh-cn:blog; zh-hk:WEBJOURNAL; zh-tw:WEBLOG;}- quux

<p>
  foo<span typeof="mw:LanguageVariant/inline" data-mw='{"text":{"zh-cn": "blog", "zh-hk": {"default":true}, "zh-tw": "WEBLOG"}}'>WEBJOURNAL</span> quux
</p>

foo-{zh;zh-hans;zh-hant|blog, WEBJOURNAL, WEBLOG}- quux

<p>
  foo<span typeof="mw:LanguageVariant/filter" data-mw='{"lang":["zh","zh-hans","zh-hant"]}'>blog, WEBJOURNAL, WEBLOG</span> quux
</p>

For unsupported conversion blocks we use mw:LanguageVariant/unknown: -{T|zh-cn:blog; zh-hk:WEBJOURNAL; zh-tw:WEBLOG;}-

<p>
  <span typeof="mw:LanguageVariant/unknown" data-mw='{"src":"T|zh-cn:blog; zh-hk:WEBJOURNAL; zh-tw:WEBLOG;"}'></span>
</p>

Alternative 2 edit

This is the alternative currently implemented in Parsoid.

This option leaves the 'content' portion of the span empty, to allow post-processing (or a JS switcher) to swap in the correct things.

The attribute is named data-mw-variant since it affects the read-only rendering of the page, and data-mw attributes are supposed to be ignored for rendering and only needed for editing.

Top-level fields in the JSON are: disabled, bidir, unidir, name, filter, and describe. If the wikitext "show" flag is not present or implicit, the DOM markup will use the ‎<meta> element. If "show" is present or implicit, the DOM markup will use ‎<span> if contents are inlineable, or ‎<div> otherwise.

foo-{bar baz}- quux

<p>
  foo<span typeof="mw:LanguageVariant" data-mw-variant='{"disabled":{"t":"bar baz"}}'></span> quux
</p>

foo-{zh-cn:blog; zh-hk:WEBJOURNAL; zh-tw:WEBLOG;}- quux

<p>
  foo<span typeof="mw:LanguageVariant" data-mw-variant='{"bidir":[{"l":"zh-cn","t":"blog"},{"l":"zh-hk","t":"WEBJOURNAL"},{"l":"zh-tw","t":"WEBLOG"}]}'></span> quux
</p>

foo-{zh;zh-hans;zh-hant|blog, WEBJOURNAL, WEBLOG}- quux

<p>
  foo<span typeof="mw:LanguageVariant" data-mw-variant='{"filter":{"l":["zh","zh-hans","zh-hant"],"t":"blog, WEBJOURNAL, WEBLOG"}}'></span> quux
</p>

foo-{H|WEBLOG=>zh-cn:blog;WEBLOG=>zh-hk:WEBJOURNAL}- quux

<p>
  foo<meta typeof="mw:LanguageVariant" data-mw-variant='{"add":true,"unidir":[{"f":"WEBLOG","l":"zh-cn","t":"blog"},{"f":"WEBLOG","l":"zh-hk","t":"WEBJOURNAL"}]}'/> quux
</p>

<span>a-{b<div>c</div>d}-e</span>

<p>
  <span>a</span>
</p>
<div typeof="mw:LanguageVariant" data-mw-variant='{"disabled":{"t":"b&lt;div data-parsoid=&#39;{\"stx\":\"html\",\"dsr\":[10,22,5,6]}&#39;>c&lt;/div>d"}}'></div>
<p>
  e
</p>

Alternative 3 edit

This option puts all the alternatives into the DOM, more smoothly handling nested markup. This uses data-mw on the inner spans.

foo-{bar baz}- quux

<p>
  foo<span typeof="mw:LanguageVariant"><span data-mw='{"disabled":true}'>bar baz</span></span> quux
</p>

foo-{zh-cn:blog; zh-hk:WEBJOURNAL; zh-tw:WEBLOG;}- quux

<p>
  foo<span typeof="mw:LanguageVariant"><span data-mw='{"lang":"zh-cn"}'>blog</span><span data-mw='{"lang":"zh-hk"}'>WEBJOURNAL</span><span data-mw='{"lang":"zh-tw"}'>WEBLOG</span></span> quux
</p>

foo-{zh;zh-hans;zh-hant|blog, WEBJOURNAL, WEBLOG}- quux

<p>
  foo<span typeof="mw:LanguageVariant"><span data-mw='{"filter":["zh","zh-hans","zh-hant"]}'>blog, WEBJOURNAL, WEBLOG</span></span> quux
</p>

Alternative 3b edit

Like alternative 3, this option puts all the alternatives into the DOM to smoothly handle nested markup. This variant uses the standard HTML5 lang attribute whenever possible; including setting it to the empty string (signifying "language unknown") where language conversion is disabled.

By making the nested content visible in the DOM, this more easily allows direct editing in the style described by phab:T17161#2354695.

foo-{bar baz}- quux

<p>
  foo<span typeof="mw:LanguageVariant"><span lang="">bar baz</span></span> quux
</p>

foo-{zh-cn:blog; zh-hk:WEBJOURNAL; zh-tw:WEBLOG;}- quux

<p>
  foo<span typeof="mw:LanguageVariant"><span lang="zh-cn">blog</span><span lang="zh-hk">WEBJOURNAL</span><span lang="zh-tw">WEBLOG</span></span> quux
</p>

foo-{zh;zh-hans;zh-hant|blog, WEBJOURNAL, WEBLOG}- quux

<p>
  foo<span typeof="mw:LanguageVariant"><span data-mw='{"filter":["zh","zh-hans","zh-hant"]}'>blog, WEBJOURNAL, WEBLOG</span></span> quux
</p>

General language conversion plan edit

The wikitext language variant converter interface documented in Writing_systems/Syntax exposes two classes of operations:

  1. Selecting content in place by variant, and
  2. dynamic modification of conversion rules that apply from that point in the page on.

In-place content selection is not just used for regular translation pairs, but also for constructs like -{zh-cn:foo;zh-tw:bar}- (also known as -{zh-cn:foo;zh-tw:bar}-) is... Content is mostly well-nested, so we can represent this as an element. The exception from grepping (see regexp used below) are constructs like -{zh-cn:<div title="foo">;zh-tw:<div title="bar">}-. Those partly stem from times when the language converter could not be used inside attributes. We can probably fix this automatically by moving the variant block inside the attribute.

In general, we will render content-producing variant code based on the wiki's default variant and the fallback chain. Regular content conversion will only happen as a post-processing step on the saved Parsoid HTML.

Dynamic modification of rules does not seem to be needed in general. Page-global and per-category rules can replace template-based definitions. Until that is implemented, we need however represent existing add / remove rules inline. For also content-producing constructs like -{A|foo}- we can both render and record the rule modification in data-mw. Pure modifications (H flag) can be represented as meta tags.

Rule format for separately stored page-global rules edit

-{H|..}- and -{-|..}- can be represented as metas, others as spans. Block-level content seems to be rare.

  • {"*": "XXX"} for rules migrated from -{A|XXX}-
  • {"zh-cn": "tom hanks", "zh-hk": "SOUP HANS", "zh-tw": "TOM HANKS"}
  • {"zh-cn": {"HUGEBLOCK":"macro"}, "zh-hk": {"BLOCKHUGE":"big"}} for migrated -{H|HUGEBLOCK=>zh-cn:macro;BLOCKHUGE=>zh-hk:big;}-
  • {"zh-cn": {"HUGEBLOCK":"macro"}, "zh-hk": {"HUGEBLOCK":"big"}} for migrated -{H|HUGEBLOCK=>zh-cn:macro;HUGEBLOCK=>zh-hk:big;}-

For consumers of this format:

  • If a rule value is a string, it is a direct translation rule
  • If a rule value is an object, it contains one or more unidirectional nested rules

Other considerations edit

  • $wgDefaultLanguageVariant and fallback chain for it (search for variantfallbacks in LanguageZh.php, retrievable from array_diff( $title->getPageLanguage()->getConverter()->getVariantFallbacks( $wgDefaultLanguageVariant ), $wgDisabledVariants ). Note that getVariantFallbacks can return a string OR an array for different input... It seems to make more sense to have getVariantFallbacks do array_diff itself but it's not doing so currently... ) is not currently exposed in the API. We'll need both to pick the right content to render for -{zh-tw:foo;zh-cn:bar}-.
  • What to store in data-mw='{"text":... when it contains some other structures?
    • HTML, but the content needs to be properly nested. Run node dumpGrepper.js '\-{[^}<]*<.*?}-' on a zhwiki dump to find potentially problematic language conversion blocks, then check nesting for them. Problematic cases seem to be
      1. different start tags per variant that really only differ in an attribute (title for example). Conversion pairs are now also supported in attributes, so try to fix wikitext to convert attribute only.

Result of node dumpGrepper.js '\-{[^}<]*<.*?}-':

Total revisions: 2234532
Total matches: 773
Ratio: 0.034593373467016804%

Naive use of dumpGrepper will be misleading, as almost all of the -{ }- markup comes from templates.

  • Need a way to mark up in-place variant conversions in attributes. Idea that might also be useful for transclusion-affected attributes:

<div title="-{foo}-">

<div title="foo" data-mw='{"genAttrs":{"title":"<span typeof=mw:LanguageConvert data-mw=..>foo</span>"}}'>

Notes edit

Random notes about language converter discovered while implementing Parsoid support.

  • Language conversion blocks can be arbitrarily nested. One example of this is found in zh:Template:DISPLAYTITLE (trimming a bit):
-{T|
zh:-{zh|{{trim|{{{1|{{FULLPAGENAME}}}}}}}}-;
zh-hans:-{zh-hans|{{trim|{{{1|{{FULLPAGENAME}}}}}}}}-;
zh-hant:-{zh-hant|{{trim|{{{1|{{FULLPAGENAME}}}}}}}}-;
}-

The T flag sets the page title. If your current variant is zh-hans it uses a nested "filter" block to extract the zh-hans version of {{trim|{{{1|{{FULLPAGENAME}}}}}}} (that is, a trimmed version of either the template argument or the FULLPAGENAME. The "filter" block also does character set conversion and rule evaluation, if the argument does not have an explicit zh-hans variant. For example, if the argument is in zh-hant, then it will be converted from traditional to simplified characters using the currently-active rule set.

  • Another example of nested rules is found in zh:Module:Template:地区用词 (the title of this module translates as "words used in some region"). The purpose of this script block is to display the preferred variant along with alternatives in parentheses. For example, a call:
{{...|zh-cn=blog|zh-hk=WEBJOURNAL|zh-tw=WEBLOG}}

produces "blog (hk: webjournal, tw: weblog)" for zh-cn readers, or "WEBLOG (CN: BLOG, HK: WEBJOURNAL)" for zh-tw readers. (Note that uppercase indicates traditional characters and lowercase represents simplified characters.)

It does this by emitting something like:

-{zh-cn:blog (hk: -{zh-hans|WEBJOURNAL}-, tw: -{zh-hans|WEBLOG}-); ...}-

Note again that it is the nested -{zh-hans|WEBJOURNAL}- which is responsible for converting "WEBJOURNAL" from traditional to simplified characters ("webjournal") for zh-cn readers.

  • As shown in these examples, emitting the correctly-rendered output for the "default wiki language variant" requires doing character set conversions and running rule sets. It is possible to naïvely emit the text in the block (or text corresponding to a particular variant), but the result may be misleading since it won't correspond to what the block would actually render.
    • If we're going to do the naïve thing, one option would to just arbitrarily select the first variant specified in the block.
    • Another alternative would be to emit the variant which most closely matches the "default wiki language variant". This is somewhat complicated; there is a fallback chain defined by the language converter instance for a given language. For example, here is the fallback chain for zh, here is the fallback chain for gan, here is the fallback chain for iu, etc. We'd probably have to export the language fallback chain via the siteinfo API.
  • There is yet another syntax variation for language converter blocks. In line 1123 of LanguageConverter.php we see support for a xxx=>zh-hans:yyy; xxx=>zh-hant:zzz in addition to the standard/documented zh-hans:xxx;zh-hant:yyy syntax. The former is mostly used in glossary templates (see Requests for comment/Scoped language converter).
  • The default flag seems to be 'S'; it falls back to 'R' behavior (displaying the rule itself) when there's no conversion table entry parsed out from the rule.
  • Semicolons are separators, but only if they are followed by a variant code and a colon, or a string, an arrow, a variant code and a colon [LanguageConverter::getVarSeparatorPattern()]. For example, ensure this test case makes it into parserTests.txt (if not already present):
echo '-{zh-hans:<span style="font-size:120%;">xxx</span>;zh-hant:<span style="font-size:120%;">yyy</span>;}-' | tests/parse.js
  • gwicke notes: "-{foo}- is the moral equivalent of -{*:foo}-" which might be a good way to handle the R flag.

See also edit