Parsoid/Parser Unification/Media structure

See the FAQ for more details about the state of this project. Also, the project workboard for the work currently in progress.

Intro

As we take steps towards converging with, and eventually replacing, core's legacy parser, one hop along the path is unifying the structure of media output. This is proposed in T118517, which implements T51097.

Two patches make up the bulk of the work:

gerrit:410362 - Move parsoid media styling to content (MERGED)
gerrit:507512 - Emit media structure as piloted in Parsoid (MERGED)

The second patch is hidden behind a config flag. We are yet to add a feature flag so that we can roll this out on a per wiki basis. T266148 tracks that, with T271129 for testing the different modes.

Work history / log

In T251641, it was decided to revert from using a custom element (figure-inline) for inline media, which Parsoid had deployed several years back, and instead use a span. T266143 tracks Parsoid clients adding support for both so that Parsoid can change its output in,

~~gerrit:623906 - s/figure-inline/span/~~

Parsoid claims to render identically while adding more semantic elements to the markup (ie. the use figure and figcaption, instead of generic divs). In order to verify correctness, it has undergone several rounds of visual diff testing, as well as being the basis of the Visual Editor, which susses out many rendering differences. Another round of visual diff testing is scheduled in T266149.

Nevertheless, new bugs are still being discovered,

~~T193695: Horizontal alignment of media in Parsoid CSS has too much margin when "thumb" isn't present~~
- ~~gerrit:430629 - Only apply tright/tleft margins to frame/thumb~~
~~T269704: Default horizontal alignment should depend on content language, not the UI~~
- ~~See the confusion here, https://gerrit.wikimedia.org/r/c/mediawiki/core/+/196532/18/includes/Linker.php~~

There also remains some known open questions about the output,

~~T171761: Figcaption overflows image width on unbroken words~~
- ~~gerrit:430102 - Set break-word on figcaption~~
- ~~Maybe this is an indication that we should switch back to styling the figcaption as a table-caption, and always emitting it so that the bottom border is present~~
  - ~~Adding the figcaptions always could be useful regardless of switching back to the old css~~
~~T169975: Missing images render as broken img tags, not redlinks -- this is only an issue with Parsoid output, not with the changes to core.~~
- ~~Specs/HTML/2.1.0#Missing media~~
T272186: Extension:ImageMap appears to do regexp post-processing of image media HTML, probably needs an update. Are there other similar extensions?
- Native ImageMap extension in gerrit:585344
T268250: Decide on a structure for galleries
- Galleries are current a div soup with inline media. The above task is about moving to a list of figures.
T270150: The css we're shipping needs stability and performance review
T271114: Update site specific css for new media structure

Finally, there is the need to proselytize this change in the community,

Come up with a story for how gadgets, user scripts, and other downstream tools will be migrated. See the notes below.
T113258 : Draft email announcement about proposed change to output ‎<figure> from legacy parser for media

Migrating Gadgets and User Scripts

Gadget usage on wikis: https://usage.toolforge.org/

Some notes from Parsing/Notes/Figure_for_Media:

Special:GadgetUsage is our friend.

Plus, mwgrep on wikis in the javascript namespace.

most gadgets: don't necessarily inspect HTML -- 30% maybe use actual HTML most things might look for ids ...

taking an inventory of gadgets

no reliable way of knowing how user scripts

a few gadgets everyone uses: popup, hotcat, (5 or so) ... ppl post in village pump in < 30 mins if they break

some 20 or 30 that a few more ppl use and will take a while to notice

last category: used for specialized processes .. about 100

page lists a lot of them
https://en.wikipedia.org/wiki/Wikipedia:User_scripts

commons has quite a lot; wikidata a few

hardest part is fixing on other wikis where things are copied over to other wikis

good to maintain documentation about what we fixed could help

The Reading Web Team has provided some additional info on how they've been making changes for the Desktop Improvements project.

generally relying on user notices (tech news) when making breaking changes, like this one
identifying potentially impacted pages and pinging users that own them
- pages can be identified using the global-search.toolforge.org and using a script to parse out usernames
for JS breaking changes, they have client-side error logging
- this helps with gadget JS errors but doesn't help with CSS errors
may need to patch broken scripts ourselves
the process for communicating broken gadgets and finding the owners still needs improvement to be effective