Reading/Web/Projects/A frontend powered by Parsoid/HTML content research

As part of the preparation for the mediawiki developer summit and the end of the research of the quarter, we are taking a deep dive into analyzing our HTML content, what is it composed of, and how those parts affect size and rendering time for our readers.

Preparation

Data to compare

restbase article
api.php (action=parse)
api.php (mobileview)
loot transformations (individually)
- No transformations
- No ambox
- No navbox
- No references
- No images
- No superficial markup
 - Empty nodes
 -   → ' '
 - Reference spans
 - [ → [
 - ] → ]
 - Transform mw:Entity to their contents
- No data-mw
- Extraneous markup
 - Parsoid-generated IDs
 - typeof and about attributes
 - rel="mw:WikiLink" attributes
 -  wrapping elements

What we want to measure

HTML size
Webpagetest speed
Device experience? (Timeline on devtools?)

How

reading-web-research.wmflabs.org/api/benchmark/
- restbase/:title
- mw/:title
- mv/:title
- loot/:title?noimages&nomarkup...

Script that takes in a list of titles and queries those endpoints and stores the output in a folder.
- HTML size analysis
After ^, responses are on cache in reading-web-research server. Execute webpagetest urls.

HTML size report

Initial report is here with a sample of pages decided in T120504.

This report compares HTML sizes of the wiki content served from different endpoints. First it shows a general overview with a comparison from parsoid+restbase, mediawiki action=parse, mediawiki action=mobileview and the loot transformations (removing and cleaning pieces of the content that can be loaded on demand or automatically after page load).

The report highlights a few things:

Parsoid+Restbase output is always bigger than the one from the MediaWiki endpoints
Parsoid+Restbase enables performant and cacheable transformations that allow loot to transform and restructure the content and get it to a fraction of the size, while keeping the endpoints performant by being cached.
- This allows loading different parts of the content separately from the main page content.
data-mw attributes are an important fraction of the content served by restbase. Work is being done on enabling serving such information separately from a different endpoint and making the HTML leaner.
References consistently take a tremendous percentage of the total size after stripping data-mw attributes. It seems fair to assume that not loading references on initial payload would be a net win. Different strategies could be applied depending on further research (loading them automatically after content has loaded, loading them on demand when the user wants to check them).
- This seems to only be an option when using restbase+parsoid as the api endpoint for fetching content, enabled by the cache infrastructure and the better parsing and transforming capabilities.
The not-mobile friendly navboxes also take a considerable percentage of the content. Seems like not serving them or serving them on demand is fair under constrained devices.
The Extraneous markup, accounts for roughly 10% of the article weight and are only ever used when transforming the article and not rendering it
- Of the 6753 ID attributes present on the Parsoid-generated Barack Obama article roughly 1254 of them are generated by the Cite extension. These ID attributes add 33 KB to the payload of the total 86 KB added to the payload by ID attributes. It might be worth considering the Cite extension's ID scheme

And opens a few questions:

Why exactly is parsoid+restbase content that much bigger than the mediawiki api content?
What percentage do navboxes and references amount to when using the mediawiki apis?
What do these metrics look like with a much bigger sample of articles?
What percentage of content does superficial markup (see How section for definition) amount to? Is it worth cleaning it up?
- Yes, accounts for ~12-15% of payload size consistently, review report again
What percentage of views make use of navboxes and references?

Notes from conversations

data-mw separation is a big win and is being worked on
Unique ids per element and about would be interesting to measure (done, see Extraneous markup above)
- Accounts for ~12-15% of payload size consistently
Parsoid seems to reduce the number of tags in dom, which is also beneficial for performance in constrained browsers. Would be interesting to measure
- Reviewed: Not a big difference by default right now.

Webpagetest report (browser loading time)

Article composition on 2G

Several webpage test runs were made on Barack Obama on a 2G connection You can see that images barely impact the first paint on a 2G connection (that said there's **no stylesheet** to conflict with <note>these pages do not use stylesheets so the download of a stylesheet is not conflicting with images for first paint. In the real world images would have more of an impact on first paint.</note>). However if you are optimising for reducing bytes, fully loaded time they are significant.

Note the most reliable result was determined by the result with the lowest first paint, where the HTML, stylesheet and images were all loaded without errors and the number of requests is maximum possible.

Current content as served by Parsoid with data-mw stripped

First paint	DOM elements	Speed index	Fully loaded	Bytes in
79.295s	13287	81103	257.194s	867 KB

Current content as served by Parsoid with data-mw and images stripped

First paint	DOM elements	Speed index	Fully loaded	Bytes in
64.000s	13288	65600	87.667s	291KB

Current content as served by Parsoid with data-mw and references/navbox stripped

First paint	DOM elements	Speed index	Fully loaded	Bytes in
35.648s	1619	36048	188.646s	584KB

Current content as served by Parsoid with data-mw, images and references/navbox stripped

First paint	DOM elements	Speed index	Fully loaded	Bytes in
36.582s	1620	31000	41.491s	94KB

(Note ten of all these seconds are to do with first byte given its running on a labs instance)