Reading/Web/Projects/A frontend powered by Parsoid/Notes/2015 11 12-Andoid-Web-Sync

Notes of meeting syncing Reading Web research with Android team.

Android team as we know are developing a content service for the app based on restbase.

They are in the process of migrating the source content from mobileview to parsoid html.

For that purpose, they do (as us in the research project) transformations to the HTML since it is huge and adapted for 2 way editing instead of lean reading.

We went through different transformations and use cases and talked about how we could share these transforms.

HTML parsing library

The services team recommended Bernd to use domino. It's a node.js library that implements an in memory dom model. Seems to be pretty fast and one of the maintainers is cscott.

In our kickstart Sam & I decided to use libxmljs, a node library with bindings to the stable old C library libxml. Some benchmarks we found on the interwebs claimed that libxmljs was the most performant.

We'll talk with the services team in the next sync up about this library and if the performance gains (if any) are worth it.

The difference is that libxmljs operates with Xpath and domino uses node apis (like CSS selectors).

Sharing code

Seems like initially the most interesting way of sharing code would be by publishing common transformations under semver versioned npm libraries. More info on interesting transforms below in other sections. Seems like we could share also API calling code, like fetching images with their properties, etc.

There is no known prefix to use in npm library names (wikimedia-parsoid-transform-images for example). We'll talk about it in the next meeting with the services team.

Things TODO on research that mobileapps-content-service does

Auto redirects

There's two types of redirects.

301/302s from restbase. We should check if our restbase querying does automatic redirects (we are using default http/s module in node, Android folk uses preq module).
Ad-hoc editor written redirects (<link rel="mw:PageProp/redirect" ...).
Example: http://reading-web-research.wmflabs.org/wiki/Confusion
We're not dealing with this now. We should.
https://github.com/wikimedia/mediawiki-services-mobileapps/blob/master/lib/parsoid-access.js#L13-L85
We may copypasta, since it's little code and it may be hard to abstract to library.

Images

We're going to add images meta data soon. What the mobileapps service does is get images from mw.api pageimages, and with a generator it fills the image props info & video props info (license, author, etc).

One thing it doesn't do properly right now is that it does not respect the order of the images on the page.

Code here: https://github.com/wikimedia/mediawiki-services-mobileapps/blob/master/lib/gallery.js

It's not clear how much we're going to need this for this quarter, maybe for us for now just grabbing the images & meta-data from the parsoid html is enough. We'll see, depending on how much of a gallery we want to implement on the UI.

Main page

Main pages are completely broken on mobile viewports (from restbase/parsoid).

Mobileview proxying for Main_Page may be necessary if we want to show it there.
We should really push for removing mobileview Main_Page hacks and work with community liasons to go to wikis and change the templates so that they are mobile friendly. This is not something we should fix with hacks forever.

Extra info of a page

When grabbing a page, there are pieces of meta data that are not provided by parsoid/restbase, like the wikibase description, etc.

Those will need to be queried in parallel and aggregated, from either mobileview or the normal query api.

More cleaning up of HTML

Apps folks have good experience removing cruft from the wikitext remainders on the html. There are a bunch of removals they perform to have a leaner HTML.

We should measure payload improvements over a sample group of articles with each transform to document why we do them when/if we productize any of this.

Examples:

Spans with ref brackes ([1])
Elements with display:none
Empty spans
Explicit spaces ( )
https://github.com/wikimedia/mediawiki-services-mobileapps/blob/master/lib/transforms.js#L178-L208

Some of them are removed because they are cruft. Others like the infobox are removed because they are preprocessed and parsed and then removed.

We'll probably have a pass at this and look into it more closely.

Sectioning & TOC

We're doing basic top level sectioning for now. Apps service does mobileview like TOC & sectioning.

We envision having a different structure for TOC & sections, instead of having contiguous sections we'll probably end up with nested semantically correct sections. We still have to talk more about that.

In such case, Bernd expressed interest on a different way of structuring sections & toc that would also benefit apps native rendering. We'll keep in touch about what we'll do in this area.

Redlinks

We don't do anything about this, apps service does. If we should in the scope of the research quarter is still unanswered. We'll think about it after we solve more pressing concerns.

Bernd: While we currently have code for it, since we moved to Parsoid it doesn't really have any effect since there is no way for us to detect red links by just looking at Parsoid output itself. They are indistinguishable from working links, see T39902. Looks like VE has implemented a solution where they make separate requests to check for red links: T39901.

Other thoughts

We talked about how it would be great to have a canonical service/library that from a title would return a data structure with the parsoid html preprocessed and transformed, properly structured/sectioned and with all the possible meta data added normalized in the top level.

This service/library could be used by other services wanting to expose only certain subsets of the whole information or do additional preprocessing.

We'll talk more about that.