Page Content Service

The Page Content Service (PCS) is a set of Node.js-based services in Wikimedia production designed to deliver Wikimedia project page content and metadata for modern reading clients. It delivers:

  1. Optimized page content for modern clients to provide a full article reading experience
  2. A standard structured representation for pages that can be used for display within lists and previews
  3. Aggregated common CSS used for styling and theming articles
  4. Client logic for page interactions such as changing reading themes, handling references, lazy loading images, and more are provided as a JavaScript interface that clients can execute locally

PCS delivers content in both HTML and JSON formats. It consolidates data from the Wikipedia, Commons and Wikidata MediaWiki APIs as well as the Parsoid and ORES API. It will supersede the mobile-sections endpoint of the Mobile Content Service (MCS).

PCS is maintained by the Content Transform team.

Endpoints edit

HTML endpoints edit

/page/mobile-html edit

Stability: experimental

 
page/mobile-html/{title} endpoint data flow

Provides Parsoid HTML with a few key differences:

  • Additions:
    • Theming support (default, sepia, dark, black) + dimming of images
    • Setting of body margins based on native client guidelines
    • Edit icons for sections + page description
    • Page header
      • Title
      • Pronunciation link (if present on page)
      • Description (from Wikidata or from local wiki) or link to add a new description
    • Page footer
      • Menu items ("last edited on" with link to page history, page issues, disambiguations, link to talk page, geo coordinates if found on page, ...)
      • Read more (makes an XHR to get a configurable number of related pages)
  • Removals:
    • Navboxes
  • Transformations:
    • Infoboxes are collapsible.
    • Reference list sections are collapsible.
    • Images of at least a certain size are lazy loaded.
    • The lead content paragraph is moved above the first infobox.
Mobile HTML transformations
Description Sidenotes
1 Apply .equation-box-elem to the element that includes .equation-box class.
2 Add class .figcaption-elem to the span elements inside figcaption tag.
3 Remove lazy load for images. This might be deprecated in the future, see T328943.
4 Add metadata (base url, title, description etc.) for tags inside HEAD and HEADER.
5 Prepare nodes for removal.
6 Mark header tag.
7 Determines if an element is a reference list that should be treated as an indicator that a section is dedicated to references. If the reference list is inside of a table, this will return false because that generally doesn't indicate a section that's dedicated to showing references.
8 Determines whether an element is a SECTION tag with the BODY tag as it's parent.
9 Mark the current section as a reference section if we've found an indicator that it is a reference section. Reset the indicator if we found it. If this is the first reference section by saving the first reference section id.
10 Mark element for theme exclusion.
11 Check if string has .no-theme class.
12 Enforce theme exclusion by running helper functions.
13 Exclude TD element if it doesn't have any inline styles but the parent does.
14 Mark for theme exclusion If some TR elements do not have inline style but have a sibling TR element that is flagged as notheme.
15 Check for rules that do not allow theme exclusion.
16 Do not allow TD elements to receive notheme unless the parent node is already marked as notheme.
17 T269476 - fix corner case for template infobox structure.
18 T268279 - Exclude all elements inside .equation-box DIV.
19 T279568 - Exclude spans inside FIGCAPTION.
20 Prepare a reference for mobile-html output. Adjusts the structure of the HTML and adds the appropriate pcs hooks.
21 Prepare section depending on output mode.
22 Move the lead paragraph in the DOM tree.
23 Create lead section button.
24 Prepare sections for reference output.
25 Replace red links with empty SPAN.
26 Prepare section header.
27 Create edit button.
28 Mark SPAN element for removal if it has brackets and ‘Z3988’ metadata in class list.
29 Check if div with .magnify class name and mark it for removal.
30 Mark all links for removal except ones that have rel attribute with value 'dc:isVersionOf'.
31 Remove elements that match one of this pattern:
  • Has coordinates id
  • Has nomobile or navbox in class list
  • Has geo-nondefault or geo-multi-punct or hide-when-compact in class list

Also, applies removable helper functions to DIVs, SPANs and LINKs.

32 Remove https: string from the element attribute.
33 Filter out images that belong to gallery for further transformation.
34 Prepare images for lazy loading.
35 Prepare anchor elements: transform red links, make schemeless.
36 Mark the ordered list if it is inside the reference section.
37 Push some list items into the reference section.
38 Push some span’s text into the reference element.
39 Do infobox transformations.
40 Mark infoboxes for further transformations.
41 Optimize images that come from Template:Gallery.
42 Prepare div elements, such as infoboxes and divs with class 'equation-box'.
43 Mark table as infobox.
44 Remove class names that contain whitebg keyword.
45 Apply extra paddings for SUP element to increase touch area on mobile devices.
46 Localize UI strings.

More detail about the JavaScript interface can be found on the pcs.md page.

Examples: Prod | Beta cluster | Labs | Local RB | Local MCS

JSON endpoints edit

/page/summary edit

Stability: stable

 
page/summary/{title} endpoint data flow

The Summary serves two very important purposes:

  1. It provides the data necessary for the representation of a page within a page/link preview, search results, other lists, etc…
  2. It provides basic metadata necessary for clients to make business logic and navigation decisions before displaying a page.

To accomplish number 1, it contains some basic metadata: an image/thumbnail, a description, the first paragraph of the page plain text and HTML form (extract and extract_html), and article language and directionality (RTL or LTR). It's preferable to use the extract_html over extract since some complex formulas are better handled with HTML than plain text.

To accomplish number 2, it contains some semantic information on the page, its name space, and various URLs in order for clients to understand the content of the page prior to deciding how to display it.

Additionally, the Summary structure is provided in other APIs (like the feed) that return lists of pages.

Page_Previews/API_Specification

Example URLs: Prod | Beta cluster | Labs | Local RB | Local MCS

For comparison, here is the action=query request this endpoint replaces: Prod. In the current version TextExtract is not used anymore, though. Instead PCS gets more of the information from the respective Parsoid HTML output and does some transformations on that.

/page/media-list edit

Stability: experimental

 
page/media-list/{title} endpoint data flow

Lists media items shown on a page: images, videos, and audio. This is useful for clients wishing to build a gallery interface for content within a page or for downloading images for offline reading.

Example URLs: Prod | Beta cluster | Labs | Local RB | Local MCS

/page/mobile-html-offline-resources edit

Stability: experimental

 
page/mobile-html-offline-resources/{title} endpoint data flow

List of the CSS and JS schemeless URLs for offline resourcing for mobile consumption. The motivation for this endpoint is to let native clients know what other files they would have to download when saving a page for offline without having to parse the page.

Example URLs: Prod | Beta cluster | Labs | Local RB | Local MCS

Example output:

[
  "//meta.wikimedia.org/api/rest_v1/data/css/mobile/base",
  "//meta.wikimedia.org/api/rest_v1/data/css/mobile/pcs",
  "//meta.wikimedia.org/api/rest_v1/data/javascript/mobile/pcs",
  "//en.wikipedia.org/api/rest_v1/data/css/mobile/site"
]


Example URLs: Prod | Beta cluster | Labs | Local RB | Local MCS

Clients edit

PCS can be used by any WMF or 3rd party client that wants to display page content for reading contexts. As mentioned above the /page/summary endpoint is heavily used in other places and already use by the native apps and the web PagePreview feature. /page/mobile-html has some coupling to the wikimedia-page-library and is somewhat tied to design decisions for the native WMF apps. If needed there could be another HTML endpoint that sits somewhere between Parsoid HTML and /page/mobile-html.

Within the WMF, the following clients are expected to integrate use of /page/mobile-html in 2020:

  1. Wikipedia Android App
  2. Wikipedia iOS App

External links edit

  • Usage documentation can be found at the API spec