Page Previews/API Specification

For documentation on the completed API, see Page Content Service and the live API spec.

This article outlines the specification for a new Node.js based API to generate summaries for MediaWiki based wikis that replaces the existing TextExtracts API.

Background & Motivation edit

Up until now, we've mostly gotten away with using the prop=extracts MediaWiki API provided by TextExtracts and RESTBase to allow us to scale out Page Previews to a couple of large Wikipedias without issue.

However, the requirement that certain classes of pages should be handled differently means that TextExtracts is no longer the most appropriate place to house the notion of what a page preview is. We should aim to keep TextExtracts as simple and as general as possible. It may be that we compose the prop=extracts API and the new Page Preview API rather than integrating them but this is not a goal of this work.

To be clear, the primary goal of this work is to minimise the amount of text/HTML processing in the Page Previews client: the less work the client has to do to display a preview, the better.

The specification edit

Intros edit

The API returns well-formed HTML representing the introductory elements of a page, which are defined as follows:

  • The first paragraph from the introductory section.
  • The first ordered, unordered, or definition list that is the next sibling of the first paragraph.

Herein we'll refer to these elements as an "intro".

Plaintext intros edit

Certain clients will not be able to handle HTML intros yet, e.g. the Wikipedia apps. To maintain compatibility with these clients, the API will also return a plaintext representation of the introductory elements of a page.

  https://gerrit.wikimedia.org/r/370694

Empty intros edit

After the HTML intro has been processed (see below), it may not contain text content but still contain HTML, e.g. <p><b></b></p>. Any processed intro that doesn't contain text content must be considered empty.

  Implemented

Markup allowed in an intro edit

By default, the Page Preview API (herein "the API") must remove any tag that doesn't fall into one of the following cases.

Emphasis edit

The API must retain any bolded or italicised text in the intro, i.e. the Page Preview API must not remove b, i, and em tags.

  Implemented

Formulae/MathML edit

In order to support browsers that don't support MathML, the API:

  1. Must remove math tags; and
  2. Must not remove either the inline or block layout fallback images generated by Math while parsing the page.
  Implemented

Super- and subscript edit

The API must retain all sup and sub tags that are not generated by Cite, i.e. <sup class="reference"> elements.

  Implemented

Stripping of parenthetical statements edit

The API must remove all content enclosed within balanced parentheses. Parentheses will be defined as the following characters: () and ( )

  Implemented

Flattening inline elements edit

The API must replace all span and a tags with their text content, e.g. <span>Foo</span> should be flattened to Foo and <a href="/foo">Foo</a> would be flattened to Foo.

  Implemented

noexcerpt edit

The API must remove any element with the noexcerpt class to replicate the current behaviour of TextExtracts.

  Implemented

Line breaks edit

It is assumed that any line breaks in the summary are necessary for the display of the content. We thus do not remove any instance of a line break that appears in the lead paragraph of a summary.

Request edit

Parameters edit

Name Type Description
title String The title of the page to get the intro for.
  Implemented

Responses edit

A successful response from the Page Preview API similarly to all existing endpoints, must have the following properties:

Name Type Description
titles Titles The various titles of the page.
lang String The 2 or 3 character ISO 639-3/ISO 639-1 code of the language of the intro. This should be the site content language or the page content language.
dir Enum The direction of the script used to render the language the intro. One of "ltr" or "rtl".
last_modified String The time at which the page was last modified in ISO 8601 format.
thumbnail ?Image The thumbnail of the image associated with the page. The thumbnail's largest side must not exceed 320px. By default, this property should not be present.
original ?Image The original of the image associated with the page as determined by PageImages. By default, this property should not be present.
wikidata_description ?String The description of the Wikidata item.

The new summary endpoint will hydrate these properties with the additional fields specific to summaries:

Name Type Description
type Enum The notional type of the intro. One of "disambiguation", "wikidata", or "standard".
intro String The intro of the page represented as well-formed HTML5.
plaintext_intro String The intro of the page represented as plaintext. This property supersedes the extract property of the current RESTBase Page Summary endpoint.
disambiguation_links ?Titles[] The titles of the first N links from the disambiguation page. By default, this property should not be present.

  Done

Where an Image type property must have the following properties:

Name Type Description
source String The URL of the image.
width Integer The width of the image in px.
height Integer The height of the image in px.

And a Titles type property must have the following properties:

Name Type Description
denormalized String The title of the page, e.g. File:Igorrr_(band).
normalized String The normalized title of the page, e.g. Igorrr (band).jpg
display String The editor-formatted title of the page (see https://www.mediawiki.org/wiki/Help:Magic_words#Displaytitle), e.g. <strong>Igorrr (band).jpg</strong>.
namespace_id Integer The ID of the namespace that the page is in on the wiki.
namespace_name String The localized name of the namespace, e.g. User, Usario, etc.
page_id Integer The internal ID of the page.

For a page in the wiki's content namespace(s) edit

The Page Preview API must respond with 200 OK.

The type property of the response must be set to "standard".

If the page has a corresponding Wikidata item, then the wikidata_description property must be set to the item's description.

  Implemented

For a page outside of the wiki's content namespaces edit

The Page Preview API must respond with 200 OK.

The type property of the response must be set to "no-extract".

The extract and extract_html properties must be set to "".

  Implemented

For a page that doesn't use the wikitext, wikibase-item, or wikibase-property content model edit

The Page Preview API must respond with 200 OK.

The type property of the response must be set to "no-extract".

The extract and extract_html properties must be set to "".

  Implemented

For a disambiguation page edit

The Page Preview API must respond with 200 OK.

The type property of the response must be set to "disambiguation".

The disambiguation_links property of the response must be set to the first N links from the disambiguation page.

The intro property of the response should be set to the intro of the page so that the client may display it if appropriate.

  Blocked

For a page that doesn't exist edit

The Page Preview API must respond with 404 Not Found.

The response body must be empty.

  Implemented

For a page that doesn't have a lead section edit

The Page Preview API must respond with 200 OK.

The type property of the response must be set to "standard".

The intro property of the response must be set to "".

Examples edit

  1. https://en.wikipedia.org/wiki/Wikipedia:Dashboard
  Implemented

For a page that has an empty intro edit

The response must be the same as the "For a page that doesn't have a lead section" case.

  Implemented

For a page that redirects to another page edit

The Page Preview API must respond with 302 Found.

The Location HTTP header must be set to the URL that will get the intro for the target page.

Note: RESTBase handles redirects transparently to the underlying service (see T176517#3634838).

The Page Preview API must respond with 200 OK.

The type property of the response must be set to "no-extract".

The extract and extract_html properties must be set to "".

Responses for Wikidata (from T111231: Page previews for Wikidata) edit

For a Wikidata item edit

This overrides the "For a page in the wiki's content namespace" case above.

The type property must be set to "wikidata_preview".

All members of the titles property object must be set to their equivalent of the item's label.

The extract property must be set to the item's description.

If the item has the image property set (to I):

  • The image property must be set to the Image object that represents the Wikimedia Commons file referenced by I.
  • The thumbnail property must be set to the Image object that represents the corresponding thumbnail.
Notes edit

The item's description should be in the user's language. If the description isn't available in the user's language, then the API must follow the language fallback chain until one is available.

For a Wikidata item with no description edit

The response should be the same as the For a Wikidata item case apart from the following:

The extract and extract_html properties of the response must be set to "".