User:Peter17/Reasonably efficient interwiki transclusion

This is a draft for my GSoC-2010 project, reasonably efficient interwiki transclusion, written after discussing with my mentor User:Catrope and updated several times during the project.

Initial state edit

Currently, some functions (interwikiTransclude and fetchScaryTemplateMaybeFromCache in Parser.php) allow interwiki transclusion of distant templates.

The interwiki table contains the known interwiki prefixes. For each of them, the value iw_trans can be set to 1 to allow (or 0 to disallow) transclusion from that wiki.

If $wgEnableScaryTranscluding is set to true, when a transclusion call refers to an article of another wiki and if transclusion from that wiki is allowed by iw_trans, then:

  • fetchScaryTemplateMaybeFromCache checks whether the template has been cached less that 1h (by default) ago
    • if yes, then, the cached template is used
    • if not, then, a GET request is made to retrieve the content from the distant wiki

There are two different possible formats to retrieve (and cache) the content: raw wikitext and html.

The default with this system is that the data is cached for an arbitrary time, which means:

  • When a template is almost never modified, the cache is still updated whereas it is useless, so, we lose some performance.
  • When a template is actually modified, in the worst case, the cache will have to wait 1h before being updated and the users of the local wiki will not see the changes made to the template during that time.

So, the cache should be updated if and only if necessary.

I made some tests on May 10th.

Good points edit

  • It is working. I mean I can transclude my userpage from this wiki (mediawiki.org) to a wiki hosted on my computer using the syntax {{mediawikiwiki::User:Peter17}}!
  • The links become full links: [[/Reasonably efficient interwiki transclusion]] becomes http://www.mediawiki.org/wiki/User:Peter17/Reasonably_efficient_interwiki_transclusion
  • If the wanted page calls some templates or parser functions, then, they are used to render the content, which is good!

Issues edit

  • When transcluding {{interwikiid:templateName|param}}, what is transcluded is just the content of the page templateName of the distant wiki, which means:
    • The parameters are totally ignored.
    • The instructions <noinclude>, <includeonly> behave as if it was not a transclusion which is the opposite of the expected behavior...
  • The transcluded content is actually cached for 1h, which means even purging the cache will not update it...
  • The fact that links point to the distant pages might not be the expected behavior...
  • All the parsing is done by the distant wiki, which is expensive for it. If foreign wikis want our templates, they could at least parse them by themselves...

Proposed approach edit

After a discussion on wikitech-l, some people, notably Chad and Aryeh Gregor, have suggested to use a similar approach as FileRepo does (see Manual:$wgForeignFileRepos). FileRepo is a class meant to allow the inclusion of distant files. It uses different backends in different cases, described below (see "Done work").

Questions and remarks edit

  • Is it possible to rely on templatelinks to obtain the list of all templates called by a page? If A calls B and B calls C, then: if B is modified and calls D instead of C, will this be taken immediately into account in the list of A template links?
    • It will be taken into account, yes, although possibly not immediately (deferred through job queue). --Catrope
  • It should be possible to transclude only a sections of an article, as Extension:Labeled Section Transclusion does. When using the API, there is a way to do this, using API:Parsing wikitext and defining the sections argument.
  • When a template is used by several distant wikis, it would be great to display an alert that would incite the administrators to protect this template, so that it is not modified too often
  • The distant messages are not retrieved
  • The retrieved templates look ugly when they use a specific style from Common.css

Done work edit

Add fields to the interwiki table edit

The former structure of the interwiki table was this one:

+-----------+------------+------+-----+---------+-------+
| Field     | Type       | Null | Key | Default | Extra |
+-----------+------------+------+-----+---------+-------+
| iw_prefix | char(32)   | NO   | PRI |         |       |
| iw_url    | blob       | NO   |     |         |       |
| iw_local  | bool       | NO   |     |         |       |
| iw_trans  | tinyint(1) | NO   |     | 0       |       |
+-----------+------------+------+-----+---------+-------+

Here is the new structure I proposed for the interwiki table:

+-----------+------------+------+-----+---------+-------+
| Field     | Type       | Null | Key | Default | Extra |
+-----------+------------+------+-----+---------+-------+
| iw_prefix | char(32)   | NO   | PRI |         |       |
| iw_url    | blob       | NO   |     |         |       |
| iw_api    | blob       | NO   |     |         |       |
| iw_wikiid | char(64)   | NO   |     |         |       |
| iw_local  | bool       | NO   |     |         |       |
| iw_trans  | tinyint(1) | NO   |     | 0       |       |
+-----------+------------+------+-----+---------+-------+

So, my changes consisted in adding two optional fields:

  • the URL of api.php of that wiki
  • the ID of that wiki (used in wfGetDb(DB_SLAVE, array(), $wikiID);)
Explanations

Currently, iw_trans allows the administrator to decide whether the templates from a particular wiki can be transcluded in the current wiki.

  • 0 will forbid this
  • 1 will allow this

With this structure, the software can allow transclusions in two different ways (using the API or using a direct DB access). When iw_trans is set to 1, the presence of iw_wikiid will indicate whether to use the DB access (iw_wikiid set) or the API (iw_wikiid not set).

Retrieve and cache the distant templates through the API edit

As explained before, the address of api.php of a foreign wiki can be stored in the interwiki table.

For a given interwiki prefix, if no wikiid is given if an API address is defined, a transclusion call will retrieve the wanted wikitext through an API call, cache it in $wgMemc and return it to the parser. The key is wfMemcKey( 'iwtransclustiontext', 'textid', $interwiki, $title['title'] );

Moreover, if the wanted page is returned, the software will retrieve the list of all templates called by this wikitext (subtemplates). He will then determine which of them are not in the cache and cache them in $wgMemc, making API requests by groups of 50 (at most) templates.

This way, on the next loop of the parser, the subtemplates will be found in the cache and will not be requested.

Retrieve and cache the distant templates through DB access edit

This is the case of the wiki farms. The interwiki table contains the wikiid of the foreign wiki.

In this case, the most efficient solution is accessing directly the wanted wikitext by reading in the database of the foreign wiki. The software can access the DB inside the wiki farm with:

$dbr = wfGetDb( DB_SLAVE, array(), $wikiID );

When a distant template is called, the software retrieves the corresponding wikitext.

Accessing the distant DB is just as expensive as accessing the local one, so, no caching is needed, except a globaltemplatelinks table, to invalidate the cache of the pages that call a template, when this template is edited.

Create a globaltemplatelinks table edit

Inside a wiki farm, it's quite easy to automagically purge the pages that use a distant template when this template is edited. We want to:

  • track the use of each template (know which distant pages are using it and update when pages are edited, deleted or moved)
  • when a template (or a subtemplate of this template!) is edited or deleted, invalidate the cache of those pages by updating page_touched in the page table

Inside a wiki farm, the transclusion links between the local pages and the distant pages would be stored in a shared DB, so that the distant wiki always knows which other wikis of the farm are using its templates and each page of each wiki knows which distant templates it uses.

The "calling" wikis could write in a "globaltemplatelinks table" on the shared database to store their usage of the templates. When a distant template is edited, the distant wiki will look in the globaltemplatelinks table to see who is transcluding the template. Then, it will access the DB of the calling wiki and invalidate the cache of the concerned pages.

When a calling page is edited/moved/created/deleted/undeleted, it will update the globaltemplatelinks table to reflect the change.

The disadvantage of this approach is that each wiki must know:

  • it's own wikiid (used by the distant wikis to access it's DB)
  • the wikiid of all the wikis allowed to transclude its templates (in order to access their DBs for the cache invalidation job)

In the case of WMF's wikis, this could be solved by using a nice interwiki prefixes system, with a unique prefix for each wiki (enwikisource instead of en:s or s:en, frwikipedia instead of fr:w or w:fr...), at least for interwiki transclusion.

Proposed structure for the globaltemplatelinks table

As in the templatelinks table, from designates the page that calls the link and to the page pointed by the link.

We also need to store the wiki ID of the calling page, plus its page ID for cache invalidation and its full title (namespace text and title) for display and the wiki ID of the pointed page.

Proposed schema
-- Table tracking interwiki transclusions in the spirit of templatelinks.
-- This table tracks transclusions of this wiki's templates on another wiki
-- The gtl_from_* fields describe the (remote) page the template is transcluded from
-- The gtl_to_* fields describe the (local) template being transcluded
CREATE TABLE /*_*/globaltemplatelinks (
  -- The wiki ID of the remote wiki
  gtl_from_wiki varchar(64) NOT NULL,

  -- The page ID of the calling page on the remote wiki
  gtl_from_page int unsigned NOT NULL,

  -- The namespace name of the calling page on the remote wiki
  -- Needed for display purposes, since the foreign namespace ID doesn't necessarily match a local one
  gtl_from_namespace varchar(255) NOT NULL,

  -- The title of the calling page on the remote wiki
  -- Needed for display purposes
  gtl_from_title varchar(255) binary NOT NULL,

  -- The interwiki of the transcluded page
  gtl_to_prefix varchar(32) NOT NULL,

  -- The namespace ID of the transcluded page on that wiki
  gtl_to_namespace int NOT NULL,

  -- The namespace name of transcluded page
  -- Needed for display purposes, since the local namespace ID doesn't necessarily match a distant one
  gtl_to_namespacetext varchar(255) NOT NULL,

  -- The title of the transcluded page on that wiki
  gtl_to_title varchar(255) binary NOT NULL
) /*$wgDBTableOptions*/;
CREATE UNIQUE INDEX /*i*/gtl_to_from ON /*_*/globaltemplatelinks (gtl_to_wiki, gtl_to_namespace, gtl_to_title, gtl_from_wiki, gtl_from_page);
CREATE UNIQUE INDEX /*i*/gtl_from_to ON /*_*/globaltemplatelinks (gtl_from_wiki, gtl_from_page, gtl_to_wiki, gtl_to_namespace, gtl_to_title);

New options for LocalSettings.php edit

Those new options have been documented in DefaultSettings.php.
  • $wgEnableScaryTranscluding is not used anymore
  • $wgEnableInterwikiTranscluding can be set to true (to enable) or to false (to disable) any interwiki transcluding
  • $wgEnableInterwikiTemplatesTracking can be set to true (to enable) or to false (to disable) the use of a global template links table to store the interwiki template links (see globaltemplatelinks table above)
  • $wgGlobalDatabase should contain the wikiID of the database that hosts the globaltemplatelinks table, when $wgEnableInterwikiTemplatesTracking is set to true

Usage tracking edit

When you edit an article, the distant templates used in it are displayed below the edit box, in a separate list, so, they are not mixed with the local templates.

To display all the distant use of a local template, you can browse Special:GlobalTemplateUsage and type its name in the box. This code is copied and adapted from Extension:GlobalUsage, which (in my opinion) should be built-in. We should have a GlobalUsage class that would deal with the distant usage of both files and templates. This can be done quite easily if GlobalUsage were built-in (or as a more flexible extension that provides a base class that is extented/subclassed by GlobalFileUsage and GlobalTemplateUsage).

Final behavior edit

No parsing is done by the foreign wiki, all by the local wiki.

This way, the behavior of the interwiki transclusion is quite simple:

  • everything is parsed locally, exactly like the local templates
  • except that the templates and their subtemplates are the distant ones
  • the parameters of those templates are interpreted as on the local wiki, which means that the local templates will be used if they are present in the arguments of the distant templates
  • the links are interpreted as local links, pointing to local pages
Example

On the local wiki, Template:Foo contains "Hello world!".

On the foreign wiki, Template:Bar contains "{{{param}}} Hi!"

Then, an article of the local wiki calling {{foreignid:Bar|param={{Foo}}}} will produce "Hello world! Hi!"

Running prototype edit

My mentor, User:Catrope, has set up three test wikis to test iwtransclusion on:

Anybody should feel free to test it and provide feedback on the discussion page of this page or on wikitech-l.

TODO edit

Create a globalnamespaces table edit

This is a simplification for the globaltemplatelinks table: instead of storing the distant namespace number + namespace text for each link (which is redundant), we just store the namespace number in the globaltemplatelinks table and the namespace text in this new table, for each namespace of each distant wiki.

Proposed schema
-- Table listing distant wiki namespace texts.
CREATE TABLE /*_*/globalnamespaces (
  -- The wiki ID of the remote wiki
  gn_wiki varchar(64) NOT NULL,

  -- The namespace ID of the transcluded page on that wiki
  gn_namespace int NOT NULL,

  -- The namespace text of transcluded page
  -- Needed for display purposes, since the local namespace ID doesn't necessarily match a distant one
  gn_namespacetext varchar(255) NOT NULL

) /*$wgDBTableOptions*/;
CREATE UNIQUE INDEX /*i*/gn_index ON /*_*/globalnamespaces (gn_wiki, gn_namespace, gn_namespacetext);

Create a globalinterwiki table edit

The code for reasonably efficient interwiki transclusion often needs to make a link between a wikiid and the corresponding interwiki prefix. It is very easy to get the wiki ID of a given interwiki prefix, because it is an attribute of the Interwiki class. However, there is currently no way to get the interwiki prefix corresponding to a given wiki ID. This problem should be solved by this table.

Isn't that solved now that iw_wikiid is/will be in the interwiki table ? Krinkle 00:55, 5 February 2011 (UTC)
No: with the current MediaWiki code, you can create an instance of the Interwiki class and then read its wiki ID. However, you cannot create an instance of the Interwiki class knowing only its wiki ID and not its iw_interwiki. That's why we need an association table (and a global one for other reasons)... We cannot use the interwiki table for this, because it is not global. Peter17 22:19, 10 February 2011 (UTC)
Proposed schema
-- Table associating distant wiki IDs with their interwiki prefixes.
CREATE TABLE /*_*/globalinterwiki (
  -- The wiki ID of the wiki
  giw_wikiid varchar(64) NOT NULL,

  -- The interwiki prefix of that wiki
  giw_prefix varchar(32) NOT NULL

) /*$wgDBTableOptions*/;
CREATE UNIQUE INDEX /*i*/giw_index ON /*_*/globalinterwiki (giw_wikiid, giw_prefix);