Extension:External Data/Web pages

Other languages:

The External Data extension can be used to extract data from pages or documents on the web in a variety of formats, including CSV, GFF, HTML, INI, JSON, XML and YAML. This retrieval can either be done directly, or, if necessary, using the SOAP protocol.

As of version 3.2, the recommended way to retrieve web data is to use one of the display functions (#external_value, #for_external_table, etc.), passing in the necessary parameters for the data retrieval, most notably either "url=" or "source=". You can also retrieve web data by calling the #get_web_data or #get_soap_data functions, or (for version 3.0 and higher) #get_external_data.

For any of these parser functions, you can also call its corresponding Lua function.

UsageEdit

The following parameters are specific to retrieving web data:

  • |url= - the data source URL
  • |data= - holds the "mappings" that connect local variable names to external variable names. Each mapping (of the form local_variable_name=external_variable_name) is separated by a comma. External variable names are the names of the values in the file (in the case of a header-less CSV file, the names are simply the indexes of the values: 1, 2, 3, etc.), and local variable names are the names that are later passed in to #external_value.
    If the value __all is passed in, then all existing external variables (if there are any) will be mapped to internal ones of the same name, brought to lowercase, if field names are case-insensitive in the used format.
    Unless one of the options use xpath or use jsonpath is set, the parameter data can be omitted altogether: the effect will be the same as setting data=__all.
    Additionally, some "special variables" will be set as well; see Special variables .
  • |filters= - sets filtering on the set of rows being returned. You can set any number of filters, separated by commas; each filter sets a specific value for a specific external variable. It is not necessary to use any filters; most APIs, it is expected, will provide their own filtering ability through the URL's query string.
  • |post data= - an optional parameter that lets you send some set of data to the URL via POST, instead of via the query string.
  • |archive path= - path within the archive, if the file is a .zip, .rar, .tar, tar.bz2 or tar.gz archive. Can be a mask.
  • |archive depth= - depth of archive iteration (default is 2)
  • |suppress error= - an optional parameter that prevents any error message from getting displayed if there is a problem retrieving the data.

The following parameters should be set if the data is being retrieved via the SOAP protocol, instead of HTTP (for example, if #get_soap_data is being used instead of #get_web_data):

  • |request= - the function used to request data
  • |requestData= - parameter1=value1, etc.
  • |response=- the function used to retrieve data

In addition, standard parameters such as |data= can be used, and all of the parameters related to the parsing of data (|format=, |delimiter=, |use xpath=, etc.) can be used as well; see Parsing data .

The parameters |cache seconds= and |use stale cache= can also be used; for information on these parameters (and on caching in general), see Caching data .

More than one #get_web_data call can be used in a page. If this happens, though, make sure that every local variable name is unique.

Getting data from a MediaWiki page or fileEdit

If the data you wish to access is on a MediaWiki page or in an uploaded file, you can use the above methods to retrieve the data assuming the page or file only contains data in one of the supported formats:

  • for data on a wiki page, use "&action=raw" as part of the URL;
  • for data in an uploaded file, use the full path.

If the MediaWiki page with the data is on the same wiki, it is best to use the fullurl: parser function, e.g.

  • {{fullurl:Test/test.csv|action=raw}}

Similarly, for uploaded files, you can use the filepath: function, e.g.

  • {{filepath:xyzzy.csv}}

For wiki pages that have additional information, the External Data extension provides a way to create an API of your own, at least for CSV data. To get this working, first place the data you want accessed in its own wiki page, in CSV format, with the headers as the top row of data (see here for an example). Then, the special page 'GetData' will provide an "instant API" for accessing either certain rows of that data, or the entire table. By adding "field-name=value" to the URL, you can limit the set of rows returned.

A URL for the 'GetData' page can then be used in a call to #get_web_data, just as any other data URL would be; the data will be returned as a CSV file with a header row, so the 'format' parameter of #get_web_data should be set to 'CSV with header'. See here for an example of such data being retrieved and displayed using #get_web_data and #for_external_table. In this way, you can use any table-based data within your wiki without the need for custom programming.

String replacement in URLsEdit

One or more of the URLs you use may contain a string that you would prefer to keep secret, like an API key. If that's the case, you can use the field 'replacements' of the relevant data source to specify a dummy string you can use in its place. For instance, let's say you want to access the URL "http://worlddata.com/api?country=Guatemala&key=123abcd", but you don't want anyone to know your API key. You can add the following to your LocalSettings.php file, after the inclusion of External Data:

// This replacement will be done only in the URL http://www.worlddata.com/api?country=Guatemala&key=WORLDDATA_KEY:
$wgExternalDataSources['http://www.worlddata.com/api?country=Guatemala&key=WORLDDATA_KEY']['replacements'] = [
    'WORLDDATA_KEY'=> '123abcd'
];
// This replacement will be done only in the URLs with the host www.worlddata.com:
$wgExternalDataSources['www.worlddata.com']['replacements'] = [
    'WORLDDATA_KEY'=> '123abcd'
];
// This replacement will be done only in the URLs with the second level domain worlddata.com:
$wgExternalDataSources['worlddata.com']['replacements'] = [
    'WORLDDATA_KEY'=> '123abcd'
];
// This replacement will be done in any URL. Not recommended as it can leak the APi key to a site controlled by the attacker:
$wgExternalDataSources['*']['replacements'] = [
    'WORLDDATA_KEY'=> '123abcd'
];

Then, in your call to #get_web_data, you can replace the real URL with: "http://worlddata.com/api?country=Guatemala&key=WORLDDATA_KEY".

Whitelist for URLsEdit

You can create a "whitelist" for URLs accessed by External Data: in other words, a list of domains, that only URLs from those domains can be accessed.

As with other extension settings, there can be a common whitelist or a whitelist for a host or second level domain (effectively blacklisting the whole host or domain except the whitelisted URLs).

To create a whitelist with one URL, add the following to LocalSettings.php:

// A whitelist for www.example.org:
$wgExternalDataSources['www.example.org']['allowed urls'] = 'http://www.example.org/good.csv';

// A whitelist for all subdomains of example.org:
$wgExternalDataSources['example.org']['allowed urls'] = 'http://www.example.org/good.csv';

// A global whitelist:
$wgExternalDataSources['*']['allowed urls'] = 'http://www.example.org/good.csv';

To create a whitelist with multiple URLs:

// A whitelist for www.example.org:
$wgExternalDataSources['www.example.org']['allowed urls'] = [
    'http://www.example.org/good.csv',
    'http://www.example.org/even_better.csv'
];

HTTP optionsEdit

By default, External Data allows for HTTPS-based wikis to access plain HTTP URLs, and vice versa, without the need for certificates (see Transport Layer Security on Wikipedia for a full explanation). If you want to require the presence of a certificate, add the following to LocalSettings.php:

$wgExternalDataSources['*']['allow ssl'] = false;

Additionally, the setting 'options' lets you set a number of other HTTP-related settings. It is an array that can take in any of the following keys:

  • timeout - how many seconds to wait for a response from the server (default is 'default', which corresponds to the value of $wgHTTPTimeout, which by default is 25)
  • sslVerifyCert - whether to verify the SSL certificate, if retrieving an HTTPS URL (default is false)
  • followRedirects - whether to retrieve another URL if the specified URL redirects to it (default is false)

So, for instance, if you want to verify the SSL certificate of any URL being accessed by External Data, you would add the following to LocalSettings.php:

$wgExternalDataSources['*']['options']['sslVerifyCert'] = true;

As with other settings, the global settings (data source '*') can be overridden with the specific settings for a URL, host or second level domain.

ExternalDataBeforeWebCall hookEdit

The ExternalDataBeforeWebCall hook can be used to alter HTTP request options, alter the URL, make any preparations to data retrieval like complex authentication procedure, or abort data retrieval.

Example:

$wgHooks['ExternalDataBeforeWebCall'][] =
function ( string $method, string &$url, array &$options, array &$errors ): bool {
	// Run the code below only for a certain URL:
	if ( !( $method === 'get' && $url === 'https://example.net/path' ) ) {
		return;
	}
	// ...
	// Correct URL beyond merely replacing some part of it:
	$url = some_function( $url );
	// ...
	// Add some option that can be set only run-time, not in $wgExternalDataSources:
	$options['headers']['Some header'] = some_other_function( /*...*/ );
	// ...
	// Return anything except false or nothing at all to proceed with the request:
	return;
	// ...
	// Add an error message that will be displayed if the hook returns false:
	$errors[] = "No, I will not get $url";
	// ...
	// Return false explicitly to prevent actual HTTP request.
	return false;
}

ExamplesEdit

You can see some example calls to #get_web_data, featuring real-world data sources, at the Examples page.