Extension:JsonConfig/Tabular

This page in a nutshell: This page is primarily of historic interest (or in many respects outdated). It describes implementation of the DataNamespace proposal using JsonConfig's Tabular content support. See Help:Tabular_Data for help with the current implementation!

Tracked in Phabricator
Task T137929

Tabular content is a machine-readable data similar to CSV and TSV formats. It allows any user to create a page, e.g. "Data:List of interesting facts.tab" (demo^{[dead link]} ), and keep it as a table, rather than wiki text. Tabular storage allows strings, numbers, booleans (true/false), and "localized strings" – strings that have different value depending on the language. Eventually, it would be good to also implement Q number support, allowing direct links to Wikidata.

Additionally, tabular data can store metadata, such as localized description and data source. More metadata can be added as needed.

Tabular storage greatly simplifies storing data for lists, tables, and graphs. Graphs may directly access tabular data, and on-wiki tables and lists can be created by using simple Lua scripts. This storage is fundamentally different from Wikidata, because it works with "blobs" (batches) of data, whereas Wikidata works with tiny "facts". Wikidata technology is simply not suited for large storage such as the list of the most expensive paintings, the shoe size comparisons table, or data to plot Moscow subway growth graph.

After a long discussion, it seemed Commons would be the best fit for such data and over 70% of the Commons community supported hosting tables on Commons. The Commons community is already experienced with international multi-licensed content. Feel free to experiment with it at http://data.wmflabs.org/wiki/Data:Sample.tab.^{[dead link]}

Note that you can view it with different languages, e.g. http://data.wmflabs.org/wiki/Data:Sample.tab?uselang=fr ^{[dead link]}

Usage

All tabular data will be stored in the Data namespace on Commons, with a ".tab" page title suffix, e.g., Data:Example.tab

The data will be accessible from all other wikis by:

<graph> will be able to use the data directly by using "tabular://Example.tab" Extension:Graph/Guide#External Data (no need for Data namespace)
- Is ~~wiki~~tabular a good name? Alternatives: wikitabdata, wikitab, ...?
- Should the page title have the page title's suffix?
Lua scripts in a Scribunto modules via mw.ext.data.get("Example.tab") function. The data will be returned as parsed JSON of the raw page content, so Scribunto module will be able to access all other metadata fields. This function is not tabular-data specific. We might also want to introduce mw.ext.data.getTabularData() to get data with localized strings resolved for a specific language.

To access data directly on a wiki page, you can import (if you don't already have them) the tabular data module (requires the navbar module) and optionally the tabular query template. (requires the aforementioned tabular data module) With these tools you can easily get a single cell's value.

Documentation

See Help:Tabular Data and Help:Map Data for in depth field description.

Future Plans

CSV and TSV copy/paste form to simplify transferring data to and from external spreadsheets.
An in-place spreadsheet table editor
Wikidata support, allowing direct links to localized Wikidata content

TBD / Questions / Ideas

Licenses - if requested per licensing discussion, how should the license be stored to make it machine readable and avoid untranslatable and unparsable free text
- We may choose to deploy without license support (public domain only), and later add licensing capability. Yurik (talk) 18:20, 30 April 2016 (UTC)
What metadata is needed? Current proposal has "source" (string) and "info" (localized string), but we might need more.
- Support for specifying data source(s). This is to avoid WW3 about what data is "right"/ the truth. There will be some frequently used sources that might need a shortcut and some sources will be less frequently used.

Is it enough to have one source for the whole table, or should we introduce a new data type called "source", to allow per row sourcing? The per-row sourcing could be added later of course. Also, ideally we should support multiple references just like wikidata. And it would be good to have multiple pairs of "source type" and "source value" - similar to Wikidata's property->value structure. Yurik (talk) 18:20, 30 April 2016 (UTC)

Cross-datacenter cache invalidation - JsonConfig supports remote cache invalidations, but it uses MW API call for that. What server should commons access to notify of data change?

Internal cross-wiki usage

Cross-wiki data usage is based on the existing JsonConfig mechanisms that have been in production use (Wikipedia Zero) for the past few years. JsonConfig supports multiple content handlers, and can be easily used for cross-wiki shared data namespace.

JCSingleton::getContent() implementation gets content for a given page title, even if that page title is located in another wiki by first checking if the content is stored in memcached (JCCache::get()). The memcached key is non-wiki specific, allowing different wikis to share the same content object. If case of a cache miss, the page is loaded locally (in case when current wiki is the storage wiki for that title), or remotely via a query api call, and cached.

When the page changes (JCSingleton::onArticleChangeComplete), the memcached is updated with the new value, and optionally an API call is made to a remote server to notify it that the cache should be updated. This could help with cross-datacenter cache invalidation.

Configuration

JsonConfig uses a very flexible (and a bit complicated) settings system. Both Commons wiki and all other wikis will need this code block to set up a cross-wiki shareable storage:

$wgJsonConfigModels['Tabular.JsonConfig'] = 'JsonConfig\JCTabularContent';
$wgJsonConfigs['Tabular.JsonConfig'] = array(
	'namespace' => 486, // === NS_DATA, but the constant is not defined yet
	'nsName' => 'Data',
	'isLocal' => false,
	'pattern' => '/.\.tab$/'
);

Commons wiki will need to specify that data should be stored locally:

$wgJsonConfigs['Tabular.JsonConfig']['store'] = true;

Other wikis will need to set how to access remote data:

$wgJsonConfigs['Tabular.JsonConfig']['remote'] = 'https://commons.wikimedia.org/w/api.php';