Extension:Sofa

MediaWiki extensions manual
OOjs UI icon advanced.svg
Sofa
Release status: beta
Implementation Data extraction
Description A reporting extension for MediaWiki
Author(s) Brian Wolff (Bawolfftalk)
Latest version 0.1
MediaWiki 1.32+
Database changes Yes
License GNU General Public License 2.0 or later
Download
Translate the Sofa extension if it is available at translatewiki.net

Check usage and version matrix.

Sofa is a reporting extension for MediaWiki inspired by the data model of CouchDB.

MotivationEdit

The most common reporting extensions for MediaWiki are DPL, SMW and Cargo (Plus the core category feature). This is an attempt to solve the same problem as those extensions, but with different trade-offs.

Problems with current reporting solutions:

  • Interact with caching badly. Either have to be stuck with outdated results & manual purging, or disable cache and have everything be slow
  • Poor unpredictable performance. User choices can lead to bad performance outcomes. Typos can take down the site
  • Bad scaling - As your wiki gets bigger, the reporting extensions start to get slow
  • Complicated query language [Not sure I really solve this]. With DPL, the ad-hoc query language gets really complicated and hard to understand. Similarly Cargo and SMW use query languages that can be confusing to users.

The goals of this extension are:

  • Allow extracting complex information. We aim to be very restricted to prevent the types of issues that other extensions suffer with performance, but still should be able to cover 85% of their usecases, but perhaps require thinking about problems in a different way
  • The core unit of data for this extension is the "page" [That's similar to other extensions]
  • User actions should not be able to cause significant performance problems. Definitely not accidentally, but the extension should be suitable for open wikis where users might be malicious.
  • Horizontal scaling. Data store should be easily shardable, allowing easy scaling by just adding more (inexpensive servers) instead of trying to make your DB server be super powerful. It should work out of the box for a small wiki, and you should still be able to use it no matter how big you get.
  • Predictable performance. Should scale roughly linear with the number of pages on your wiki, so sysadmins can easily predict scaling needs
  • DB queries should touch a relatively small number of rows, thus resulting in very fast queries. As a trade off, that probably means that the size of the overall DB will be much bigger (Which does have performance implications). For a big wiki, you would probably want this backed by a different DB server than your main db, or possibly sharded amongst multiple.
  • Have a simple primitive that users can build structure around, as opposed to a large primitive that's inflexible. In DPLv2, there's options for a lot of things, but its hard to understand it all, and its a bit brittle if there's something not covered. Instead we provide a very simple primitive to Lua, which users can build abstractions around to do whatever they want.
  • Aim to do more work at save time & job queue time then view time. Most wikis have lots of views, much fewer edits. Wikis are very sensitive to view latency, but relatively insensitive to update latency when things aren't directly being viewed.

The proposed approach is to follow the data model that W:CouchDB uses (hence the name Sofa). See their docs for how their DB works which inspired this. In particular I envision this to be something like:

  • On a page, you can emit metadata. Each piece of metadata contains a schema name, a key value (or array of string values), and an optional JSON document value
  • Cache invalidation just works. Things are cache. If you edit something, anything that uses it gets cache cleared, just like how templates work.
  • You can also query this metadata from lua, given a schema name, a range of keys, and a limit for number of results. This will return all the metadata within the limit (up to some max), as well as the associated page names, and the value json document
  • Not yet implemented yet, but you can also define aggregates, which are basically lua functions that do a Map-Reduce computation on emitted metadata values. You can then query a range of values (or group by a key or part of a key) and get the aggregated value. So you can do simple things like count how many documents, or more complex calculations
  • [not yet implemented] You can define a default lua module that can emit metadata for all pages, so you can do non-existence queries.
  • [Maybe, not sure] - Being able to do intersections of ranges would be cool, but probably questionable performance. This is a core feature that DPL is used for
  • [Maybe not sure] - Having a special GPS key type, and then doing geometric queries, would be cool.

This approach sounds limited, and it is in comparison to the features of Cargo. However, CouchDB has proven you can do a lot with this type of data model. The hope is, this is just enough to meet the needs of people who use this type of extension, but also allows us to make something with predictable performance & scalability.

How it worksEdit

This is very much a proof of concept right now. Currently you can use {{#sofaset:schemaName|KeyName|Value}} to set a metadata property and {{#sofaget:schema|...options...}} where the options are limit=, start= and stop= to query. The cache invalidation should work. Lua primitives should come soon. Aggreagates might not be implemented for a little while.

InstallEdit

  • Download and place the file(s) in a directory called Sofa in your extensions/ folder.
  • Add the following code at the bottom of your LocalSettings.php:
    wfLoadExtension( 'Sofa' );
    
  • Run the update script which will automatically create the necessary database tables that this extension needs.
  •   Done – Navigate to Special:Version on your wiki to verify that the extension is successfully installed.