User:Dantman/Abstract query language/old

  • Return a list of user objects
  • Query from the list of revisions
  • ?page is an unassigned variable that must be externally defined, expected to be a page, ie Title (the mapping of Revision:page to a page/Title is built into MediaWiki).
  • The page of a revision is defined by ?page
  • ?rev:user is ?u, so the User that is returned is extracted from there
  • Order by the largest timestamp of the revisions for the distinct user.
DISTINCT { User ?u }
FROM { Revision ?rev }
WHERE { ?rev:page IS ?page, ?rev:user IS ?u }
ORDER { MAX(?rev:timestamp) }

Taking into account the database structure which would be described inside MediaWiki the query would translate about into this sql currently.

SELECT user.*, rev_user_text AS user_name, MAX(rev_timestamp) AS timestamp
FROM revision LEFT JOIN user ON (rev_user = user_id)
WHERE (rev_page = 1)
GROUP BY rev_user, rev_user_text
ORDER BY timestamp DESC

The query however when looped over would return objects where ->u would contain a User instance.

The key difference of course is:

  • If the database structure changes, the generated sql will be different, the abstract query will not need to be updated to fit the new db structure as SQL would.
    • This flexibility would actually flex out even to extreme database changes. That query could still be functional even if we dropped the entire user table, the rev_user_text field, and replaced them with a usermap table that maps anon ips and registered users to ids.
  • Because the query is written more abstractly individual db engines could theoretically set rules for certain types of queries to optimize them for that specific DB engine.
    • We should theoretically be able to keep our top MySQL performance.
    • We might also get better performance in other dbs like pgsql.

PHP Api

edit

There's a little undecided on the php api, there are two possibilities:

// Abstract query
$q->query("
	DISTINCT { User ?u }
	FROM { Revision ?rev }
	WHERE { ?rev:page IS ?page, ?rev:user IS ?u }
	ORDER { MAX(?rev:timestamp) }",
	array( 'page' => $this->getTitle() ) );

// Abstract query (php syntaxed?)
$q->query(array(
	"distinct" => "User ?u",
	"from" => "Revision ?rev"
	"where" => array( "?rev:page" => "?page", "?rev:user" => "?u" ),
	"order" => "MAX(?rev:timestamp)" ),
	array( 'page' => $this->getTitle() ) );

The advantage of the PHP syntax is that for a portion of the syntax mistakes php will throw a syntax error, however the abstract string dsl is more flexible for things like != and other comparisons.

We're beginning to target HipHop as a primary target, and we have the option in cases where performance is key to use a php extension, so performance wise a string based dsl shouldn't be a prooblem.