User:DWalden (WMF)/Test data sampling using SQL

Problem

edit
  • I want to come up with a rich set of test data.
  • I already have a test environment with lots of data.
  • Therefore, I don't need to create test data.

Solution

edit

Query your SQL database to find data that is already available.

  • Identify what you want to test.
    • This could be an API, HTML form or something else.
    • We will use as an example deleting pages.
  • Identify which variables might affect its behaviour.
    • For example, if deleting a page, two variables which may affect behaviour are number of edits to the page and whether or not it is a file.
    • It will generally work best if you can identify discrete values for these variables.
      • For example, number of edits could be either less than 1000 or 1000 and greater. Whether something is a file is either true or false.
  • Construct the query to list all these variables

For example:

SELECT page_title, page_namespace = 6 AS "Is File", COUNT(rev_id) < 1000
FROM revision
INNER JOIN page ON rev_page=page_id
GROUP BY page_title;

Which will output something like:

page_title  Is File     COUNT(rev_id) < 1000
----------  ----------  --------------------
Main_Page   0           1
Foobar      0           1                   
File:Foo    1           0
File:Bar    1           0
File:Baz    1           1
  • On a large database, there will likely be lots of pages in each partition. You will likely only want to test a sample from each.
SELECT page_title, page_namespace = 6 AS "Is File", COUNT(rev_id) < 1000 AS "size", ROW_NUMBER() OVER (PARTITION BY "Is File", "size") AS part
FROM revision
INNER JOIN page ON rev_page=page_id
GROUP BY page_title;

Which will output something like:

page_title  Is File     size     part
----------  ----------  -------  -------
Main_Page   0           1        1
Foobar      0           1        2
File:Foo    1           0        1
File:Bar    1           0        2
File:Baz    1           1        1
  • Notice that the numbering produced by ROW_NUMBER() will reset to 1 with each new partition. You can then select a subset of a partition based on the numbering. For example:
SELECT * FROM
(SELECT page_title, page_namespace = 6 AS "Is File", COUNT(rev_id) < 1000 AS "size", ROW_NUMBER() OVER (PARTITION BY "Is File", "size") AS part
FROM revision
INNER JOIN page ON rev_page=page_id) AS foo
WHERE foo.part = 1
GROUP BY page_title;

Which will output something like:

page_title  Is File     size     part
----------  ----------  -------  -------
Main_Page   0           1        1
File:Foo    1           0        1
File:Baz    1           1        1
  • You may wish to order the rows within each partition. For example:
    • Number of revisions: ROW_NUMBER() OVER (PARTITION BY "Is File", "size" ORDER BY COUNT(rev_id) DESC)
    • Randomly: ROW_NUMBER() OVER (PARTITION BY "Is File", "size" ORDER BY RAND())

API testing

edit

To make this work with the API testing script, name the columns after the API parameters.

For example, the delete API has the parameter title. Therefore:

SELECT page_title AS title FROM
(SELECT page_title, page_namespace = 6 AS "Is File", COUNT(rev_id) < 1000 AS "size", ROW_NUMBER() OVER (PARTITION BY "Is File", "size") AS part
FROM revision
INNER JOIN page ON rev_page=page_id) AS foo
WHERE foo.part = 1
GROUP BY page_title;

Pass this to the script:

python3 pwb.py all_params_api_db -m delete -q 'SELECT page_title AS title FROM (SELECT page_title, page_namespace = 6 AS "Is File", COUNT(rev_id) < 1000 AS "size", ROW_NUMBER() OVER (PARTITION BY "Is File", "size") AS part FROM revision INNER JOIN page ON rev_page=page_id) AS foo WHERE foo.part = 1 GROUP BY page_title' -d ~/core/cache/sqlite/my_wiki.sqlite -lang:en -family:localwiki

Examples

edit