Talk:Quarry

About this board

Previous discussion was archived at Talk:Quarry/Archive 1 on 2015-04-17. Discussion area for discussion about Quarry itself and help about individual queries.

Zar2gar1 (talkcontribs)

Hi there,

I've started trying out Quarry recently, and I really like the service. It's focused, snappy, and makes results easy to link to. However, I've hit a couple limitations in what I want to do.

I figured from the start I'll ultimately need to code up something more complex at Toolforge, but I realized a couple things would allow for more heavy-lifting with Quarry:

  1. Is there any reason the watchlist table is entirely redacted, instead of just user fields being blanked? I guess this one's more a general question about the DB replicas
  2. How exactly is the result-set data stored from successful queries? And would it be possible to provide it through SQL, even temporarily in a cache somehow? Maybe similarly to the ToolDB databases? My thinking is that could allow decomposing queries, then joining or filtering their result-sets, all asynchronously through Quarry.

I can fill out feature request tickets at Phabricator, but I thought I'd ask here first in case I'm missing something obvious.

BDavis (WMF) (talkcontribs)
Reply to "Two feature ideas"
Liz (talkcontribs)

Is there a character count for Query names? Because there is a new editor, Yesh0305, who is writing ridiculously long query names and the table at https://quarry.wmcloud.org/query/runs/all gets all out-of-shape. I've posted to their talk page but I don't think they even realize that they have a talk page. I've looked at their global contributions to reach out to them on their home Wikipedia (which I think is tewiki) but they had none so they must use a different username on Quarry. Maybe there could be a reasonable character limit on names, like 20-30 characters. What do you think?

Enhancing999 (talkcontribs)

You mean https://quarry.wmcloud.org/Yesh0305 ? @User:Yesh0305

I don't think name such as "Compare each top editor's total edits against the overall average: How much more than average as percentage" is problematic. It's actually fairly descriptive as name.

Personally, I'm either too lazy or try to keep them short because it becomes the download name, but in principle, in a list of queries, the above can be sensible.

Reply to "Character count"

Why are queries getting stopped?

4
Prototyperspective (talkcontribs)

It says the query was stopped but I did not stop it.

Prototyperspective (talkcontribs)
Enhancing999 (talkcontribs)

I had that too. I assumed a dbadmin stopped it as it was running for a long time.

Prototyperspective (talkcontribs)

That could be the case. It could also be because there were issues with the database or because some limit was hit. I think at a minimum it should display some error message / info. One of the queries did run through now and with the bug report above I guess this is solved here.

Reply to "Why are queries getting stopped?"

OperationalError('table resultsets already exists')

5
Enhancing999 (talkcontribs)

Any idea why that happens? I get it after long running queries. Sample: quarry:query/86096. If I happen to have the window open, I sometimes see the actual results before, meaning the query was successful. Usually I forget to export it before it disappears.

Framawiki (talkcontribs)
Prototyperspective (talkcontribs)

Just had the same problem at query 86864. Is there a bug report about it by now? I could not find one with that error so probably not (please link it here). I refreshed the page and then submitted the query again, hopefully it works now.

Enhancing999 (talkcontribs)

No, I might be mistaken, but frequently nothing happens once one adds a problem to phab. Just creates cost at WMF and work for bug sorters.

Prototyperspective (talkcontribs)

It wouldn't be an issue or unexpected if that frequently happens but it's the most common thing that happens and even for major problems, including major issues open for over a decade. Changing that is I think step 1 and I made a concrete proposal for that here (and several more that are linked there).

Reply to "OperationalError('table resultsets already exists')"
Enhancing999 (talkcontribs)

From locally running python, what's the best way to run a query and download the result?

Supposedly Manual:Pywikibot/MySQL can't work with Quarry, except from toolserver.

Matěj Suchánek (talkcontribs)

There has never been support for Quarry in Pywikibot, but recently support for Wikimedia Superset was added. Use SupersetPageGenerator in code or -supersetquery from command line.

Enhancing999 (talkcontribs)

Interesting suggestion: I should try to figure out how to get Superset to work. The access to create new datasets seems to be limited.

Enhancing999 (talkcontribs)

Is there a way to trigger the update of query from python and then load the result of the most recent run without knowing the run number?

Reply to "use from python"

Page deletions from 2023 seem off

2
Clayoquot (talkcontribs)
Matěj Suchánek (talkcontribs)

I think COUNT(log_namespace = 0) is incorrect. When I reproduce the stats using:

SELECT LEFT(log_timestamp, 4) AS year, COUNT(*) FROM logging_logindex WHERE log_namespace = 0 AND log_type = 'delete' AND log_action = 'delete' GROUP BY LEFT(log_timestamp, 4);

I get 89,872 deleted main space pages in 2022 and 81,099 in 2023. For even namespaces (log_namespace % 2 = 0), it's 440,817 and 391,807.

Reply to "Page deletions from 2023 seem off"

What's more efficient: AND NOT, not in, <>

2
Enhancing999 (talkcontribs)

Which is more efficient?

  • AND NOT ( lt_title = "ABC" ) AND NOT ( lt_title = "XYZ")
  • AND NOT in ( "ABC", "XYZ")
  • AND lt_title <> "ABC" AND lt_title <> "XYZ"

Agree that none is ideal.

Matěj Suchánek (talkcontribs)

As the first thing, the query engine builds a query plan, then executes the query according to the plan. You can check if the query plan is always the same (e.g., using Toolforge SQL Optimizer). If it is, there is no difference.

I prefer NOT IN.

Reply to "What's more efficient: AND NOT, not in, <>"
Enhancing999 (talkcontribs)

How to specify tables from two different databases (wikidatawiki_p and commonswiki_p)?

Enhancing999 (talkcontribs)

I tried

  • SELECT * FROM `commonswiki_p`.`pages` LIMIT 1
  • USE DATABASE commonswiki_p
  • USE commonswiki_p;

to override what's specified in the GUI.

Matěj Suchánek (talkcontribs)
TheDJ (talkcontribs)

They are completely separate DB servers, you cannot make queries across multiple servers in the same query.

Enhancing999 (talkcontribs)

Apparently it was possible (see sample in the topic referenced by Matej) but then un-featured.

I found an easier solution, as the gap between Wikidata and Commons is only partial: one table at Commons is updated ( wbc_entity_usage), but not the other (page_props): quarry:query/86040

TheDJ (talkcontribs)

"Apparently it was possible " Yes, until the infrastructure ran into scaling problems.

Enhancing999 (talkcontribs)

Are there any measure in place to keep the databases in sync? The gap mentioned above is minor in percentages (maybe 0.1%), but in absolute numbers 4600 is a lot.

Reply to "Using 2 databases"

combining SQL and SPARQL query

2
Jarekt (talkcontribs)
Enhancing999 (talkcontribs)
Reply to "combining SQL and SPARQL query"
Summary by GTrang

No more replag

GTrang (talkcontribs)

The enwiki database has been on replag for an entire week now. It should hopefully be fixed in the next week or so.

Return to "Quarry" page.