User:TJones (WMF)/Notes/Review of Commons Queries
July 2020 — See TJones_(WMF)/Notes for other projects. See also T258297 and T252544.
I undertook a review of queries on Wikimedia Commons to get a sense of how people are using Commons, and how we might improve search on Commons.
Full Query Corpus Analysis
editQuick Summary
editIn three month's worth of likely-human queries issued on Commons, over 90% are in the Latin script, about 50% are in English, almost 25% are names, and almost 10% are porn-related.
Among the most common queries, 8 of the top 10 and 66 of the top 100 are porn-related, but even the most common queries are not really that common, and only 6 queries out of over 1.04M unique (lightly normalized) queries were searched 1,000 times or more, and only 660 were searched 50 times or more. Over 950K were unique. There is not really a head—it's pretty much all the long tail.
In a sample of 100 random queries (the long tail), 30 were specific things, 22 people, 14 places, 11 organizations, and 12 were porn. 60 queries were narrow and fairly specific, 17 were fairly broad, and 22 were in the middle. (Broad queries were often one word.)
In a sample of the 100 most common queries (the head-ish), 66 were porn, 7 were looking for "facts", 7 were specific things, 6 were current events, 5 were people. 24 queries were narrow and fairly specific, 46 were fairly broad, and 27 were in the middle. (Broad queries were often one word.)
Only 1.6% of queries used a namespace, 0.9% had a file extension. Boolean and special operators were very rare.
10% of queries got zero results. Less than 1% got a million results or more.
If we break queries on whitespace and punctuation (less than ideal, but easy), 66% of queries are one or two words; 93% are four words or fewer.
Data
editI pulled three months' worth of Commons queries from mediawiki_cirrussearch_request
to analyze—from April 15 to July 15, 2020.
The sample does not include all queries from the time period; I applied some of the usual filters that the search team has found to be useful to get a reasonable sample from "normal" human users:
- We require the queries to have used the search box in the upper corner. This eliminates some bots, screen scrapers, and links that are queries.
- We eliminate users who make more than 100 queries in a day. This helps prevent us from oversampling bots, power users and editors, script/gadget users, shared IP addresses, and other users who are either not "normal" or not human.
Other caveats:
- This sample only includes queries performed on Commonswiki. It does not include queries from other wikis that also search Commons data (like Wikipedias do).
- I performed some very minimal normalization on all the data to make it easier to process: I converted tabs and newlines in any query to spaces. This shouldn't change the results shown or the intent of the query, but it sure does make a tab-separated file easier to process. (See below for info on other kinds of normalization done to group queries together.)
- Note that this is a different from the data source that Erik used for his Top N queries per day. See T257361.
The sample contains 1,532,070 (~1.5M) queries.
Scripts and Languages
editThe first thing I do with a pile of data, of course, is try to figure out what scripts and languages are contained in it.
Scripts
edit- The vast majority of queries (1,414,693; 92.3%) are in the Latin script—possibly with additional numbers, fairly common punctuation, symbols, etc.—and didn't fall into any other category.
- 1,170,330 (76.4%) of queries consist only of A-Z (upper- and lowercase) and spaces.
- Another 70,876 (4.6%) are only A-Z and the digits 0-9.
- The next biggest groups are Arabic (17,675; 1.2%), Cyrillic (15,180; 1.0%), and CJK (25,239; 1.6%)—again possibly with additional numbers, punctuation, and symbols.
- The 25,239 "CJK" queries include 13,974 queries that are CJK Unified Ideographs, 5,454 queries in Hangul/Korean, 3,007 queries in Japanese Katakana, 497 queries in Japanese Hiragana, and 2,307 that are "mixed" CJK characters (mostly Japanese Hiragana or Katakana with Chinese characters).
The breakdown by script is below:
17,675 | Arabic |
231 | Armenian |
1,784 | Bengali |
2 | Bopomofo |
2,307 | CJK |
3 | Canadian Syllabics |
13 | Carian |
1 | Cuneiform |
15,180 | Cyrillic |
987 | Devanagari |
6 | Egyptian |
16 | Ethiopic |
681 | Georgian |
1 | Glagolitic |
992 | Greek |
67 | Gujarati |
8 | Gurmukhi |
5,454 | Hangul |
2,627 | Hebrew |
497 | Hiragana |
13,974 | Ideographic |
73 | Kannada |
3,007 | Katakana |
91 | Khmer |
22 | Lao |
1,414,693 | Latin |
192 | Malayalam |
1 | Mongolian |
190 | Myanmar |
7 | N'Ko |
13 | Ol Chiki |
48 | Oriya |
83 | Sinhala |
2 | Syriac |
1 | Tai Tham |
343 | Tamil |
46 | Telugu |
2 | Thaana |
1,176 | Thai |
8 | Tibetan |
6 | Tifinagh |
- An additional 2,352 queries (0.2%) are mixed-script (here "scripts" include less common symbols, punctuation, and emoji). The largest groups are Latin/CJK (~500), Cyrillic/Latin (~400), and Arabic/Latin (~300). My favorite query in this group is mixed Cyrillic/Greek/Latin: Jolly Zοmbіеѕ (the bolded characters are not Latin).
Numbers
editA small number of queries are mostly numbers:
- 2,448 (0.2%) are integers (a small number—just three—with invisibles or diacritics, and some—eleven—are longer numbers with commas)
- 457 look like measurements (e.g., 3mm or 5x5)
- 313 look like IP addresses
- Plus a handful (69) of other numbers, including decimals, ordinals, hex numerals, malformed IP addresses, etc.
Misc
editThere are a small number of additional text patterns not included above.
- 2,720 (0.2%) queries look like identifiable web domains.
- 626 queries look like email addresses.
- 160 Latin-script acronyms (91 upper case, 69 lowercase). Traditionally we don't handle acronyms very well in search, so at least there aren't a lot of them.
Symbols
editA small number of queries are all symbols (711; e.g., $600) or punctuation (107; ,,,,,,,,,,,,,,,,,,) or emoji (79; 🤣).
126 additional queries have characters that my Unicode regexes identify as "unassigned" code points, but these are mostly—but not entirely—emoji. (I assume most are newer emoji than have recently been assigned code points.)
Invisibles
edit1,390 queries (0.1%) include invisible characters: bi-directional markers, control characters, formatting characters, or odd whitespace characters.
If these aren't normalized well, they can screw up query results, however, these are clearly not a huge problem on Commons.
Languages
editI took a random sample of 1,000 queries and tried to identify them by language.
The categorizations are almost certainly not perfect, but they should be close enough to get a sense of the proportions of different languages used on Commons.
As a general rule, I don't try to identify the "language" of names. North Americans in particular like to combine names from various ethnolinguistic origins, though they aren't the only ones. Some names—Maria is the most egregious—are too ambiguous to categorize. I make exceptions for names in fairly unambiguous scripts; for example, 엔리코 콜런토니 is arguably "in Korean", even though it's a very Italian name ("Enrico Colantoni") of a Canadian actor.
If a query includes a name and text in a particular language, I count that as in that language. So, Abraham Lincoln is a name, but birthplace of Abraham Lincoln would be categorized as English and local de nascimento de Abraham Lincoln would be categorized as Portuguese.
19 queries were "unidentifiable" because they weren't words (11q!) or were too ambiguous as to language (e.g., a single-word query that could be English, French, or Spanish).
I categorized 42 queries as "technical terms" which are either too ambiguous or not really words (xml, t-800). There was also 1 number and 1 measurement.
A few items are assignable to a given language even though they are in the wrong script. The Russian song В путь is transliterated as V Put or V Put', which doesn't mean anything in English, so I guess it's in Russian? Sure, why not—there were only a small handful. Also, there were two wrong-keyboard Russian queries that I noticed, which I also counted as Russian.
A couple of queries were mixed-language. I counted them as the predominate language and noted the other language.
The most common categories then are English queries (507; 50.7%) and names (237; 23.7%), with a smattering of technical terms (43; 4.3%), German (34; 3.4%), Spanish (28; 2.8%), French (27; 2.7%), Chinese (14; 1.4%), Russian (13; 1.3%), Italian (11; 1.1%), and others.
Non-language groups | |
19 | unidentifiable |
237 | names (1 Category, 1 File) |
42 | technical terms |
1 | measurement |
1 | number |
Languages | |
2 | Arabic |
1 | Armenian |
1 | Bulgarian |
2 | Bengali |
14 | Chinese |
1 | Czech |
1 | Danish |
4 | Dutch |
507 | English |
1 | Finnish |
27 | French |
34 | German |
2 | Hebrew |
1 | Hungarian |
3 | Indonesian |
11 | Italian |
7 | Japanese (3 in transliteration) |
8 | Korean (1 in transliteration) |
4 | Latin (1 Category) |
6 | Persian |
8 | Polish |
8 | Portuguese |
13 | Russian (1 in transliteration, 2 wrong keyboard, 1 with some English) |
28 | Spanish (1 with some English) |
2 | Swedish |
1 | Tagalog |
1 | Tajik |
1 | Thai |
1 | Turkish |
Query Patterns
editHere are some potentially interesting patterns I noticed in the queries:
Query Frequency
editWe wanted to look at the "head, torso, and tail" of the distribution of queries by frequency—however, there looks to me to be at most a tiny head and a long, long tail.
I did some very basic normalization of the queries for bucketing; I lowercased them, and normalized whitespace (removing leading and trailing spaces, and reducing multiple spaces to just one space), so that " JoHN SMiTh " and "john smith" count as the same query.
There are 1,090,396 unique normalized queries (out of 1,532,070 queries).
I grouped the query frequencies into quasi-logarithmic binary/decimal buckets (1/2/5/10/20/50...) which is approximately logarithmic in both binary and decimal, has relatively fine-grained buckets, and is human-friendly. I also added buckets for 3 and 4, since there are many queries with these very low frequencies.
Only 2 queries appeared more than 2,000 times, only 4 between 1,000 and 2,000. All together, only 660 queries (0.06%) appeared more than 20 times.
So, 99.94% of queries occur less than 20 times in 3 months. 97.89% occur fewer than 5 times. It's all long tail.
Normalized Query Frequency Distribution
count | bucket |
955,635 | 1 |
77,294 | 2 |
23,539 | 3 |
10,918 | 4 |
15,451 | 5-9 |
4,990 | 10-19 |
1,909 | 20-49 |
433 | 50-99 |
138 | 100-199 |
62 | 200-499 |
21 | 500-999 |
4 | 1,000-1,999 |
2 | 2,000-4,999 |
mean count: 1.40506
Where to break the head and tail in a distribution is generally subjective, however Wikipedia has a page on the Head/tail Breaks algorithm, which breaks the head and tail at the mean value in the distribution. The mean frequency for the normalized queries is 1.40506, which means the head would be everything with a frequency of 2+ (12.4% of queries), and the tail would be all of the unique queries (87.6% of queries). Having a frequency of 2 (or even 10) out of over a million queries doesn't seem like the "head" to me, so I'm going to stick with my claim that there is no head!
Query Intent
editWhile talking to Cormac about this analysis, we talked a bit about "modifiers" to queries, such as looking for things with specific colors, etc.—y'know, the kinds of things that might show up in structured data! It's a difficult and subjective line to draw, but I tried to divide between things that are essential and things that are preferences (like color of a car), and maybe things that would be reified in wikidata (e.g., "Chinese art") or not (e.g., "ugly art"), with reifiable things not counting as "modified". There are not many queries with modifiers—none in the top 100 most common queries.
In the random sample of 1,000 queries that I tried to identify by language, I also categorized them as "porn" or not while I was working through the list. 95 (9.5%) were about or likely about porn, porn actors, nudity, sexual acts, etc. So, roughly 9-10% of Commons queries are likely porn-themed.
I took a sub-sample of 100 queries from this set and tried to categorize them by intent, to compare to the head-ish top 100 queries:
- 30 were about specific or general things.
- 23 things hard to classify more specifically
- 1 had additional modifiers/specifiers
- 3 specific products
- 2 logos (also counted as images)
- 1 film
- 1 list (of latin phrases—??)
- 23 things hard to classify more specifically
- 22 were about specific people
- 1 had additional modifiers/specifiers
- 1 was about 2 specific people
- 14 were about specific places
- 1 had additional modifiers/specifiers
- 1 was about an activity at that place
- 12 were porn
- 11 were about specific organizations
- 3 were about general concepts
- 2 were about an activity (one in a particular place)
- 2 seemed to be looking for specific images (e.g., something fairly specific.jpg)
- 1 was about art
I reviewed the top 100 most common queries and tried to categorize them by intent:
- 61 were porn, and another 5 were likely porn. That's 2/3 of the most common queries.
- The 4 non-English non-name higher-frequency queries were here: two German words and two Persian words.
- There was one Category in the sample, and it was also in the porn category.
- 7 were about "facts" (e.g., map of a place).
- 1 list of...
- 7 were about specific or general things.
- 6 were about current events and topics in the news.
- 5 were about non-pornographic celebrities or historical figures
- 2 were about art or artists.
- 2 were about specific places.
- 2 were about tech topics or companies.
- 3 I couldn't categorize (2 of them got 0 results).
Query Generality
editI tried to categorize queries in various samples as narrow, broad, or somewhere in the middle.
- From the sample of 1,000 random queries I categorized by language (representing the tail), I took a random subsample of 100 and categorized them by generality:
- 60 were narrow (specific person, place, or object)
- 22 were in the middle; a somewhat specific category of things or type of thing (e.g., smart home)
- 17 were broad (many are one word); these were all porn but one
- 1 was uncategorizable.
- From the top 100 most common queries (the stubby head):
- 24 were narrow (specific person, place, or object)
- 27 were in the middle; a somewhat specific category of things or type of thing (e.g., hyena cub)
- 46 were broad (often one word); these were all porn but one
- 3 were uncategorizable
Keywords and Specific Purposes
edit- 25,047 queries (1.6%) specify a namespace. The most common are Category (15,519), File (7,044), and Template (823). 282 namespace queries were in Talk namespaces. There was one instance of a non-existant namespace: media.
- 83 queries use the character ː, which is normally used in phonetic transcription to indicate a lengthened vowel (less often a lengthened consonant). It's only used that way once in these 83 queries. 81 of them use it in place of a colon with a namespace!—which does not work. (And there was one other weird one I couldn't figure out.)
- 14,118 queries (0.9%) end in a file extension (but don't use the File: keyword). The most common are .jpg (6,482), .svg (3,142), and .png (1,786). Others include .djvu, .gif, .jpeg, .js, .ogg, .ogv, .pdf, .php, .srt, .stl, .tab, .tif, .tiff, .txt, .wav, .webm.
- 13,239 queries (0.9%) have easily detected URL bits—e.g., ?q=query&thing=whatchamacallit. The most common elements are tbnid= and source=sh/x/i, which seem to come from Google image search, though it isn't clear how the URL components are getting copied from Google to Commons. I suppose it could be a logging error of some sort. A partly sanitized example query is below (the source language is Polish, and kudłaty stwór means "shaggy creature").
- Chewbacca&tbnid=12345678901234&vet=1&docid=12345678901234&w=3456&h=5184&q=kudłaty+stwór&hl=pl&source=sh/x/im
- Searching for kudłaty stwór in images on google.pl does give the Polish Wikipedia article and image for Chewbacca as the 4th result... so something is going on here, but it isn't clear what.
- Of the 12,275 queries with tbnid= in them, all but one start with what looks like a query in Latin script (the other one is Bengali).
- Of the 7,273 tbnid queries with a q= parameter, all but about 25 are largely Latin script; the others are Arabic, Bengali, Cyrillic, Devanagari, Emoji, Greek, Hebrew, Korean, Tamil, and Thai.
- Chewbacca&tbnid=12345678901234&vet=1&docid=12345678901234&w=3456&h=5184&q=kudłaty+stwór&hl=pl&source=sh/x/im
- Only a handful of queries use Boolean or other special operators or characters. It can be hard to be 100% sure about user intent, but Portland, OR and PRIDE AND PREJUDICE are probably not intending to use Boolean operators. So, based on context and capitalization, I've done my best to categorize them.
- AND: 83 of 234 queries with AND seem to be using Boolean operators, even though they don't really do anything (everything is ANDed together by default).
- OR: 44 of 77 queries with OR seem to be Boolean operators.
- NOT: 2 of 17 queries with NOT seem to be Boolean operators.
- !: None of the 475 queries with ! seem to be well-formed, intentional Boolean queries. Most are not formatted as Booleans—Welcome! is not a Boolean query. The ones that are look like typos, like !mismatched quotes" or !st (looks like a mis-shift key for 1st). Others don't seem terribly useful, like Category:!Dogs returns everything that is not in Category:Dogs. Fewer than 10 queries are plausible Boolean queries.
- -: 533 queries plausibly use - as negation. I did a quick skim, and the large majority of them look to be using it intentionally. 70 queries start with - and most of them don't look to be using it intentionally (or at least not correctly), since it's very hard (but not entirely impossible) to imagine -172 is a useful query.
- ~: 110 queries use ~. Almost half of them are of the form User~commonswiki, which are likely leftovers from implementing unified login. Many look like typos for a hyphen, as in 1900~2000. About 15 look to be intentional use of ~.
- (): A handful of the queries using AND, OR, or NOT as Booleans also use parens—even though they don't do anything in our current search system.
- +: Ignoring tbnid queries, 2,304 queries use +. Most seem to be using it in place of a space.
- There are no special keywords with colons other than namespaces in my sample. Carly asked about
haswbstatement:
so I looked a little harder for those and found two instances ofhaswbstatemen
, sixhaswbstatement
, and onesshaswb
, none with any other search terms. There was one malformed query:haswbstatementP180=Q42133786
, but it is also missing the colon.
Light vs Heavy Normalization
editWhile talking to Erik about the Top N queries report he's putting together, he mentioned that he's doing a stronger form of normalization for that report, and replacing all punctuation with spaces before normalizing whitespace.
This is more likely to group queries that get different results than my "light" normalization used above—though even the light normalization may do so because of the way we treat CamelCase search terms. The most obvious case is removing quotes from around a query: John Smith will get many more results than "John Smith". However, as Erik rightly pointed out, in most cases, such variants probably generally represent the same query intent.
It turns out that in my sample, light vs heavy normalization makes only a small difference, about 1.5-2.5% increase in buckets other than the singleton bucket.
The most common change among the heavily normalized top 100 queries is stripping quotes. The most impactful change—in terms of increasing the number of queries grouped together—is stripping hyphens, though it only applied to one query.
bucket | raw queries | light norm | heavy norm | norm diff% |
1 | 1,041,577 | 955,635 | 945,944 | -1.01% |
2 | 72,776 | 77,294 | 78,374 | 1.40% |
3 | 20,870 | 23,539 | 24,007 | 1.99% |
4 | 9,330 | 10,918 | 11,101 | 1.68% |
5-9 | 12,766 | 15,451 | 15,712 | 1.69% |
10-19 | 4,054 | 4,990 | 5,090 | 2.00% |
20-49 | 1,597 | 1,909 | 1,955 | 2.41% |
50-99 | 329 | 433 | 438 | 1.15% |
100-199 | 111 | 138 | 141 | 2.17% |
200-499 | 48 | 62 | 62 | 0.00% |
500-999 | 17 | 21 | 22 | 4.76% |
1,000-1,999 | 3 | 4 | 4 | 0.00% |
2,000-4,999 | 2 | 2 | 0.00% | |
mean freq | 1.3168 | 1.40506 | 1.41485 | 0.70% |
Distribution of Hits
editBelow is a summary of the distribution of results from all of the queries in the sample.
153,549 queries (10%) get zero results. 9,566 queries (0.62%) get more than a million results. The ideal number is probably somewhere in between.
Result Count Frequency Distribution
count | bucket |
153,549 | 0 |
48,409 | 1 |
30,234 | 2 |
22,071 | 3 |
18,006 | 4 |
59,228 | 5-9 |
68,978 | 10-19 |
110,981 | 20-49 |
96,986 | 50-99 |
109,809 | 100-199 |
164,736 | 200-499 |
129,358 | 500-999 |
126,694 | 1K-<2K |
137,790 | 2K-<5K |
74,174 | 5K-<10K |
62,176 | 10K-<20K |
53,881 | 20K-<50K |
25,064 | 50K-<100K |
14,318 | 100K-<200K |
11,013 | 200K-<500K |
5,049 | 500K-<1M |
3,659 | 1M-<2M |
2,808 | 2M-<5M |
1,231 | 5M-<10M |
768 | 10M-<20M |
699 | 20M-<50M |
401 | 50M+ |
Distribution of Token Counts
editThe number of tokens (roughly, words) in a query is an easy proxy for the complexity of a query. It's not perfect, but it is easy to calculate—mostly. For spaceless languages (Chinese, Japanese, Korean, Thai, and others), counting the actual words is much more difficult.
Here, we're using a very simple process of breaking tokens on spaces and punctuation. So, a long Chinese sentence would be counted as one token, and ain't would be counted as two, and .. ,, ;; -- would be counted as zero. It isn't perfect, but it gives us a reasonable approximation of what we have.
Of note, 1,011,221 queries (66.0%) are one or two tokens. 1,417,935 queries (92.6%) are one to four tokens.
All of the "zero-token" queries are strings of punctuation and symbols.
Of the 20 queries with 40 or more tokens, 4 are gibberish, and most of the rest seem to be captions from images (presumably looking for the original image, but who knows?). 5 are in French, 1 in Spanish, and 11 are in English.
Tokens Count Distribution
count | tokens |
107 | 0 |
353,268 | 1 |
657,953 | 2 |
293,343 | 3 |
113,371 | 4 |
48,063 | 5 |
22,677 | 6 |
11,543 | 7 |
7,344 | 8 |
4,068 | 9 |
2,857 | 10 |
2,597 | 11 |
3,032 | 12 |
3,556 | 13 |
2,964 | 14 |
2,038 | 15 |
1,206 | 16 |
708 | 17 |
409 | 18 |
234 | 19 |
135 | 20 |
109 | 21 |
73 | 22 |
74 | 23 |
54 | 24 |
149 | 25-29 |
118 | 30-39 |
17 | 40-48 |
1 | 58 |
2 | 65 |
Miscellaneous Odd or Interesting Queries
edit- Erik's Top N report doesn't exclude anything, and covers all wikis. We noticed some politics-themed searches on English Wikipedia like 2020 Texas US Senate Election with no variation in whitespace or capitalization. I'm not sure I found the source of the query, but I did find a political website that helps you determine what politicians you align with by asking you questions about various topics. For each topic, they have a "Learn More" link which links to a search on English Wikipedia. So it is definitely possible that unexpected commons searches (e.g., a relatively complex query with no variation) may be the result of links—which may not be publicly available.
- I saw a handful of queries that look like attempts at SQL injection.
- There are some random-looking queries, e.g. 510d47d9-4f0a-a3d9-e040-e00a18064a99, that turn out to be identifiers in collections that images have been taken from.
Zero-Results Query Sub-Corpus Analysis
editAfter writing up the initial report, I realized that 10% of the queries get zero results, which is a lot, but not as many as we see on many Wikipedias. I investigated just the zero-results queries separately.
Zero-Results Summary
editIn three month's worth of likely-human queries issued on Commons, zero-result queries make up about 10% of all queries (which is less than the zero-results rate on Wikipedias). Subjectively, the zero-results queries seem to have less junk than on Wikipedia, and so may be more salvageable. Also, there seem to be more spelling errors/typos in the zero-results queries.
80% of the zero-results queries are in the Latin script (which is less than in the total corpus, which is 90% Latin text). Only 32% are in English (vs 50%), and roughly 25% are names (same as overall). Only 6.5% are porn-related (vs 9.5% overall).
Only 31 of the top 100 most commons zero-results queries are porn-related, vs 66 overall.
Zero-results are more heavily skewed toward unique queries.
In a sample of 200 random zero-results queries (the long tail), 37% were about specific things, 20.5% people, 13% places, 5% facts, 3% organizations, and 6.5% were porn. This is roughly similar to the full corpus, with a bit less porn. 60% of zero-results queries were narrow and fairly specific, 10% were fairly broad, and 22.5% were in the middle. (Broad zero-results queries were often one word.) This is very similar to the full corpus.
In a sample of the 100 most common zero-results queries (the head-ish), 31 were porn, 28 were specific things, 23 were people. This is much more specific and has half the porn of the full corpus. 57 queries were narrow and fairly specific, 31 were fairly broad, and 9 were in the middle. (Broad zero-results queries were often one word.) This is skewed much more toward narrow queries compared to the full corpus.
Breaking on whitespace and punctuation (less than ideal, but easy), 60% of queries are one or two words; 86% are four words or less. This is slightly less than the full corpus. More than half of all high-token queries (10+) give zero results.
Spelling errors seem more common in the zero-results queries (and there is less junk in the zero-results queries than in Wikipedia data); 32% of the random sample of zero-results queries have spelling errors, and 38% of the top 100 zero-results queries have spelling errors. "Did you mean" suggestions and the completion suggester do okay, but could be much better. The current completion suggester doesn't have much to work with because it is limited to page/file/category names, which are not always good matches with what people are searching for. T250436 could be a big help!
The most common zero-results queries are very specific and don't show much variation under normalization (e.g., variation in capitalization or punctuation), which I interpret as either one person repeating the search over and over, someone linking to the search results, or similar "non-organic" source.
Zero-Results Data
editThis is a sub-sample of the earlier-described data set, limited to queries that got zero results.
The sample contains 153,525 zero-results queries (almost exactly 10.0% of the full sample).
Zero-Results Scripts and Languages
editZero-Results Scripts
editThe majority of zero-results queries (125,326; 81.6%) are in the Latin script—possibly with additional numbers, fairly common punctuation, symbols, etc.—and didn't fall into any other category.
- This is notably less than the percentage of queries overall that are in the Latin script (92.3%).
- 88,804 (57.8%) of zero-results queries consist only of A-Z (upper- and lowercase) and spaces.
- This group is notably smaller (57.8% vs 76.4%)
- Another 5434 (3.5%) are only A-Z and the digits 0-9.
The next biggest groups are Arabic (6,618; 4.3%), Cyrillic (2,361; 1.5%), and CJK (2,733; 1.8%)—again possibly with additional numbers, punctuation, and symbols.
- The Arabic percentage is notably higher (4.3% vs 1.2%)
- The 2,733 "CJK" zero-results queries include 1,737 queries in Hangul/Korean, 432 queries in Japanese Katakana, 1 query in Japanese Hiragana, 139 queries that are CJK Unified Ideographs, and 424 that are "mixed" CJK characters (mostly Japanese Hiragana or Katakana with Chinese characters).
- The mix here is very different from the general queries, with significantly more Hangul, and significantly fewer Ideographs and Hiragana.
The breakdown by script is below:
6,618 | Arabic |
78 | Armenian |
758 | Bengali |
424 | CJK, Mixed |
8 | Carian |
1 | Cuneiform |
2,361 | Cyrillic |
416 | Devanagari |
5 | Egyptian Hieroglyphics |
9 | Ethiopic |
235 | Georgian |
354 | Greek |
30 | Gujarati |
2 | Gurmukhi |
1,737 | Hangul |
374 | Hebrew |
1 | Hiragana |
139 | Ideographic |
33 | Kannada |
432 | Katakana |
69 | Khmer |
18 | Lao |
125,326 | Latin |
102 | Malayalam |
102 | Myanmar |
4 | N'Ko |
2 | Ol Chiki |
9 | Oriya |
35 | Sinhala |
1 | Tai Tham |
87 | Tamil |
10 | Telugu |
609 | Thai |
8 | Tibetan |
3 | Tifinagh |
An additional 1,182 zero-results queries (0.8%, up from 0.2%) are mixed-script (here "scripts" include less common symbols, punctuation, and emoji). The largest groups are Arabic/Latin (~240), Cyrillic/Latin (~200), Latin/CJK (~160), and Hangul/Latin (~120). Not surprisingly, the mixed Cyrillic/Greek/Latin: Jolly Zοmbіеѕ got zero results.
Zero-Results Numbers
editA small number of zero-results queries are mostly numbers:
- 923 (0.6%) are integers.
- 1 looks like a measurement (e.g., 3mm or 5x5)
- Plus a handful (7) of other numbers, including decimals, etc.
Zero-Results Misc
editThere are a small number of additional text patterns not included above.
- 666 (0.4%) zero-results queries look like identifiable web domains.
- 572 (0.4%) zero-results queries look like email addresses.
Zero-Results Symbols
editA small number of zero-results queries are all symbols (74; e.g., $600) or punctuation (97; ,,,,,,,,,,,,,,,,,,) or emoji (79; 🤣).
Zero-Results Invisibles
edit393 zero-results queries (0.3%) include invisible characters: bi-directional markers, control characters, formatting characters, or odd whitespace characters.
Zero-Results Languages
editI took a random sample of 200 zero-result queries and tried to identify them by language.
Reminders:
- I generally don't try to identify the language of names.
- If a zero-results query includes a name and text in a particular language, I count that as in that language.
14 zero-results queries (7.0%) were "unidentifiable" because they weren't words (11q!). This is significantly more than the full corpus (7.0% vs 1.9%).
I categorized 12 (6.0% vs 4.2% in the full corpus) zero-results queries as "technical terms" which are either too ambiguous or not really words (xml, t-800). There was also 1 number.
A couple of zero-results queries were mixed-language. I counted them as the predominate language and noted the other language.
The most common categories then are English queries (63; 32.5% vs 50.7%) and names (47; 23.5% vs 23.7%), with a smattering of technical terms (12; 6.0% vs 4.3%), German (12; 6% vs 3.4%), Arabic (9; 4.5% vs 0.2%), Spanish (8; 4.0% vs 2.8%), Korean (5; 2.5% vs 0.8%), French (4; 2.0% vs 2.7%), and others.
Non-language groups | |
14 | unidentifiable |
47 | names (1 Category, 1 File—same as last time!) |
12 | technical terms |
1 | number |
Languages | |
9 | Arabic (1 with some English) |
1 | Albanian |
1 | Dutch |
63 | English |
4 | French |
12 | German |
1 | Hebrew |
1 | Icelandic |
2 | Indonesian |
2 | Italian |
2 | Japanese |
5 | Korean |
1 | Malayalam (in translation) |
1 | Norwegian |
3 | Persian |
3 | Polish |
2 | Portuguese |
1 | Russian |
8 | Spanish (1 with some English) |
1 | Swedish |
1 | Tamil |
2 | Turkish |
Zero-Results Query Patterns
editHere are some potentially interesting patterns I noticed in the zero-results queries...
Zero-Results Query Frequency
editThe frequency distribution of the zero-results queries has a stronger skew towards unique queries, even with light normalization, with the mean frequency being 1.08 (vs 1.41 for all queries).
There are 142,406 unique normalized zero-results queries (out of 153,525 zero-results queries).
Only 2 zero-results queries appeared more than 200 times, only 2 between 50 and 100. All together, only 35 zero-results queries (0.02%) appeared more than 20 times.
So, 99.98% of zero-results queries occur less than 20 times in 3 months. 99.7% occur fewer than 5 times. It's all long tail.
Normalized Zero-Results Query Frequency Distribution | |
count | bucket |
136,530 | 1 |
4,268 | 2 |
828 | 3 |
331 | 4 |
329 | 5-9 |
85 | 10-19 |
31 | 20-49 |
2 | 50-99 |
2 | 200-499 |
mean count: 1.07808 |
Zero-Results Query Intent
editIn the random sample of 200 zero-results queries that I tried to identify by language, I also categorized them as "porn" or not while I was working through the list. 13 (6.5% vs 9.5%) were about or likely about porn, porn actors, nudity, sexual acts, etc.
- I also tried to categorize this set by intent, to compare to the head-ish top 100 zero-results queries:
- 74 were about specific or general things.
- 5 were about specific events
- 3 specific products
- 2 films
- 1 song
- 1 website
- 41 were about specific people
- 26 were about specific places
- 13 were porn
- 11 were about "facts"
- 6 were about specific organizations
- 6 were about general concepts
- 4 were about art
- 8 were junk
- 4 were categories
- 3 seemed to be looking for specific files (e.g., something fairly specific.jpg)
- 74 were about specific or general things.
I reviewed the top 100 most zero-results common zero-results queries and tried to categorize them by intent:
- 31 were porn
- 28 were about specific or general things.
- 5 films
- 5 websites
- 23 were about specific people
- 5 were about specific organizations
- 3 were about specific places
- 2 were about "facts"
- 2 were about general concepts
- 2 malformed keywords ("category:portal:mathematics")
- 3 I couldn't categorize
- 1 seemed to be looking for a specific file
Zero-Results Query Generality
editI tried to categorize zero-results queries in various samples as narrow, broad, or somewhere in the middle.
From the sample of 200 random zero-results queries I categorized by language (representing the tail), I categorized them by generality:
- 120 (60.0%) were narrow (specific person, place, or object)
- 43 (22.5%) were in the middle; a somewhat specific category of things or type of thing (e.g., smart home)
- 20 (10.0%) were broad (many are one word); these were all porn but one
- 17 (8.5%) were uncategorizable.
From the top 100 most common zero-results queries (the stubby head):
- 57 were narrow (specific person, place, or object)
- 9 were in the middle; a somewhat specific category of things or type of thing (e.g., hyena cub)
- 31 were broad (often one word); these were all porn but one
- 3 were uncategorizable.
Zero-Results Keywords and Specific-Purposes
edit- 7,292 zero-results queries (5.1% vs 1.6%) specify a namespace. The most common are Category (5,809), File (970), and Template (206). 60 namespace zero-results queries were in Talk namespaces. The previously mentioned non-existant namespace, media, got zero results.
- 70 zero-results queries use the character ː, which is normally used in phonetic transcription, instead of a colon in an obvious namespace-style query. (There are 81 uses of ː in the whole corpus).
- 2,077 zero-results queries (1.5%) end in a file extension (but don't use the File: keyword). The most common are .jpg (1,211), .png (236), and .svg (197). Others include .djvu, .gif, .jpeg, .js, .ogg, .ogv, .pdf, .php, .srt, .tab, .tif, .wav, .webm.
- 13,193 zero-results queries (9.3% vs 0.9%) have easily detected URL bits. The most common elements are tbnid= and source=sh/x/i, which seem to come from Google image search. As most of the queries with URL bits get zero results, these are pretty much the same set.
- Only a handful of zero-results queries use Boolean or other special operators or characters. It can be hard to be 100% sure about user intent, but Portland, OR and PRIDE AND PREJUDICE are probably not intending to use Boolean operators. So, based on context and capitalization, I've done my best to categorize them.
- AND: 9 of 23 zero-results queries with AND seem to be using Boolean operators.
- OR: There are no zero-results queries with OR.
- NOT: Neither of the 2 zero-results queries with NOT seem to be Boolean operators.
- -: 30 zero-results queries plausibly use - as negation. "Covid -19" is common among the ones that do not seem to be using it intentionally as negation.
- ~: 37 zero-results queries use ~. More than half of them are of the form User~commonswiki, which are likely leftovers from implementing unified login. Maybe 1 looks to be intentional use of ~.
- +: Ignoring tbnid zero-results queries, 301 queries use +. Most seem to be using it in place of a space.
Zero-Results Distribution of Token Counts
editThe number of tokens (roughly, words) in a zero-results query is an easy proxy for the complexity of the query. It's not perfect, but it is easy to calculate—mostly. For spaceless languages (Chinese, Japanese, Korean, Thai, and others), counting the actual words is much more difficult.
Of note, 86,230 zero-results queries (60.6% vs 66.0%) are one or two tokens. 122,420 zero-results queries (86.0% vs 92.6%) are one to four tokens.
All of the "zero-token" zero-results queries are strings of punctuation and symbols.
Given that zero-results queries are 10% of all queries in the sample, the table below shows were zero-results queries are over-represented.
- "Zero-token" queries are punctuation and symbols, so it isn't too surprising most of them get zero results.
- Queries of 7 or more tokens are more than twice as likely to give zero-results.
- For queries with more than 10 tokens, more than half give zero-results.
Tokens Count Distribution | |||
count (all) | count (zero-results) | zero/all | tokens |
107 | 96 | 89.72% | 0 |
353,268 | 38,470 | 10.89% | 1 |
657,953 | 47,760 | 7.26% | 2 |
293,343 | 24,288 | 8.28% | 3 |
113,371 | 11,902 | 10.50% | 4 |
48,063 | 6,985 | 14.53% | 5 |
22,677 | 4,132 | 18.22% | 6 |
11,543 | 2,497 | 21.63% | 7 |
7,344 | 1,605 | 21.85% | 8 |
4,068 | 1,086 | 26.70% | 9 |
2,857 | 1,011 | 35.39% | 10 |
2,597 | 1,490 | 57.37% | 11 |
3,032 | 2,303 | 75.96% | 12 |
3,556 | 3,017 | 84.84% | 13 |
2,964 | 2,671 | 90.11% | 14 |
2,038 | 1,794 | 88.03% | 15 |
1,206 | 1,032 | 85.57% | 16 |
708 | 569 | 80.37% | 17 |
409 | 312 | 76.28% | 18 |
234 | 162 | 69.23% | 19 |
135 | 81 | 60.00% | 20 |
109 | 46 | 42.20% | 21 |
73 | 30 | 41.10% | 22 |
74 | 33 | 44.59% | 23 |
54 | 24 | 44.44% | 24 |
149 | 61 | 40.94% | 25-29 |
118 | 57 | 48.31% | 30-39 |
17 | 8 | 47.06% | 40-48 |
1 | 1 | 100.00% | 58 |
2 | 2 | 100.00% | 65 |
Zero-Results Spelling Errors
editI noticed more obvious spelling errors in the zero-results sample than in the larger sample, so I tagged items that seemed to be spelling errors. Spelling errors in languages that use the Latin script are easier for me to detect, so there may be more that I missed.
65 (32.5%) of the 200 random sample of zero-results queries look like spelling errors, in several languages: mostly English, but also German, Spanish, French, Polish, Portuguese, and Swedish.
- 26 (13.0%) had good Did You Mean suggestions. 8 DYM suggestions were bad, and 7 were mediocre.
- 5 (2.5%) had good completion suggester suggestions. 1 was mediocre.
- 22 (11.0%) had no useful corrections.
38 (38.0%) of the 100 top zero-result queries look like spelling errors, one in Spanish, the rest in English.
- 23 (23.0%) had good Did You Mean suggestions. 2 DYM suggestions were bad, and 2 were mediocre.
- 7 (7.0%) had good completion suggester suggestions. 1 was bad, and 2 were mediocre.
- 8 (8.0%) had no useful corrections.
The completion suggester often didn't have much to work with, because there are no pages, files, or categories that match the text of the query, though they could match part of the query. For example, nothing matches Waashington state flag, though Waashington by itself gets several suggestions which start with Washington.
Erik's planned work (T250436) on improving suggestions could help a lot with these kinds of queries on Commons.
Zero-Results Miscellaneous Odd or Interesting Queries
editSome of the queries that look like attempts at SQL injection unsurprisingly get zero results.
The most common zero-results queries are very specific and don't show much variation under normalization (e.g., variation in capitalization or punctuation), which I interpret as either one person repeating the search over and over, someone linking to the search results, or similar "non-organic" source.
Schematized Top 8 Queries with Frequency:
- 318 Internátional website.com Airport
- international airport with a website in the middle of its name
- all with the same capitalization and accents!
- 232 1234abcd
- 82 "Firstname Lastname"
- 54 Merkel Potrait / merkel potrait / Merkel potrait
- This one looks "organic", because of the variation. I guess portrait is hard to spell. I found several other instances of potrait—though not this many—as well as mnerkel potrait, and other instances of angela maerkel, angela merkeöl, and Angela Merklel.
- 45 Firstname Middlename Lastname
- 36 porn related searc
- porn-related search with the last letter of the last word missing
- 34 titleof famous painting
- famous painting title with two words run together
Other than Merkel potrait, all instances of these are identical under heavy normalization (lowercasing, removing punctuation, and normalizing whitespace).
Zero-Results Examples
editCarly asked for some examples of people and places, so I reviewed the items I had identified and provided notes here. These have all been reviewed for PII, and names of non–public figures have been omitted or <schematized>. tbnid
examples have also had their <ids> removed, since I don't know whether those might be recoverable somehow. (I doubt it, but why take a chance?)
Note: Looking at these much more closely has led to identifying a few more likely spelling errors, but I’m not going to go back and correct the stats for now. They are still in the right ballpark.
People:
- Cwnciones ana gsbryel
- Mexican singer Ana Gabriel; The first word wants to be canciones (“songs”); no results because of typos and extra words
- quien es <user> wikimedia commons
- “who is <user>”; <user> is a specific user; no results because of extra words
- Johnmichaelelambo
- baseball player; his name is written this way (with spaces) on Japanese Wikipedia.
- cornelious mcarthy
- typo for artist Cornelius McCarthy, known for his male nudes; presumably the searcher was looking for that, but based on his date of birth, his work is likely all under copyright and can’t be on Commons.
- mahatma ganfhi
- typo for Mahatma Gandhi; auto-corrected by DYM and lots of results.
- <firstname> macdonalf
- typo for MacDonald; auto-corrected by DYM, but no obviously relevant results. Looks like a non-notable person.
- Shamsia hassan
- typo for artist Shamsia Hassani; gets autocorrected by DYM and has good results.
- attila hiltman
- misspelling of Attila Hildmann; no results because of typos
- <firstname> Sedighei
- likely typo for Sedighi; does not appear to be a notable person
- Stuhlpfarrer
- on German Wikipedia, that’s Karl Stuhlpfarrer; zero results probably because of no relevant content at the time (now 3 hits)
- Brian Cox (physicist)&tbnid=<id>&vet=1&docid=<id>&w=960&h=1440&q=Professor+Brian+Cox&source=sh/x/im
- Brian Cox (physicist) is a fine query with relevant results, and the name of a category
- Evil Queen (Disney)&tbnid=<id>&vet=1&docid=<id>&w=256&h=228&q=Snow+White's+stepmother&hl=en-US&source=sh/x/im
- Evil Queen (Disney) is also a fine query with relevant results, and the name of a category
- சுவாமி நிரலாம்பா (Gets results through MediaSearch)
- Tamil “Niralamba Swami”; no results because the search is in Tamil.
- Christine Ballestrero.JPG
- likely typo for Cristina Ballestero, a singer/actress; no Wikipedia article, but she’s on YouTube and IMDB. No file with this name; no content on Commons for either spelling.
- Category:nicolo de albate
- Likely misspelling for artist Niccolò dell'Abbate (also spelled Nicolò dell'Abate). Category:Nicolò dell'Abate and Category:Niccolò dell' Abbate exist (the later is a redirect to the former).
- Photo adjiani 2020
- Likely misspelling of actress Isabelle Adjani; searching for adjani doesn’t get good results or DYM suggestions; searching for adjani gets results, but not from 2020.
Additional Specific People:
I got tired of looking up people to make sure they didn't need to be omitted, so I summarized the rest:
- 10 people with a Wikipedia page (not necessarily on enwiki)
- 1 with extra numbers in the query
- no results because there is no content
- 2 misspelled names that would get decent results on Commons
- no results because of typos; not corrected by DYM
- 2 names that don’t seem to belong to anyone notable (even to Google)
- no results because there is no content
- 3 people trying to be famous online but not yet notable enough for Wikipedia
- 1 is a porn performer
- no results because there is no content
- 1 tbnid query
- no results because of all the extra junk
- 2 people mentioned in a Wikipedia page
- 1 is a fictional character
- no results because there is no content
More people who are less specific:
- נערת איגרו ף
- Hebrew; looks to be extra spaces in נערת איגרוף (“boxing girl”)—not as specific as most queries; probably gets no results because it’s in Hebrew, but the typo doesn’t help; female boxing in English gets relevant results.
- penemu pesawat terbang
- Indonesian for “inventor of the airplane”; gets no results for being in Indonesian, but doesn’t get good results in English either.
- ネトウヨ
- Japanese for Netouyo (“Japanese Internet rightists”); no results (or no relevant results) in Japanese or several versions in English (Netto-uyoku, Net uyoku, Netouyo, Netōyo)
Places:
- Mnandilocation
- run together Mnandi location, which is more of an informational query. The separated form gets a few results, Mnandi by itself gets more, or course—but still not many.
- heindentor
- typo for Heidentor, which gets results and is a Category
- woodland campfl
- Camp Woodland, FL is a place. Still no good results for woodland camp fl or woodland camp florida.
- Hotel longyarbyen
- typo for hotel longyearbyen; DYM auto corrects it.
- محافضه اربي ل
- extra space in محافضه اربيل; likely typo for محافظة اربيل (“Erbil Governorate”), which gets results on Commons.
- مدينة خريبق ه
- extra space in مدينة خريبقه, likely typo for مدينة خريبكة (“City of Khouribga”), which gets a few results; خريبكة (“Khouribga”) by itself gets more results, of course.
- opponitzer tunnel
- Seems to be referring to a tunnel in Opponitz; no results because no content.
- werderau maiacherstraße siedlung
- Werderau is a district of Nuremberg; Maiach is an adjacent district. Siedlung is a settlement. Zero results probably because there is no relevant content tagged in German.
- Armenian Heritage Park&tbnid=<id>&vet=1&docid=<id>&w=2592&h=1683&q=armenian+heritage+park&source=sh/x/im
- Armenian Heritage Park is a category and also returns lots of results
- Gush Katif Airport
- zero results because there’s no content by that name
- Legaslative district of Cloverport ky
- typo for legislative district of Cloverport ky, which DYM corrects.
- Washington, DC Metropolitan Area Special Flight Rules Area&tbnid=<id>&vet=1&docid=<id>&w=900&h=692&q=dc+flight+restricted+zone&hl=en-US&source=sh/x/im
- not a Category.. but Washington, DC Metropolitan Area Special Flight Rules Area gets plenty of results
- Weeki Wachee River&docid=<id>&tbnid=<id>&vet=1&w=1200&h=900&itg=1&hl=en-US&source=sh/x/im
- Weeki Wachee River is a category and gets other results, too
- Castle of St Peters Bodrum
- Bodrum Castle is also known as the “Castle of St. Peter”. English text processing does not stem Peters to Peter (probably to keep the last name Peters and the first name Peter separate. Castle of St Peters Bodrum currently gets 11 results. Castle of St Peter Bodrum gets 112.
- Castillo de Kalmykov
- “Castle of Kalmykov” is an abandoned castle in Russia. No results because there is no content on Commons.
- HISTOIRE DE L'HOPITAL DE YAKUSU
- “History of Yakusu Hospital”. Gets no results because it’s in French. Yakusu Hospital gets a small number of relevant results.
- akurey kollafirði
- Akurey is ambiguous in Icelandic—a couple of islands and a town have that name—so this specifies the one in Kollafjörður (a fjord), which is in Faxaflói bay. The Commons category Akurey (Faxaflói) seems to be a match.
- சுவாமி நிரலாம்பா바레인세계무역센터
- Korean “Bahrain World Trade Center”. No results because it is in Korean.
- <name of church>목사<name of person>
- All in Korean 목사 is “pastor”. Seems not to be notable; it’s a church between Washington DC and Baltimore.
- Halstaat
- Either a small town in Austria or a thoroughbred horse. No results because no content.
- May Hill&docid=<id>&tbnid=<id>&vet=1&w=640&h=480&itg=1&hl=en-gb&source=sh/x/im
- May Hill is a category on Commons, and May Hill gets over 660K results.
- Villar de Mazarife
- Town in Spain. No results because no content.
- Zabłoty
- Village in Poland. No results because no content.
- wyśnierzyce nad pilicą
- Possible typo for Wyśmierzyce nad Pilicą. “Wyśmierzyce on the Pilica River”. No results because no content. Category and some results for Wyśmierzyce, and results for “nad Pilicą” for other towns on the Pilica River—just not this one.
- Category:Chemin du Coin-de-Terre
- A road in Geneva, Switzerland. Has since been created as a category.