July 2020 — See TJones_(WMF)/Notes for other projects. See also T258297 and T252544.

I undertook a review of queries on Wikimedia Commons to get a sense of how people are using Commons, and how we might improve search on Commons.

Full Query Corpus Analysis

Quick Summary

In three month's worth of likely-human queries issued on Commons, over 90% are in the Latin script, about 50% are in English, almost 25% are names, and almost 10% are porn-related.

Among the most common queries, 8 of the top 10 and 66 of the top 100 are porn-related, but even the most common queries are not really that common, and only 6 queries out of over 1.04M unique (lightly normalized) queries were searched 1,000 times or more, and only 660 were searched 50 times or more. Over 950K were unique. There is not really a head—it's pretty much all the long tail.

In a sample of 100 random queries (the long tail), 30 were specific things, 22 people, 14 places, 11 organizations, and 12 were porn. 60 queries were narrow and fairly specific, 17 were fairly broad, and 22 were in the middle. (Broad queries were often one word.)

In a sample of the 100 most common queries (the head-ish), 66 were porn, 7 were looking for "facts", 7 were specific things, 6 were current events, 5 were people. 24 queries were narrow and fairly specific, 46 were fairly broad, and 27 were in the middle. (Broad queries were often one word.)

Only 1.6% of queries used a namespace, 0.9% had a file extension. Boolean and special operators were very rare.

10% of queries got zero results. Less than 1% got a million results or more.

If we break queries on whitespace and punctuation (less than ideal, but easy), 66% of queries are one or two words; 93% are four words or fewer.

Data

I pulled three months' worth of Commons queries from mediawiki_cirrussearch_request to analyze—from April 15 to July 15, 2020.

The sample does not include all queries from the time period; I applied some of the usual filters that the search team has found to be useful to get a reasonable sample from "normal" human users:

We require the queries to have used the search box in the upper corner. This eliminates some bots, screen scrapers, and links that are queries.

We eliminate users who make more than 100 queries in a day. This helps prevent us from oversampling bots, power users and editors, script/gadget users, shared IP addresses, and other users who are either not "normal" or not human.

Other caveats:

This sample only includes queries performed on Commonswiki. It does not include queries from other wikis that also search Commons data (like Wikipedias do).

I performed some very minimal normalization on all the data to make it easier to process: I converted tabs and newlines in any query to spaces. This shouldn't change the results shown or the intent of the query, but it sure does make a tab-separated file easier to process. (See below for info on other kinds of normalization done to group queries together.)

Note that this is a different from the data source that Erik used for his Top N queries per day. See T257361.

The sample contains 1,532,070 (~1.5M) queries.

Scripts and Languages

The first thing I do with a pile of data, of course, is try to figure out what scripts and languages are contained in it.

Scripts

The vast majority of queries (1,414,693; 92.3%) are in the Latin script—possibly with additional numbers, fairly common punctuation, symbols, etc.—and didn't fall into any other category.
- 1,170,330 (76.4%) of queries consist only of A-Z (upper- and lowercase) and spaces.
- Another 70,876 (4.6%) are only A-Z and the digits 0-9.

The next biggest groups are Arabic (17,675; 1.2%), Cyrillic (15,180; 1.0%), and CJK (25,239; 1.6%)—again possibly with additional numbers, punctuation, and symbols.

The 25,239 "CJK" queries include 13,974 queries that are CJK Unified Ideographs, 5,454 queries in Hangul/Korean, 3,007 queries in Japanese Katakana, 497 queries in Japanese Hiragana, and 2,307 that are "mixed" CJK characters (mostly Japanese Hiragana or Katakana with Chinese characters).

The breakdown by script is below:

17,675	Arabic
231	Armenian
1,784	Bengali
2	Bopomofo
2,307	CJK
3	Canadian Syllabics
13	Carian
1	Cuneiform
15,180	Cyrillic
987	Devanagari
6	Egyptian
16	Ethiopic
681	Georgian
1	Glagolitic
992	Greek
67	Gujarati
8	Gurmukhi
5,454	Hangul
2,627	Hebrew
497	Hiragana
13,974	Ideographic
73	Kannada
3,007	Katakana
91	Khmer
22	Lao
1,414,693	Latin
192	Malayalam
1	Mongolian
190	Myanmar
7	N'Ko
13	Ol Chiki
48	Oriya
83	Sinhala
2	Syriac
1	Tai Tham
343	Tamil
46	Telugu
2	Thaana
1,176	Thai
8	Tibetan
6	Tifinagh

An additional 2,352 queries (0.2%) are mixed-script (here "scripts" include less common symbols, punctuation, and emoji). The largest groups are Latin/CJK (~500), Cyrillic/Latin (~400), and Arabic/Latin (~300). My favorite query in this group is mixed Cyrillic/Greek/Latin: Jolly Zοmbіеѕ (the bolded characters are not Latin).

Numbers

A small number of queries are mostly numbers:

2,448 (0.2%) are integers (a small number—just three—with invisibles or diacritics, and some—eleven—are longer numbers with commas)

457 look like measurements (e.g., 3mm or 5x5)

313 look like IP addresses

Plus a handful (69) of other numbers, including decimals, ordinals, hex numerals, malformed IP addresses, etc.

Misc

There are a small number of additional text patterns not included above.

2,720 (0.2%) queries look like identifiable web domains.

626 queries look like email addresses.

160 Latin-script acronyms (91 upper case, 69 lowercase). Traditionally we don't handle acronyms very well in search, so at least there aren't a lot of them.

Symbols

A small number of queries are all symbols (711; e.g., $600) or punctuation (107; ,,,,,,,,,,,,,,,,,,) or emoji (79; 🤣).

126 additional queries have characters that my Unicode regexes identify as "unassigned" code points, but these are mostly—but not entirely—emoji. (I assume most are newer emoji than have recently been assigned code points.)

Invisibles

1,390 queries (0.1%) include invisible characters: bi-directional markers, control characters, formatting characters, or odd whitespace characters.

If these aren't normalized well, they can screw up query results, however, these are clearly not a huge problem on Commons.

Languages

I took a random sample of 1,000 queries and tried to identify them by language.

The categorizations are almost certainly not perfect, but they should be close enough to get a sense of the proportions of different languages used on Commons.

As a general rule, I don't try to identify the "language" of names. North Americans in particular like to combine names from various ethnolinguistic origins, though they aren't the only ones. Some names—Maria is the most egregious—are too ambiguous to categorize. I make exceptions for names in fairly unambiguous scripts; for example, 엔리코 콜런토니 is arguably "in Korean", even though it's a very Italian name ("Enrico Colantoni") of a Canadian actor.

If a query includes a name and text in a particular language, I count that as in that language. So, Abraham Lincoln is a name, but birthplace of Abraham Lincoln would be categorized as English and local de nascimento de Abraham Lincoln would be categorized as Portuguese.

19 queries were "unidentifiable" because they weren't words (11q!) or were too ambiguous as to language (e.g., a single-word query that could be English, French, or Spanish).

I categorized 42 queries as "technical terms" which are either too ambiguous or not really words (xml, t-800). There was also 1 number and 1 measurement.

A few items are assignable to a given language even though they are in the wrong script. The Russian song В путь is transliterated as V Put or V Put', which doesn't mean anything in English, so I guess it's in Russian? Sure, why not—there were only a small handful. Also, there were two wrong-keyboard Russian queries that I noticed, which I also counted as Russian.

A couple of queries were mixed-language. I counted them as the predominate language and noted the other language.

The most common categories then are English queries (507; 50.7%) and names (237; 23.7%), with a smattering of technical terms (43; 4.3%), German (34; 3.4%), Spanish (28; 2.8%), French (27; 2.7%), Chinese (14; 1.4%), Russian (13; 1.3%), Italian (11; 1.1%), and others.

Non-language groups
19	unidentifiable
237	names (1 Category, 1 File)
42	technical terms
1	measurement
1	number

Languages
2	Arabic
1	Armenian
1	Bulgarian
2	Bengali
14	Chinese
1	Czech
1	Danish
4	Dutch
507	English
1	Finnish
27	French
34	German
2	Hebrew
1	Hungarian
3	Indonesian
11	Italian
7	Japanese (3 in transliteration)
8	Korean (1 in transliteration)
4	Latin (1 Category)
6	Persian
8	Polish
8	Portuguese
13	Russian (1 in transliteration, 2 wrong keyboard, 1 with some English)
28	Spanish (1 with some English)
2	Swedish
1	Tagalog
1	Tajik
1	Thai
1	Turkish

Query Patterns

Here are some potentially interesting patterns I noticed in the queries:

Query Frequency

We wanted to look at the "head, torso, and tail" of the distribution of queries by frequency—however, there looks to me to be at most a tiny head and a long, long tail.

I did some very basic normalization of the queries for bucketing; I lowercased them, and normalized whitespace (removing leading and trailing spaces, and reducing multiple spaces to just one space), so that " JoHN SMiTh " and "john smith" count as the same query.

There are 1,090,396 unique normalized queries (out of 1,532,070 queries).

I grouped the query frequencies into quasi-logarithmic binary/decimal buckets (1/2/5/10/20/50...) which is approximately logarithmic in both binary and decimal, has relatively fine-grained buckets, and is human-friendly. I also added buckets for 3 and 4, since there are many queries with these very low frequencies.

Only 2 queries appeared more than 2,000 times, only 4 between 1,000 and 2,000. All together, only 660 queries (0.06%) appeared more than 20 times.

So, 99.94% of queries occur less than 20 times in 3 months. 97.89% occur fewer than 5 times. It's all long tail.

Normalized Query Frequency Distribution

count	bucket
955,635	1
77,294	2
23,539	3
10,918	4
15,451	5-9
4,990	10-19
1,909	20-49
433	50-99
138	100-199
62	200-499
21	500-999
4	1,000-1,999
2	2,000-4,999

mean count: 1.40506

For the graph I combined buckets 2-4 to stick with the pure quasi-logarithmic binary/decimal buckets, which show a very nice power-law distribution!

Where to break the head and tail in a distribution is generally subjective, however Wikipedia has a page on the Head/tail Breaks algorithm, which breaks the head and tail at the mean value in the distribution. The mean frequency for the normalized queries is 1.40506, which means the head would be everything with a frequency of 2+ (12.4% of queries), and the tail would be all of the unique queries (87.6% of queries). Having a frequency of 2 (or even 10) out of over a million queries doesn't seem like the "head" to me, so I'm going to stick with my claim that there is no head!

Query Intent

While talking to Cormac about this analysis, we talked a bit about "modifiers" to queries, such as looking for things with specific colors, etc.—y'know, the kinds of things that might show up in structured data! It's a difficult and subjective line to draw, but I tried to divide between things that are essential and things that are preferences (like color of a car), and maybe things that would be reified in wikidata (e.g., "Chinese art") or not (e.g., "ugly art"), with reifiable things not counting as "modified". There are not many queries with modifiers—none in the top 100 most common queries.

In the random sample of 1,000 queries that I tried to identify by language, I also categorized them as "porn" or not while I was working through the list. 95 (9.5%) were about or likely about porn, porn actors, nudity, sexual acts, etc. So, roughly 9-10% of Commons queries are likely porn-themed.

I took a sub-sample of 100 queries from this set and tried to categorize them by intent, to compare to the head-ish top 100 queries:

30 were about specific or general things.
- 23 things hard to classify more specifically
  - 1 had additional modifiers/specifiers
- 3 specific products
- 2 logos (also counted as images)
- 1 film
- 1 list (of latin phrases—??)
22 were about specific people
- 1 had additional modifiers/specifiers
- 1 was about 2 specific people
14 were about specific places
- 1 had additional modifiers/specifiers
- 1 was about an activity at that place
12 were porn
11 were about specific organizations
3 were about general concepts
2 were about an activity (one in a particular place)
2 seemed to be looking for specific images (e.g., something fairly specific.jpg)
1 was about art

I reviewed the top 100 most common queries and tried to categorize them by intent:

61 were porn, and another 5 were likely porn. That's 2/3 of the most common queries.
- The 4 non-English non-name higher-frequency queries were here: two German words and two Persian words.
- There was one Category in the sample, and it was also in the porn category.
7 were about "facts" (e.g., map of a place).
- 1 list of...
7 were about specific or general things.
6 were about current events and topics in the news.
5 were about non-pornographic celebrities or historical figures
2 were about art or artists.
2 were about specific places.
2 were about tech topics or companies.
3 I couldn't categorize (2 of them got 0 results).

Query Generality

I tried to categorize queries in various samples as narrow, broad, or somewhere in the middle.

From the sample of 1,000 random queries I categorized by language (representing the tail), I took a random subsample of 100 and categorized them by generality:
- 60 were narrow (specific person, place, or object)
- 22 were in the middle; a somewhat specific category of things or type of thing (e.g., smart home)
- 17 were broad (many are one word); these were all porn but one
- 1 was uncategorizable.

From the top 100 most common queries (the stubby head):
- 24 were narrow (specific person, place, or object)
- 27 were in the middle; a somewhat specific category of things or type of thing (e.g., hyena cub)
- 46 were broad (often one word); these were all porn but one
- 3 were uncategorizable

Keywords and Specific Purposes

25,047 queries (1.6%) specify a namespace. The most common are Category (15,519), File (7,044), and Template (823). 282 namespace queries were in Talk namespaces. There was one instance of a non-existant namespace: media.
- 83 queries use the character ː, which is normally used in phonetic transcription to indicate a lengthened vowel (less often a lengthened consonant). It's only used that way once in these 83 queries. 81 of them use it in place of a colon with a namespace!—which does not work. (And there was one other weird one I couldn't figure out.)

14,118 queries (0.9%) end in a file extension (but don't use the File: keyword). The most common are .jpg (6,482), .svg (3,142), and .png (1,786). Others include .djvu, .gif, .jpeg, .js, .ogg, .ogv, .pdf, .php, .srt, .stl, .tab, .tif, .tiff, .txt, .wav, .webm.

13,239 queries (0.9%) have easily detected URL bits—e.g., ?q=query&thing=whatchamacallit. The most common elements are tbnid= and source=sh/x/i, which seem to come from Google image search, though it isn't clear how the URL components are getting copied from Google to Commons. I suppose it could be a logging error of some sort. A partly sanitized example query is below (the source language is Polish, and kudłaty stwór means "shaggy creature").
- Chewbacca&tbnid=12345678901234&vet=1&docid=12345678901234&w=3456&h=5184&q=kudłaty+stwór&hl=pl&source=sh/x/im
  - Searching for kudłaty stwór in images on google.pl does give the Polish Wikipedia article and image for Chewbacca as the 4th result... so something is going on here, but it isn't clear what.
- Of the 12,275 queries with tbnid= in them, all but one start with what looks like a query in Latin script (the other one is Bengali).
  - Of the 7,273 tbnid queries with a q= parameter, all but about 25 are largely Latin script; the others are Arabic, Bengali, Cyrillic, Devanagari, Emoji, Greek, Hebrew, Korean, Tamil, and Thai.

Only a handful of queries use Boolean or other special operators or characters. It can be hard to be 100% sure about user intent, but Portland, OR and PRIDE AND PREJUDICE are probably not intending to use Boolean operators. So, based on context and capitalization, I've done my best to categorize them.
- AND: 83 of 234 queries with AND seem to be using Boolean operators, even though they don't really do anything (everything is ANDed together by default).
- OR: 44 of 77 queries with OR seem to be Boolean operators.
- NOT: 2 of 17 queries with NOT seem to be Boolean operators.
- !: None of the 475 queries with ! seem to be well-formed, intentional Boolean queries. Most are not formatted as Booleans—Welcome! is not a Boolean query. The ones that are look like typos, like !mismatched quotes" or !st (looks like a mis-shift key for 1st). Others don't seem terribly useful, like Category:!Dogs returns everything that is not in Category:Dogs. Fewer than 10 queries are plausible Boolean queries.
- -: 533 queries plausibly use - as negation. I did a quick skim, and the large majority of them look to be using it intentionally. 70 queries start with - and most of them don't look to be using it intentionally (or at least not correctly), since it's very hard (but not entirely impossible) to imagine -172 is a useful query.
- ~: 110 queries use ~. Almost half of them are of the form User~commonswiki, which are likely leftovers from implementing unified login. Many look like typos for a hyphen, as in 1900~2000. About 15 look to be intentional use of ~.
- (): A handful of the queries using AND, OR, or NOT as Booleans also use parens—even though they don't do anything in our current search system.
- +: Ignoring tbnid queries, 2,304 queries use +. Most seem to be using it in place of a space.
There are no special keywords with colons other than namespaces in my sample. Carly asked about haswbstatement: so I looked a little harder for those and found two instances of haswbstatemen, six haswbstatement, and one sshaswb, none with any other search terms. There was one malformed query: haswbstatementP180=Q42133786, but it is also missing the colon.

Light vs Heavy Normalization

While talking to Erik about the Top N queries report he's putting together, he mentioned that he's doing a stronger form of normalization for that report, and replacing all punctuation with spaces before normalizing whitespace.

This is more likely to group queries that get different results than my "light" normalization used above—though even the light normalization may do so because of the way we treat CamelCase search terms. The most obvious case is removing quotes from around a query: John Smith will get many more results than "John Smith". However, as Erik rightly pointed out, in most cases, such variants probably generally represent the same query intent.

It turns out that in my sample, light vs heavy normalization makes only a small difference, about 1.5-2.5% increase in buckets other than the singleton bucket.

The most common change among the heavily normalized top 100 queries is stripping quotes. The most impactful change—in terms of increasing the number of queries grouped together—is stripping hyphens, though it only applied to one query.

bucket	raw queries	light norm	heavy norm	norm diff%
1	1,041,577	955,635	945,944	-1.01%
2	72,776	77,294	78,374	1.40%
3	20,870	23,539	24,007	1.99%
4	9,330	10,918	11,101	1.68%
5-9	12,766	15,451	15,712	1.69%
10-19	4,054	4,990	5,090	2.00%
20-49	1,597	1,909	1,955	2.41%
50-99	329	433	438	1.15%
100-199	111	138	141	2.17%
200-499	48	62	62	0.00%
500-999	17	21	22	4.76%
1,000-1,999	3	4	4	0.00%
2,000-4,999		2	2	0.00%

mean freq	1.3168	1.40506	1.41485	0.70%

Distribution of Hits

Below is a summary of the distribution of results from all of the queries in the sample.

153,549 queries (10%) get zero results. 9,566 queries (0.62%) get more than a million results. The ideal number is probably somewhere in between.

Result Count Frequency Distribution

count	bucket
153,549	0
48,409	1
30,234	2
22,071	3
18,006	4
59,228	5-9
68,978	10-19
110,981	20-49
96,986	50-99
109,809	100-199
164,736	200-499
129,358	500-999
126,694	1K-<2K
137,790	2K-<5K
74,174	5K-<10K
62,176	10K-<20K
53,881	20K-<50K
25,064	50K-<100K
14,318	100K-<200K
11,013	200K-<500K
5,049	500K-<1M
3,659	1M-<2M
2,808	2M-<5M
1,231	5M-<10M
768	10M-<20M
699	20M-<50M
401	50M+

For the graph I combined buckets 2-4 to stick with the pure quasi-logarithmic binary/decimal buckets.

Distribution of Token Counts

The number of tokens (roughly, words) in a query is an easy proxy for the complexity of a query. It's not perfect, but it is easy to calculate—mostly. For spaceless languages (Chinese, Japanese, Korean, Thai, and others), counting the actual words is much more difficult.

Here, we're using a very simple process of breaking tokens on spaces and punctuation. So, a long Chinese sentence would be counted as one token, and ain't would be counted as two, and .. ,, ;; -- would be counted as zero. It isn't perfect, but it gives us a reasonable approximation of what we have.

Of note, 1,011,221 queries (66.0%) are one or two tokens. 1,417,935 queries (92.6%) are one to four tokens.

All of the "zero-token" queries are strings of punctuation and symbols.

Of the 20 queries with 40 or more tokens, 4 are gibberish, and most of the rest seem to be captions from images (presumably looking for the original image, but who knows?). 5 are in French, 1 in Spanish, and 11 are in English.

Tokens Count Distribution

count	tokens
107	0
353,268	1
657,953	2
293,343	3
113,371	4
48,063	5
22,677	6
11,543	7
7,344	8
4,068	9
2,857	10
2,597	11
3,032	12
3,556	13
2,964	14
2,038	15
1,206	16
708	17
409	18
234	19
135	20
109	21
73	22
74	23
54	24
149	25-29
118	30-39
17	40-48
1	58
2	65

Miscellaneous Odd or Interesting Queries

Erik's Top N report doesn't exclude anything, and covers all wikis. We noticed some politics-themed searches on English Wikipedia like 2020 Texas US Senate Election with no variation in whitespace or capitalization. I'm not sure I found the source of the query, but I did find a political website that helps you determine what politicians you align with by asking you questions about various topics. For each topic, they have a "Learn More" link which links to a search on English Wikipedia. So it is definitely possible that unexpected commons searches (e.g., a relatively complex query with no variation) may be the result of links—which may not be publicly available.
I saw a handful of queries that look like attempts at SQL injection.
There are some random-looking queries, e.g. 510d47d9-4f0a-a3d9-e040-e00a18064a99, that turn out to be identifiers in collections that images have been taken from.

Zero-Results Query Sub-Corpus Analysis

After writing up the initial report, I realized that 10% of the queries get zero results, which is a lot, but not as many as we see on many Wikipedias. I investigated just the zero-results queries separately.

Zero-Results Summary

In three month's worth of likely-human queries issued on Commons, zero-result queries make up about 10% of all queries (which is less than the zero-results rate on Wikipedias). Subjectively, the zero-results queries seem to have less junk than on Wikipedia, and so may be more salvageable. Also, there seem to be more spelling errors/typos in the zero-results queries.

80% of the zero-results queries are in the Latin script (which is less than in the total corpus, which is 90% Latin text). Only 32% are in English (vs 50%), and roughly 25% are names (same as overall). Only 6.5% are porn-related (vs 9.5% overall).

Only 31 of the top 100 most commons zero-results queries are porn-related, vs 66 overall.

Zero-results are more heavily skewed toward unique queries.

In a sample of 200 random zero-results queries (the long tail), 37% were about specific things, 20.5% people, 13% places, 5% facts, 3% organizations, and 6.5% were porn. This is roughly similar to the full corpus, with a bit less porn. 60% of zero-results queries were narrow and fairly specific, 10% were fairly broad, and 22.5% were in the middle. (Broad zero-results queries were often one word.) This is very similar to the full corpus.

In a sample of the 100 most common zero-results queries (the head-ish), 31 were porn, 28 were specific things, 23 were people. This is much more specific and has half the porn of the full corpus. 57 queries were narrow and fairly specific, 31 were fairly broad, and 9 were in the middle. (Broad zero-results queries were often one word.) This is skewed much more toward narrow queries compared to the full corpus.

Breaking on whitespace and punctuation (less than ideal, but easy), 60% of queries are one or two words; 86% are four words or less. This is slightly less than the full corpus. More than half of all high-token queries (10+) give zero results.

Spelling errors seem more common in the zero-results queries (and there is less junk in the zero-results queries than in Wikipedia data); 32% of the random sample of zero-results queries have spelling errors, and 38% of the top 100 zero-results queries have spelling errors. "Did you mean" suggestions and the completion suggester do okay, but could be much better. The current completion suggester doesn't have much to work with because it is limited to page/file/category names, which are not always good matches with what people are searching for. T250436 could be a big help!

The most common zero-results queries are very specific and don't show much variation under normalization (e.g., variation in capitalization or punctuation), which I interpret as either one person repeating the search over and over, someone linking to the search results, or similar "non-organic" source.

Zero-Results Data

This is a sub-sample of the earlier-described data set, limited to queries that got zero results.

The sample contains 153,525 zero-results queries (almost exactly 10.0% of the full sample).

Zero-Results Scripts and Languages

Zero-Results Scripts

The majority of zero-results queries (125,326; 81.6%) are in the Latin script—possibly with additional numbers, fairly common punctuation, symbols, etc.—and didn't fall into any other category.

This is notably less than the percentage of queries overall that are in the Latin script (92.3%).
88,804 (57.8%) of zero-results queries consist only of A-Z (upper- and lowercase) and spaces.
- This group is notably smaller (57.8% vs 76.4%)
Another 5434 (3.5%) are only A-Z and the digits 0-9.

The next biggest groups are Arabic (6,618; 4.3%), Cyrillic (2,361; 1.5%), and CJK (2,733; 1.8%)—again possibly with additional numbers, punctuation, and symbols.

The Arabic percentage is notably higher (4.3% vs 1.2%)
The 2,733 "CJK" zero-results queries include 1,737 queries in Hangul/Korean, 432 queries in Japanese Katakana, 1 query in Japanese Hiragana, 139 queries that are CJK Unified Ideographs, and 424 that are "mixed" CJK characters (mostly Japanese Hiragana or Katakana with Chinese characters).
- The mix here is very different from the general queries, with significantly more Hangul, and significantly fewer Ideographs and Hiragana.

The breakdown by script is below:

6,618	Arabic
78	Armenian
758	Bengali
424	CJK, Mixed
8	Carian
1	Cuneiform
2,361	Cyrillic
416	Devanagari
5	Egyptian Hieroglyphics
9	Ethiopic
235	Georgian
354	Greek
30	Gujarati
2	Gurmukhi
1,737	Hangul
374	Hebrew
1	Hiragana
139	Ideographic
33	Kannada
432	Katakana
69	Khmer
18	Lao
125,326	Latin
102	Malayalam
102	Myanmar
4	N'Ko
2	Ol Chiki
9	Oriya
35	Sinhala
1	Tai Tham
87	Tamil
10	Telugu
609	Thai
8	Tibetan
3	Tifinagh

An additional 1,182 zero-results queries (0.8%, up from 0.2%) are mixed-script (here "scripts" include less common symbols, punctuation, and emoji). The largest groups are Arabic/Latin (~240), Cyrillic/Latin (~200), Latin/CJK (~160), and Hangul/Latin (~120). Not surprisingly, the mixed Cyrillic/Greek/Latin: Jolly Zοmbіеѕ got zero results.

Zero-Results Numbers

A small number of zero-results queries are mostly numbers:

923 (0.6%) are integers.

1 looks like a measurement (e.g., 3mm or 5x5)

Plus a handful (7) of other numbers, including decimals, etc.

Zero-Results Misc

There are a small number of additional text patterns not included above.

666 (0.4%) zero-results queries look like identifiable web domains.

572 (0.4%) zero-results queries look like email addresses.

Zero-Results Symbols

A small number of zero-results queries are all symbols (74; e.g., $600) or punctuation (97; ,,,,,,,,,,,,,,,,,,) or emoji (79; 🤣).

Zero-Results Invisibles

393 zero-results queries (0.3%) include invisible characters: bi-directional markers, control characters, formatting characters, or odd whitespace characters.

Zero-Results Languages

I took a random sample of 200 zero-result queries and tried to identify them by language.

Reminders:

I generally don't try to identify the language of names.
If a zero-results query includes a name and text in a particular language, I count that as in that language.

14 zero-results queries (7.0%) were "unidentifiable" because they weren't words (11q!). This is significantly more than the full corpus (7.0% vs 1.9%).

I categorized 12 (6.0% vs 4.2% in the full corpus) zero-results queries as "technical terms" which are either too ambiguous or not really words (xml, t-800). There was also 1 number.

A couple of zero-results queries were mixed-language. I counted them as the predominate language and noted the other language.

The most common categories then are English queries (63; 32.5% vs 50.7%) and names (47; 23.5% vs 23.7%), with a smattering of technical terms (12; 6.0% vs 4.3%), German (12; 6% vs 3.4%), Arabic (9; 4.5% vs 0.2%), Spanish (8; 4.0% vs 2.8%), Korean (5; 2.5% vs 0.8%), French (4; 2.0% vs 2.7%), and others.

Non-language groups
14	unidentifiable
47	names (1 Category, 1 File—same as last time!)
12	technical terms
1	number

Languages
9	Arabic (1 with some English)
1	Albanian
1	Dutch
63	English
4	French
12	German
1	Hebrew
1	Icelandic
2	Indonesian
2	Italian
2	Japanese
5	Korean
1	Malayalam (in translation)
1	Norwegian
3	Persian
3	Polish
2	Portuguese
1	Russian
8	Spanish (1 with some English)
1	Swedish
1	Tamil
2	Turkish

Zero-Results Query Patterns

Here are some potentially interesting patterns I noticed in the zero-results queries...

Zero-Results Query Frequency

The frequency distribution of the zero-results queries has a stronger skew towards unique queries, even with light normalization, with the mean frequency being 1.08 (vs 1.41 for all queries).

There are 142,406 unique normalized zero-results queries (out of 153,525 zero-results queries).

Only 2 zero-results queries appeared more than 200 times, only 2 between 50 and 100. All together, only 35 zero-results queries (0.02%) appeared more than 20 times.

So, 99.98% of zero-results queries occur less than 20 times in 3 months. 99.7% occur fewer than 5 times. It's all long tail.

Normalized Zero-Results Query Frequency Distribution
count	bucket
136,530	1
4,268	2
828	3
331	4
329	5-9
85	10-19
31	20-49
2	50-99
2	200-499

mean count: 1.07808

Zero-Results Query Intent

In the random sample of 200 zero-results queries that I tried to identify by language, I also categorized them as "porn" or not while I was working through the list. 13 (6.5% vs 9.5%) were about or likely about porn, porn actors, nudity, sexual acts, etc.

I also tried to categorize this set by intent, to compare to the head-ish top 100 zero-results queries:
- 74 were about specific or general things.
  - 5 were about specific events
  - 3 specific products
  - 2 films
  - 1 song
  - 1 website
- 41 were about specific people
- 26 were about specific places
- 13 were porn
- 11 were about "facts"
- 6 were about specific organizations
- 6 were about general concepts
- 4 were about art
- 8 were junk
- 4 were categories
- 3 seemed to be looking for specific files (e.g., something fairly specific.jpg)

I reviewed the top 100 most zero-results common zero-results queries and tried to categorize them by intent:

31 were porn
28 were about specific or general things.
- 5 films
- 5 websites
23 were about specific people
5 were about specific organizations
3 were about specific places
2 were about "facts"
2 were about general concepts
2 malformed keywords ("category:portal:mathematics")
3 I couldn't categorize
1 seemed to be looking for a specific file

Zero-Results Query Generality

I tried to categorize zero-results queries in various samples as narrow, broad, or somewhere in the middle.

From the sample of 200 random zero-results queries I categorized by language (representing the tail), I categorized them by generality:

120 (60.0%) were narrow (specific person, place, or object)
43 (22.5%) were in the middle; a somewhat specific category of things or type of thing (e.g., smart home)
20 (10.0%) were broad (many are one word); these were all porn but one
17 (8.5%) were uncategorizable.

From the top 100 most common zero-results queries (the stubby head):

57 were narrow (specific person, place, or object)
9 were in the middle; a somewhat specific category of things or type of thing (e.g., hyena cub)
31 were broad (often one word); these were all porn but one
3 were uncategorizable.

Zero-Results Keywords and Specific-Purposes

7,292 zero-results queries (5.1% vs 1.6%) specify a namespace. The most common are Category (5,809), File (970), and Template (206). 60 namespace zero-results queries were in Talk namespaces. The previously mentioned non-existant namespace, media, got zero results.
- 70 zero-results queries use the character ː, which is normally used in phonetic transcription, instead of a colon in an obvious namespace-style query. (There are 81 uses of ː in the whole corpus).

2,077 zero-results queries (1.5%) end in a file extension (but don't use the File: keyword). The most common are .jpg (1,211), .png (236), and .svg (197). Others include .djvu, .gif, .jpeg, .js, .ogg, .ogv, .pdf, .php, .srt, .tab, .tif, .wav, .webm.

13,193 zero-results queries (9.3% vs 0.9%) have easily detected URL bits. The most common elements are tbnid= and source=sh/x/i, which seem to come from Google image search. As most of the queries with URL bits get zero results, these are pretty much the same set.

Only a handful of zero-results queries use Boolean or other special operators or characters. It can be hard to be 100% sure about user intent, but Portland, OR and PRIDE AND PREJUDICE are probably not intending to use Boolean operators. So, based on context and capitalization, I've done my best to categorize them.
- AND: 9 of 23 zero-results queries with AND seem to be using Boolean operators.
- OR: There are no zero-results queries with OR.
- NOT: Neither of the 2 zero-results queries with NOT seem to be Boolean operators.
- -: 30 zero-results queries plausibly use - as negation. "Covid -19" is common among the ones that do not seem to be using it intentionally as negation.
- ~: 37 zero-results queries use ~. More than half of them are of the form User~commonswiki, which are likely leftovers from implementing unified login. Maybe 1 looks to be intentional use of ~.
- +: Ignoring tbnid zero-results queries, 301 queries use +. Most seem to be using it in place of a space.

Zero-Results Distribution of Token Counts

The number of tokens (roughly, words) in a zero-results query is an easy proxy for the complexity of the query. It's not perfect, but it is easy to calculate—mostly. For spaceless languages (Chinese, Japanese, Korean, Thai, and others), counting the actual words is much more difficult.

Of note, 86,230 zero-results queries (60.6% vs 66.0%) are one or two tokens. 122,420 zero-results queries (86.0% vs 92.6%) are one to four tokens.

All of the "zero-token" zero-results queries are strings of punctuation and symbols.

Given that zero-results queries are 10% of all queries in the sample, the table below shows were zero-results queries are over-represented.

"Zero-token" queries are punctuation and symbols, so it isn't too surprising most of them get zero results.
Queries of 7 or more tokens are more than twice as likely to give zero-results.
For queries with more than 10 tokens, more than half give zero-results.

Tokens Count Distribution
count (all)	count (zero-results)	zero/all	tokens
107	96	89.72%	0
353,268	38,470	10.89%	1
657,953	47,760	7.26%	2
293,343	24,288	8.28%	3
113,371	11,902	10.50%	4
48,063	6,985	14.53%	5
22,677	4,132	18.22%	6
11,543	2,497	21.63%	7
7,344	1,605	21.85%	8
4,068	1,086	26.70%	9
2,857	1,011	35.39%	10
2,597	1,490	57.37%	11
3,032	2,303	75.96%	12
3,556	3,017	84.84%	13
2,964	2,671	90.11%	14
2,038	1,794	88.03%	15
1,206	1,032	85.57%	16
708	569	80.37%	17
409	312	76.28%	18
234	162	69.23%	19
135	81	60.00%	20
109	46	42.20%	21
73	30	41.10%	22
74	33	44.59%	23
54	24	44.44%	24
149	61	40.94%	25-29
118	57	48.31%	30-39
17	8	47.06%	40-48
1	1	100.00%	58
2	2	100.00%	65

Zero-Results Spelling Errors

I noticed more obvious spelling errors in the zero-results sample than in the larger sample, so I tagged items that seemed to be spelling errors. Spelling errors in languages that use the Latin script are easier for me to detect, so there may be more that I missed.

65 (32.5%) of the 200 random sample of zero-results queries look like spelling errors, in several languages: mostly English, but also German, Spanish, French, Polish, Portuguese, and Swedish.

26 (13.0%) had good Did You Mean suggestions. 8 DYM suggestions were bad, and 7 were mediocre.
5 (2.5%) had good completion suggester suggestions. 1 was mediocre.
22 (11.0%) had no useful corrections.

38 (38.0%) of the 100 top zero-result queries look like spelling errors, one in Spanish, the rest in English.

23 (23.0%) had good Did You Mean suggestions. 2 DYM suggestions were bad, and 2 were mediocre.
7 (7.0%) had good completion suggester suggestions. 1 was bad, and 2 were mediocre.
8 (8.0%) had no useful corrections.

The completion suggester often didn't have much to work with, because there are no pages, files, or categories that match the text of the query, though they could match part of the query. For example, nothing matches Waashington state flag, though Waashington by itself gets several suggestions which start with Washington.

Erik's planned work (T250436) on improving suggestions could help a lot with these kinds of queries on Commons.

Zero-Results Miscellaneous Odd or Interesting Queries

Some of the queries that look like attempts at SQL injection unsurprisingly get zero results.

The most common zero-results queries are very specific and don't show much variation under normalization (e.g., variation in capitalization or punctuation), which I interpret as either one person repeating the search over and over, someone linking to the search results, or similar "non-organic" source.

Schematized Top 8 Queries with Frequency:

318 Internátional website.com Airport
- international airport with a website in the middle of its name
- all with the same capitalization and accents!
232 1234abcd
82 "Firstname Lastname"
54 Merkel Potrait / merkel potrait / Merkel potrait
- This one looks "organic", because of the variation. I guess portrait is hard to spell. I found several other instances of potrait—though not this many—as well as mnerkel potrait, and other instances of angela maerkel, angela merkeöl, and Angela Merklel.
45 Firstname Middlename Lastname
36 porn related searc
- porn-related search with the last letter of the last word missing
34 titleof famous painting
- famous painting title with two words run together

Other than Merkel potrait, all instances of these are identical under heavy normalization (lowercasing, removing punctuation, and normalizing whitespace).

Zero-Results Examples

Carly asked for some examples of people and places, so I reviewed the items I had identified and provided notes here. These have all been reviewed for PII, and names of non–public figures have been omitted or <schematized>. tbnid examples have also had their <ids> removed, since I don't know whether those might be recoverable somehow. (I doubt it, but why take a chance?)

Note: Looking at these much more closely has led to identifying a few more likely spelling errors, but I’m not going to go back and correct the stats for now. They are still in the right ballpark.

People:

Cwnciones ana gsbryel
- Mexican singer Ana Gabriel; The first word wants to be canciones (“songs”); no results because of typos and extra words
quien es <user> wikimedia commons
- “who is <user>”; <user> is a specific user; no results because of extra words
Johnmichaelelambo
- baseball player; his name is written this way (with spaces) on Japanese Wikipedia.
cornelious mcarthy
- typo for artist Cornelius McCarthy, known for his male nudes; presumably the searcher was looking for that, but based on his date of birth, his work is likely all under copyright and can’t be on Commons.
mahatma ganfhi
- typo for Mahatma Gandhi; auto-corrected by DYM and lots of results.
<firstname> macdonalf
- typo for MacDonald; auto-corrected by DYM, but no obviously relevant results. Looks like a non-notable person.
Shamsia hassan
- typo for artist Shamsia Hassani; gets autocorrected by DYM and has good results.
attila hiltman
- misspelling of Attila Hildmann; no results because of typos
<firstname> Sedighei
- likely typo for Sedighi; does not appear to be a notable person
Stuhlpfarrer
- on German Wikipedia, that’s Karl Stuhlpfarrer; zero results probably because of no relevant content at the time (now 3 hits)
Brian Cox (physicist)&tbnid=<id>&vet=1&docid=<id>&w=960&h=1440&q=Professor+Brian+Cox&source=sh/x/im
- Brian Cox (physicist) is a fine query with relevant results, and the name of a category
Evil Queen (Disney)&tbnid=<id>&vet=1&docid=<id>&w=256&h=228&q=Snow+White's+stepmother&hl=en-US&source=sh/x/im
- Evil Queen (Disney) is also a fine query with relevant results, and the name of a category
சுவாமி நிரலாம்பா (Gets results through MediaSearch)
- Tamil “Niralamba Swami”; no results because the search is in Tamil.
Christine Ballestrero.JPG
- likely typo for Cristina Ballestero, a singer/actress; no Wikipedia article, but she’s on YouTube and IMDB. No file with this name; no content on Commons for either spelling.
Category:nicolo de albate
- Likely misspelling for artist Niccolò dell'Abbate (also spelled Nicolò dell'Abate). Category:Nicolò dell'Abate and Category:Niccolò dell' Abbate exist (the later is a redirect to the former).
Photo adjiani 2020
- Likely misspelling of actress Isabelle Adjani; searching for adjani doesn’t get good results or DYM suggestions; searching for adjani gets results, but not from 2020.

Additional Specific People:

I got tired of looking up people to make sure they didn't need to be omitted, so I summarized the rest:

10 people with a Wikipedia page (not necessarily on enwiki)
- 1 with extra numbers in the query
- no results because there is no content
2 misspelled names that would get decent results on Commons
- no results because of typos; not corrected by DYM
2 names that don’t seem to belong to anyone notable (even to Google)
- no results because there is no content
3 people trying to be famous online but not yet notable enough for Wikipedia
- 1 is a porn performer
- no results because there is no content
1 tbnid query
- no results because of all the extra junk
2 people mentioned in a Wikipedia page
- 1 is a fictional character
- no results because there is no content

More people who are less specific:

נערת איגרו ף
- Hebrew; looks to be extra spaces in נערת איגרוף (“boxing girl”)—not as specific as most queries; probably gets no results because it’s in Hebrew, but the typo doesn’t help; female boxing in English gets relevant results.
penemu pesawat terbang
- Indonesian for “inventor of the airplane”; gets no results for being in Indonesian, but doesn’t get good results in English either.
ネトウヨ
- Japanese for Netouyo (“Japanese Internet rightists”); no results (or no relevant results) in Japanese or several versions in English (Netto-uyoku, Net uyoku, Netouyo, Netōyo)

Places:

Mnandilocation
- run together Mnandi location, which is more of an informational query. The separated form gets a few results, Mnandi by itself gets more, or course—but still not many.
heindentor
- typo for Heidentor, which gets results and is a Category

woodland campfl
- Camp Woodland, FL is a place. Still no good results for woodland camp fl or woodland camp florida.
Hotel longyarbyen
- typo for hotel longyearbyen; DYM auto corrects it.
محافضه اربي ل
- extra space in محافضه اربيل; likely typo for محافظة اربيل (“Erbil Governorate”), which gets results on Commons.
مدينة خريبق ه
- extra space in مدينة خريبقه, likely typo for مدينة خريبكة (“City of Khouribga”), which gets a few results; خريبكة (“Khouribga”) by itself gets more results, of course.
opponitzer tunnel
- Seems to be referring to a tunnel in Opponitz; no results because no content.
werderau maiacherstraße siedlung
- Werderau is a district of Nuremberg; Maiach is an adjacent district. Siedlung is a settlement. Zero results probably because there is no relevant content tagged in German.
Armenian Heritage Park&tbnid=<id>&vet=1&docid=<id>&w=2592&h=1683&q=armenian+heritage+park&source=sh/x/im
- Armenian Heritage Park is a category and also returns lots of results
Gush Katif Airport
- zero results because there’s no content by that name
Legaslative district of Cloverport ky
- typo for legislative district of Cloverport ky, which DYM corrects.
Washington, DC Metropolitan Area Special Flight Rules Area&tbnid=<id>&vet=1&docid=<id>&w=900&h=692&q=dc+flight+restricted+zone&hl=en-US&source=sh/x/im
- not a Category.. but Washington, DC Metropolitan Area Special Flight Rules Area gets plenty of results
Weeki Wachee River&docid=<id>&tbnid=<id>&vet=1&w=1200&h=900&itg=1&hl=en-US&source=sh/x/im
- Weeki Wachee River is a category and gets other results, too
Castle of St Peters Bodrum
- Bodrum Castle is also known as the “Castle of St. Peter”. English text processing does not stem Peters to Peter (probably to keep the last name Peters and the first name Peter separate. Castle of St Peters Bodrum currently gets 11 results. Castle of St Peter Bodrum gets 112.
Castillo de Kalmykov
- “Castle of Kalmykov” is an abandoned castle in Russia. No results because there is no content on Commons.
HISTOIRE DE L'HOPITAL DE YAKUSU
- “History of Yakusu Hospital”. Gets no results because it’s in French. Yakusu Hospital gets a small number of relevant results.
akurey kollafirði
- Akurey is ambiguous in Icelandic—a couple of islands and a town have that name—so this specifies the one in Kollafjörður (a fjord), which is in Faxaflói bay. The Commons category Akurey (Faxaflói) seems to be a match.
சுவாமி நிரலாம்பா바레인세계무역센터
- Korean “Bahrain World Trade Center”. No results because it is in Korean.
<name of church>목사<name of person>
- All in Korean 목사 is “pastor”. Seems not to be notable; it’s a church between Washington DC and Baltimore.
Halstaat
- Either a small town in Austria or a thoroughbred horse. No results because no content.
May Hill&docid=<id>&tbnid=<id>&vet=1&w=640&h=480&itg=1&hl=en-gb&source=sh/x/im
- May Hill is a category on Commons, and May Hill gets over 660K results.
Villar de Mazarife
- Town in Spain. No results because no content.
Zabłoty
- Village in Poland. No results because no content.
wyśnierzyce nad pilicą
- Possible typo for Wyśmierzyce nad Pilicą. “Wyśmierzyce on the Pilica River”. No results because no content. Category and some results for Wyśmierzyce, and results for “nad Pilicą” for other towns on the Pilica River—just not this one.
Category:Chemin du Coin-de-Terre
- A road in Geneva, Switzerland. Has since been created as a category.

User:TJones (WMF)/Notes/Review of Commons Queries

Contents

Full Query Corpus Analysis

Quick Summary

Data

Scripts and Languages

Scripts

Numbers

Misc

Symbols

Invisibles

Languages

Query Patterns

Query Frequency

Query Intent

Query Generality

Keywords and Specific Purposes

Light vs Heavy Normalization

Distribution of Hits

Distribution of Token Counts

Miscellaneous Odd or Interesting Queries

Zero-Results Query Sub-Corpus Analysis

Zero-Results Summary

Zero-Results Data

Zero-Results Scripts and Languages

Zero-Results Scripts

Zero-Results Numbers

Zero-Results Misc

Zero-Results Symbols

Zero-Results Invisibles

Zero-Results Languages

Zero-Results Query Patterns

Zero-Results Query Frequency

Zero-Results Query Intent

Zero-Results Query Generality

Zero-Results Keywords and Specific-Purposes

Zero-Results Distribution of Token Counts

Zero-Results Spelling Errors

Zero-Results Miscellaneous Odd or Interesting Queries

Zero-Results Examples