User:TJones (WMF)/Notes/Survey of Zero-Results Queries
July/August 2015 — See TJones_(WMF)/Notes for other projects.
Review of a multilingual 500K sample
editIntroduction
editIn an effort to reduce the rate of Wikipedia search queries that produce no results (see the Discovery team's proposal), I've undertaken a manual review of three batches of 500,000 full-text queries that returned no results (taken from the top 52 wikis, with 100K+ articles—for future reference, at this time that's en, sv, de, nl, fr, war, ru, it, ceb, es, vi, pl, ja, pt, zh, uk, ca, fa, sh, no, ar, fi, id, ro, hu, cs, sr, ko, ms, tr, min, eo, kk, eu, da, sk, bg, hy, he, lt, hr, sl, et, uz, gl, nn, vo, la, simple, el, hi, & ce).
Samples
editMy first sample ("7/24") is the first 500,000 zero-result full-text queries from the 2015-07-24 Cirrus Search Request log. The queries are time-stamped from 2015-07-23 07:51:29 to 2015-07-23 10:11:29. (The time zone is not indicated, but I assume it is consistent from file to file.) I reviewed this sample the most extensively, reviewing patterns I had previously found in a sample of 100,000 similar queries restricted to enwiki, as well as looking for new patterns.
I also reviewed similar samples of 500,000 zero-result full-text queries from the Cirrus Search Request logs dated 2015-07-10 ("7/10", time-stamped from 2015-07-09 07:43:26 and 2015-07-09 10:20:40) and 2015-07-17 ("7/17", 2015-07-16 07:42:37 to 2015-07-16 09:52:23). In these samples I only looked for the patterns I had previously identified in the 7/24 sample.
Caveats and Limitations
editAn important part of this process has been looking for patterns of queries that would not show up when listing the individual top zero-result queries. However, I could not manually review every unique query individually in a timely fashion, so I have resorted to heuristics (mostly grep patterns) for counting instances of the various patterns. The numbers are not exact, but they sometimes vary significantly from sample to sample anyway. I do believe that large systematic query patterns have been identified.
The samples were limited to the first 500K relevant queries in each log file, and do not represent the full 24-hour day.
This review is necessarily subjective and also limited by my familiarity with the languages and writing systems involved. (Hence, there's often more detail in enwiki and the top wikis in various Romance languages.)
Recurring Patterns
editI've sorted the patterns by maximum frequency of occurrence in my three samples. Unless otherwise noted, we haven't yet tracked down a source for the queries, and their intent is generally unclear.
DOI
editExamples:
- "10.3897/zookeys.457.6760" OR "http://zookeys.pensoft.net/articles.php?id=4267"
- "10.1371/journal.pntd.0003900" OR "http://journals.plos.org/plosntds/article?id=10.1371/journal.pntd.0003900"
- "10.3332/ecancer.2013.301"
- "10.7821/naer.1.1.2-6"
We had between 15,393 and 96,998 such queries in our samples, representing 3.08% to 19.40% of all zero-results queries. And while there were more in enwiki, there were many in these wikis as well: nl, ja, zh, war, vi, uk, sv, pt, pl, no, ko, it, id, hu, fr, fi, fa, de, cs, ceb, ca, ar, es, ru.
We were able to track the source of these queries down to a software package called Lagotto, used to track references to articles made online and elsewhere.
These queries are well-formed and can return results, but many do not because not all published academic articles are referenced in Wikipedia.
The Lagotto developer indicated that some traffic may the result of an error in a demo, and so it may decrease, but will not disappear.
Unix Timestamps
editExamples:
- 1431786835781:بيت لحم
- 1436436482196:Илюзия
- 1432198699732:Meryl Streep
We had between 26,351 and 42,650 such queries in our samples, representing 5.27% to 8.53% of all zero-results queries. They were spread across many wikis, with substantial numbers in en, it, ru, ja, fa, tr, nl, he, ar, id, cs, hi, vi, ro, hu, and uk.
Though the leading number looks like a Unix timestamp in epoch seconds, the numbers don't make sense. The time indicated spans from last year to next month.
These queries do not return results, though removing the number and colon from the beginning of the query results in a title match on the relevant wiki in all the examples tested.
We tracked down these queries to a bug in the Wikipedia Mobile App, which has since been fixed.
"Article_title" AND "title of link taken from article"
editExamples:
- "Argentine_football_league_system" AND "Football in Moldova"
- "Argentine_football_league_system" AND "Football in Mongolia"
- "Argentine_football_league_system" AND "Football in Mozambique"
- "Argentine_football_league_system" AND "Football in Papua New Guinea"
We had an estimated 8,174 to 16,657 such queries in our samples, representing 1.63% to 3.33% of all zero-results queries. These are generally restricted to enwiki.
These queries seem to consist of a quoted article title (with underscores) ANDed with the quoted title of an article (without underscores) linked to in the first article. There can be hundreds of different queries with the same first component.
Variants of the queries with spaces instead of underscores or with neither quotes nor underscores do not return results. Searching for just the first component does give results.
All of these queries are coming from a single IP address. We are trying to contact the owner. We have at least temporarily blocked the IP for failing to adhere to our terms of use and failing to contact us.
TV Episodes / Movies—"..." film
editExamples:
- "88 Minutos" film
- "Castle S1E1" film
- "Como treinar o seu Dragão 2 Filmes Completos Dublados" film
- "30 Rock - Season 3 S3E18" film
We had between 7,878 and 8,794 such queries in our samples, representing 1.58% to 1.76% of all zero-results queries. These were common in the following wikis: en, nl, de, fr, ja.
These queries consist of quoted material (generally a movie or TV show title, often followed by a season/episode number, e.g. S3E18), followed by the word film, even if the quoted material is not a film.
Many of the titles used in the queries (88 Minutos, Como treinar o seu Dragão 2) return results in the appropriate language wiki.
We tracked down these queries to a media player, which searches the title of a the file being played.
quot
editExamples:
- quot Anesthesia quot
- quot Albert Payne quot
- Canberra AND quot Andy Fisher quot
- Moira East quot
We had between 5,888 and 7,768 such queries in our samples, representing 1.18% to 1.55% of all zero-results queries. These were generally found in enwiki.
These queries contain the word quot in them. It appears that quotation marks in the original query were converted to entities (") and then sanitized (removing & and ; from the query), leaving quot.
Many of the queries without the quot's are exact matches for article titles, and many match with the the quot's either dropped or converted to straight quotation marks (").
These are coming from National Library of Australia. They have been contacted.
term+term+term country
editExamples:
- ópera+del+estado+de+hamburgo Bangladés
- zanthoxylum+thomasianum Bélgica
- finance+and+revenue+f+c Germany
- finance+and+revenue+f+c Ecuador
We had an estimated 3,437 to 6,725 such queries in our samples, representing 0.69% to 1.35% of all zero-results queries. These were generally found in eswiki, with some in enwiki. The country names are generally in Spanish or English, though some were in other languages.
The countries included in the query don't seem to necessarily have any relationship with the other search terms.
Many of the queries do return results if the country name is excluded, in both enwiki and eswiki.
term+term+term
edit(This pattern is out of order to be grouped with the previous one.)
Examples:
- accountable+care+organizations+and+evidence+based+payment+reform
- android+phone+gone+awol+just+google
- como+afilar+una+batidora+picadora+electrica+recupera+batidoras
- el+peine+de+las+sirenas
We had an estimated 1,382 to 2,536 such queries in our samples, representing 0.28% to 0.51% of all zero-results queries. These were generally found in eswiki, with some in enwiki.
These queries are characterized by being run together with pluses instead of spaces.
paint
editExamples:
- ""abel boulineau"" paint
- ""wilhelmina k. lagerholm"" paint
- Carl Friedrich Schulz - Zeitungslektüre am Biertisch (1851) paint
- Karr par Breton d'après Petit paint
We had an estimated 1,094 to 3,554 such queries in our samples, representing 0.22% to 0.71% of all zero-results queries. These were generally found in enwiki.
These queries are characterized by ending with the word paint. They come in two formats:
- ""<artist>"" paint
- <commons file> paint
<artist> seems to be the name of an artist, double quoted twice. <commons file> is the name of a file on Wikimedia Commons, without the file suffix (e.g., .jpg). The artist names I tried did not return results, though Google found them in various art galleries. The Commons files generally return results when searched on Commons.
Highly repeated searches
editThese are idiosyncratic searches, but can be repeated up to hundred of times per hour, indicating that they are probably not driven by a human typing in the search over and over.
Examples:
- ou as I can get tonight without being detected, and Tuck and Clay will be there too, along with an undercover team. You’ll have an earpiece no one will be able to see, so we
- Google books says this is a snippet from a novel. In a two-week sample of the top 100 queries per day, this came up 12 times. It was probably there the other two days, but didn't make the top 100. There are up to 964 queries in a day.
- Upon further investigation, the user agent for these queries (which have since stopped) appear to be from a Nook tablet e-reader, which does support searching Wikipedia. Perhaps someone set their Nook down in a way that the "search Wikipedia" button was continuously pressed (for a couple of weeks!)
- Iamlookingfornodethree
- There is a weird pattern here of iamlookingfornode<x>, where <x> can be a few different things.
- form 1+ 3dprinter
- Just found one day, but 668 times in less than three hours.
- Dounload feer game
- Just found one day, but 248 times in less than three hours.
These are hard to quantify, but I looked for queries that were repeated more than 50 times in my samples, didn't fall into other categories, and were unique enough not to be driven by the day's events or random searching (e.g., one word searches).
We had an estimated 892 to 3,019 such queries in our samples, representing 0.18% to 0.60% of all zero-results queries.
{searchTerms}
editExamples:
- {searchTerms}
- %7bsearchTerms%7d
- Liste der {searchTerms} Episoden
- {searchTerms}'||'
We had between 1,909 and 2,314 such queries in our samples, representing 0.38% to 0.46% of all zero-results queries. These were generally found in ruwiki.
This is likely developer error in an app or other automated search.
Similarly, we had a number of examples of search_suggest_query (440 to 509, 0.09% to 0.10%, en, de, fr) and \{@} (148 to 205, 0.03% to 0.04%, en, de, ru).
## <countrycode> tel fax
editExamples:
- Aluminum Bracket 44 uk tel fax
- diecast aluminum housing 31 nl tel fax
- plastic injection mold makers 34 es tel fax
We had between 33 and 1,293 such queries in our samples, representing 0.01% to 0.26% of all zero-results queries. These were generally found in dewiki.
These queries seem to be manufacturing terms, a two digit number, a country code, and "tel fax".
Chinese product descriptions .xyz
editExamples:
- 花店用品/蓝色妖姬着色剂 根部吸水浅宝蓝玫瑰*13230782866*QQ34040316座机03177896222*psuddf.zhijieranliao.xyz
- 直接染料/直接深棕GTL *13230782866*QQ34040316座机03177896222*zkdact.baohuabanranliao.xyz
We had an estimated 172 to 989 such queries in our samples, representing 0.03% to 0.20% of all zero-results queries. These were generally found in enwiki.
These appear to be product description in Chinese, along with additional information. They all end in .xyz. Note that 座机 means landline, indicating contact info.
Online searches for parts of these reveal a similar pattern on Chinese-language business/manufacturing sites.
Massive snippets
editThis particular category is fairly rare, but may incur significant computational cost, so it is worth noting. These are searches that are 500 characters in length or more (up to more than 5,000 characters). Many look like snippets from larger texts, such as books or articles, and are in several different languages.
We had between 183 and 261 such queries in our samples, representing 0.04% to 0.05% of all zero-results queries. These were generally found in en, fr, de, and ru.
Miscellany
editBelow are some very general impressions from the larger collections of zero-results (10K+ from a given wiki). My ability to analyze languages I don't know is limited, but here is what I noticed:
- dewiki has a few hundred OR'd together wildcard searches, some of which seem to be trying handle variations in declension.
- jawiki has lots of "..." film searches.
- ruwiki has a few non-cyrillic searches
- itwiki has lots of queries that are multi-word phrases with underscores instead of spaces
- eswiki and frwiki have a fair number of build up searches and searches in Arabic, and frwiki has a fair number of searches in Chinese
- zhwiki has lots of non-Chinese searches in various languages
- plwiki has a fair number of queries of the form *<musical thing>* AND (muzyk* OR Dyskografia) (with asterisks) where <musical thing> seems to be an artist, band, album, or something similar.
Summary Table
editQuery type | Sample 7/10 | Sample 7/17 | Sample 7/24 | % of zero rate (min / max) of 500K samples | Most affected wikis | |
DOI | 15393 | 96998 | 50181 | 3.08% | 19.40% | en, nl, ja, zh, war, vi, uk, sv, pt, pl, no, ko, it, id, hu, fr, fi, fa, de, cs, ceb, ca, ar, es, ru |
Unix timestamps | 42650 | 26351 | 28089 | 5.27% | 8.53% | en, it, ru, ja, fa, tr, nl, he, ar, id, cs, hi, vi, ro, hu, uk, etc. |
"Article_title" AND "title of link taken from article" | 10524 | 8174 | 16657 | 1.63% | 3.33% | en |
TV Episodes / Movies—"..." film | 7989 | 7878 | 8794 | 1.58% | 1.76% | en, nl, de, fr, ja |
quot | 7768 | 5888 | 6297 | 1.18% | 1.55% | en |
term+term+term country | 6725 | 3437 | 5645 | 0.69% | 1.35% | es, en |
paint | 3554 | 1917 | 1094 | 0.22% | 0.71% | en |
Highly repeated searches | 892 | 1186 | 3019 | 0.18% | 0.60% | ? |
term+term+term | 2247 | 1382 | 2536 | 0.28% | 0.51% | es, en |
{searchTerms} | 2314 | 1909 | 1997 | 0.38% | 0.46% | ru |
## <countrycode> tel fax | 572 | 33 | 1293 | 0.01% | 0.26% | de |
Chinese product descriptions .xyz | 172 | 221 | 989 | 0.03% | 0.20% | en |
search_suggest_query | 449 | 509 | 440 | 0.09% | 0.10% | en, de, fr |
Massive snippets | 258 | 183 | 261 | 0.04% | 0.05% | en, fr, de, ru |
\{@} | 205 | 150 | 148 | 0.03% | 0.04% | en, de, ru |
One Month Followup
edit
I've conducted a followup study to track changes to these patterns over time. Since my investigation is more focused this time (on these queries) I was able to process a lot more data.
As before, I've limited my investigation to full text queries against the 52 wikis with 100K+ articles (see above).
My new samples are taken from the entire day (i.e., one log file, labeled from ~8am to ~8am) for each day queried. For zero-result queries, I sampled 1-in-15, to get a 300K-400K query sample. For all queries, I sampled 1-in-45 to get a 600K to 800K query sample.
I focused on the largest patterns above, and included some of the easier to automatically detect patterns as well.
In particular: DOI, Unix Timestamp, "Title_1" AND "Title 2", "..." film, quot, "..." paint, tel fax, .xyz, {searchTerms}, \{@}, search_suggest_query, term+term+term Country, and long lines (150+ characters).
I counted the frequency and ratio of these kinds of queries for each day of one recent week (8/19-8/25) and one week around the original investigation date (7/22-7/28); both are Wednesday to Tuesday (since I was investigating on a Wednesday, 8/26).
Last Month (July 2015)
editBelow are the results for a week in July. Keep in mind that Zero Queries are sampled 1-in-15 and All Queries are sampled 1-in-45. The percentage to the right of the sample size for the Zero Queries is the zero results rate (zero_queries*15/all_queries*45).
Wednesday 07-22 | Thursday 07-23 | Friday 07-24 | ||||||||||
Zero Queries | All Queries | Zero Queries | All Queries | Zero Queries | All Queries | |||||||
samples / zero% | 304845 | 14.86% | 683966 | 310079 | 15.92% | 649420 | 354738 | 17.63% | 670675 | |||
DOI | 11753 | 3.86% | 3868 | 0.57% | 15410 | 4.97% | 5092 | 0.78% | 44656 | 12.59% | 15022 | 2.24% |
Unix Timestamp | 23611 | 7.75% | 8033 | 1.17% | 23220 | 7.49% | 7526 | 1.16% | 23119 | 6.52% | 7792 | 1.16% |
"Title_1" AND "Title 2" | 16226 | 5.32% | 32664 | 4.78% | 11029 | 3.56% | 21975 | 3.38% | 14793 | 4.17% | 25581 | 3.81% |
"..." film | 5490 | 1.80% | 2219 | 0.32% | 6454 | 2.08% | 2471 | 0.38% | 5199 | 1.47% | 1983 | 0.30% |
quot | 3886 | 1.27% | 1896 | 0.28% | 3868 | 1.25% | 2005 | 0.31% | 3809 | 1.07% | 1887 | 0.28% |
paint | 563 | 0.18% | 261 | 0.04% | 684 | 0.22% | 292 | 0.04% | 576 | 0.16% | 281 | 0.04% |
tel fax | 270 | 0.09% | 85 | 0.01% | 130 | 0.04% | 45 | 0.01% | 109 | 0.03% | 51 | 0.01% |
.xyz | 80 | 0.03% | 28 | 0.00% | 363 | 0.12% | 101 | 0.02% | 406 | 0.11% | 137 | 0.02% |
{searchTerms} | 1026 | 0.34% | 757 | 0.11% | 1026 | 0.33% | 750 | 0.12% | 1045 | 0.29% | 759 | 0.11% |
\{@} | 74 | 0.02% | 20 | 0.00% | 74 | 0.02% | 25 | 0.00% | 67 | 0.02% | 26 | 0.00% |
search_suggest_query | 365 | 0.12% | 127 | 0.02% | 394 | 0.13% | 113 | 0.02% | 374 | 0.11% | 131 | 0.02% |
term+term Country | 2063 | 0.68% | 849 | 0.12% | 2674 | 0.86% | 1297 | 0.20% | 2354 | 0.66% | 1249 | 0.19% |
long lines (150+) | 1347 | 0.44% | 493 | 0.07% | 1572 | 0.51% | 566 | 0.09% | 1197 | 0.34% | 416 | 0.06% |
Saturday 07-25 | Sunday 07-26 | Monday 07-27 | Tuesday 07-28 | |||||||||||||
Zero Queries | All Queries | Zero Queries | All Queries | Zero Queries | All Queries | Zero Queries | All Queries | |||||||||
samples / zero% | 333763 | 16.87% | 659440 | 379430 | 17.70% | 714432 | 404075 | 16.99% | 792621 | 385144 | 18.18% | 706337 | ||||
DOI | 31388 | 9.40% | 10309 | 1.56% | 65501 | 17.26% | 21910 | 3.07% | 70647 | 17.48% | 23619 | 2.98% | 70225 | 18.23% | 23551 | 3.33% |
Unix Timestamp | 23019 | 6.90% | 7691 | 1.17% | 25149 | 6.63% | 8336 | 1.17% | 27026 | 6.69% | 8980 | 1.13% | 23915 | 6.21% | 8077 | 1.14% |
"Title_1" AND "Title 2" | 16639 | 4.99% | 24661 | 3.74% | 9516 | 2.51% | 16438 | 2.30% | 15114 | 3.74% | 24345 | 3.07% | 16195 | 4.20% | 24692 | 3.50% |
"..." film | 5885 | 1.76% | 2237 | 0.34% | 7177 | 1.89% | 2820 | 0.39% | 6554 | 1.62% | 2623 | 0.33% | 5636 | 1.46% | 2258 | 0.32% |
quot | 3599 | 1.08% | 1708 | 0.26% | 3850 | 1.01% | 1865 | 0.26% | 3954 | 0.98% | 1919 | 0.24% | 3807 | 0.99% | 1893 | 0.27% |
paint | 226 | 0.07% | 136 | 0.02% | 243 | 0.06% | 110 | 0.02% | 218 | 0.05% | 151 | 0.02% | 221 | 0.06% | 96 | 0.01% |
tel fax | 181 | 0.05% | 62 | 0.01% | 204 | 0.05% | 54 | 0.01% | 82 | 0.02% | 25 | 0.00% | 264 | 0.07% | 76 | 0.01% |
.xyz | 281 | 0.08% | 103 | 0.02% | 557 | 0.15% | 179 | 0.03% | 450 | 0.11% | 140 | 0.02% | 417 | 0.11% | 137 | 0.02% |
{searchTerms} | 987 | 0.30% | 690 | 0.10% | 959 | 0.25% | 653 | 0.09% | 956 | 0.24% | 680 | 0.09% | 1039 | 0.27% | 720 | 0.10% |
\{@} | 65 | 0.02% | 22 | 0.00% | 50 | 0.01% | 11 | 0.00% | 53 | 0.01% | 20 | 0.00% | 70 | 0.02% | 25 | 0.00% |
search_suggest_query | 381 | 0.11% | 127 | 0.02% | 474 | 0.12% | 171 | 0.02% | 528 | 0.13% | 176 | 0.02% | 457 | 0.12% | 134 | 0.02% |
term+term Country | 2223 | 0.67% | 1143 | 0.17% | 2220 | 0.59% | 1071 | 0.15% | 1940 | 0.48% | 1030 | 0.13% | 1971 | 0.51% | 1031 | 0.15% |
long lines (150+) | 1375 | 0.41% | 496 | 0.08% | 1298 | 0.34% | 459 | 0.06% | 1264 | 0.31% | 462 | 0.06% | 1799 | 0.47% | 589 | 0.08% |
This Month (August 2015)
editBelow are the results for a week in August. Keep in mind that Zero Queries are sampled 1-in-15 and All Queries are sampled 1-in-45. The percentage to the right of the sample size for the Zero Queries is the zero results rate (zero_queries*15/all_queries*45).
Wednesday 08-19 | Thursday 08-20 | Friday 08-21 | ||||||||||
Zero Queries | All Queries | Zero Queries | All Queries | Zero Queries | All Queries | |||||||
samples/ zero% | 401789 | 19.29% | 694139 | 393384 | 19.56% | 670250 | 354555 | 18.12% | 652231 | |||
DOI | 30369 | 7.56% | 9992 | 1.44% | 45115 | 11.47% | 15167 | 2.26% | 16416 | 4.63% | 5599 | 0.86% |
Unix Timestamp | 5045 | 1.26% | 1710 | 0.25% | 4831 | 1.23% | 1656 | 0.25% | 4618 | 1.30% | 1474 | 0.23% |
"Title_1" AND "Title 2" | 4834 | 1.20% | 6768 | 0.98% | 0 | 0.00% | 0 | 0.00% | 0 | 0.00% | 0 | 0.00% |
"..." film | 6065 | 1.51% | 2466 | 0.36% | 5682 | 1.44% | 2275 | 0.34% | 5851 | 1.65% | 2385 | 0.37% |
quot | 3566 | 0.89% | 1751 | 0.25% | 3879 | 0.99% | 2036 | 0.30% | 3983 | 1.12% | 2008 | 0.31% |
paint | 194 | 0.05% | 90 | 0.01% | 119 | 0.03% | 55 | 0.01% | 198 | 0.06% | 78 | 0.01% |
tel fax | 49 | 0.01% | 8 | 0.00% | 233 | 0.06% | 87 | 0.01% | 120 | 0.03% | 43 | 0.01% |
.xyz | 265 | 0.07% | 85 | 0.01% | 283 | 0.07% | 106 | 0.02% | 462 | 0.13% | 166 | 0.03% |
{searchTerms} | 1047 | 0.26% | 752 | 0.11% | 1064 | 0.27% | 746 | 0.11% | 1060 | 0.30% | 716 | 0.11% |
\{@} | 86 | 0.02% | 31 | 0.00% | 53 | 0.01% | 28 | 0.00% | 52 | 0.01% | 26 | 0.00% |
search_suggest_query | 349 | 0.09% | 126 | 0.02% | 377 | 0.10% | 127 | 0.02% | 400 | 0.11% | 125 | 0.02% |
term+term Country | 0 | 0.00% | 1 | 0.00% | 0 | 0.00% | 0 | 0.00% | 0 | 0.00% | 0 | 0.00% |
long lines (150+) | 1420 | 0.35% | 545 | 0.08% | 1487 | 0.38% | 0 | 0.00% | 1657 | 0.47% | 576 | 0.09% |
Saturday 08-22 | Sunday 08-23 | Monday 08-24 | Tuesday 08-25 | |||||||||||||
Zero Queries | All Queries | Zero Queries | All Queries | Zero Queries | All Queries | Zero Queries | All Queries | |||||||||
samples/ zero% | 341387 | 17.86% | 637122 | 355099 | 17.47% | 677356 | 391179 | 17.50% | 744908 | 394403 | 18.89% | 695853 | ||||
DOI | 17008 | 4.98% | 5681 | 0.89% | 18768 | 5.29% | 6295 | 0.93% | 26501 | 6.77% | 8890 | 1.19% | 39322 | 9.97% | 13236 | 1.90% |
Unix Timestamp | 4412 | 1.29% | 1529 | 0.24% | 4542 | 1.28% | 1579 | 0.23% | 4566 | 1.17% | 1560 | 0.21% | 4206 | 1.07% | 1378 | 0.20% |
"Title_1" AND "Title 2" | 1 | 0.00% | 0 | 0.00% | 0 | 0.00% | 0 | 0.00% | 0 | 0.00% | 0 | 0.00% | 0 | 0.00% | 1 | 0.00% |
"..." film | 6578 | 1.93% | 2690 | 0.42% | 8623 | 2.43% | 3387 | 0.50% | 7442 | 1.90% | 3160 | 0.42% | 5174 | 1.31% | 2138 | 0.31% |
quot | 3801 | 1.11% | 1998 | 0.31% | 3896 | 1.10% | 1869 | 0.28% | 4028 | 1.03% | 1869 | 0.25% | 4219 | 1.07% | 2064 | 0.30% |
paint | 136 | 0.04% | 78 | 0.01% | 129 | 0.04% | 50 | 0.01% | 84 | 0.02% | 30 | 0.00% | 61 | 0.02% | 28 | 0.00% |
tel fax | 41 | 0.01% | 15 | 0.00% | 0 | 0.00% | 0 | 0.00% | 252 | 0.06% | 69 | 0.01% | 490 | 0.12% | 143 | 0.02% |
.xyz | 302 | 0.09% | 92 | 0.01% | 381 | 0.11% | 123 | 0.02% | 527 | 0.13% | 187 | 0.03% | 455 | 0.12% | 160 | 0.02% |
{searchTerms} | 1002 | 0.29% | 666 | 0.10% | 980 | 0.28% | 582 | 0.09% | 1043 | 0.27% | 622 | 0.08% | 949 | 0.24% | 580 | 0.08% |
\{@} | 57 | 0.02% | 16 | 0.00% | 58 | 0.02% | 20 | 0.00% | 41 | 0.01% | 19 | 0.00% | 68 | 0.02% | 29 | 0.00% |
search_suggest_query | 388 | 0.11% | 119 | 0.02% | 423 | 0.12% | 160 | 0.02% | 464 | 0.12% | 150 | 0.02% | 354 | 0.09% | 142 | 0.02% |
term+term Country | 0 | 0.00% | 1 | 0.00% | 1 | 0.00% | 1 | 0.00% | 923 | 0.24% | 434 | 0.06% | 2095 | 0.53% | 1066 | 0.15% |
long lines (150+) | 1163 | 0.34% | 400 | 0.06% | 1359 | 0.38% | 498 | 0.07% | 1666 | 0.43% | 601 | 0.08% | 1222 | 0.31% | 452 | 0.06% |
Analysis
editThe overall zero results rate hasn't changed much from July to August. Friday to Tuesday track pretty closely in the graph below, and Wednesday and Thursday have higher zero results rates in August than July.
The decrease in some queries, such as the "Title_1" AND "Title 2" queries, have little overall effect, because the zero results rate is only slightly higher than the overall query rate (usually within 1% above), so removing both the zero results queries and the queries that do return results has little net effect.
"Title_1" AND "Title 2"
editThis IP was banned (apparently around 8/19), and queries dropped to zero, both overall and zero results queries.
Unix Timestamps
editThe fix to the Wikipedia Mobile App was deployed, and rates have dropped significantly, though not to 0%, because not everyone updates their apps right away.
DOI
editDOI rates have always fluctuated considerably across various samples (3% to 19%, see above), so it's hard to be sure, but in the samples for these weeks, the overall rate seems to be down.
quot and "..." film
editNot much has happened with the quot and "..." film queries. We've contacted the respective sources, but it isn't clear anything will be done soon. The media player is behaving as designed, so these queries probably won't stop. The NLA may eventually fix their quot queries.
Comparison to Dashboards
editDespite the large dip in "Title_1" AND "Title 2" and Unix Timestamp queries, we haven't seen any corresponding movement in the Discovery KPI dashboard.
Part of the reason for this is that this analysis and the dashboards are measuring different things: the dashboards look at all zero results (including prefix searches), while I'm looking only at full text queries to large wikis (to limit the scope of my initial investigation). I'm going to try to confirm this (and see if anything else pops out!) by quickly looking at everything I didn't look at here.
Everything Else (full text)
editI took a 1-in-45 sample of full text queries from 8/25 for wikis not in the 52 100K+ wikis, and found some interesting stuff.
There were 269,924 samples (compared to 695,853 in the "big" wikis)—so, roughly 28% of traffic.
There were 152,726 zero result queries, for a 56.6% zero results rate (compared to ~20% in the "big" wikis).
The usual suspects do not show up as much, other than DOI:
samples/ zero% | 152726 | 56.6% |
DOI | 3967 | 2.60% |
Unix Timestamp | 961 | 0.63% |
"Title_1" AND "Title 2" | 0 | 0.00% |
"..." film | 152 | 0.10% |
quot | 73 | 0.05% |
paint | 0 | 0.00% |
tel fax | 166 | 0.11% |
.xyz | 0 | 0.00% |
{searchTerms} | 30 | 0.02% |
\{@} | 3 | 0.00% |
search_suggest_query | 7 | 0.00% |
term+term Country | 0 | 0.00% |
long lines (150+) | 117 | 0.08% |
Other interesting facts:
There were only about 3K non-latin searches (lots of Greek, Cyrillic, Hebrew, Arabic, Devanagari, Thai, Japanese, Chinese, emoji, and more).
A quick skim of the zero results show lots of names.. but not a lot else stuck out.
The top 12 wikis are responsible for 145,325 of the zero results queries (95.2%) (with Wikidata on top and most of the rest coming from itwiki...):
- 20699 wikidatawiki_content (13.5%)
- 19798 itwiktionary_content (13.0%)
- 18753 itwikiversity_content
- 18080 itwikinews_content
- 16587 itwikivoyage_content
- 15504 itwikibooks_content
- 13460 itwikisource_content
- 12148 itwikiquote_content
- 3773 eswiki
- 3764 nlwiktionary_content
- 1419 sewiki_content
- 1340 thwiki_content
Change in Zero Results Rate by Wiki (July to August)
editThe table below shows the change in zero results rate from a week's sample (1-in-750 ) each from July (7/22-7/28) and August (8/19-8/25). The wikis shown are those with at least 1000 queries in either week sampled. In total, they account for just under 98% of queries each week.
- zero rate is the zero results rate.
- % of all is the percentage of all queries that come from this wiki.
- % of zero is the percentage of zero results that come from this wiki.
- zero/all is the ratio of the previous two percentages. Values > 1 indicate more than average zero results, < 1indicates fewer than average zero results.
- query vol Δ is the change in query volume from July to August.
- zero rate Δ is the change in zero results rate from July to August.
N.B.: This analysis does not properly account for itwiki's interwiki search (which searches several Italian wikis at once) because this is a 1-in-750 sample, so will likely only see one element of the itwiki interwiki search, and itwik* is over represented. The Discovery Dashboards use unsampled data and so can account for this.
July | August | |||||||||||
wiki | zero rate | % of all | % of zero | zero/all | ___ | zero rate | % of all | % of zero | zero/all | ___ | query vol Δ | zero rate Δ |
TOTAL | 25.18% | 100.00% | 100.00% | 1.00 | 28.75% | 100.00% | 100.00% | 1.00 | 6.66% | 3.56% | ||
enwiki | 14.18% | 47.13% | 26.54% | 0.56 | 13.79% | 38.03% | 18.25% | 0.48 | -13.94% | -0.39% | ||
dewiki | 14.78% | 8.19% | 4.81% | 0.59 | 17.51% | 8.28% | 5.04% | 0.61 | 7.75% | 2.73% | ||
eswiki | 21.87% | 2.83% | 2.46% | 0.87 | 18.83% | 3.97% | 2.60% | 0.65 | 49.57% | -3.04% | ||
frwiki | 11.58% | 3.30% | 1.52% | 0.46 | 12.67% | 3.13% | 1.38% | 0.44 | 1.06% | 1.10% | ||
itwiktionary | 66.60% | 2.27% | 6.01% | 2.64 | 67.87% | 2.94% | 6.95% | 2.36 | 38.25% | 1.27% | ||
itwiki | 18.52% | 2.80% | 2.06% | 0.74 | 19.87% | 2.93% | 2.02% | 0.69 | 11.38% | 1.35% | ||
itwikinews | 61.16% | 2.35% | 5.71% | 2.43 | 62.69% | 2.90% | 6.31% | 2.18 | 31.48% | 1.53% | ||
itwikisource | 47.03% | 2.33% | 4.35% | 1.87 | 46.19% | 2.90% | 4.65% | 1.61 | 32.56% | -0.83% | ||
itwikiquote | 43.33% | 2.29% | 3.94% | 1.72 | 42.85% | 2.89% | 4.31% | 1.49 | 34.83% | -0.48% | ||
itwikiversity | 64.41% | 2.31% | 5.90% | 2.56 | 65.77% | 2.89% | 6.61% | 2.29 | 33.63% | 1.36% | ||
itwikibooks | 53.88% | 2.29% | 4.90% | 2.14 | 53.43% | 2.88% | 5.36% | 1.86 | 34.22% | -0.44% | ||
itwikivoyage | 60.09% | 2.35% | 5.60% | 2.39 | 58.49% | 2.86% | 5.82% | 2.03 | 29.95% | -1.60% | ||
wikidatawiki | 88.28% | 1.23% | 4.31% | 3.51 | 72.08% | 2.83% | 7.11% | 2.51 | 145.62% | -16.20% | ||
ruwiki | 16.14% | 2.79% | 1.79% | 0.64 | 17.19% | 2.52% | 1.51% | 0.60 | -3.60% | 1.06% | ||
jawiki | 23.74% | 2.30% | 2.17% | 0.94 | 26.11% | 2.25% | 2.05% | 0.91 | 4.40% | 2.37% | ||
ptwiki | 23.41% | 2.15% | 2.00% | 0.93 | 22.33% | 2.13% | 1.65% | 0.78 | 5.28% | -1.08% | ||
eswiki | 25.25% | 1.03% | 1.03% | 1.00 | 25.77% | 1.61% | 1.44% | 0.90 | 66.91% | 0.52% | ||
nlwiki | 23.76% | 0.82% | 0.78% | 0.94 | 34.47% | 1.33% | 1.60% | 1.20 | 72.53% | 10.71% | ||
zhwiki | 27.77% | 1.05% | 1.16% | 1.10 | 41.58% | 1.06% | 1.53% | 1.45 | 7.19% | 13.81% | ||
plwiki | 33.45% | 0.48% | 0.63% | 1.33 | 44.24% | 0.88% | 1.35% | 1.54 | 96.25% | 10.79% | ||
trwiki | 19.99% | 0.47% | 0.37% | 0.79 | 34.40% | 0.79% | 0.95% | 1.20 | 78.98% | 14.41% | ||
viwiki | 53.43% | 0.25% | 0.53% | 2.12 | 42.57% | 0.77% | 1.14% | 1.48 | 231.56% | -10.86% | ||
arwiki | 31.50% | 0.71% | 0.88% | 1.25 | 27.86% | 0.61% | 0.59% | 0.97 | -7.98% | -3.65% | ||
idwiki | 36.12% | 0.41% | 0.58% | 1.43 | 31.42% | 0.65% | 0.71% | 1.09 | 68.86% | -4.70% | ||
svwiki | 23.31% | 0.65% | 0.60% | 0.93 | 23.39% | 0.50% | 0.40% | 0.81 | -18.21% | 0.08% | ||
fawiki | 32.52% | 0.51% | 0.66% | 1.29 | 28.80% | 0.42% | 0.42% | 1.00 | -11.78% | -3.72% | ||
nlwiktionary | 99.59% | 0.46% | 1.82% | 3.95 | 99.26% | 0.41% | 1.42% | 3.45 | -4.59% | -0.33% | ||
commonswiki | 8.94% | 0.32% | 0.11% | 0.35 | 9.17% | 0.38% | 0.12% | 0.32 | 25.04% | 0.23% | ||
kowiki | 37.46% | 0.36% | 0.53% | 1.49 | 37.18% | 0.35% | 0.45% | 1.29 | 5.24% | -0.28% | ||
cswiki | 42.39% | 0.17% | 0.29% | 1.68 | 37.32% | 0.25% | 0.32% | 1.30 | 50.62% | -5.07% | ||
fiwiki | 38.62% | 0.23% | 0.35% | 1.53 | 36.15% | 0.24% | 0.31% | 1.26 | 14.42% | -2.47% | ||
hewiki | 25.61% | 0.25% | 0.26% | 1.02 | 27.37% | 0.18% | 0.17% | 0.95 | -23.59% | 1.76% | ||
thwiki | 42.47% | 0.08% | 0.14% | 1.69 | 59.33% | 0.20% | 0.41% | 2.06 | 159.87% | 16.86% | ||
commonswiki_file | 44.95% | 0.20% | 0.35% | 1.78 | 40.25% | 0.18% | 0.25% | 1.40 | -2.07% | -4.70% | ||
dawiki | 22.48% | 0.07% | 0.06% | 0.89 | 35.95% | 0.17% | 0.21% | 1.25 | 156.59% | 13.47% | ||
huwiki | 51.06% | 0.15% | 0.31% | 2.03 | 44.28% | 0.12% | 0.18% | 1.54 | -16.90% | -6.78% | ||
nowiki | 58.41% | 0.12% | 0.28% | 2.32 | 47.78% | 0.14% | 0.23% | 1.66 | 19.47% | -10.63% | ||
hiwiki | 33.09% | 0.11% | 0.15% | 1.31 | 37.28% | 0.13% | 0.17% | 1.30 | 23.50% | 4.19% | ||
ukwiki | 61.76% | 0.14% | 0.34% | 2.45 | 58.64% | 0.10% | 0.20% | 2.04 | -25.10% | -3.13% |
Notes
edit- eswiki is listed twice; the first instance is searches against "eswiki_content", the second against "eswiki". (I'm not sure what that means.)
- enwiki had a significant drop in traffic from July to August
- wikidata, eswiki, and many of the itwikis had significant increases in traffic from July to August
- wikidata had a significant decrease in zero rate, though it is still incredibly high.
- nlwiktionary has a phenomenal zero results rate, over 99%. A quick check shows that it is getting spammed with URLs for German (.de) websites.
Full manual review of a 1K enwiki sample
editIntroduction
editLooking for large recurring patterns of searches will not reveal the frequency of idiosyncratic erorrs (like typos, gibberish, and queries in foreign languages) that can't be easily be recognized automatically. So I undertook a manual review of 1047 randomly sampled queries from from one day's worth of logs of enwiki full-text searches, starting on 2015-07-29. This was a random sample (or as random as pseudorandom can be) rather than an every-N sample.
Caveats
editAll categories are at least somewhat subjective, and depend in part on my ability to recognize (or uncover) user intent. Many items could have been put in multiple categories, but I chose just one each time (some comments about this are included below). I know others would disagree on some categories, and I know that there are errors in categorization (i.e., I'd disagree with myself), but the overall trends are still illustrative.
Requestors
editOverall, 67.2% were requested via API, 32.8% via web.
Typos
editI broke typos up into two categories: apparent mistakes, and incomplete words or phrases. Note that the vast majority in the incomplete category come via API, hinting that some app may be sending incomplete queries.
153 14.6% TYPO, 54.2% web, 45.8% api 93 8.9% INCOMP, 97.8% api, 2.2% web Total: 23.5%
Typo autosuggestions
editI took a random sample of 40 of the queries in the TYPO category and searched for them in enwiki via the web. The results were oddly evenly distributed:
- 10/40 = 25.0% zero results
- 10/40 = 25.0% some results
- 20/40 = 50.0% correct results
Half had clearly correct suggestions, a quarter had no results, and a quarter had some non-zero results that were either wrong, or not clearly correct.
Typo reverse index
editBy manual inspection, 20 of the 153 TYPOs (13.1%) had an error in the first two characters of one of the search terms, and thus might benefit from a reverse index.
Previously seen categories
editThese are the previously discussed large categories of zero-return queries. The distributions are different from previous samples, sometimes drastically different. This could be attributed to the current small sample size, the time skew in the earlier sample, day-of-the-week effects, and random vagaries of millions of people searching—though that last can account for almost any variance!
125 11.9% AND, 100.0% api "Article_title" AND "title of link taken from article" 91 8.7% UNIX, 100.0% api Unix Timestamps 29 2.8% FILM, 100.0% web TV Episodes / Movies—"..." film 28 2.7% QUOT, 100.0% api quot 20 1.9% DOI, 100.0% api DOI 3 0.3% TERM+, 100.0% api term+term+term / term+term+term country Total: 28.3%
Foreign languages
editThough enwiki has many pages with titles in other languages, these searches didn't get any results. I didn't dig too deeply into most of them. However, numerous entries filed under MOVIES, MUSIC, and YOUTUBE are in Spanish, Portuguese, Turkish, or other languages.
13 1.2% CHINESE, 61.5% api, 38.5% web 12 1.1% ARABIC, 58.3% web, 41.7% api 7 0.7% CYRILLIC, 57.1% web, 42.9% api 7 0.7% TAGALOG, 85.7% web, 14.3% api 6 0.6% SPANISH, 83.3% api, 16.7% web 5 0.5% MALAY, 80.0% web, 20.0% api 3 0.3% GERMAN, 66.7% api, 33.3% web 3 0.3% DEVTRANS, 100.0% api 2 0.2% NORWEGIAN, 100.0% api 2 0.2% SWAHILI, 100.0% web 1 each (0.1%, 100% api) for GREEK, THAI, TAMIL 1 each (0.1%, 100% web) for PORTUGUESE, DUTCH, FRENCH, LATIN, SWEDISH, ITALIAN, CROATIAN, HINDI, ESTONIAN, HMONG, KANNADA, FINNISH Total: 7.2%
One or Two Word Queries
editHow many of these foreign language queries are one or two words, and thus, for example, more likely to get a match in Wikitionary or Wikivoyage?
- Arabic: 0/12, 3/12
- Chinese: hard to judge, but apparently 0/13; not sure on "two words"
- Cyrillic: 4/7, 1/7
- Dev-Trans: 3/3
- German: 1/3, 1/3
- Malay: 0/5, 1/5
- Norwegian: 0/2, 2/2
- Spanish: 1/6, 2/6
- Swahili: 0/2, 0/2
- Tagalog: 0/7
- Croatian, Estonian, Swedish, Tamil: 1/1
- Dutch, Greek, Hindi, Hmong, Italian, Portuguese, Thai: 0/1, 0/1
- Finnish, French, Kannada, Latin: 0/1, 1/1
Totals:
- One word: 13/75
- Two words: 14/75
Mystery queries
editThese are ones that I just couldn't figure out. They weren't clearly junk. They could be typos, but often they are are too ambiguous to be sure.
66 6.3% ??, 71.2% api, 28.8%
Not encyclopedic
editThese categories are potentially problematic, since they may depend in part on what you think should or should not be in Wikipedia.
Total: 28.0%
Wrong website
edit- PROD: These are queries that appear to be about general or specific products, including video games, clothing brands, drugs, decorations, laptop replacement parts, etc.
- QUESTION: These are queries that seem to be asking for non-encyclopedic information, including advice on romance, study habits, and home furnishing, job searches, making travel arrangements, celebrity facts, scholarly research, etc.
- URL: These look like they are or tried to be URLs.
- NEWS: Questions about current events.
- TWITTER: Actual tweets or parts of tweets.
57 5.4% PROD, 66.7% api, 33.3% web 45 4.3% QUESTION, 57.8% web, 42.2% api 32 3.1% URL, 62.5% web, 37.5% api 3 0.3% NEWS, 66.7% web, 33.3% api 2 0.2% TWITTER, 100.0% api Total: 13.3%
Content
editThese appear to be searches for particular content, including particular songs, albums, music by a particular artist, movies, TV episodes, books, or scholarly articles. The YOUTUBE category are queries that exactly match the titles of individual YouTube videos. ISBNs are specific, plain ISBN numbers.
42 4.0% MUSIC, 78.6% api, 21.4% web 25 2.4% YOUTUBE, 92.0% api, 8.0% web 10 1.0% MOVIE, 50.0% api, 50.0% web 7 0.7% ARTICLE, 85.7% web, 14.3% api 7 0.7% ISBN, 71.4% web, 28.6% api Total: 8.8%
People and places
editThese are queries for particular people or places that are not in Wikipedia, including named individuals or parts of names (PERSON), online usernames (USER), addresses of business (ADDRESS—note that all are in Las Vegas) or email addresses.
The LINKED category are searches in this form:
- "SURNAME" "FIRST MIDDLE" "COMPANY" LINKEDIN
Both LINKEDIN and VIADEO (professional social networking sites) were used.
26 2.5% PERSON, 53.8% web, 46.2% api 16 1.5% USER, 56.2% api, 43.8% web 10 1.0% ADDRESS, 100.0% api 4 0.4% LINKED, 100.0% api 2 0.2% EMAIL, 100.0% web Total: 5.6%
Misc
editStuff I could at least partly identify, but couldn't categorize elsewhere. One was just a number, one a slightly mangled Wikimedia Commons file name, the other a Wikimedia Commons category name.
2 0.2% COMMONS, 50.0% api, 50.0% web 1 0.1% NUMBER, 100.0% web Total: 0.3%
Junk
edit- JUNK includes snippets of larger texts, multiply repeated letters, keyboard banging, and the like.
- OCR are "words" that seem to appear primarily as OCR errors in Google Books.
- ERRORs include "search_suggest_query" and the like.
- EMOJI are strings of emoji characters.
- SPAM includes an actual advertisement and what looks like a hacking probe attempt.
- NODE is a pattern that's come up before: Iamlookingfor[...]node[...]
34 3.2% JUNK, 85.3% web, 14.7% api 9 0.9% OCR, 55.6% api, 44.4% web 3 0.3% ERROR, 66.7% api, 33.3% web 3 0.3% EMOJI, 100.0% web 2 0.2% SPAM, 50.0% api, 50.0% web 1 0.1% NODE, 100.0% web Total: 5.0%
Misses
editActual contentful queries that there are no entries for.
- DICT: mostly obscure words, but they are in Wiktionary.
- SPECIES: Latin species names.
- MISS: other misses that seem to describe reasonable things that could be in Wikipedia, but either aren't or weren't found.
12 1.1% MISS, 50.0% api, 50.0% web 4 0.4% SPECIES, 75.0% web, 25.0% api 5 0.5% DICT, 80.0% api, 20.0% web Total: 2.0%