User:TJones (WMF)/Notes

This is an index for the reports I've written up on various search- & discovery-related topics. I also have a Project Wishlist of 10% projects and other tasks I would like to do if I had infinite time. If you want to help out on any of those, let me know!

If the technical jargon is confusing, check out our Search Glossary. Feel free to request definitions there, too.

Elasticsearch Analysis Chain Analysis

edit

For help with the technical jargon used in the Analysis Chain Analysis, check out the Language Analysis section of the Search Glossary. For an overview of how speakers review analysis chain changes, see Speaker Review Notes.

Harmonization Notes

edit

Language Analyzer Harmonization Notes—Notes on "harmonizing" language analysis across languages so that non–language-specific analysis is as much the same as possible. (Epic: T219550)

Normalization for Arabic Script Across Languages

edit

Normalization for Arabic Script Across Languages (April 2024) Enable normalization for Arabic script variants of kaf, yeh, and heh

Enable Yiddish Ligatures

edit

Enable Yiddish Ligatures (April 2024) Enable global mappings for Yiddish ligatures to decomposed forms.—(merged, waiting for reindexing)

Stopwords 2023

edit

Stopwords 2023—A quick overview of where our stopwords come from, where they live, and how to go about updating them.

Unpacking Notes

edit

Unpacking NotesAnalysis of the effect of unpacking monolithic analyzers and enabling homoglyph processing, ICU normalization, and ICU folding. (Related blog post, April 2023)

Elasticsearch Update Analyzer Analysis

edit

Analyzer Analysis for Elasticsearch Upgrade from 6.8 to 7.10 (August 2022) Analysis of the language analysis changes that result from the upgrade to Elasticsearch 7.10.

Analyzer Analysis for Elasticsearch Upgrade from 6.5 to 6.8 (February 2022) Analysis of the language analysis changes that result from the upgrade to Elasticsearch 6.8.

Elasticsearch 6 vs Elasticsearch 5 Analyzer Analysis (February 2019) Analysis of the language analysis changes that result from the upgrade to Elasticsearch 6.

Homoglyph Analysis

edit

Homoglyph Analysis (April 2020, with Maryum) Analysis of the effect of adding homoglyph normalization to the analysis chains of English, French, Russian, Polish, and Serbian.

English Homoglyph Before and After Reindexing Report (April 2021) Second time trying to measure the impact of an analysis change at reindexing time—this time for a small change on a very active wiki.

Khmer Analysis

edit

Syllable Re-Writing to Improve Khmer Search Performance (September 2019) Development and analysis of an algorithm to re-order ambiguous Khmer syllables. (Related blog post, June 2020)

Khmer Reordering Analysis Analysis (January 2021) Analysis of the impact of adding the Khmer reordering plugin to the Khmer analysis chain.

Khmer Reordering Before and After Reindexing Report (March 2021) A first attempt at quantifying the impact of analysis changes & reindexing by running a sample right before and after reindexing. (Related video presentation, May 2021)

... Older Projects ...

edit

Projects concluded in 2020 or earlier are in the section Elasticsearch Analysis Chain Analysis (2016–2020) below.

How Many Languages Does Search Support?

edit

How Many Languages Does Search Support? (July 2024) A quick count of the hard-to-pin-down number of "languages" that are "supported" by search. (Also available as a blog post.)

Blog Posts

edit

Sometimes I write posts for various WMF blogs (Tech Blog, Diff, Foundation Blog), some related to the work on this page. Related blog posts are mentioned with the topic they are based on. Blogs and other communications (like recorded videos) include:

  • How Many Languages Does Search Support?—A quick count of the hard-to-pin-down number of "languages" that are "supported" by search—July 2024. (Also available, with updates, on-wiki. Last wiki update in September 2024.)
  • Language, Harmony, and Unpacking—A Year in the Life of a Search Nerd—An overview of the language analyzer unpacking project I've been working on for a loooong time, with special attention to fun language facts encountered along the way. (April 2023)
    • Video re-recording of internal presentation—July 2023
  • Improving Breton Search—How it started vs. How it's going, a lightning talk at the 2021 Arctic Knot conference (video—June 2021)
  • Khmer—I spent so long working on Khmer that I've presented it a couple of different times in a couple of different ways.
    • Khmer, Khmer, Khmer!—All my Khmer links (including code) in one place.
    • A 5½-minute video presentation I gave at a WMF Tech Department meeting, which provides a quick sketch of the background and problem, and a brief overview of the impact of the changes made. (video—May 2021)
    • Permuting Khmer: Restructuring Khmer syllables for search (blog—June 2020)
  • Search Support for Minority Languages, a presentation at the 2020 Celtic Knot conference (July 2020—video, etherpad with notes, slides).
  • Computational knowledge: Wikidata, Wikidata Query Service, and women who are mayors! (March 2020)—an overview of Wikidata and WDQS.
  • Edit Distance Learning Circle (January 2020—video)—a team-internal presentation on the basics of edit distance, plus my updates to the algorithm to make it token-aware.
  • The Anatomy of Search: a series on how full-text search engines work.
    • A token of my affection (August 2018)—tokenization.
    • Variation under nature (September 2018)—normalization.
    • The root of the problem (November 2018)—stemming, stop words, and thesauri.
    • A place for my stuff (March 2019)—storing text in an index (or indexes!).
    • In search of… (September 2019)—the user's query, and matching & ranking results.
  • Hello, my name is ________: Searching for names is not always straightforward (May 2018/new link)—a collection of name trivia and how the details make searching for names harder than you might think it would be.
  • Confound it!—Supporting languages with multiple writing systems (March 2018/new link)—reading, editing, and searching in multi-script languages.
  • Bare-Bones Basics of Full-Text Search (Jan 2018)—a video recording of a presentation I gave on the fundamentals of full-text search. No pre-requisites, about 45 minutes long.
  • Gnomes and trolls and hobgoblins (oh my!)—Failed queries and the vicarious fear of missing out (Dec 2017)—zero results queries are, collectively, crap.
  • So -happy to meet you: Advanced searching techniques on Wikimedia sites (Nov 2017)—when search syntax interferes with real life.
  • Admittedly loopy but not entirely absurd—Understanding our Search Relevance Survey (Sept 2017)
  • Wikipedia, search, and the “цкщтп” keyboard (Aug 2017)
  • Stripping question marks from Wikimedia searches (Aug 2016)
  • Wikipedia seeks to speak your language (July 2016—collaboration with Deb Tankersley)—TextCat deployed!
  • Wikipedia Search Isn’t Necessarily Third BESt (Sep 2015)—Write up in The Signpost defending Wikipedia search after it got severely dissed in an academic article.

Glent "Did You Mean" Suggestions

edit

Analysis of DYM Method 2 (October 2020 / February 2021) A comparison of the suggestions from the current DYM suggester and "Method 2" of the new NLP-based approach. (Korean, Japanese, and Chinese)

Glent Update Notes (May–June 2020) Notes on a number of smaller topics related to the ongoing Glent updates (currently mostly driven by Method 1 improvements).

Analysis of DYM Method 1 (October 2019) A comparison of the suggestions from the current DYM suggester and "Method 1" of the new NLP-based approach.

Analysis of DYM Method 0 (May 2019) A comparison of the suggestions from the current DYM suggester and "Method 0" of the new NLP-based approach.

Wrong Keyboard/Encoding Detection

edit

DWIM as API

edit

DWIM as API (January 2021) What would it take to replace the DWIM gadgets with an update to the completion suggester API? A brief overview of requirements and options.

Implementation Design and Parameter Optimization for Wrong Keyboard Detection and Suggestion

edit

Implementation Design and Parameter Optimization for Wrong Keyboard Detection and Suggestion (January 2019) Notes on working through the design for implementing wrong keyboard and wrong encoding detection and optimizing the TextCat parameters.

Typing on the Wrong Keyboard / Russian and English

edit

Typing on the Wrong Keyboard / Russian and English (June 2016) A quick attempt to identify and convert queries typed on the wrong keyboard on the English and Russian Wikipedias. (Related WMF blog post, Aug 2017.)

Review of Commons Queries

edit

Review of Commons Queries (July–August 2020) A review of 3 months' worth of queries issued on Commons to determine the distribution of queries along various dimensions, and look for "interesting" patterns. Including separate review of zero-results queries.

Token-Aware Edit Distance

edit

Token-Aware Edit Distance (on GitHub) (March 2020) As part of my 10% time, I pulled out the token-aware edit distance algorithm from Glent, and wrote a simple command-line driver for it, and created a repo on GitHub. (Related Learning Circle presentation.)

Elasticsearch Analysis Chain Analysis (2016–2020)

edit

What the Heck Does ICU Normalization Do, Anyway?

edit

What the Heck Does ICU Normalization Do, Anyway? (March 2020) Review of all the characters modified by Elasticsearches ICU Normalization token filter and character filter.

Slovak Analysis

edit

Analyzer

edit

Slovak Analyzer Analysis (March 2018) Analysis of the Slovak analysis chain, built from the stemmer we recently wrapped into an Elasticsearch plugin.

Folding Diacritics in Slovak (June/July/October 2019) Analysis of the impact of folding all letters—including Slovak-specific diacritical letters.

Stemmer

edit

Slovak Stemmer Analysis (March 2018) Review of performance of two Slovak stemmers with the potential to be wrapped into Elasticsearch language analyzers.

Language-Specific Lowercasing and ICU Normalization

edit

Language-Specific Lowercasing and ICU Normalization (March 2019) Analysis of the effects of preserving lang-specific lowercasing when upgrading lowercase filter to ICU normalizer filter, particularly in the plain field.

Greek and Unexpected Empty Tokens

edit

Greek and Unexpected Empty Tokens (February/March 2019) Analysis of unpacking Greek analyzer to add empty-token filter to deal with erroneous empty tokens. Surprise Bonus: lang-specific lowercasing matters!

Nori Analyzer Analysis

edit

Nori Analyzer Analysis (August—October 2018) Analysis of the Nori language analyzer for Korean.

Strip Empty Tokens Generated by ICU Folding

edit

Strip Empty Tokens Generated by ICU Folding (August 2018) Look into the empty tokens generated by ICU folding and make sure patching it has no unintended side effects.

Esperanto Analysis

edit

Stemmer

edit

Esperanto Stemmer Analysis (June/July 2018) Review of the performance of an Esperanto stemmer with the potential to be wrapped into an Elasticsearch language analyzer.

Analyzer

edit

Esperanto Analysis Chain Analysis (August 2018) Analysis of the Esperanto analysis chain, built from the stemmer we recently wrapped into an Elasticsearch plugin.

Stempel Analyzer Analysis

edit

Stempel Analyzer Analysis (February 2017) Analysis of Stempel Polish Analyzer from Elasticsearch, which we'd like to deploy for Polish wiki projects. Generally it works well, but it has some interesting bugs.

Stempel Analyzer Patch Filters (July 2018) Analysis of patches (mostly filters) we can apply to the unpacked Stempel Polish Analyzer to decrease the impact of occasional poor stemming.

Malay Analysis (with a bit of Indonesian)

edit

Analysis of Applying Indonesian Analysis Chain to Malay (June 2018) Analysis of the effect of the Elasticsearch Indonesian analysis chain on Malay-language data, and the effect of unpacking the Indonesian analyzer and adding ICU normalization to Indonesian.

Bosnian, Croatian, Serbian, & Serbo-Croatian Analysis

edit

Stemmer

edit

Serbian Stemmer Analysis (November 2017) Review of performance of three Serbian stemmers with the potential to be wrapped into Elasticsearch language analyzers.

Stemmer No. 4 (December 2017) Review of stemmer #4, which was labeled "Croatian" but which actually works for Serbian, too, after a quick update from the developer.

Analyzer

edit

Serbian Analyzer Analysis (March 2018) Analysis of the Serbian analysis chain, built from the stemmer we recently wrapped into an Elasticsearch plugin.

Bosnian, Croatian, and Serbo-Croatian Analyzer Analysis (April 2018) Analysis of the Serbian analysis chain, applied to Bosnian, Croatian, and Serbo-Croatian.

(Vaguely related WMF blog post, Mar 2018.)

Overview of Phonetic Algorithm Performance

edit

Overview of Phonetic Algorithm Performance (January 2018) Analysis of available Elasticsearch phonetic algorithm plugins for possible implementation of phonetic searching.

Language Analysis Morphological Libraries

edit

Language Analysis Morphological Libraries (October 2017) Review of available morphological analysis libraries that have potential to be wrapped into Elasticsearch language analyzers. Covers Japanese, Vietnamese, Korean, Serbian, Malay, Estonian, Slovak.

Chinese Analyzer Analysis

edit

Chinese Analyzer Analysis (February–April 2017) Analysis of several Chinese Elasticsearch plugins for traditional-to-simplified character conversion and for word segmenting. (Vaguely related WMF blog post, March 2018.)

Punctuation config update (August 2017) About 16% of tokens are punctuation, all indexed as commas, which is silly.

Vietnamese Analyzer Analysis

edit

Vietnamese Analyzer Analysis Analysis of the Vietnamese language analyzer.

Analysis Analysis Tools

edit

Analysis Analysis Tools (July 2017) The first draft of the README file for my Language Analysis Analysis tools, which are being added to the RelForge repo.

Kuromoji Analyzer Analysis

edit

Kuromoji Analyzer Analysis (June-July 2017) Analysis of the Kuromoji language analyzer for Japanese.

HebMorph Analyzer Analysis

edit

HebMorph Analyzer Analysis (May 2017) Analysis of the HebMorph language analyzer for Hebrew.

Ukrainian Morfologik Analysis

edit

Ukrainian Morfologik Analysis (March 2017) Analysis of Elasticsearch plugin for Ukrainian Morfologik Analyzer, recommended by Elasticsearch. It looks good, but because we were originally using the Russian analyzer, the situation is complicated.

Swedish Analyzer Analysis

edit

Swedish Analyzer Analysis (March 2017) Quick analysis of the impact of folding on Swedish.

On Generic ICU Folding

edit

On Generic ICU Folding (December 2016) Copy of quick discussion on Phab about the goals of generic ICU folding, and how to apply it to specific language wikis, hopefully with input from the wiki/language communities.

Upgrading ASCII Folding to ICU Folding for French and English

edit

Upgrading ASCII Folding to ICU Folding for French and English (September 2016) A quick analysis of the effects of enabling ICU folding instead of simple ASCII folding for French and English.

Removing Stress Accents and Folding Ё to Е for Russian Wikis

edit

Removing Stress Accents and Folding Ё to Е for Russian Wikis (September 2016) A quick-ish test on the effects of adding stress-accent-stripping and ё-folding to Russian wikis.

On Merging Apostrophes and Other Unicode Characters

edit

On Merging Apostrophes and Other Unicode Characters (August 2016) Copied from a quick analysis in a Phab ticket on merging Unicode characters so I can easily find it later.

Adding Ascii-Folding to French Wikipedia

edit

Adding Ascii-Folding to French Wikipedia (August 2016) A not-so-quick test on the effects of adding ascii-folding to French Wikipedia. Many unexpected twists and surprises!

Re-Ordering Stemming and Ascii-Folding on English Wikipedia

edit

Re-Ordering Stemming and Ascii-Folding on English Wikipedia (August 2016) A quick test of the effects of moving ascii-folding before stemming on English Wikipedia.

TextCat, Language ID, Etc.

edit

Review of Language Identification in Production, with a Special Focus on Stupid Identification Tricks

edit

Review of Language Identification in Production, with a Special Focus on Stupid Identification Tricks (March 2019) What started as a look into how often punctuation-only queries caused weird cross-language results turned into an overview of current TextCat performance in production and ended up finding bad data in the Chinese query-based language models. What a ride!!

November 2016—These are all on one wiki page if you want to browse them all, or jump to a specific section.

December 2016

January 2017

  • Optimization Framework updates (Dec/Jan): now with coordinate descent!
  • Bucketing and Bonuses (Dec/Jan): give the most likely languages—esp. the "host" language—a boost so that ambiguity or near ambiguity comes out in their favor. Also, re-evaluate whether we've made enough progress to warrant putting back some languages we had to exclude (spoiler: we have!)
  • Unknown n-gram Penalty: Maybe an extra penalty for unknown n-grams will reduce ambiguity; or maybe the penalty is too high and we're throwing out the baby with the bathwater; or maybe it's just right.
  • Final Summary & Recommendations: stick a fork in it; it's done! A review of the overall findings, and the general improvement in F0.5 accuracy across the nine corpora we currently have.

TextCat Released into Production!

edit

July 2016—There's a blog post on the Wikimedia blog that Deb and I worked on, announcing TextCat/language ID being in production for five wikis, and a PDF of a longer first draft I wrote on Commons. And while I'm here, I'll suggest the online demo if you want to play around with language identification directly.

TextCat and Confidence

edit

TextCat and Confidence (July 2016) Quick summary of concerns and ideas for assigning a confidence score to TextCat's language identification.

Favoring Recall in Language Identification

edit

Favoring Recall in Language Identification (May 2016) Analysis of recall-favoring options for language detection (rather than precision-favoring), using the same data from frwiki, eswiki, itwiki, and dewiki as below.

Balanced Language Identification Evaluation Set for Queries

edit

Balanced Language Identification Evaluation Set for Queries (February 2016) Creation of a 21-language balanced query corpus, and the evaluation of TextCat against that corpus.

TextCat with Additional Non-Word Characters

edit

TextCat with Additional Non-Word Characters (January 2016) A follow up on an idea from Stas about modifying the non-word characters in TextCat. Ignoring parens helps a wee bit.

ElasticSearch Plugin—Limiting Languages & Retraining

edit

ES Plugin, Limiting Language Options and Retraining on Query Data (December 2015) David retrained the ES Plugin models using the data from the TextCat evaluations, and figured out how to limit the plugin to the "useful" languages. The results are much improved and on-par with TextCat.

Language Detection Evaluation—TextCat

edit

Language Detection with TextCat (December 2015)—An evaluation of TextCat (an n-gram–based language identifier) on the enwiki zero-results queries. Includes updates to TextCat, re-training on query data, and limiting language identification to "useful" languages. Offers an improvement over the ES Plugin.

Language Detection Evaluation—Update: Thresholds by Language

edit

Language Detection Evaluation—Update: Thresholds by Language (October 2015)—Evaluated adding a language specific threshold (i.e., "it's never Romanian" on enwiki!) to the ElasticSearch language detection plugin. Results are overfitted because of small available data set, but are indicative of significant improvement to precision in language detection.

Language Detection Evaluation

edit

Language Detection Evaluation (September 2015)—A test of language detection against a representative sample of hand-coded zero-results queries from enwiki.

  • ElasticSearch language detection plugin—A language detection plugin available for ElasticSearch;
    • also evaluated with initial and final spaces added (which gives better results, probably because of better recognition of letters at the edges of words)
  • Always "English" detector—Baseline against the current de facto default; also demonstrates that F-score is not necessarily the only relevant measure for search purposes.

Conferences and Trip Reports

edit

WikiConference North America 2018 Trip Report (October 2018) Some highlights and thoughts from the conference.

April 2018 Conference Trip Report (April 2018) An overview of talks (with as many links as possible) from the OpenSource Connections Haystack Search Relevance Conference and Tom Tom Founders Festival Machine Learning Conference.

Searching for Punctuation Gives Weird Results

edit

Searching for Punctuation Gives Weird Results (June 2018)—An explanation of the weirdness that comes from searching for single punctuation characters without good redirect support.

Survey of Regular Expression Searches

edit

Survey of Regular Expression Searches (May 2018)—A quick overview of the kinds of regex searches performed across all wikis.

edit

Potential Applications of Natural Language Processing to On-Wiki Search (May 2018)—Outline of lots of ways NLP could improve search on Wikipedia, Wiktionary, etc.

Myanmar Zawgyi Encoding, Initial Survey

edit

Myanmar Zawgyi Encoding, Initial Survey (March/April 2018) A quick review of the Zawgyi font encoding for Myanmar text and plan to assess its impact on search on Myanmar wikis (which are uniformly Unicode-encoded).

extra-analysis Elasticsearch Plugin

edit

extra-analysis Elasticsearch Plugin (February-March 2018) Brief overview of the reasons for creating search/extra-analysis rather than search/serbian-analysis to incorporate an open-source Serbian stemmer.

Hiragana to Katakana Mapping for English and Japanese

edit

Hiragana to Katakana Mapping for English and Japanese (November 2017) Investigate effects of folding hiragana and katakana together for English-language and Japanese-language projects.

Fallback Languages

edit

Fallback Redux (September/October 2017) A more careful analysis of what fallback languages are enabled where, and general notes on likely compatibility.

Fallback Languages (October 2016) A list of languages that are potentially used as fallbacks for other languages in language analysis.

Crimean Tatar Transliteration

edit

Crimean Tatar Transliteration (May-July 2017) An analysis of a work-in-progress transliteration module, adapted from previous work from 2010.

(Vaguely related WMF blog post, Mar 2018.)

Accents, Dead Keys, and Suggestions

edit

Accents, Dead Keys, and Suggestions (July 2017) Copy from Phabricator of discussion of accented characters not generating completion suggester suggestions.

Some Thoughts on the Math of Scoring

edit

Some Thoughts on the Math of Scoring (April 2017) A cleaned up and slightly expanded version of a discussion I had with David about the math of scoring functions. Use your hyperoperations, kids!

So Many Search Options

edit

So Many Search Options (December 2016): an initial proposal to encourage thinking about how to deal with all the different additional ways of searching when a query doesn't give great results ("Did you mean" suggestions, language detection, quote stripping, wrong keyboard detection, etc).

January 2017: Lots of updates and refinements, and the first draft of a proposal to update the API. Now moved out of my Notes to a more generic page.

TextCat Optimizations

edit

TextCat Optimization for plwiki, arwiki, zhwiki, and nlwiki

edit

TextCat Optimization for plwiki, arwiki, zhwiki, and nlwiki (September 2016) Analysis of low-performing queries (< 3 results) to optimize languages to be used for language detection.

TextCat Optimization for ptwiki, ruwiki, and jawiki

edit

TextCat Optimization for ptwiki, ruwiki, jawiki (July 2016) Analysis of low-performing queries (< 3 results) to optimize languages to be used for language detection.

TextCat Re-optimization for enwiki

edit

TextCat Re-optimization for enwiki (June 2016) Analysis of low-performing queries (< 3 results) to optimize languages to be used for language detection; plus comparison to similar previous ZRR-based enwiki corpus from 2015.

TextCat Optimization for frwiki, eswiki, itwiki, and dewiki

edit

TextCat Optimization for frwiki, eswiki, itwiki, and dewiki (April 2016) Analysis of low-performing queries (< 3 results) to optimize languages to be used for language detection.

Spaceless Writing Systems and Wiki-Projects

edit

Spaceless Writing Systems and Wiki-Projects (November 2016) A quick review of languages/projects that don't use spaces between most words in their writing systems.

Top Unsuccessful Search Queries

edit

Top Unsuccessful Search Queries (July 2016) Analysis of the top 100 most frequent zero-results queries for enwiki for the month of May, 2016, to help determine whether mining such queries is worthwhile. (Related WMF blog post, Dec 2017.)

Dropping Final Question Marks in the Top 10 Wikipedias

edit

Dropping Final Question Marks in the Top 10 Wikipedias (June 2016) More detailed look at the effects on search results (especially Zero Results Rate and Poorly Performing Queries) of dropping final question marks from queries on the top 10 Wikipedias. (Related WMF blog post, Aug 2016.)

Quotes and Questions

edit

Quotes and Questions (May 2016) Quick write up of effects of removing quotation marks and question marks from poorly performing queries.

How Wrong Would Using Out of Date Page View Data Be?

edit

How Wrong Would Using Out of Date Page View Data Be? (January 2016) We want to integrate page view information into the scoring algorithms we use for both the completion suggester and our regular search results. Our initial idea is we only update this page view information when doing normal document updates after a page edit (for technical reasons, page view data is available/provided when a page is edited). We need to analyze if this page view data will be "good enough" or if we need to do something more.

Relevance Lab!

edit

Relevance Lab (October 2015)—High level description and design of a Relevance Lab for Discovery, which would allow us (and others!) to experiment with proposed modifications to our search process and gauge their effectiveness and impact before deploying them.

Why People Use Search Engines

edit

Why People Use Search Engines (September 2015)—An overview of how well English Wikipedia Search performs on a sample of ~4K queries that came from Google, with analysis of categories of unsuccessful queries and lots of ideas (not all necessarily practical) for Wikimedia search improvements.

Cross Language Wiki Searching

edit

Cross Language Wiki Searching (September 2015)—An attempt to estimate the impact on enwiki's zero-results rate given "perfect" (or at least human-level) language identification.

Phrase Slop Pre-Test

edit

Phrase Slop Pre-Test (August 2015)—An in vitro test of the ElasticSearch phrase slop parameter against ptwiki and dewiki before the in vivo A/B test. The final report, prepared by Mikhail, is here.

Survey of Zero-Results Queries

edit

Survey of Zero-Results Queries (July 2015)—A survey of the readily identifiable patterns in full-text zero-results queries. Lots of potential bots and bugs identified.

  • One Month Followup (August 2015)—Overview of day-by-day changes in full-text traffic for known bots and bugs one month later, and monthly changes in zero-results rate for top wikis by volume.
  • Full manual review of a 1K enwiki sample (August 2015)—Hand coding and categorization of a 1K sample of full-text zero-results queries.