User:EBernhardson (WMF)/Notes/Accept-Language

November 2015 - Questions and comments are welcome on the talk page

This is currently a work in progress and is not complete*

Hypothesis

Using the first non-English Accept-Language HTTP header will provide a good proxy for the language the query is in when the query returns no results against the wiki it was already run against. Further that this is a better proxy than the existing elasticsearch langdetect plugin.

Final Results

 180,298 full text desktop queries to enwiki for the hour
  28,929 filtered to those with a usable non-English accept-language header
   2,015 number of those queries that give zero results
     395 number of those queries which convert to non-zero results via accept-language
     21% conversion rate of zero result with usable accept-language header to non-zero result

  21,022 Estimated number of full text desktop queries to enwiki for the hour that have zero results
    1.9% conversion rate from zero result to having result for full data set

Comparison to language detector:

   2,015 number of zero result queries from above
   1,429 number of those that detect to non-english
     128 number of those which convert to non-zero results via language detection
      9% conversation rate of zero result with usable accept-language header to non-zero result via lang-detect

Generally this looks to imply we should prefer the accept-language header over language detection for choosing a second wiki to query. We will still need to use language detection as only 16% of queries had a usable accept-language header.

Caveats

This only analyzed traffic to enwiki. It is likely if we ran this for traffic going to ruwiki or zhwiki we would get slightly different results. Additionally this only considers desktop search. Query rewriting for the API is not done by default, but is instead hidden behind a feature flag. As such even if we could have an effect on API requests most of them do not enable the rewrite flag and will not be affected. Note that the API makes up something like 75% of all search requests.

Additionally this data set, once filtered to queries with non-English accept headers, has a zero result rate of only 7%. This is 40% lower ((11.6-6.9)/11.6) than the overall zero result rate recorded in our CirrusSearchRequestSet logs for the same hour. Not sure if this means anything, but seems like a large variance.

Process

Extract one hour worth of desktop full text searches from webrequest logs

Started by taking an hour worth of queries + accept language headers for enwiki from hive wmf.webrequests table using the following query. The specific day and hour to work with was arbitrarily chosen. This gives us a set of 180,298 queries to start with, which we will use to calculate the expected change to zero result rate. This feels much too low to be the total number of full text queries on enwiki for that hour, but is probably a reasonable number to run this test against.

INSERT OVERWRITE LOCAL DIRECTORY '/home/ebernhardson/hive'
   ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t' 
STORED AS TEXTFILE
SELECT accept_language, query_string['search']
  FROM (SELECT accept_language, str_to_map(substr(uri_query, 2), '&', '=') as query_string
          FROM wmf.webrequest
         WHERE year = 2015 AND month = 11 AND day = 04 AND hour = 04
           AND uri_host = 'en.wikipedia.org'
           AND uri_path <> '/w/api.php') x
 WHERE length(query_string['search']) > 0
   AND (query_string['title'] = 'Special:Search' OR query_string['title'] = 'Special%3ASearch');

Filter out requests with only English or invalid accept-language

The result of the above query was then run through the following php script to filter out queries that had an invalid accept-language header recorded, or only included English. Run against the above set of 180,298 queries we end up with 28,929 queries that could be affected. This means around 16% of our search queries to enwiki contain a non-English Accept-Language header.

<?php
$data = array();
while( false !== ( $line = fgets( STDIN ) ) ) {
    list( $acceptLang, $term ) = explode( "\t", rtrim( $line, "\n" ) );
    $term = trim( $term );
    if ( strlen( $term ) === 0 ) {
        continue;
    }
    $parsed = parseAcceptLang( $acceptLang );
    foreach ( array_keys( $parsed ) as $lang ) {
        if ( substr( $lang, 0, 2 ) !== 'en' ) {
            $data[] = array( $acceptLang, $term );
            break;
        }
    }
}

usort( $data, function( $a, $b ) {
        return strcmp( $a[1], $b[1] );
} );
foreach ( $data as $value ) {
        echo implode( "\t", $value ) . "\n";
}

Run the queries through enwiki to find which are zero result queries

These queries are then run against the enwiki index we have in the hypothesis-testing cluster to see which are zero result queries. This was done with the following command line:

cat queries.accept-lang.sorted | \
    ssh -o Compression=yes suggesty.eqiad.wmflabs 'sudo -u vagrant mwscript extensions/CirrusSearch/maintenance/runSearch.php --wiki=enwiki --decode --options='\''{"wgCirrusSearchEnableAltLanguage":false}'\' | \
    pv -l -s $(wc -l queries.accept-lang.sorted | awk '{print $1}') \
    > queries.accept-lang.sorted.results

The results of this were filtered down to only the zero result queries, giving only 2,015 queries to test our original hypothesis against. Note that 2,015 out of 28,929 means this set had a zero result rate of 6.9% which is much lower than expected. I queried the same hour from the CirrusSearchRequestSet in hive and came up with a ZRR of 11.7% (see appendix for the query used).

cat queries.accept-lang.sorted.results | \
    jq -c 'if (.totalHits > 0) . else empty end' |\
    wc -l

Sort zero result queries into a bucket per target wiki

For the next step I needed a map from the languages to their wikis. This was sourced from this gerrit patch.

These queries were then separated out into a file per wiki using the following php script.

<?php

$langMap = include __DIR__ . '/langs.php';
$queryFile = fopen( $argv[1], "r" );
$resultFile = fopen( $argv[2], "r" );

$match = $total = $zeroResult = $hasAcceptLang = $error = 0;
while ( !feof( $queryFile ) && !feof( $resultFile ) ) {
    $line = rtrim( fgets( $queryFile ), "\n" );
    list( $accept, $encodedQuery ) = explode( "\t", $line, 2 );
    $query = urldecode( $encodedQuery );
    // not sure why this is necessary, there is some sort of bug in runSearch.php
    // most likely, but a quick review didn't turn anything up.
    while ( $query === "0" ) {
        $line = rtrim( fgets( $queryFile ), "\n" );
        list( $accept, $encodedQuery ) = explode( "\t", $line, 2 );
        $query = urldecode( $encodedQuery );
    }

    $result = json_decode( rtrim( fgets( $resultFile ), "\n" ), true );
    $total++;
    if ( $query !== $result['query'] ) {
        continue;
    }
    $match++;
    if ( isset( $result['error'] ) ) {
        $error++;
        continue;
    }
    // totalHits will be not set for empty queries, such as ' '
    if ( !isset( $result['totalHits'] ) || $result['totalHits'] > 0 ) {
        continue;
    }
    $zeroResult++;

    // we now have a query and know it's a zero result, strip the accept
    // language header down to the first accepted language that is not
    // english
    $parsedAcceptLang = array_keys( parseAcceptLang( $accept ) );
    $tryWiki = null;
    foreach ( $parsedAcceptLang as $lang ) {
        $shortLangCode = preg_replace( '/-.*$/', '', $lang );
        if ( isset( $langMap[$shortLangCode] ) ) {
            $tryWiki = $langMap[$shortLangCode];
            break;
        }
    }
    if ( $tryWiki === null ) {
        continue;
    }
    file_put_contents( $tryWiki, urlencode( $query ) . "\n", FILE_APPEND );
    $hasAcceptLang++;
}
fwrite( STDERR, "\nMatch: $match\nTotal: $total\nError: $error\nZero Result: $zeroResult\nHas accept-language: $hasAcceptLang\n");

A quick look at which wikis the queries were assigned to was done with

wc -l * | sort -rn | head -n 21 | tail -n 20

This gives the following top 20 targets of queries from enwiki in the analyzed hour:

 545 zhwiki
 329 kowiki
 241 svwiki
 220 eswiki
 114 jawiki
  53 dewiki
  51 ruwiki
  50 arwiki
  45 thwiki
  44 ptwiki
  43 frwiki
  33 hiwiki
  32 idwiki
  19 viwiki
  17 mswiki
  14 hewiki
  13 plwiki
  12 nlwiki
  11 fiwiki
  10 trwiki

I don't have the resources available to run the full 180k query set to calculate it's ZRR rate, but we can estimate using #Calculate_zero_result_rate_from_hive_CirrusSearchRequestSet_table. This gives a ZRR of 11.7%, which suggests 21k of the 180k queries would have zero results. We were able to convert 395 of those 21k queries into results for a conversion rate of 1.9%

Search target wikis

Now that we have all the zero result queries that have a usable accept-language header broken out into files per wiki we can run them with the following:

for i in $(wc -l * | sort -rn | head -n 21 | tail -n 20 | awk '{print $2}'); do
    cat $i | \
        ssh -o Compression=yes suggesty.eqiad.wmflabs 'sudo -u vagrant mwscript extensions/CirrusSearch/maintenance/runSearch.php --wiki='$i' --decode --options='\''{"wgCirrusSearchEnableAltLanguage":false}'\' | \
        pv -l -s $(wc -l $i | awk '{print $1}') \
        > $i.results
done

These can then be processed to get the new zero result rate:

(for i in *.results; do
     TOTAL="$(wc -l < $i)"
     NONZERO="$(cat $i | jq -c 'if (.totalHits > 0) then . else empty end' | wc -l)"
     echo $i total: $TOTAL non-zero percent: $(echo "scale=3; $NONZERO / $TOTAL" | bc)
done) | sort -rnk3

Which results in:

 zhwiki.results total: 545 non-zero percent: .216
 kowiki.results total: 329 non-zero percent: .079
 svwiki.results total: 241 non-zero percent: .522
 eswiki.results total: 220 non-zero percent: .222
 jawiki.results total: 114 non-zero percent: .140
 dewiki.results total: 53 non-zero percent: .207
 ruwiki.results total: 51 non-zero percent: .235
 arwiki.results total: 50 non-zero percent: .120
 thwiki.results total: 45 non-zero percent: .066
 ptwiki.results total: 44 non-zero percent: .136
 frwiki.results total: 43 non-zero percent: .162
 hiwiki.results total: 33 non-zero percent: 0
 idwiki.results total: 32 non-zero percent: .281
 viwiki.results total: 19 non-zero percent: 0
 mswiki.results total: 17 non-zero percent: .058
 hewiki.results total: 14 non-zero percent: 0
 plwiki.results total: 13 non-zero percent: .076
 nlwiki.results total: 12 non-zero percent: .083
 fiwiki.results total: 11 non-zero percent: .090
 trwiki.results total: 10 non-zero percent: .200

Across the top 20 wikis there are 1896 queries, and 395 were able to find results. This gives an overall 20.8% conversion rate. Considering the full set of queries, including those against languages we did not run due to being in the long tail, we have 395/2011 = 19.6%.

Compare against language detection

To determine if language detection does a better job than the accept-language header we take the set of 1896 queries that were made above and re-bucket them based on language detection. That was done with the following code:

$errors = $undetectable = $detected = $unknown = 0;
while ( false !== ( $line = fgets( STDIN ) ) ) {
        $encodedQuery = rtrim( $line, "\n" );
        $lang = detectLanguage( $encodedQuery );
        if ( $lang === null ) {
                $undetectable++;
                continue;
        }
        $lang = preg_replace( '/-.*$/', '', $lang );
        if ( isset( $langs[$lang] ) ) {
                file_put_contents( $langs[$lang], $line, FILE_APPEND );
                $detected++;
        } else {
                // usually 'en'
                $unknown++;
        }
}

fwrite( STDERR, "Errors: $errors\nUndetectable: $undetectable\nUnknown: $unknown\nDetected: $detected\n" );

// taken from CirrusSearch\Searcher::detectLanguage()
function detectLanguage( $encodedText ) {
        $ch = curl_init( 'http://localhost:9200/_langdetect' );
        curl_setopt( $ch, CURLOPT_POST, true );
        curl_setopt( $ch, CURLOPT_POSTFIELDS, $encodedText );
        curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
        $value = json_decode( curl_exec( $ch ), true );
        if ( $value && !empty( $value['languages'] ) ) {
                $langs = $value['languages'];
                if ( count( $langs ) === 1 ) {
                        return $langs[0]['language'];
                }
                if ( count( $langs ) === 2 ) {
                        if ( $langs[0]['probability'] > 2*$langs[1]['probability'] ) {
                                return $langs[0]['language'];
                        }
                }
        }
        return null;
}

It was run as follows:

mkdir per-wiki.lang-detect
cd per-wiki.lang-detect
cat ../per-wiki.accept-lang/*wiki | php ../lang-detect.php

Re-running the same analysis as above for per-wiki ZRR we get:

 ptwiki.results total: 193 non-zero percent: .015
 itwiki.results total: 149 non-zero percent: .120
 rowiki.results total: 148 non-zero percent: .081
 dewiki.results total: 143 non-zero percent: .118
 tlwiki.results total: 87 non-zero percent: 0
 frwiki.results total: 67 non-zero percent: .119
 eswiki.results total: 65 non-zero percent: .307
 sqwiki.results total: 57 non-zero percent: 0
 idwiki.results total: 53 non-zero percent: .207
 huwiki.results total: 47 non-zero percent: 0
 svwiki.results total: 40 non-zero percent: .350
 ltwiki.results total: 38 non-zero percent: 0
 nowiki.results total: 36 non-zero percent: .138
 zhwiki.results total: 31 non-zero percent: .129
 dawiki.results total: 31 non-zero percent: .032
 hrwiki.results total: 29 non-zero percent: .034
 fiwiki.results total: 29 non-zero percent: .034
 plwiki.results total: 28 non-zero percent: .071
 etwiki.results total: 23 non-zero percent: 0
 trwiki.results total: 22 non-zero percent: .045
 nlwiki.results total: 20 non-zero percent: .250
 viwiki.results total: 12 non-zero percent: 0
 lvwiki.results total: 7 non-zero percent: 0
 cswiki.results total: 7 non-zero percent: .142

This finds results for 124 queries out of 2015, for a conversion rate of 6.1%.

Additional functions used in above scripts

parseAcceptLang

// sourced from mediawiki WebRequest::getAcceptLang()
function parseAcceptLang( $acceptLang ) {
    if ( !$acceptLang ) {
        return array();
    }

    $acceptLang = strtolower( $acceptLang );
    $lang_parse = null;
    preg_match_all(
        '/([a-z]{1,8}(-[a-z]{1,8})*|\*)\s*(;\s*q\s*=\s*(1(\.0{0,3})?|0(\.[0-9]{0,3})?)?)?/',
        $acceptLang,
        $lang_parse
    );

    if ( !count( $lang_parse[1] ) ) {
        return array();
    }

    $langcodes = $lang_parse[1];
    $qvalues = $lang_parse[4];
    $indices = range( 0, count( $lang_parse[1] ) - 1 );

    foreach ( $indices as $index ) {
        if ( $qvalues[$index] === '' ) {
            $qvalues[$index] = 1;
        } elseif ( $qvalues[$index] == 0 ) {
            unset( $langcodes[$index], $qvalues[$index], $indices[$index] );
        }
    }

    array_multisort( $qvalues, SORT_DESC, SORT_NUMERIC, $indices, $langcodes );
    return array_combine( $langcodes, $qvalues );
}

Calculate zero result rate from hive CirrusSearchRequestSet table

SELECT SUM(results.outcome) AS non_zero,
       COUNT(*) - SUM(results.outcome) AS zero,
       1 - SUM(results.outcome) / COUNT(*) AS zero_result_rate
  FROM ( SELECT IF(array_sum(requests.hitstotal) > 0, 1, 0) AS outcome
           FROM ebernhardson.cirrussearchrequestset
          WHERE year=2015 AND month=11 AND day=4 AND hour=4
            AND wikiid='enwiki'
            AND source='web'
            AND array_contains(requests.querytype, 'full_text')
       ) AS results;

Result:

 non_zero zero    zero_result_rate
 131003   17298   0.11664115548782539