User:EBernhardson (WMF)/Notes/Accept-Language
November 2015 - Questions and comments are welcome on the talk page
- This is currently a work in progress and is not complete*
Hypothesis
editUsing the first non-English Accept-Language HTTP header will provide a good proxy for the language the query is in when the query returns no results against the wiki it was already run against. Further that this is a better proxy than the existing elasticsearch langdetect plugin.
Final Results
edit180,298 full text desktop queries to enwiki for the hour 28,929 filtered to those with a usable non-English accept-language header 2,015 number of those queries that give zero results 395 number of those queries which convert to non-zero results via accept-language 21% conversion rate of zero result with usable accept-language header to non-zero result
21,022 Estimated number of full text desktop queries to enwiki for the hour that have zero results 1.9% conversion rate from zero result to having result for full data set
Comparison to language detector:
2,015 number of zero result queries from above 1,429 number of those that detect to non-english 128 number of those which convert to non-zero results via language detection 9% conversation rate of zero result with usable accept-language header to non-zero result via lang-detect
Generally this looks to imply we should prefer the accept-language header over language detection for choosing a second wiki to query. We will still need to use language detection as only 16% of queries had a usable accept-language header.
Caveats
editThis only analyzed traffic to enwiki. It is likely if we ran this for traffic going to ruwiki or zhwiki we would get slightly different results. Additionally this only considers desktop search. Query rewriting for the API is not done by default, but is instead hidden behind a feature flag. As such even if we could have an effect on API requests most of them do not enable the rewrite flag and will not be affected. Note that the API makes up something like 75% of all search requests.
Additionally this data set, once filtered to queries with non-English accept headers, has a zero result rate of only 7%. This is 40% lower ((11.6-6.9)/11.6) than the overall zero result rate recorded in our CirrusSearchRequestSet logs for the same hour. Not sure if this means anything, but seems like a large variance.
Process
editExtract one hour worth of desktop full text searches from webrequest logs
editStarted by taking an hour worth of queries + accept language headers for enwiki from hive wmf.webrequests table using the following query. The specific day and hour to work with was arbitrarily chosen. This gives us a set of 180,298 queries to start with, which we will use to calculate the expected change to zero result rate. This feels much too low to be the total number of full text queries on enwiki for that hour, but is probably a reasonable number to run this test against.
INSERT OVERWRITE LOCAL DIRECTORY '/home/ebernhardson/hive'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
SELECT accept_language, query_string['search']
FROM (SELECT accept_language, str_to_map(substr(uri_query, 2), '&', '=') as query_string
FROM wmf.webrequest
WHERE year = 2015 AND month = 11 AND day = 04 AND hour = 04
AND uri_host = 'en.wikipedia.org'
AND uri_path <> '/w/api.php') x
WHERE length(query_string['search']) > 0
AND (query_string['title'] = 'Special:Search' OR query_string['title'] = 'Special%3ASearch');
Filter out requests with only English or invalid accept-language
editThe result of the above query was then run through the following php script to filter out queries that had an invalid accept-language header recorded, or only included English. Run against the above set of 180,298 queries we end up with 28,929 queries that could be affected. This means around 16% of our search queries to enwiki contain a non-English Accept-Language header.
<?php
$data = array();
while( false !== ( $line = fgets( STDIN ) ) ) {
list( $acceptLang, $term ) = explode( "\t", rtrim( $line, "\n" ) );
$term = trim( $term );
if ( strlen( $term ) === 0 ) {
continue;
}
$parsed = parseAcceptLang( $acceptLang );
foreach ( array_keys( $parsed ) as $lang ) {
if ( substr( $lang, 0, 2 ) !== 'en' ) {
$data[] = array( $acceptLang, $term );
break;
}
}
}
usort( $data, function( $a, $b ) {
return strcmp( $a[1], $b[1] );
} );
foreach ( $data as $value ) {
echo implode( "\t", $value ) . "\n";
}
Run the queries through enwiki to find which are zero result queries
editThese queries are then run against the enwiki index we have in the hypothesis-testing cluster to see which are zero result queries. This was done with the following command line:
cat queries.accept-lang.sorted | \
ssh -o Compression=yes suggesty.eqiad.wmflabs 'sudo -u vagrant mwscript extensions/CirrusSearch/maintenance/runSearch.php --wiki=enwiki --decode --options='\''{"wgCirrusSearchEnableAltLanguage":false}'\' | \
pv -l -s $(wc -l queries.accept-lang.sorted | awk '{print $1}') \
> queries.accept-lang.sorted.results
The results of this were filtered down to only the zero result queries, giving only 2,015 queries to test our original hypothesis against. Note that 2,015 out of 28,929 means this set had a zero result rate of 6.9% which is much lower than expected. I queried the same hour from the CirrusSearchRequestSet in hive and came up with a ZRR of 11.7% (see appendix for the query used).
cat queries.accept-lang.sorted.results | \
jq -c 'if (.totalHits > 0) . else empty end' |\
wc -l
Sort zero result queries into a bucket per target wiki
editFor the next step I needed a map from the languages to their wikis. This was sourced from this gerrit patch.
These queries were then separated out into a file per wiki using the following php script.
<?php
$langMap = include __DIR__ . '/langs.php';
$queryFile = fopen( $argv[1], "r" );
$resultFile = fopen( $argv[2], "r" );
$match = $total = $zeroResult = $hasAcceptLang = $error = 0;
while ( !feof( $queryFile ) && !feof( $resultFile ) ) {
$line = rtrim( fgets( $queryFile ), "\n" );
list( $accept, $encodedQuery ) = explode( "\t", $line, 2 );
$query = urldecode( $encodedQuery );
// not sure why this is necessary, there is some sort of bug in runSearch.php
// most likely, but a quick review didn't turn anything up.
while ( $query === "0" ) {
$line = rtrim( fgets( $queryFile ), "\n" );
list( $accept, $encodedQuery ) = explode( "\t", $line, 2 );
$query = urldecode( $encodedQuery );
}
$result = json_decode( rtrim( fgets( $resultFile ), "\n" ), true );
$total++;
if ( $query !== $result['query'] ) {
continue;
}
$match++;
if ( isset( $result['error'] ) ) {
$error++;
continue;
}
// totalHits will be not set for empty queries, such as ' '
if ( !isset( $result['totalHits'] ) || $result['totalHits'] > 0 ) {
continue;
}
$zeroResult++;
// we now have a query and know it's a zero result, strip the accept
// language header down to the first accepted language that is not
// english
$parsedAcceptLang = array_keys( parseAcceptLang( $accept ) );
$tryWiki = null;
foreach ( $parsedAcceptLang as $lang ) {
$shortLangCode = preg_replace( '/-.*$/', '', $lang );
if ( isset( $langMap[$shortLangCode] ) ) {
$tryWiki = $langMap[$shortLangCode];
break;
}
}
if ( $tryWiki === null ) {
continue;
}
file_put_contents( $tryWiki, urlencode( $query ) . "\n", FILE_APPEND );
$hasAcceptLang++;
}
fwrite( STDERR, "\nMatch: $match\nTotal: $total\nError: $error\nZero Result: $zeroResult\nHas accept-language: $hasAcceptLang\n");
A quick look at which wikis the queries were assigned to was done with
wc -l * | sort -rn | head -n 21 | tail -n 20
This gives the following top 20 targets of queries from enwiki in the analyzed hour:
545 zhwiki 329 kowiki 241 svwiki 220 eswiki 114 jawiki 53 dewiki 51 ruwiki 50 arwiki 45 thwiki 44 ptwiki 43 frwiki 33 hiwiki 32 idwiki 19 viwiki 17 mswiki 14 hewiki 13 plwiki 12 nlwiki 11 fiwiki 10 trwiki
I don't have the resources available to run the full 180k query set to calculate it's ZRR rate, but we can estimate using #Calculate_zero_result_rate_from_hive_CirrusSearchRequestSet_table. This gives a ZRR of 11.7%, which suggests 21k of the 180k queries would have zero results. We were able to convert 395 of those 21k queries into results for a conversion rate of 1.9%
Search target wikis
editNow that we have all the zero result queries that have a usable accept-language header broken out into files per wiki we can run them with the following:
for i in $(wc -l * | sort -rn | head -n 21 | tail -n 20 | awk '{print $2}'); do
cat $i | \
ssh -o Compression=yes suggesty.eqiad.wmflabs 'sudo -u vagrant mwscript extensions/CirrusSearch/maintenance/runSearch.php --wiki='$i' --decode --options='\''{"wgCirrusSearchEnableAltLanguage":false}'\' | \
pv -l -s $(wc -l $i | awk '{print $1}') \
> $i.results
done
These can then be processed to get the new zero result rate:
(for i in *.results; do
TOTAL="$(wc -l < $i)"
NONZERO="$(cat $i | jq -c 'if (.totalHits > 0) then . else empty end' | wc -l)"
echo $i total: $TOTAL non-zero percent: $(echo "scale=3; $NONZERO / $TOTAL" | bc)
done) | sort -rnk3
Which results in:
zhwiki.results total: 545 non-zero percent: .216 kowiki.results total: 329 non-zero percent: .079 svwiki.results total: 241 non-zero percent: .522 eswiki.results total: 220 non-zero percent: .222 jawiki.results total: 114 non-zero percent: .140 dewiki.results total: 53 non-zero percent: .207 ruwiki.results total: 51 non-zero percent: .235 arwiki.results total: 50 non-zero percent: .120 thwiki.results total: 45 non-zero percent: .066 ptwiki.results total: 44 non-zero percent: .136 frwiki.results total: 43 non-zero percent: .162 hiwiki.results total: 33 non-zero percent: 0 idwiki.results total: 32 non-zero percent: .281 viwiki.results total: 19 non-zero percent: 0 mswiki.results total: 17 non-zero percent: .058 hewiki.results total: 14 non-zero percent: 0 plwiki.results total: 13 non-zero percent: .076 nlwiki.results total: 12 non-zero percent: .083 fiwiki.results total: 11 non-zero percent: .090 trwiki.results total: 10 non-zero percent: .200
Across the top 20 wikis there are 1896 queries, and 395 were able to find results. This gives an overall 20.8% conversion rate. Considering the full set of queries, including those against languages we did not run due to being in the long tail, we have 395/2011 = 19.6%.
Compare against language detection
editTo determine if language detection does a better job than the accept-language header we take the set of 1896 queries that were made above and re-bucket them based on language detection. That was done with the following code:
$errors = $undetectable = $detected = $unknown = 0;
while ( false !== ( $line = fgets( STDIN ) ) ) {
$encodedQuery = rtrim( $line, "\n" );
$lang = detectLanguage( $encodedQuery );
if ( $lang === null ) {
$undetectable++;
continue;
}
$lang = preg_replace( '/-.*$/', '', $lang );
if ( isset( $langs[$lang] ) ) {
file_put_contents( $langs[$lang], $line, FILE_APPEND );
$detected++;
} else {
// usually 'en'
$unknown++;
}
}
fwrite( STDERR, "Errors: $errors\nUndetectable: $undetectable\nUnknown: $unknown\nDetected: $detected\n" );
// taken from CirrusSearch\Searcher::detectLanguage()
function detectLanguage( $encodedText ) {
$ch = curl_init( 'http://localhost:9200/_langdetect' );
curl_setopt( $ch, CURLOPT_POST, true );
curl_setopt( $ch, CURLOPT_POSTFIELDS, $encodedText );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
$value = json_decode( curl_exec( $ch ), true );
if ( $value && !empty( $value['languages'] ) ) {
$langs = $value['languages'];
if ( count( $langs ) === 1 ) {
return $langs[0]['language'];
}
if ( count( $langs ) === 2 ) {
if ( $langs[0]['probability'] > 2*$langs[1]['probability'] ) {
return $langs[0]['language'];
}
}
}
return null;
}
It was run as follows:
mkdir per-wiki.lang-detect
cd per-wiki.lang-detect
cat ../per-wiki.accept-lang/*wiki | php ../lang-detect.php
Re-running the same analysis as above for per-wiki ZRR we get:
ptwiki.results total: 193 non-zero percent: .015 itwiki.results total: 149 non-zero percent: .120 rowiki.results total: 148 non-zero percent: .081 dewiki.results total: 143 non-zero percent: .118 tlwiki.results total: 87 non-zero percent: 0 frwiki.results total: 67 non-zero percent: .119 eswiki.results total: 65 non-zero percent: .307 sqwiki.results total: 57 non-zero percent: 0 idwiki.results total: 53 non-zero percent: .207 huwiki.results total: 47 non-zero percent: 0 svwiki.results total: 40 non-zero percent: .350 ltwiki.results total: 38 non-zero percent: 0 nowiki.results total: 36 non-zero percent: .138 zhwiki.results total: 31 non-zero percent: .129 dawiki.results total: 31 non-zero percent: .032 hrwiki.results total: 29 non-zero percent: .034 fiwiki.results total: 29 non-zero percent: .034 plwiki.results total: 28 non-zero percent: .071 etwiki.results total: 23 non-zero percent: 0 trwiki.results total: 22 non-zero percent: .045 nlwiki.results total: 20 non-zero percent: .250 viwiki.results total: 12 non-zero percent: 0 lvwiki.results total: 7 non-zero percent: 0 cswiki.results total: 7 non-zero percent: .142
This finds results for 124 queries out of 2015, for a conversion rate of 6.1%.
Additional functions used in above scripts
editparseAcceptLang
edit// sourced from mediawiki WebRequest::getAcceptLang()
function parseAcceptLang( $acceptLang ) {
if ( !$acceptLang ) {
return array();
}
$acceptLang = strtolower( $acceptLang );
$lang_parse = null;
preg_match_all(
'/([a-z]{1,8}(-[a-z]{1,8})*|\*)\s*(;\s*q\s*=\s*(1(\.0{0,3})?|0(\.[0-9]{0,3})?)?)?/',
$acceptLang,
$lang_parse
);
if ( !count( $lang_parse[1] ) ) {
return array();
}
$langcodes = $lang_parse[1];
$qvalues = $lang_parse[4];
$indices = range( 0, count( $lang_parse[1] ) - 1 );
foreach ( $indices as $index ) {
if ( $qvalues[$index] === '' ) {
$qvalues[$index] = 1;
} elseif ( $qvalues[$index] == 0 ) {
unset( $langcodes[$index], $qvalues[$index], $indices[$index] );
}
}
array_multisort( $qvalues, SORT_DESC, SORT_NUMERIC, $indices, $langcodes );
return array_combine( $langcodes, $qvalues );
}
Calculate zero result rate from hive CirrusSearchRequestSet table
editSELECT SUM(results.outcome) AS non_zero,
COUNT(*) - SUM(results.outcome) AS zero,
1 - SUM(results.outcome) / COUNT(*) AS zero_result_rate
FROM ( SELECT IF(array_sum(requests.hitstotal) > 0, 1, 0) AS outcome
FROM ebernhardson.cirrussearchrequestset
WHERE year=2015 AND month=11 AND day=4 AND hour=4
AND wikiid='enwiki'
AND source='web'
AND array_contains(requests.querytype, 'full_text')
) AS results;
Result:
non_zero zero zero_result_rate 131003 17298 0.11664115548782539