User:BDavis (WMF)/Notes/Finding Files To Purge

Trying to help with bugzilla:54647.

Some unspecified network error caused htcp purge messages to be lost for an extended period of time. Once the network issue was resolved we needed to purge a lot of potentially stale pages from the varnish cache layer.

Figuring out what to do edit

  • Reedy knew that the purgeList.php maintenance script could be used to send the purges
  • Nik knew where to find a query that would grab all pages changed in a date range
  • I volunteered to try and bring the two together.

Finding the pages edit

Nik's sql query came in the form of php code from CirrusSearch. I took that and put it into phpsh to get the generated sql.

$minUpdate = new MWTimestamp( '2013-09-22T00:00:00Z' );
$maxUpdate = new MWTimestamp( '2013-09-26T00:00:00Z' );
$dbr = wfGetDB( DB_SLAVE );
$minId = $dbr->addQuotes( '-1' );
$minUpdate = $dbr->addQuotes( $dbr->timestamp( $minUpdate ) );
$maxUpdate = $dbr->addQuotes( $dbr->timestamp( $maxUpdate ) );
$res = $dbr->selectSQLText(
               array( 'page', 'revision' ),
               array_merge(
                   array( 'page_touched', 'page_counter', 'page_restrictions' ),
                   Revision::selectPageFields(),
                   Revision::selectFields()
               ),
               'page_id = rev_page'
               . ' AND rev_id = page_latest'
               . " AND ( ( $minUpdate = rev_timestamp AND $minId < page_id ) OR $minUpdate < rev_timestamp )"
               . " AND rev_timestamp <= $maxUpdate",
               // Note that redirects are allowed here so we can pick up redirects made during search downtime
               'INTERACTIVE',
               array( 'ORDER BY' => 'rev_timestamp, rev_page')
           );

The resulting sql was:

SELECT page_touched
    ,page_counter
    ,page_restrictions
    ,page_namespace
    ,page_title
    ,page_id
    ,page_latest
    ,page_is_redirect
    ,page_len
    ,rev_id
    ,rev_page
    ,rev_text_id
    ,rev_timestamp
    ,rev_comment
    ,rev_user_text
    ,rev_user
    ,rev_minor_edit
    ,rev_deleted
    ,rev_len
    ,rev_parent_id
    ,rev_sha1
    ,rev_content_format
    ,rev_content_model
FROM `page`
    ,`revision`
WHERE page_id = rev_page
  AND rev_id = page_latest 
  AND (
    ( '20130922000000' = rev_timestamp AND '-1' < page_id )
    OR '20130922000000' < rev_timestamp
  )
  AND rev_timestamp <= '20130926000000'
  ORDER BY rev_timestamp, rev_page

This was TMI but led me to formulate a simpler version that was "good enough for now". I ran that on tools-lab against the enwiki and dewiki mirrors:

ssh tools-login.wmflabs.org
become bd808-test
mysql --defaults-file="${HOME}/replica.my.cnf" -h enwiki.labsdb enwiki_p -e "SELECT page_title
FROM page, revision
WHERE page_id = rev_page
  AND rev_id = page_latest 
  AND page_namespace = 0
  AND (
    ( '20130922000000' = rev_timestamp AND '-1' < page_id )
    OR '20130922000000' < rev_timestamp
  )
  AND rev_timestamp <= '20130926000000'
;" > enwiki-misses.txt
mysql --defaults-file="${HOME}/replica.my.cnf" -h dewiki.labsdb dewiki_p -e "SELECT page_title
FROM page, revision
WHERE page_id = rev_page
  AND rev_id = page_latest 
  AND page_namespace = 0
  AND (
    ( '20130922000000' = rev_timestamp AND '-1' < page_id )
    OR '20130922000000' < rev_timestamp
  )
  AND rev_timestamp <= '20130926000000'
;" > dewiki-misses.txt

Massaging the data edit

Now that I had dumps of page titles I needed to turn them into full urls for Reedy to use to do the purges:

First, URLEncode the titles:

perl -pi.raw -e 'chomp();s/([^A-Za-z0-9])/sprintf("%%%02X", ord($1))/seg;print "\n"' enwiki-misses.txt dewiki-misses.txt

Then turn them into full URLs and make sure there aren't any dupes:

perl -pi.enc -e 's@^@http://de.wikipedia.org/@' dewiki-misses.txt
mv dewiki-misses.txt dewiki-misses.txt.url
sort -u dewiki-misses.txt.url > dewiki-misses.txt
perl -pi.enc -e 's@^@http://en.wikipedia.org/@' enwiki-misses.txt
mv enwiki-misses.txt enwiki-misses.txt.url
sort -u enwiki-misses.txt.url > enwiki-misses.txt

Now What? edit

After a lot of futzing around I scp'd the files to a web accessible location where Reedy picked them up and applied them.

Here's the start of a script that may help with this the next time it happens.

#!/usr/bin/env bash
# Find all wiki pages changed between a given start and end date.
# TODO:
# - Add command line flags
# - Add namespace support

LANG=${1:-en}
WIKI=${LANG}wiki
START_TS=20130922000000
END_TS=20130926000000

# NOTE: only gets changes to namespace 0 (Main)
mysql --defaults-file="${HOME}/replica.my.cnf" -h ${WIKI}.labsdb \
  ${WIKI}_p --skip-column-names -e "
SELECT page_title
FROM page, revision
WHERE page_id = rev_page
  AND rev_id = page_latest
  AND page_namespace = 0
  AND (
    ( '${START_TS}' = rev_timestamp AND '-1' < page_id )
      OR '${START_TS}' < rev_timestamp
  )
  AND rev_timestamp <= '${END_TS}'
;" |
perl -e 'while (<STDIN>) {
  next if (/^\s+$/);
  chomp;
  s/([^A-Za-z0-9])/sprintf("%%%02X", ord($1))/seg;
  print $_, "\n";
}'|
perl -p -e "s@^@http://${LANG}.wikipedia.org/wiki/@" |
sort -u > ${WIKI}-changed.txt

# vim:sw=2 ts=2 sts=2 et:

Maintenance Scripts edit

purgeChangedPages edit

Tim suggested that I go ahead and make a real maintenance script for this. See gerrit:86883 for progress on the new purgeChangedPages script.

Once the maintenance script is approved, somebody (maybe even me) will use it to send purges esams for all wikis.

Testing edit

Tested in beta to see if it really does what it is supposed to do.

  1. ssh deployment-bastion.pmtpa.wmflabs
  2. Check that script runs at all:
    1. mwscript purgeChangedPages.php enwiki --starttime 20130922000000 --endtime 20130926000000 -v --dry-run
    2. manually examined list of changed files generated
  3. Check that PURGE messages are recieved by Varnish:
    1. mwscript purgeChangedPages.php enwiki --starttime 20130922000000 --endtime 20130926000000 -v
    2. watched PURGE requests come in on deployment-cache-text1.pmtpa.wmflabs with varnishncsa -n frontend|grep PURGE
  4. Check that using --htcp-dest flag does something useful:
    1. started htcpsnoop to listen for HTCP packets at 127.0.0.1:31337 on deployment-bastion.pmtpa.wmflabs
    2. mwscript purgeChangedPages.php enwiki --starttime 20130922000000 --endtime 20130926000000 -v --htcp-dest=127.0.0.1:31337
      1. watched HTCP arrive on python daemon
      2. verified that no PURGE requests come in on deployment-cache-text1.pmtpa.wmflabs at same time

Production purge edit

  1.   Done Find out when script will land on terbium: Thursday 2013-10-10
    •   Done terbium:/usr/local/apache/common/php-1.22wmf21/maintenance/purgeChangedPages.php
    •   Done terbium:/usr/local/apache/common/php-1.22wmf20/maintenance/purgeChangedPages.php
  2.   Done Get IP address of hooft to target HTCP packets
    • 91.198.174.113:4827
    • Ori said that he verified that UDP packets can traverse to there from terbium
  3.   Done Schedule deployment window with Greg
  4.   Done Announce intentions to ops-l
  5.   Done Get updated patch with rate limit approved
  6.   Done Get updated patch with rate limit pushed to terbium
  7.   Done Wait... for... deployment... window...
  8.   Done ssh terbium
  9.   Done Purge all the things!
while read wiki; do
  echo "== ${wiki} =="
  LOG="purge-${wiki}-$(date +%Y%m%dT%H%M).log"
  date >${LOG}
  mwscript purgeChangedPages.php $wiki \
    --starttime 20130922000000 \
    --endtime 20130926000000 \
    --htcp-dest 91.198.174.113:4827 \
    --sleep-per-batch 500 \
    --verbose |
  tee -a ${LOG}
done </usr/local/apache/common/all.dblist

Purge post-mortem edit

  • Started at 2013-10-16T20:01Z
  • Brandon provided Ops support
  • Brandon discovered that traffic was not arriving at esams as expected
  • Changed HTCP traffic target to 91.198.174.106 (nescio)
  • Still couldn't get packets into esams
  • Mark pointed out that terbium has no route to external addresses at about the same time Brandon figured that out
  • Mark suggested directing traffic to equiad UDP relay host dobson @ 208.80.152.173
  • Discussion was had over rate of sending packets and overflow possibility
    • Decided to reduce batch size from 100 to 10 and adjust delay to 50ms
  • Restarted at 2013-10-16T21:08Z
  • Brandon became concerned by IOWAIT on cp3xxx
    • Stopped run in the middle of commonswiki
    • Brandon investigated historical IOWAIT and decided it was a daily trend
    • Decided that was should add an additional delay just in case
    • I changed the runner script to make it possible to skip the wikis that we had already purged
  • Restarted at 2013-10-16T21:27Z with 100ms delay and skipping wikis before commonswiki
  • Process continued to run with no further interruption until completion at 2013-10-17T02:43Z
  • Approximately 1.6M purges were sent

Final driver script used:

#!/usr/bin/env bash

set -o nounset
set -o errexit

LOGDIR="./run-$(date +%Y%m%dT%H%M)"
mkdir ${LOGDIR}

#hooft: RELAY=91.198.174.113
# dobson
RELAY=208.80.152.173

START_AT=commonswiki

while read wiki; do
  if [[ ${wiki} < ${START_AT} ]]; then
    echo "> SKIPPING ${wiki}"
    continue
  fi
  echo "== ${wiki} =="
  LOG="${LOGDIR}/${wiki}.log"
  date >${LOG}
  mwscript purgeChangedPages.php $wiki \
    --starttime 20130922000000 \
    --endtime 20130926000000 \
    --htcp-dest ${RELAY}:4827 \
    --batch-size 10 \
    --sleep-per-batch 100 \
    --verbose |
  tee -a ${LOG}
  date >>${LOG}
done </usr/local/apache/common/all.dblist

purgeChangedFiles edit

Faidon and Aaron volunteered me to create a similar script for Files. After discussion with Aaron I think the plan is to rename the existing purgeDeletedFiles maintenance script to purgeChangedFiles and add some new command line switches:

--type <created|modified|deleted|all>
Types of file changes to send purges for
--htcp-dest <IP:post>
HTCP announcement destination (IP:port)


The type flag will control which log messages are examined:

created
upload/upload, import/upload, import/interwiki
modified
upload/overwrite, upload/revert, move/move, move/move_redir
deleted
delete/delete, delete/revision, suppress/delete, suppress/revision

The move/* logs will need special handling to purge not only the original page but also the move destination.

See Also edit