Analytics/Archive/Editor Engagement Vital Signs/Backfilling

This page is archived! Find new documentation at https://wikitech.wikimedia.org/wiki/Analytics/Vital_Signs

Backfilling

edit

Some benchmarks of how log did it take to backfill data for Rolling Active Editor EEVS metric. The point is to have a ballpark estimate for our backfilling and make sure future code changes do not blow up these numbers.

Numbers will fluctuate depending on labs DB access but they should not diverge much from these.

We use as baseline our master branch on 2014-06-12 versus changes on this patchset: https://gerrit.wikimedia.org/r/#/c/150475/

Labs db infrstructure ( labsdb1002 dewiki, commons, etc) was upgraded to maria db about the last week of July. All data is on an SSD now.


Config for celery was:

BROKER_URL                          : redis://localhost:6379/0
CELERY_RESULT_BACKEND               : redis://localhost:6379/0
CELERY_TASK_RESULT_EXPIRES          : 2592000
CELERY_DISABLE_RATE_LIMITS          : True
CELERY_STORE_ERRORS_EVEN_IF_IGNORED : True
CELERYD_CONCURRENCY                 : 10
CELERYD_TASK_TIME_LIMIT             : 3630
CELERYD_TASK_SOFT_TIME_LIMIT        : 3600
DEBUG                               : False
LOG_LEVEL                           : INFO
MAX_PARALLEL_PER_RUN                : 10
MAX_INSTANCES_PER_RECURRENT_REPORT  : 365
CELERY_BEAT_DATAFILE                : /var/run/wikimetrics/celerybeat_scheduled_tasks
CELERY_BEAT_PIDFILE                 : /var/run/wikimetrics/celerybeat.pid
CELERYBEAT_SCHEDULE                 :
   'update-daily-recurring-reports':
       'task'      : 'wikimetrics.schedules.daily.recurring_reports'
       # The schedule can be set to 'daily' for a crontab-like daily recurrence
       'schedule'  : debug

rowiki

edit
  • Backfilling of 3 months of data takes about 3 minutes
  • Backfilling of 1 year of data takes about 10 minutes

eswiki

edit
  • Backfilling of 3 months of data took 8 minutes.
  • Backfilling of 5 months of data took 10 minutes
  • Backfilling of 1 year of data took 30 minutes

frwiki

edit
  • backfilling 3 months took 12 mins

Results with master branch (72ac421affa0c90183d9dde743cc79a91525fe12)

edit

rowiki

edit
  • Backfilling of 3 months of data takes about 3 minutes
  • Backfilling of 1 year of data takes about 6 minutes

frwiki

edit
  • Backfilling 3 months: 7 mins


rowiki RollingNewActiveEditor

edit
  • Backfilling 3 months: 42 seconds

rowiki RollingSurvivingNewActiveEditor

edit
  • Backfilling 3 months: 44 seconds

frwiki RollingNewActiveEditor

edit
  • Backfilling 3 months: 3.5 minutes

frwiki RollingSurvivingNewActiveEditor

edit
  • Backfilling 3 months: 4.5 minutes


RollingRecurringOldActiveEditor, patch: https://gerrit.wikimedia.org/r/#/c/161521/

edit

ruwiki

edit
  • Backfilling 1 day - 4 minutes
  • Backfilling 1 week - 4 minutes
  • Backfilling a month - 7minutes

frwiki

edit
  • Backfilling 2 months: 12 minutes


Rolling Recurrent old active editor https://gerrit.wikimedia.org/r/#/c/161521/

edit

Select as is did not run (as in it run forever)

SELECT anon_1.user_id AS anon_1_user_id, 
IF(SUM(anon_1.count_one) >= %s AND SUM(anon_1.count_two) >= %s, %s, %s) 
AS `IF_1` FROM (SELECT anon_2.user_id AS user_id, anon_2.count_one AS count_one, anon_2.count_two AS count_two 
FROM (SELECT revision_userindex.rev_user AS user_id, SUM(IF(revision_userindex.rev_timestamp <= %s, %s, %s)) AS count_one, 
SUM(IF(revision_userindex.rev_timestamp > %s, %s, %s)) AS count_two FROM revision_userindex 
INNER JOIN user ON user.user_id = revision_userindex.rev_user INNER JOIN logging ON user.user_id = logging.log_user 
WHERE logging.log_type = %s AND logging.log_action = %s AND logging.log_timestamp < %s 
AND revision_userindex.rev_timestamp BETWEEN %s AND %s GROUP BY revision_userindex.rev_user 

UNION ALL 

SELECT archive.ar_user AS user_id, SUM(IF(archive.ar_timestamp <= %s, %s, %s)) AS count_one, SUM(IF(archive.ar_timestamp > %s, %s, %s)) AS count_two 
FROM archive INNER JOIN user ON user.user_id = archive.ar_user INNER JOIN logging ON user.user_id = logging.log_user WHERE logging.log_type = %s 
AND logging.log_action = %s AND logging.log_timestamp < %s AND archive.ar_timestamp BETWEEN %s AND %s GROUP BY archive.ar_user) AS anon_2) 
AS anon_1 GROUP BY anon_1.user_id


Pages created

edit

Changes were done for pages created to default to all name spaces, gerrit change: https://gerrit.wikimedia.org/r/#/c/167214/

enwiki

edit

We were able to backfill a month for enwiki in 20 minutes

ruwiki

edit

We were able to backfill a month for ruwiki in 3 minutes