Analytics/Archive/Editor Engagement Vital Signs/Backfilling
This page is archived! Find new documentation at https://wikitech.wikimedia.org/wiki/Analytics/Vital_Signs
Backfilling
editSome benchmarks of how log did it take to backfill data for Rolling Active Editor EEVS metric. The point is to have a ballpark estimate for our backfilling and make sure future code changes do not blow up these numbers.
Numbers will fluctuate depending on labs DB access but they should not diverge much from these.
We use as baseline our master branch on 2014-06-12 versus changes on this patchset: https://gerrit.wikimedia.org/r/#/c/150475/
Labs db infrstructure ( labsdb1002 dewiki, commons, etc) was upgraded to maria db about the last week of July. All data is on an SSD now.
Config for celery was:
BROKER_URL : redis://localhost:6379/0 CELERY_RESULT_BACKEND : redis://localhost:6379/0 CELERY_TASK_RESULT_EXPIRES : 2592000 CELERY_DISABLE_RATE_LIMITS : True CELERY_STORE_ERRORS_EVEN_IF_IGNORED : True CELERYD_CONCURRENCY : 10 CELERYD_TASK_TIME_LIMIT : 3630 CELERYD_TASK_SOFT_TIME_LIMIT : 3600 DEBUG : False LOG_LEVEL : INFO MAX_PARALLEL_PER_RUN : 10 MAX_INSTANCES_PER_RECURRENT_REPORT : 365 CELERY_BEAT_DATAFILE : /var/run/wikimetrics/celerybeat_scheduled_tasks CELERY_BEAT_PIDFILE : /var/run/wikimetrics/celerybeat.pid CELERYBEAT_SCHEDULE : 'update-daily-recurring-reports': 'task' : 'wikimetrics.schedules.daily.recurring_reports' # The schedule can be set to 'daily' for a crontab-like daily recurrence 'schedule' : debug
Results with patchset https://gerrit.wikimedia.org/r/#/c/150475/
editrowiki
edit- Backfilling of 3 months of data takes about 3 minutes
- Backfilling of 1 year of data takes about 10 minutes
eswiki
edit- Backfilling of 3 months of data took 8 minutes.
- Backfilling of 5 months of data took 10 minutes
- Backfilling of 1 year of data took 30 minutes
frwiki
edit- backfilling 3 months took 12 mins
Results with master branch (72ac421affa0c90183d9dde743cc79a91525fe12)
editrowiki
edit- Backfilling of 3 months of data takes about 3 minutes
- Backfilling of 1 year of data takes about 6 minutes
frwiki
edit- Backfilling 3 months: 7 mins
Results with patchset https://gerrit.wikimedia.org/r/#/c/158630/
editrowiki RollingNewActiveEditor
edit- Backfilling 3 months: 42 seconds
rowiki RollingSurvivingNewActiveEditor
edit- Backfilling 3 months: 44 seconds
frwiki RollingNewActiveEditor
edit- Backfilling 3 months: 3.5 minutes
frwiki RollingSurvivingNewActiveEditor
edit- Backfilling 3 months: 4.5 minutes
RollingRecurringOldActiveEditor, patch: https://gerrit.wikimedia.org/r/#/c/161521/
editruwiki
edit- Backfilling 1 day - 4 minutes
- Backfilling 1 week - 4 minutes
- Backfilling a month - 7minutes
frwiki
edit- Backfilling 2 months: 12 minutes
Rolling Recurrent old active editor https://gerrit.wikimedia.org/r/#/c/161521/
editSelect as is did not run (as in it run forever)
SELECT anon_1.user_id AS anon_1_user_id, IF(SUM(anon_1.count_one) >= %s AND SUM(anon_1.count_two) >= %s, %s, %s) AS `IF_1` FROM (SELECT anon_2.user_id AS user_id, anon_2.count_one AS count_one, anon_2.count_two AS count_two FROM (SELECT revision_userindex.rev_user AS user_id, SUM(IF(revision_userindex.rev_timestamp <= %s, %s, %s)) AS count_one, SUM(IF(revision_userindex.rev_timestamp > %s, %s, %s)) AS count_two FROM revision_userindex INNER JOIN user ON user.user_id = revision_userindex.rev_user INNER JOIN logging ON user.user_id = logging.log_user WHERE logging.log_type = %s AND logging.log_action = %s AND logging.log_timestamp < %s AND revision_userindex.rev_timestamp BETWEEN %s AND %s GROUP BY revision_userindex.rev_user UNION ALL SELECT archive.ar_user AS user_id, SUM(IF(archive.ar_timestamp <= %s, %s, %s)) AS count_one, SUM(IF(archive.ar_timestamp > %s, %s, %s)) AS count_two FROM archive INNER JOIN user ON user.user_id = archive.ar_user INNER JOIN logging ON user.user_id = logging.log_user WHERE logging.log_type = %s AND logging.log_action = %s AND logging.log_timestamp < %s AND archive.ar_timestamp BETWEEN %s AND %s GROUP BY archive.ar_user) AS anon_2) AS anon_1 GROUP BY anon_1.user_id
Pages created
editChanges were done for pages created to default to all name spaces, gerrit change: https://gerrit.wikimedia.org/r/#/c/167214/
enwiki
editWe were able to backfill a month for enwiki in 20 minutes
ruwiki
editWe were able to backfill a month for ruwiki in 3 minutes