Analytics/Server Admin Log/Archive/2021
19:13 milimetric: Additional context on the last delete message: on an-launcher1002 which is filled up
19:12 milimetric: Marcel and I are deleting files from /tmp older than 60 days
15:55 mforns: finished refinery deployment for anomaly detection queries
14:54 mforns: starting refinery deployment for anomaly detection queries
18:59 mforns: finished deployment of refinery, adding anomaly detection hql for airflow job
18:39 mforns: started to deploy refinery, adding anomaly detection hql for airflow job
12:32 btullis: Upgraded druid packages, with pool/depool on druid1004
11:20 btullis: btullis@an-test-druid1001:~$ sudo apt-get install druid-broker druid-common druid-coordinator druid-historical druid-middlemanager druid-overlord
11:18 btullis: updating reprepo with new druid packages for buster-wikimedia to pick up new log4j jar files
11:01 btullis: btullis@an-test-druid1001:~$ sudo apt-get install druid-broker druid-common druid-coordinator druid-historical druid-middlemanager druid-overlord
11:01 btullis: upgrading druid on the test cluster with new packages to test log4j changes.
08:51 joal: Rerun failed cassandra-daily-wf-local_group_default_T_mediarequest_per_file-2021-12-13 after cluster restart
07:20 elukey: elukey@stat1007:~$ sudo systemctl reset-failed product-analytics-movement-metrics
19:02 milimetric: finished deploying the weekly train as per etherpad
18:04 joal: Rerun failed cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-12-13 after cluster reboot
17:51 btullis: rebooting aqs1015
17:25 btullis: rebooting aqs1013
17:19 btullis: rebooting aqs1012
16:00 btullis: rebooting aqs1011
15:53 btullis: rebooting aqs1010
15:00 btullis: btullis@aqs1010:~$ sudo nodetool-a repair --full system_auth
14:59 btullis: cassandra@cqlsh> ALTER KEYSPACE "system_auth" WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': '12'}; on aqs1010-a
14:25 btullis: btullis@aqs1011:$ sudo systemctl start cassandra-b.service
12:44 joal: Rerun failed cassandra-hourly-wf-local_group_default_T_pageviews_per_project_v2-2021-12-14-10
12:42 joal: Kill late spark cassandra loading job
10:06 elukey: kill process 2560 on stat1005 to allow puppet to clean up the related user (offboarded)
10:04 elukey: kill process 2831 on stat1008 to allow puppet to clean up the related user (offboarded)
11:08 btullis: roll restarting druid historical daemons on analytics cluster T297148
10:46 btullis: roll restarting druid brokers on analytics cluster
20:09 ottomata: deploy wikistats2 with doc updates
17:36 razzi: restart aqs-next to pick up new mediawiki snapshot: `razzi@cumin1001:~$ sudo cumin A:aqs-next 'systemctl restart aqs'`
17:36 razzi: restart aqs to pick up new mediawiki snapshot: `razzi@cumin1001:~$ sudo cookbook sre.aqs.roll-restart aqs`
07:33 elukey: move kafka-test to fixed uid/gid
20:05 ottomata: restarting pageview-druid-daily-coord (killing 0062888-210701181527401-oozie-oozi-C) - I can't seem to rerun a particular hour, so just starting again from that hour.
17:57 elukey: drop "EventLogging MySQL" datasource from Superset (not valid anymore)
17:26 joal: Kill paragon job to prevent more nodemangers to OOM
09:56 elukey: powercycle analytics1071, soft lockup stacktraces in the tty
17:30 mforns: Deployed refinery using scap, then deployed onto hdfs
12:31 btullis: btullis@an-launcher1002:~$ sudo systemctl reset-failed monitor_refine_event_sanitized_analytics_delayed.service
07:10 elukey: drop /tmp/blockmgr-20fe4b2b-31fb-4a85-b5b1-bebe254120f8 on stat1006 to free space on the root partition
11:56 btullis: roll-restarting the cassandra services on the aqs cluster. (Not the aqs_next cluster)
11:49 btullis: btullis@an-coord1001:~$ sudo systemctl restart presto-server.service
11:49 btullis: btullis@an-coord1001:~$ sudo systemctl restart oozie.service
12:18 btullis: failed back the hive services to an-coord1001 via CNAME change
11:36 btullis: btullis@an-coord1001:~$ sudo systemctl restart hive-server2 hive-metastore
10:44 btullis: deploying DNS change to switch hive to the standby server.
10:18 btullis: btullis@an-coord1002:~$ sudo systemctl restart hive-server2 hive-metastore
17:26 elukey: varnishkafka-webrequest on cp3050 is running with /etc/ssl/localcerts/wmf_trusted_root_CAs.pem
10:03 elukey: restart prometheus-druid-exporter on Druid Analytics to clear unnecessary metrics
07:32 elukey: restart prometheus-druid-exporter on Druid Public to see metrics difference
16:01 btullis: roll-restarting kafka-test brokers
12:12 btullis: roll-restarting the presto analytics workers
11:44 btullis: btullis@archiva1002:~$ sudo systemctl restart archiva.service
07:29 elukey: `apt-get clean` on an-tool1005 to free space in the root partition
07:28 elukey: `sudo pkill -U jmixter` on stat100[5,8] to allow puppet to run and remove the offboarded user
19:40 joal: Deploying refinery to HDFS
19:15 joal: Deploying refinery with scap
18:23 joal: Releasing refinery-source v0.1.21
11:32 btullis: btullis@cumin1001:~$ sudo cookbook sre.druid.roll-restart-workers public
10:20 btullis: roll-restarting hadoop masters
16:37 joal: Rerun failed mediawiki-wikitext-history-wf-2021-10
06:56 elukey: `systemctl start prometheus-mysqld-exporter@analytics_meta` on db1108
18:20 btullis: btullis@an-launcher1002:~$ sudo systemctl reset-failed monitor_refine_event_sanitized_analytics_delayed.service
10:19 btullis: btullis@an-launcher1002:~$ sudo systemctl reset-failed monitor_refine_event_sanitized_analytics_delayed
16:52 razzi: restart presto server on an-coord1001 to apply change for T292087
16:30 razzi: set superset presto version to 0.246 in ui
16:30 razzi: set superset presto timeout to 170s: {"connect_args":{"session_props":{"query_max_run_time":"170s"} for T294771}}
12:23 btullis: btullis@an-launcher1002:~$ sudo systemctl reset-failed monitor_refine_event_sanitized_analytics_delayed
07:23 elukey: `apt-get clean` on stat1006 to free some space (root partition full)
19:51 ottomata: an-coord1002: drop user 'admin'@'localhost'; start slave; to fix broken replication - T284150
19:44 razzi: create admin user on an-coord1001 for T284150
18:07 razzi: run `create user 'admin'@'localhost' identified by <password>; grant all privileges on *.* to admin;` to allow milimetric to access mysql on an-coord1002 for T284150
16:39 razzi: add "can sql json on superset" permission to Alpha role on superset.wikimedia.org
16:14 razzi: drop and restore superset_staging database to test permissions as they are in production
17:07 razzi: razzi@an-tool1010:~$ sudo systemctl stop superset
16:57 razzi: dump mysql in preparation for superset upgrade
02:23 milimetric: deployed refinery with regular train
23:04 btullis: deleted all remaining old cassandra snapshots on aqs100x servers.
22:58 btullis: deleted old snapshots from aqs1006 and aqs1009
17:45 razzi: set presto_analytics_hive extra parameter engine_params.connect_args.session_props.query_max_run_time to 55s on superset.wikimedia.org
10:39 elukey: roll restart of kafka-test to pick up new truststore (root PKI added)
19:13 ottomata: re-enable hdfs-cleaner for /wmf/gobblin
09:01 btullis: reverted hive services back to an-coord1001.
16:03 btullis: btullis@an-coord1001:~$ sudo systemctl restart hive-server2 hive-metastore
13:02 btullis: btullis@an-coord1002:~$ sudo systemctl restart hive-server2 hive-metastore
12:51 btullis: btullis@aqs1007:~$ sudo nodetool-a clearsnapshot
14:05 ottomata: rerun refine_eventlogging_analytics refine_eventlogging_legacy and refine_event with -ignore-done-flag=true --since=2021-10-21T01:00:00 --until=2021-10-21T04:00:00 for backfill of missing data after gobblin problems
13:39 btullis: btullis@an-launcher1002:~$ sudo systemctl restart gobblin-event_default
10:35 joal: Re-refine netflow data after gobblin pulled data fix
08:41 joal: Rerun webrequest-load jobs for hour 2021-10-21T02:00
18:11 razzi: Deployed refinery using scap, then deployed onto hdfs
16:36 razzi: deploy refinery change for https://phabricator.wikimedia.org/T287084
07:15 joal: rerun webrequest-load-wf-upload-2021-10-20-1 after node issue
06:27 elukey: reboot analytics1066 - OS showing CPU soft lockups, tons of defunct processes (including node manager) and high CPU usage
07:14 joal: Rerun cassandra-daily-wf-local_group_default_T_mediarequest_top_files-2021-10-17
19:29 joal: Rerun cassandra-daily-wf-local_group_default_T_top_pageviews-2021-10-17
18:36 joal: Rerun cassandra-daily-wf-local_group_default_T_unique_devices-2021-10-17
16:22 joal: rerun cassandra-daily-wf-local_group_default_T_top_percountry-2021-10-17
16:16 joal: Rerun cassandra-daily-wf-local_group_default_T_mediarequest_per_referer-2021-10-17
15:17 joal: Rerun failed instances from cassandra-hourly-coord-local_group_default_T_pageviews_per_project_v2
14:49 elukey: restart hadoop-yarn-nodemanager on an-worker1119 and an-worker1103 (Java OOM in the logs)
12:09 btullis: root@aqs1013:/srv/cassandra-b/tmp# systemctl restart cassandra-b.service
12:09 btullis: root@aqs1012:/srv/cassandra-b/tmp# systemctl restart cassandra-b.service
09:25 btullis: btullis@cumin1001:~$ sudo transfer.py aqs1013.eqiad.wmnet:/srv/cassandra-b/tmp/local_group_default_T_pageviews_per_article_flat an-presto1001.eqiad.wmnet:/srv/cassandra_migration/aqs1013-b/
09:17 btullis: btullis@cumin1001:~$ sudo transfer.py aqs1012.eqiad.wmnet:/srv/cassandra-b/tmp/local_group_default_T_pageviews_per_article_flat an-presto1001.eqiad.wmnet:/srv/cassandra_migration/aqs1012-b/
09:16 btullis: btullis@cumin1001:~$ sudo transfer.py aqs1012.eqiad.wmnet:/srv/cassandra-b/tmp/local_group_default_T_pageviews_per_article_flat an-presto1001.eqiad.wmnet:/srv/cassandra_migration/cassandra_migration/aqs1012-b/
08:33 btullis: btullis@aqs1007:~$ sudo nodetool-b clearsnapshot
19:49 mforns: re-ran cassandra-daily-coord-local_group_default_T_pageviews_per_article_flat for 2021-10-12 successfully
17:58 ottomata: deleting files on stat1008 in /tmp older than 10 days and larger than 20M sudo find /tmp -mtime +10 -size +20M | xargs sudo rm -rfv
17:54 ottomata: removed /tmp/spark-* files belonging to aikochou on stat1008
15:43 btullis: btullis@aqs1008:~$ sudo nodetool-b clearsnapshot
13:17 btullis: btullis@analytics1069:~$ sudo shutdown -h now
13:15 btullis: btullis@analytics1069:~$ sudo systemctl stop hadoop-hdfs-*
13:14 btullis: btullis@analytics1069:~$ sudo systemctl stop hadoop-yarn-nodemanager.service
07:26 joal: Rerun cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-10-11
07:37 joal: rerun refine_event for `event`.`mediawiki_content_translation_event` year=2021/month=10/day=10/hour=16
18:07 joal: Rerun webrequest-load-wf-text-2021-10-10-10 - failed due to network issue
14:30 elukey: upgrade stat1005 to ROCm 4.2.0
13:20 btullis: btullis@aqs1004:~$ sudo nodetool-a clearsnapshot
10:20 elukey: upgrade ROCm to 4.2 on stat1008
11:28 elukey: failover analytics-hive back to an-coord1001 after maintenance
16:56 elukey: restart java daemons on an-coord1001 (standby)
13:43 elukey: failover analytics-hive to an-coord1002 (to restart java daemons on 1001)
07:43 joal: Kill-restart mediawiki-history-reduced job after deploy (more ressources)
07:32 joal: Deploy refinery to hdfs
07:10 joal: Deploy refinery for mediawiki-history-reduced hotfix
06:56 joal: Kill-restart pageview-monthly_dump-coord to apply fix for SLA
15:11 btullis: sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_eventlogging_legacy --ignore_failure_flag=true --table_include_regex='editoractivation' --since='2021-09-29T22:00:00.000Z' --until='2021-09-30T23:00:00.000Z'
19:55 ottomata: not changing to stats uid to 499; it already exists as a another system user
19:54 ottomata: changing stats uid and gid on an-launcher1002 and stat1005 to 499
09:32 btullis: btullis@an-launcher1002:~$ sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_netflow --ignore_failure_flag=true --since=2021-09-28T11:00:00 --until 2021-09-28T12:00:00
09:16 elukey: restart hive-* units on an-coord1002 for openjdk upgrades (standby node)
13:14 btullis: Deployed refinery using scap, then deployed onto hdfs
12:34 btullis: deploying refinery
09:55 btullis: btullis@cumin1001:~$ sudo cumin --mode async 'aqs100*.eqiad.wmnet' 'nodetool-a snapshot -t T291472 local_group_default_T_pageviews_per_article_flat' 'nodetool-b snapshot -t T291472 local_group_default_T_pageviews_per_article_flat'
09:36 elukey: restart java daemons on an-test-coord1001 to pick up new openjdk
11:18 btullis: btullis@stat1005:~$ sudo apt purge usrmerge
11:11 btullis: btullis@stat1005:~$ sudo apt install usrmerge
22:33 razzi: restart an-test-coord presto coordinator service to experiment withweb-ui.authentication.type=fixed
15:06 btullis: btullis@cumin1001:~$ sudo cumin --mode async 'aqs100[4,7].eqiad.wmnet' 'nodetool-a snapshot -t T291469' 'nodetool-b snapshot -t T291469'
14:47 btullis: btullis@aqs1007:~$ sudo nodetool-a repair --full local_group_default_T_mediarequest_per_file data
11:02 btullis: btullis@an-master1001:~$ sudo systemctl restart hadoop-mapreduce-historyserver
10:47 btullis: btullis@an-master1002:~$ sudo systemctl restart hadoop-hdfs-namenode
10:47 btullis: btullis@an-master1002:~$ sudo systemctl restart hadoop-hdfs-zkfc
10:35 btullis: btullis@an-master1001:~$ sudo -u hdfs kerberos-run-command hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
10:07 btullis: btullis@an-launcher1002:~$ sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_eventlogging_legacy --ignore_failure_flag=true --table_include_regex='centralnoticeimpression' --since='2021-09-23T04:00:00.000Z' --until='2021-09-24T05:00:00.000Z'
17:23 razzi: razzi@an-test-coord1001:/etc/presto$ sudo systemctl restart presto-server
17:05 joal: Kill-restart oozie jobs after deploy (mediawiki-history-denormalize-coord, mediawiki-history-check_denormalize-coord, mediawiki-history-dumps-coord, mediawiki-history-reduced-coord)
11:54 joal: release refiner-source v0.1.18 to archiva with Jenkins
08:12 elukey: remove old /reportcard (password protected, old files from 2012) httpd settings for stats.wikimedia.org
06:48 joal: Rerun webrequest-load-wf-text-2021-9-18-0 for errors after yesterday night production issue
16:03 btullis: Cleared all snapshots on aqs100[47] to reclaim space with nodetool-[ab] clearsnapshot (T249755)
15:15 btullis: btullis@aqs1004:~$ sudo nodetool-a repair --full && sudo nodetool-b repair --full (T249755)
10:18 btullis: btullis@an-web1001:~$ sudo find /srv/published-rsynced -user systemd-coredump -exec chown stats {} \;
09:47 milimetric: deployed refinery to sync sanitize allowlist, deleting event_sanitized data per decision in the task
08:21 elukey: disable mod_cgi/mod_cgid on an-web1001 (and remove cgi-perl related httpd configs/settings)
19:25 ottomata: pointing analytics-web cname at new an-web1001, this moves stats and analytics .wm.org from thorium to an-web1001 - T285355
18:30 joal: Create HDFS home folder for user 'analytics-research'
07:03 elukey: stop jupyter-kaywong-singleuser.service on stat1005 to allow puppet to clean up
16:26 joal: Deploying refinery
18:25 razzi: (I stopped replication earlier but forgot to !log)
18:24 razzi: razzi@dbstore1007:~$ for socket in /run/mysqld/*; do sudo mysql --socket=$socket -e "START SLAVE"; done - reenable replication for T290841
18:19 razzi: razzi@dbstore1007:~$ sudo systemctl restart mariadb@s4.service for T290841
18:13 razzi: razzi@dbstore1007:~$ sudo systemctl restart mariadb@s3.service for T290841
18:05 razzi: sudo systemctl restart mariadb@s2.service
11:41 joal: Restarting cassandra hourly loading job after C2 snapshot taken and C3 tables truncated
11:37 joal: Re-Add test rows in cassandra3 cluster after tables got truncated
10:25 hnowlan: truncating data tables on aqs_next cluster
10:12 joal: Kill cassandra-hourl loading job for cluster-migration first step
11:43 joal: Deploying refinery to hotfix mediarequest cassandra3 loading jobs (second)
09:57 joal: Deploy AQS on new AQS servers
09:45 joal: Kill-restart mediarequest-top cassandra loading jobs after deploy
09:12 joal: Rerun mediawiki-history-denormalize-wf-2021-08 after failure
09:07 joal: Deploying refinery to hotfix mediarequest cassandra3 loading jobs
16:44 mforns: finished one-off deployment of refinery to fix cassandra3 loading
15:57 joal: Kill cassandra loading jobs and restart them after deploy
15:55 mforns: starting one-off deployment of refinery to fix cassandra3 loading
13:15 joal: Restart cassandra jobs to load cassandra3 with spark
08:21 joal: Rerun webrequest-load-wf-upload-2021-9-1-0
23:25 mforns: finished deployment of refinery (regular weekly train v0.1.17) successfully, only an-test-coord1001.eqiad.wmnet failed
22:41 mforns: starting deployment of refinery (regular weekly train v0.1.17)
22:27 mforns: Deployed refinery-source using jenkins
10:30 hnowlan: sudo cookbook sre.aqs.roll-restart aqs-next
06:53 elukey: drop an-airflow1001's old airflow logs to fix root partition almost filled up
06:22 elukey: root@an-launcher1002:/var/lib/puppet/clientbucket# find -type d -empty -delete
06:21 elukey: root@an-launcher1002:/var/lib/puppet/clientbucket# find -type f -delete -mtime +60
13:40 joal: Kill restart pageview-monthly_dump job and 2 backfilling jobs
13:34 joal: Deploy refinery onto HDFS
13:09 joal: Deploying refinery using scap
10:30 btullis: btullis@an-launcher1002:~$ sudo systemctl start hdfs-balancer.service
08:46 btullis: btullis@druid1001:~$ sudo systemctl stop druid-broker druid-coordinator druid-historical druid-middlemanager druid-overlord
19:05 razzi: razzi@deploy1002:/srv/deployment/analytics/aqs/deploy$ scap deploy "Deploy aqs 9c062f2"
19:02 razzi: note that the aqs-deploy repo's commit message DOES NOT include the changes of aqs in its changes list (though it has the correct SHA in the first line)
18:26 razzi: Beginning aqs deploy process
17:55 razzi: razzi@labstore1007:~$ sudo systemctl start analytics-dumps-fetch-geoeditors_dumps.service
17:53 razzi: sudo systemctl start analytics-dumps-fetch-geoeditors_dumps.service on labstore1006
17:37 btullis: on an-coord1001: MariaDB [superset_production]> update clusters set broker_host='an-druid1001.eqiad.wmnet' where cluster_name='analytics-eqiad';
15:08 joal: Restart oozie jobs loading druid to use new druid-host
08:55 joal: Deploying refinery with scap
16:46 elukey: cleanup /srv/discovery on stat1007 after https://gerrit.wikimedia.org/r/c/operations/puppet/+/712422
15:16 milimetric: reran the other three failed jobs successfully
14:52 milimetric: rerunning webrequest-druid-hourly-wf-2021-8-13-13 because of failure to connect to Hive metastore
14:46 btullis: btullis@druid1002:/etc/zookeeper/conf$ sudo systemctl disable druid-broker druid-coordinator druid-historical druid-middlemanager druid-overlord
14:45 btullis: btullis@druid1002:/etc/zookeeper/conf$ sudo systemctl stop druid-broker druid-coordinator druid-historical druid-middlemanager druid-overlord
19:43 btullis: btullis@druid1003:~$ sudo systemctl stop druid-overlord && sudo systemctl disable druid-overlord
19:41 btullis: btullis@druid1003:~$ sudo systemctl stop druid-historical && sudo systemctl disable druid-historical
19:40 btullis: btullis@druid1003:~$ sudo systemctl stop druid-coordinator && sudo systemctl disable druid-coordinator
19:37 btullis: btullis@druid1003:~$ sudo systemctl stop druid-broker && sudo systemctl disable druid-broker
19:30 btullis: btullis@druid1003:~$ curl -X POST http://druid1003.eqiad.wmnet:8091/druid/worker/v1/disable
12:13 btullis: migration of zookeeper from druid1002 to an-druid1002 complete, with quorum and two zynced followers. Re-enabling puppet on all druid nodes.
09:48 btullis: suspended the following oozie jobs in hue: webrequest-druid-hourly-coord, pageview-druid-hourly-coord, edit-hourly-druid-coord
09:45 btullis: btullis@an-launcher1002:~$ sudo systemctl disable eventlogging_to_druid_editattemptstep_hourly.timer eventlogging_to_druid_navigationtiming_hourly.timer eventlogging_to_druid_netflow_hourly.timer eventlogging_to_druid_prefupdate_hourly.timer
09:21 elukey: run "sudo find /var/log/airflow -type f -mtime +15 -delete" on an-airflow1001 to free space (root partition almost full)
17:27 razzi: resume the following schedules in hue: edit-hourly-druid-coord, pageview-druid-hourly-coord, webrequest-druid-hourly-coord
17:10 razzi: sudo cookbook sre.druid.roll-restart-workers analytics (errored out)
09:04 btullis: btullis@an-launcher1002:~$ sudo systemctl restart eventlogging_to_druid_prefupdate_hourly.service
09:04 btullis: btullis@an-launcher1002:~$ sudo systemctl restart eventlogging_to_druid_netflow_daily.service
10:45 btullis_: btullis@an-druid1003:/var/log/druid$ sudo chown -R druid:druid /srv/druid /var/log/druid
10:25 btullis_: btullis@an-druid1003:~$ sudo puppet agent -tv
09:12 btullis: btullis@an-coord1001:~$ sudo systemctl start hive-metastore.service hive-server2.service
09:12 btullis: btullis@an-coord1001:~$ sudo systemctl stop hive-server2.service hive-metastore.service
09:00 btullis: sudo systemctl start hive-metastore && sudo systemctl start hive-server2
09:00 btullis: btullis@an-coord1002:~$ sudo systemctl stop hive-server2 && sudo systemctl stop hive-metastore
19:23 ottomata: bump Refine to refinery version 0.1.16 to pick up normalized_host transform - now all event tables will have a new normalized_host field - T251320
19:02 ottomata: Deployed refinery using scap, then deployed onto hdfs
14:57 ottomata: rerunning webrequest refine for upload 08-03T01:00 - 0042643-210701181527401-oozie-oozi-W
18:49 razzi: sudo cookbook sre.druid.roll-restart-workers analytics
17:57 razzi: sudo cookbook sre.druid.roll-restart-workers public
22:22 razzi: razzi@cumin1001:~$ sudo cookbook sre.druid.roll-restart-workers test
18:12 razzi: sudo cookbook sre.aqs.roll-restart aqs
10:46 btullis: btullis@an-test-coord1001:/etc/hive/conf$ sudo systemctl start hive-metastore.service hive-server2.service
10:46 btullis: btullis@an-test-coord1001:/etc/hive/conf$ sudo systemctl stop hive-server2.service hive-metastore.service
20:54 razzi: reran the failed workflow of cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-7-25
18:38 ottomata: deploy refinery to an-launcher1002 for bin/gobblin job lock change
20:30 joal: rerun webrequest timed-out instances
18:58 mforns: starting refinery deployment
18:40 razzi: razzi@an-launcher1002:~$ sudo puppet agent --enable
18:39 razzi: razzi@an-master1001:/var/log/hadoop-hdfs$ sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
18:37 razzi: razzi@an-master1002:~$ sudo -i puppet agent --enable
18:34 razzi: razzi@an-master1002:~$ sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
18:32 razzi: razzi@an-master1002:~$ sudo systemctl start hadoop-yarn-resourcemanager.service
18:31 razzi: razzi@an-master1002:~$ sudo systemctl stop hadoop-yarn-resourcemanager.service
18:22 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
18:21 razzi: re-enable yarn queues by merging puppet patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/705732
17:27 razzi: razzi@cumin1001:~$ sudo -i wmf-auto-reimage-host -p T278423 an-master1001.eqiad.wmnet
17:17 razzi: stop all hadoop processes on an-master1001
16:52 razzi: starting hadoop processes on an-master1001 since they didn't failover cleanly
16:31 razzi: sudo bash gid_script.bash on an-maseter1001
16:29 razzi: razzi@alert1001:~$ sudo icinga-downtime -h an-master1001 -d 7200 -r "an-master1001 debian upgrade"
16:25 razzi: razzi@an-master1001:~$ sudo systemctl stop hadoop-mapreduce-historyserver
16:25 razzi: sudo systemctl stop hadoop-hdfs-zkfc.service on an-master1001 again
16:25 razzi: sudo systemctl stop hadoop-yarn-resourcemanager on an-master1001 again
16:23 razzi: sudo systemctl stop hadoop-hdfs-namenode on an-master1001
16:19 razzi: razzi@an-master1001:~$ sudo systemctl stop hadoop-hdfs-zkfc
16:19 razzi: razzi@an-master1001:~$ sudo systemctl stop hadoop-yarn-resourcemanager
16:18 razzi: sudo systemctl stop hadoop-hdfs-namenode
16:10 razzi: razzi@cumin1001:~$ sudo transfer.py an-master1002.eqiad.wmnet:/home/razzi/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz stat1004.eqiad.wmnet:/home/razzi/hdfs-namenode-fsimage
16:03 razzi: root@an-master1002:/srv/hadoop/name# tar -czf /home/razzi/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz current
15:57 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
15:52 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
15:37 razzi: kill yarn applications: for jobId in $(yarn application -list | awk 'NR > 2 { print $1 }'); do yarn application -kill $jobId; done
15:08 razzi: sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
14:52 razzi: sudo systemctl stop 'gobblin-*.timer'
14:51 razzi: sudo systemctl stop analytics-reportupdater-logs-rsync.timer
14:47 razzi: Disable jobs on an-launcher1002 (see https://phabricator.wikimedia.org/T278423#7190372 )
14:46 razzi: razzi@an-launcher1002:~$ sudo puppet agent --disable 'razzi: upgrade hadoop masters to debian buster'
08:32 mforns: restarted webrequest bundle (messed up a coord when trying to rerun some failed hours)
08:54 elukey: run 'sudo find -type f -name '*.log*' -mtime +30 -delete' on an-coord1001:/var/log/hive to free space (root partition almost filled up) - T279304
16:44 ottomata: deploying refinery and refinery-source 0.1.15 for refine job fixes - T271232
13:39 joal: Kill refine_event application_1623774792907_154469 to let manual run finish
13:35 joal: Kill currently running refine job (application_1623774792907_154014)
11:20 joal: Kill stuck refine application
17:39 razzi: sudo cookbook sre.druid.roll-restart-workers public for https://phabricator.wikimedia.org/T283067
00:34 razzi: razzi@an-test-druid1001:~$ sudo systemctl restart zookeeper
00:33 razzi: razzi@an-test-druid1001:~$ sudo systemctl restart druid-coordinator
00:33 razzi: razzi@an-test-druid1001:~$ sudo systemctl restart druid-broker
00:28 razzi: razzi@an-test-druid1001:~$ sudo systemctl restart druid-middlemanager
00:24 razzi: razzi@an-test-druid1001:~$ sudo systemctl restart druid-overlord
00:24 razzi: razzi@an-test-druid1001:~$ sudo systemctl restart druid-historical
19:29 joal: move /wmf/data/raw/eventlogging --> /wmf/data/raw/eventlogging_camus and drop /wmf/data/raw/eventlogging_legacy/*/year=2021/month=07/day=13/hour=14
19:02 razzi: razzi@cumin1001:~$ sudo cookbook sre.hadoop.roll-restart-workers analytics
13:03 joal: remove /wmf/gobblin/locks/event_default.lock to unlock gobblin event job
18:37 joal: Move /wmf/data/raw/event to /wmf/data/raw/event_camus and /wmf/data/raw/event_gobblin to /wmf/data/raw/event
18:36 joal: Delete /year=2021/month=07/day=12/hour=14 of gobblin imported events
18:17 ottomata: stopped puppet and refines and imports for event data on an-launcher1002 in prep for gobblin finalization for event_default job
12:31 joal: Rerun failed webrequest hour after having checked that loss was entirely false-positive
03:21 joal: Rerun webrequest descendent jobs for 2021-07-08T10:00 problem
17:22 joal: Deploy refinery to HDFS
16:57 joal: Kill-restart webrequest oozie job after gobblin time-format change
16:44 joal: Deploying refinery to an-launcher and hadoop-test
16:05 joal: Manually add /wmf/data/raw/webrequest/webrequest_text/year=2021/month=7/day=8/hour=9/_IMPORTED
17:03 joal: Deploy refinery to HDFS
16:52 joal: Deploy refinery to an-launcher1002
16:05 joal: Deploy refinery to test-cluster
13:30 joal: kill-restart webrequest using gobblin data
13:12 ottomata: deploying refinery to an-launcher1002 for webrequest gobblin migratoin
13:09 joal: Move data for webrequest camus-gobblin migration
13:03 ottomata: disabled camus-webrequest and gobblin-webrequest timer on an-launcher1002 in prep for migration
17:33 joal: Deploy refinery onto HDFS
16:41 joal: Deploy refinery for gobblin
16:03 joal: Kill webrequest_test oozie job
15:55 joal: Drop and recreate wmf_raw.webrequest table on analytics-test-hadoop
15:52 joal: Moved camus and gobblin data for webrequest on analytics-test-hadoop
15:48 ottomata: deploying refinery to test cluster for webrequest_test gobblin job
14:16 ottomata: restarted aqs for july mw histroy snapshot deploy
13:29 joal: Run first manual empty job for webrequest_test on analytics-test-hadoop
13:29 joal: Clean gobblin state_store and data before starting webrequest_test on analytics-test-hadoop
19:57 joal: rerun learning-features-actor-hourly-wf-2021-7-2-11
13:47 joal: Reset failed timer refinery-sqoop-mediawiki-private.service
12:21 joal: Replacing failed data with successful data generated when testing https://gerrit.wikimedia.org/r/702877 - wmf_raw.mediawiki_private_cu_changes
00:04 razzi: razzi@an-coord1002:~$ sudo mount -a
00:04 razzi: razzi@an-coord1002:~$ sudo umount /mnt/hdfs
00:03 razzi: razzi@an-coord1002:~$ sudo systemctl restart hive-metastore.service
00:02 razzi: razzi@an-coord1002:~$ sudo systemctl restart hive-server2.service
18:56 razzi: razzi@authdns1001:~$ sudo authdns-update
18:19 razzi: razzi@an-coord1001:~$ sudo mount -a
18:18 razzi: razzi@an-coord1001:~$ sudo umount /mnt/hdfs
18:17 razzi: razzi@an-coord1001:~$ sudo systemctl restart presto-server.service
18:16 razzi: razzi@an-coord1001:~$ sudo systemctl restart hive-metastore.service
18:16 razzi: sudo systemctl restart hive-server2.service
18:15 razzi: sudo systemctl restart oozie on an-coord1001 for https://phabricator.wikimedia.org/T283067
16:38 razzi: sudo authdns-update on ns0.wikimedia.org to apply https://gerrit.wikimedia.org/r/c/operations/dns/+/702689
18:19 razzi: unmount and remount /mnt/hdfs on an-test-client1001 for java security update
22:55 razzi: sudo systemctl restart hive-server2 on an-test-coord1001.eqiad.wmnet for T283067
22:53 razzi: sudo systemctl restart hive-metastore on an-test-coord1001.eqiad.wmnet for T283067
22:52 razzi: sudo systemctl restart presto-server on an-test-coord1001.eqiad.wmnet for T283067
22:51 razzi: sudo systemctl restart oozie on an-test-coord1001.eqiad.wmnet for T283067
13:31 ottomata: deploying refinery for weekly train
17:00 elukey: apt-get reinstall llvm-gpu on stat100[5-8] - T285495
08:01 elukey: reboot an-worker1101 to unblock stuck GPU
07:57 elukey: execute "sudo /opt/rocm/bin/rocm-smi --gpureset -d 1" on an-worker1101 as attempt to unblock the GPU
06:38 elukey: drop hieradata/role/common/analytics_cluster/superset.yaml from puppet private repo (unused config, all the values dumplicated in the new hiera config)
06:34 elukey: rename superset hiera role configs in puppet private repo (to match the role change done recently) + superset restart
14:46 XioNoX: remove decom hosts from the analytics firewall filter on cr2-eqiad - T279429
14:37 XioNoX: start updating analytics firewall rules to capirca generated ones on cr2-eqiad - T279429
14:28 XioNoX: remove decom hosts from the analytics firewall filter on cr1-eqiad - T279429
14:12 XioNoX: start updating analytics firewall rules to capirca generated ones on cr1-eqiad - T279429
13:35 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1001-eqiad-wmnet an-master1002-eqiad-wmnet
06:37 elukey: execute "sudo find -type f -name '*.log*' -mtime +30 -delete" on an-coord1001 to free space in the root partition
17:46 razzi: remove hdfs namenode backup on stat1004
17:45 razzi: enable puppet on an-launcher
17:45 razzi: sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
16:55 razzi: sudo -i wmf-auto-reimage-host -p T278423 an-master1002.eqiad.wmnet
16:53 razzi: run uid script on an-master1002
16:33 elukey: restart hadoop-yarn-resourcemanager on an-master1001
16:16 razzi: sudo systemctl stop 'hadoop-*' on an-master1002
16:14 razzi: sudo systemctl stop hadoop-* on an-master1001, then realize I meant to do this on an-master1002, so start hadoop-*
16:11 razzi: downtime an-master1002
15:55 razzi: sudo transfer.py an-master1001.eqiad.wmnet:/srv/hadoop/backup/hdfs-namenode-snapshot-buster-reimage-2021-06-15.tar.gz stat1004.eqiad.wmnet:/home/razzi/hdfs-namenode-fsimage
15:42 razzi: tar -czf /srv/hadoop/backup/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz current on an-master1001
15:38 razzi: backup /srv/hadoop/name/current to /home/razzi/hdfs-namenode-snapshot-buster-reimage-2021-06-15.tar.gz on an-master1001
15:33 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
15:27 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
15:25 razzi: kill running yarn applications via for loop
15:11 razzi: sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
15:09 razzi: disable puppet on an-mastesr
15:08 razzi: run puppet on an-masters to update capacity-scheduler.xml
15:02 razzi: disable puppet on an-masters
15:01 razzi: sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues to stop queues
14:35 razzi: disable jobs that use hadoop on an-launcher1002 following https://phabricator.wikimedia.org/T278423#7094641
18:45 ottomata: remove packges from hadoop common nodes: sudo cumin 'R:Class = profile::analytics::cluster::packages::common' 'apt-get -y remove python3-pandas python3-pycountry python3-numpy python3-tz' - T275786
18:43 ottomata: remove packges from stat nodes: sudo cumin 'stat*' apt-get -y remove subversion mercurial tofrodos libwww-perl libcgi-pm-perl libjson-perl libtext-csv-xs-perl libproj-dev libboost-regex-dev libboost-system-dev libgoogle-glog-dev libboost-iostreams-dev libgdal-dev
07:18 joal: Rerun cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-6-11
21:17 razzi: sudo systemctl restart monitor_refine_eventlogging_analytics
18:17 razzi: sudo systemctl restart hadoop-mapreduce-historyserver
17:24 razzi: sudo systemctl restart hadoop-hdfs-namenode on an-master1002
17:24 razzi: sudo systemctl restart hadoop-hdfs-zkfc on an-master1002
17:12 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
16:25 razzi: rolling restart hadoop masters to pick up https://gerrit.wikimedia.org/r/c/operations/puppet/+/698194
14:07 ottomata: altered event.wmdebannerevent event.eventRate field to change type from BIGINT to DOUBLE - T282562
16:56 elukey: move away from dbstore1004 in favor of dbstore1007 in analytics CNAME/SRV records (will affect analytics-mysql and sqoop)
13:42 ottomata: roll restart an-conf zookeepers - T283067
13:22 ottomata: roll restarting analytics presto-servers - T283067
06:08 elukey: restart yarn nodemanager on analytics1075 to clear the un-healthy state after some days of downtime (one-off issue but let's keep an eye on it)
18:14 ottomata: rolling restart of kafka jumbo brokers - T283067
17:53 ottomata: rolling restart of kafka jumbo mirror makers - T283067
17:07 ottomata: remove packages from an clsuter nodes: sudo apt-get -y remove r-cran-rmysql python3-matplotlib python3-sklearn python3-enchant python3-nltk gfortran liblapack-dev libopenblas-dev - T275786
16:50 ottomata: restarting mysqld analytics-meta replica on db1108 to apply config change - T272973
17:42 razzi: sudo cookbook sre.aqs.roll-restart aqs to deploy new mediawiki history snapshot
22:32 razzi: sudo manage_principals.py create jdl --email_address=jlinehan@wikimedia.org
22:32 razzi: sudo manage_principals.py create phuedx --email_address=phuedx@wikimedia.org
15:46 ottomata: add airflow_2.1.0-py3.7-1_amd64.deb to apt.wm.org
15:20 ottomata: created airflow_analytics database and user on an-coord1001 analytics-meta instance - T272973
18:09 ottomata: remove .deb packages from stat boxes: python3-mysqldb python3-boto python3-ua-parser python3-netaddr python3-pymysql python3-protobuf python3-unidecode python3-oauth2client python3-oauthlib python3-requests-oauthlib python3-ua-parser - T275786
06:56 joal: Rerun cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-5-29
19:14 ottomata: deploying refinery and refinery source 0.1.13
17:29 ottomata: killing and restarting oozie cassandra loader jobs coord_unique_devices_daily and coord_pageview_top_percountry_daily after revert of oozie job to load to cassandra 3
14:18 ottomata: deploying refinery...
14:17 ottomata: Deployed refinery-source using jenkins
18:16 razzi: sudo systemctl start all failed units from `systemctl list-units --state=failed` on an-launcher1002
18:14 razzi: sudo systemctl start eventlogging_to_druid_navigationtiming_hourly.service
18:01 razzi: manually edit /etc/hadoop/conf/capacity-scheduler.xml to make queues running and sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
17:52 razzi: sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues on an-master1001 and an-master1002
17:28 razzi: sudo systemctl restart refine_eventlogging_legacy
17:28 razzi: sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues to enable submitting jobs once again
17:08 razzi: re-enabled puppet on an-masters and an-launcher
17:04 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode leave
17:03 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
16:43 razzi: sudo systemctl restart hadoop-hdfs-namenode on an-master1001
16:38 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
16:35 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
16:28 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
16:23 razzi: sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode leave
16:06 razzi: sudo systemctl restart hadoop-hdfs-namenode
15:52 razzi: checkpoint hdfs with sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
15:51 razzi: enable safe mode on an-master1001 with sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
15:36 razzi: disable puppet on an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet again
15:35 razzi: re-enable puppet on an-masters, run puppet, and sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
15:32 razzi: disable puppet on an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet
14:39 razzi: stop puppet on an-launcher and stop hadoop-related timers
01:09 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet
01:07 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1001-eqiad-wmnet an-master1002-eqiad-wmnet
00:34 razzi: sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1001-eqiad-wmnet an-master1002-eqiad-wmnet
18:05 ottomata: resume failing cassandra 3 oozie loading jobs, they are also loading to cassandra 2: cassandra-daily-coord-local_group_default_T_top_percountry (0011318-210426062240701-oozie-oozi-C), cassandra-daily-coord-local_group_default_T_unique_devices (0011324-210426062240701-oozie-oozi-C)
18:04 ottomata: suspend failing cassandra 3 oozie loading jobs: cassandra-daily-coord-local_group_default_T_top_percountry (0011318-210426062240701-oozie-oozi-C), cassandra-daily-coord-local_group_default_T_unique_devices (0011324-210426062240701-oozie-oozi-C)
15:19 ottomata: rm -rf /tmp/analytics/* on an-launcher1002 - T283126
06:05 elukey: kill christinedk's jupyter process on stat1007 (offboarded user) to allow puppet to run
16:31 razzi: restart turnilo for T279380
20:22 razzi: restart oozie virtualpageview hourly, virtualpageview druid daily, virtualpageview druid monthly
18:57 razzi: deployed refinery via scap, then deployed to hdfs
18:46 ottomata: removing extraneous python-kafka and python-confluent-kafka deb packages from analytics cluster - T275786
12:40 joal: Add monitoring data in cassandra-3
06:50 joal: run manual unique-devices cassandra job for one day with debug logging
02:20 ottomata: manually running drop_event with --verbose flag
11:09 joal: Restart cassandra-daily-wf-local_group_default_T_unique_devices-2021-5-4 for testing after host generating failures has been moved out of cluster
10:41 joal: Restart cassandra-daily-wf-local_group_default_T_unique_devices-2021-5-4 for testing after drop/create of keyspace
10:28 joal: Restart cassandra-daily-wf-local_group_default_T_unique_devices-2021-5-4 for testing
09:45 joal: Rerun of cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-5-15
11:41 hnowlan: running truncate "local_group_default_T_pageviews_per_article_flat".data; on aqs1012
15:17 ottomata: dropped event.mediawiki_job_* tables and data directories with mforns - T273789 & T281605
13:56 ottomata: removing refine_mediawiki_job Refine jobs - T281605
21:00 mforns: finished repeated refinery deployment (matching source v0.1.11) - missed unmerged change
19:59 mforns: repeating refinery deployment (matching source v0.1.11) - missed unmerged change
19:53 mforns: finished refinery deployment (matching source v0.1.11)
18:41 mforns: starting refinery deployment (matching source v0.1.11)
17:26 mforns: deployed refinery-source v0.1.11
21:27 razzi: sudo manage_principals.py reset-password nahidunlimited --email_address=nsultan@wikimedia.org
13:29 elukey: roll restart of hadoop yarn nodemanagers to pick up TasksMax=26214
12:39 elukey: restart Yarn RMs to apply the dominant resource calculator setting - T281792
12:15 hnowlan: changed eventlogging CNAME to point to eventlog1003
09:19 hnowlan: starting decommission of eventlog1002
17:36 razzi: create principal for sihe: sudo manage_principals.py create sihe --email_address=silvan.heintze@wikimedia.de
12:22 joal: Reset monitor_refine_eventlogging_legacy after manual rerun of failed job
12:02 joal: rerun cassandra-daily-wf-local_group_default_T_top_percountry-2021-5-4
20:31 joal: Kill-restart 16 cassandra jobs
20:29 joal: Kill-restart referer-daily job
20:12 joal: Deploy refinery onto HDFSb
19:46 joal: Deploying refinery using scap
19:34 joal: refinery v0.1.10 released to Archiva
14:23 ottomata: stopping all venv based jupyter singleuser servers - T262847
13:59 ottomata: dropped all obselete (upper cased location) event_santizied.*_T280813 tables created for T280813
10:43 joal: Add _SUCCESS flag to /wmf/data/raw/mediawiki_private/tables/cu_changes/month=2021-04 after having manually sqooped missing tables
09:57 joal: restart refinery-sqoop-mediawiki-private timer after patch
09:56 joal: Reset refinery-sqoop-mediawiki-private timer
09:38 joal: Drop already sqooped data to restart jobs
08:53 joal: Deploy refinery for sqoop hotfix
08:33 elukey: clean up libmariadb-java from hadoop workers and clients
07:46 joal: Kill prod sqoop job to restart after fix
07:04 elukey: hue restarted using the database 'hue' instead of 'hue_next'
06:56 elukey: stop hue to allow database rename (hue_next -> hue)
15:55 razzi: restart hadoop-yarn-nodemanager and hadoop-hdfs-datanode on an-worker1100 for hadoop to recognize new disk /dev/sdl
15:38 ottomata: enabling event_sanitized_main jobs - T273789
14:57 elukey: run mysql_upgrade on an-coord1001 to complete the buster upgrade - T278424
14:44 hnowlan: restored all eventlogging jobs to eventlog1003
14:21 hnowlan: bump eventlog1003 CPUs to 6
13:53 joal: Rerun failed pageview-hourly-wf-2021-4-29-11 and pageview-hourly-wf-2021-4-29-12
13:09 joal: Rerun failed pageview-hourly-wf-2021-4-29-11
12:35 hnowlan: restarting 2 processors on eventlog1002
12:02 hnowlan: stopping processors on eventlog1002 to migrate to eventlog1003
11:50 elukey: manual stop of one of the eventlog processors on eventlog1002 to see if 1003 takes it over
02:59 milimetric: deployed hotfix for referrer job
17:46 hnowlan: eventlog1003 joined to groups successfully
17:36 razzi: sudo mkdir /srv/log/eventlogging and sudo chown eventlogging:eventlogging /srv/log/eventlogging to workaround missing directory puppet error (to be puppetized later)
17:31 razzi: remove deployment cache on eventlogging1003: sudo rm -fr /srv/deployment/eventlogging/analytics-cache/
17:26 razzi: manually change /srv/deployment/eventlogging/analytics/.git/DEPLOY_HEAD to deployment1002 on deployment1002 to fix puppet scap error
16:53 hnowlan: stopping deployment-eventlog05 in deployment-prep
14:42 milimetric: deployed refinery with 0.1.9 jars and synced to hdfs
14:30 elukey: chown -R analytics-deploy:analytics-deploy /srv/deployment/analytics on an-coord1001
12:50 ottomata: applied data_purge jobs in analytics test cluster; old data will now be dropped there - T273789
08:33 elukey: run mysql_upgrade for analytics-meta on an-coord1002 (should be part of the upgrade process) - T278424
07:11 elukey: restart yarn resource managers to pick up yarn label settings
08:01 elukey: restart hadoop-mapreduce-historyserver on an-master1001 after changes to the yarn ui user
07:36 elukey: re-enable timers after setting the capacity scheduler
07:31 elukey: restart hadoop RM on an-master* to pick up capacity scheduler changes
06:44 elukey: stop timers on an-launcher1002 again as prep step for capacity scheduler changes
06:32 elukey: roll restart of hadoop-yarn-nodemanagers to pick up new log4j settings - T276906
06:25 elukey: re-enable timers
06:20 elukey: reboot an-coord1001 to pick up kernel security settings
05:57 elukey: stop timers on an-launcher1002 to allow a reboot of an-coord1001
08:03 joal: Rerun failed webrequest-druid-hourly-wf-2021-4-23-13
14:23 elukey: roll restart an-master100[1,2] daemons to pick up new lo4j settings - T276906
10:30 elukey: restart hadoop daemons (NM, DN, JN) on an-worker1080 to further test the new log4j config - T276906
09:12 elukey: change default log4j hadoop config to include rolling gzip appender
21:30 ottomata: temporariliy disabling sanitize_eventlogging_analytics_delayed jobs until T280813 is completed (probably tomorrow)
20:04 ottomata: renaming event_santized hive table directories to lower case and repairing table partition paths - T280813
09:28 elukey: roll restart druid-overlord on druid* after an-coord1001 maintenance
09:09 elukey: upgrade hue on an-tool1009 to 4.9.0-2
08:31 elukey: re-enable timers on an-launcher1002 and airflow on an-airflow1001 after maintenance on an-coord1001
07:08 elukey: reimage an-coord1001 after partition reshape (/var/lib/mysql folded in /srv)
06:51 elukey: stop airflow on an-airflow1001
06:49 elukey: stop all services on an-coord1001 as prep step for reimage
06:45 elukey: PURGE BINARY LOGS BEFORE '2021-04-14 00:00:00'; on an-coord1001 to free some space before the reimage
06:00 elukey: stop timers on an-launcher1002 as prep step for an-coord1001 reimage
15:51 elukey: move analytics-hive.eqiad.wmnet back to an-coord1001 (test on an-coord1002 successful)
15:38 ottomata: deployed refiner to hdfs
13:59 ottomata: deploying refinery and refinery source 0.1.6 for weekly train
13:37 ottomata: deployed aqs
13:16 elukey: failover analytics-hive to an-coord1002 to test the host (running on buster)
12:40 elukey: PURGE BINARY LOGS BEFORE '2021-04-12 00:00:00'; on an-coord1001 - T280367
16:45 ottomata: make RefineMonitor use analytics keytab - this should be a no-op
16:07 razzi: run kafka preferred-replica-election on jumbo cluster (kafka-jumbo1002)
06:50 elukey: move /var/lib/hadoop/name partition under /srv/hadoop/name on an-master1001 - T265126
05:45 elukey: cleanup Lex's jupyter notebooks on stat1007 to allow puppet to clean up
07:25 elukey: run "PURGE BINARY LOGS BEFORE '2021-04-11 00:00:00';" on an-coord1001 to free some space - T280367
15:14 elukey: execute PURGE BINARY LOGS BEFORE '2021-04-09 00:00:00'; on an-coord1001 to free space for /var/lib/mysql - T280367
15:13 elukey: execute PURGE BINARY LOGS BEFORE '2021-04-09 00:00:00';
07:54 elukey: drop all the cloudera packages from our repositories
21:13 razzi: rebalance kafka partitions for webrequest_text partition 23
14:56 elukey: deploy refinery via scap - weekly train
09:50 elukey: rollback hue on an-tool1009 to 4.8, it seems that 4.9 still has issues
06:32 elukey: move hue.wikimedia.org to an-tool1009 (from analytics-tool1001)
01:36 razzi: rebalance kafka partitions for webrequest_text partitions 21,22
14:05 elukey: run build/env/bin/hue migrate on an-tool1009 after the hue upgade
13:10 elukey: rollback hue-next to 4.8 - issues not present in staging
13:00 elukey: upgrade Hue to 4.9 on an-tool1009 - hue-next.wikimedia.org
10:02 elukey: roll restart yarn nodemanagers on hadoop prod (attempt to see if they entered in a weird state, graceful restart)
09:54 elukey: kill long running mediawiki-job refine erroring out application_1615988861843_166906
09:46 elukey: kill application_1615988861843_163186 for the same reason
09:43 elukey: kill application_1615988861843_164387 to see if any improvement to socket consumption is made
09:14 elukey: run "sudo kill `pgrep -f sqoop`" on an-launcher1002 to clean up old test processes still running
16:17 razzi: rebalance kafka partitions for webrequest_text partitions 19, 20
13:18 ottomata: Refine now uses refinery-job 0.1.4; RefineFailuresChecker has been removed and its function rolled into RefineMonitor -
10:23 hnowlan: deploying aqs with updated cassandra libraries to aqs1004 while depooled
06:17 elukey: kill application application_1615988861843_158645 to free space on analytics1070
06:10 elukey: kill application_1615988861843_158592 on analytics1061 to allow space to recover (truncate of course in D state)
06:05 elukey: truncate logs for application_1615988861843_158592 on analytics1061 - one partition full
14:21 ottomata: stop using http proxies for produce_canary_events_job - T274951
16:33 elukey: reboot an-worker1100 again to check if all the disks come up correctly
15:43 razzi: rebalance kafka partitions for webrequest_text partitions 17, 18
15:35 elukey: reboot an-worker1100 to see if it helps with the strange BBU behavior in T279475
14:07 elukey: drop /var/spool/rsyslog from stat1008 - corrupted files due to root partition filled up caused a SEGV for rsyslog
11:14 hnowlan: created aqs user and loaded full schemas into analytics wmcs cassandra
08:35 elukey: apt-get clean on stat1008 to free some space
07:44 elukey: restart hadoop hdfs masters on an-master100[1,2] to apply the new log4j settings fro the audit log
06:44 elukey: re-deployed refinery to hadoop-test after fixing permissions on an-test-coord1001
23:03 ottomata: installing anaconda-wmf-2020.02~wmf5 on remaining nodes - T279480
22:51 ottomata: installing anaconda-wmf-2020.02~wmf5 on stat boxes - T279480
22:47 mforns: finished refinery deployment up to 1dbbd3dfa996d2e970eb1cbc0a63d53040d4e3a3
22:39 mforns: deployment of refinery via scap to hadoop-test failed with Permission denied: '/srv/deployment/analytics/refinery-cache/.config' (deployemt to production went fine)
21:44 mforns: starting refinery deploy up to 1dbbd3dfa996d2e970eb1cbc0a63d53040d4e3a3
21:26 mforns: deployed refinery-source v0.1.4
21:25 razzi: sudo apt-get install --reinstall sudo apt-get install --reinstall anaconda-wmf on stat1008
20:15 razzi: rebalance kafka partitions for webrequest_text partitions 15, 16
19:53 ottomata: upgrade anaconda-wmf everywhere to 2020.02~wmf4 with fixes for T279480
14:03 hnowlan: setting profile::aqs::git_deploy: true in aqs-test1001 hiera config
22:34 razzi: rebalance kafka partitions for webrequest_text_13,14
09:37 elukey: reimage an-coord1002 to Debian Buster
16:07 razzi: remove old hive logs on an-coord1001: sudo rm /var/log/hive/hive-*.log.2021-02-*
14:54 razzi: remove empty /var/log/sqoop on an-launcher1002 (logs go in /var/log/refinery); sudo rmdir /var/log/sqoop
14:51 razzi: rebalance kafka partitions for webrequest_text partitions 11, 12
16:28 razzi: rebalance kafka partitions for webrequest_text partitions 9,10
16:19 elukey: all the Hadoop test cluster on Debian Buster
07:28 elukey: manual fix for an-worker1080's interface in netbox (xe-4/0/11), moved by mistake to public-1b
20:27 razzi: restore superset_production from backup superset_production_1617306805.sql
20:14 razzi: manually run bash /srv/deployment/analytics/superset/deploy/create_virtualenv.sh as analytics_deploy on an-tool1010, since somehow it didn't run with scap
20:01 razzi: sudo chown -R analytics_deploy:analytics_deploy /srv/deployment/analytics/superset/venv since it's owned by root and needs to be removed upon deployment
19:54 razzi: dump superset production to an-coord1001.eqiad.wmnet:/home/razzi/superset_production_1617306805.sql just in case
16:50 razzi: rebalance kafka partitions for webrequest_text partitions 7 and 8
14:18 hnowlan: starting copy of large tables from aqs1007 to aqs1011
20:25 joal: Kill-Restart data_quality_stats-hourly-bundle after deploy
20:19 joal: Deploying refinery onto HDFS
19:57 joal: Deploying refinery using scap
19:57 joal: Refinery-source released to archiva and new jars commited to refinery (v0.1.3)
17:07 razzi: rebalance kafka partitions for webrequest_text partitions 5 and 6
12:35 hnowlan: Depooling aqs1004 for another transfer of local_group_default_T_pageviews_per_article_flat
12:30 elukey: restart reportupdater-codemirror on an-launcher1002 fro T275757
11:30 elukey: ERRATA: upgrade to 2.3.6-2
11:29 elukey: upgrade hive client packages to 2.3.6-1 on an-launcher1002 (already applied to all stat100x)
15:58 elukey: disable vmemory checks in Yarn nodemanagers on Hadoop
13:53 elukey: systemctl restart performance-asotranking on stat1007 for T276121
08:14 elukey: upgrade hive packages on stat100x to 2.6.3-2 - T276121
08:12 elukey: upgrade hive packages in thirdparty/bigtop15 to 2.3.6-2 for buster-wikimedia
18:49 elukey: systemctl restart refinery-import-* failed jobs (/mnt/hdfs errors due to me umounting the mountpoint)
18:43 elukey: kill fuse hdfs mount process on an-launcher1002, re-mounted /mnt/hdfs, too many processes in D state
15:46 razzi: rebalance kafka partitions for webrequest_text partitions 3 and 4
05:40 razzi: sudo chown analytics /var/log/refinery/sqoop-mediawiki.log.1 on an-launcher1002 and restart logrotate
18:12 elukey: drop /srv/.hardsync* to clean up hardlinks not needed
18:07 elukey: run rm -rfv .hardsync.*/archive/public-datasets/* on thorium:/srv to clean up files to drop (didn't work)
18:01 elukey: drop /srv/.hardsync*trash* on thorium - old hardlinks that should have been trashed
15:52 razzi: rebalance kafka partitions for webrequest_text partition 2
09:28 elukey: move the yarn scheduler in hadoop test to capacity
15:44 razzi: rebalance kafka partitions for webrequest_text partition 1
19:30 razzi: rename /usr/lib/python2.7/dist-packages/cqlshlib/copyutil.so back
19:29 razzi: temporarily rename /usr/lib/python2.7/dist-packages/cqlshlib/copyutil.so on aqs1004 to fix https://issues.apache.org/jira/browse/CASSANDRA-11574
19:02 ottomata: hdfs dfs -chgrp -R analytics-privatedata-users /wmf/camus - T275396
16:47 razzi: rebalance kafka partitions for webrequest_text partition 0
06:32 elukey: force a manual run of create_virtualenv.sh on an-tool1010 - superset down
20:45 razzi: release wikistats 2.9.0
20:15 ottomata: install anaconda-wmf 2020.02~wmf3 on analytics cluster clients and workers - T262847
18:10 ottomata: started oozie/cassandra/coord_pageview_top_percountry_daily
15:21 razzi: rebalance kafka partitions for webrequest_upload partitions 22 and 23
13:54 razzi: sudo cookbook sre.hosts.reboot-single an-conf1001.eqiad.wmnet
13:47 razzi: sudo cookbook sre.hosts.reboot-single an-conf1003.eqiad.wmnet
13:41 razzi: sudo cookbook sre.hosts.reboot-single an-conf1002.eqiad.wmnet
13:39 ottomata: deploying refinery for weekly train
13:28 ottomata: deploy aqs as part of train - T207171, T263697
01:28 razzi: rebalance kafka partitions for webrequest_upload partition 21
14:43 razzi: rebalance kafka partitions for webrequest_upload partition 20
03:17 razzi: rebalance kafka partitions for webrequest_upload partition 19
16:53 razzi: rebalance kafka partitions for webrequest_upload partition 18
08:25 elukey: stop/start hdfs-balancer on an-launcher1002 with bw 200MB
07:48 joal: Manually start mediawiki-history-drop-snapshot.service to check the run succeeds
07:47 joal: Drop hive wmf.mediawiki_wikitext_history snapshot partitions (2020-08, 2020-09, 2020-10, 2020-11)
20:49 joal: Manually clean some data ( mediawiki-history-drop-snapshot.service seems not working)
20:46 joal: Force a run of mediawiki-history-drop-snapshot.service to clean up some data
17:20 elukey: kill duplicate mediawiki-wikitext-history coordinator failing and sending emails to alerts@
07:21 elukey: re-run monitor_refine_event_failure_flags
22:31 razzi: rebalance kafka partitions for webrequest_upload partition 17
20:20 razzi: disable maintenance mode for matomo1002
20:08 razzi: starting reboot of matomo1002 for kernel upgrade
18:52 razzi: systemctl restart hadoop-hdfs-datanode on analytics1059
18:50 razzi: systemctl restart hadoop-yarn-nodemanager on analytics1059
18:35 razzi: apt-get install parted on analytics1059
15:34 razzi: rebalance kafka partitions for webrequest_upload partition 17
10:52 elukey: drop /home/bsitzmann on all stat100x hosts - T273712
08:25 elukey: drop database dedcode cascade in hive - T276748
08:15 elukey: hdfs dfs -rmr /user/dedcode on an-launcher1002 (data in trash for a month) - T276748
23:15 razzi: rebalance kafka partitions for webrequest_upload partition 16
18:44 mforns: finished deployment of refinery (session length oozie job)
18:16 mforns: starting deployment of refinery (session length oozie job)
16:54 razzi: rebalance kafka partitions for webrequest_upload partition 15
07:05 elukey: all hadoop worker nodes on Buster
06:28 elukey: force the re-run of refine_eventlogging_legacy - failed due to worker reimage in progress
06:17 elukey: reimage an-worker1111 to buster
22:00 razzi: rebalance kafka partitions for webrequest_upload partition 14
20:42 elukey: reimaged an-worker1091 to buster
18:26 elukey: reimage an-worker1087 to buster
16:40 elukey: reimage analytics1077 to buster
15:36 razzi: rebalance kafka partitions for webrequest_upload partition 13
15:18 elukey: reimage analytics1072 (hadoop hdfs journal node) to buster
14:29 elukey: drain + reimage an-worker1090/89 to Buster
13:26 elukey: reimage an-worker1102 and an-worker1080 (hdfs journal node) to Buster
12:59 elukey: drain + reimage an-worker1103 to Buster
09:14 elukey: drain + reimage analytics1076 and an-worker1112 to Buster
07:01 elukey: drain + reimage an-worker109[4,5] to Buster
23:22 razzi: rebalance kafka partitions for webrequest_upload partition 12
18:49 razzi: rebalance kafka partitions for webrequest_upload partition 11
18:11 elukey: drain + reimage an-worker11[15,16] to Buster
17:12 elukey: drain + reimage an-worker11[13,14] to Buster
16:17 elukey: drain + reimage an-worker1109/1110 to Buster
14:54 elukey: drain + reimage an-worker110[7,8] to Buster
14:52 ottomata: altered topics (eqiad|codfw).mediawiki.client.session_tick to have 2 partitions - T276502
13:51 elukey: drain + reimage an-worker110[4,5] to Buster
10:41 elukey: drain + reimage an-worker1104/1089 to Debian Buster
09:19 elukey: drain + reimage an-worker108[3,4] to Buster
08:20 elukey: drain + reimage an-worker108[1,2] to Buster
07:23 elukey: drain + reimage analytics107[4,5] to Buster
08:00 elukey: "megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll" on analytics1066
07:49 elukey: umount /var/lib/hadoop/data/e on analytics1059 and restart hadoop daemons to exclude failed disk - T276696
18:30 razzi: run again sudo -i wmf-auto-reimage-host -p T269211 clouddb1021.eqiad.wmnet --new
18:18 razzi: sudo cookbook sre.dns.netbox -t T269211 "Move clouddb1021 to private vlan"
18:17 razzi: re-run interface_automation.ProvisionServerNetwork with private vlan
18:16 razzi: delete non-mgmt interface for clouddb1021
17:07 razzi: sudo -i wmf-auto-reimage-host -p T269211 clouddb1021.eqiad.wmnet --new
16:54 razzi: sudo cookbook sre.dns.netbox -t T269211 "Reimage and rename labsdb1012 to clouddb1021"
16:52 razzi: run script at https://netbox.wikimedia.org/extras/scripts/interface_automation.ProvisionServerNetwork/
16:47 razzi: edit https://netbox.wikimedia.org/dcim/devices/2078/ device name from labsdb1012 to clouddb1021
16:30 razzi: delete non-mgmt interfaces for labsdb1012 at https://netbox.wikimedia.org/dcim/devices/2078/interfaces/
16:28 razzi: rename https://netbox.wikimedia.org/ipam/ip-addresses/734/ DNS name from labsdb1012.mgmt.eqiad.wmnet to clouddb1021.mgmt.eqiad.wmnet
16:08 razzi: sudo cookbook sre.hosts.decommission labsdb1012.eqiad.wmnet -t T269211
15:52 razzi: stop mariadb on labsdb1012
15:39 razzi: rebalance kafka partitions for webrequest_upload partition 10
15:07 elukey: drain + reimage analytics1073 and an-worker1086 to Debian Buster
13:36 elukey: roll restart HDFS Namenodes for the Hadoop cluster to pick up new Xmx settings (https://gerrit.wikimedia.org/r/c/operations/puppet/+/668659 )
10:20 elukey: force run of refinery-druid-drop-public-snapshots to check Druid public's performances
10:06 elukey: failover HDFS Namenode from 1002 to 1001 (high GC pauses triggered the HDFS zkfc daemon on 1001 and the failover to 1002)
08:32 elukey: drain + reimage an-worker107[8,9] to Debian Buster (one Journal node included)
07:22 elukey: drain + reimage analytics107[0-1] to debian buster
07:13 elukey: add analytis1066 back with /dev/sdb removed
07:01 elukey: stop hadoop daemons on analytics1066 - disk errors on /dev/sdb after reimage
21:19 razzi: rebalance kafka partitions for webrequest_upload partition 9
16:27 elukey: drain + reimage analytics106[8,9] to Debian Buster (one is a journalnode)
15:12 elukey: drain + reimage analytics106[6,7] to Debian Buster
14:21 elukey: drain + reimage analytics1065 to Debian Buster
13:32 elukey: drain + reimage analytics10[63,64] to Debian Buster
12:48 elukey: drain + reimage analytics10[61,62] to Debian Buster
10:40 elukey: drain + reimage analytics1059/1060 to Debian Buster
09:32 elukey: reboot an-worker[1097-1101] (GPU workers) to pick up the new kernel (5.10)
09:02 elukey: kill/start mediawiki-geoeditors-monthly to apply backtick change (hive script)
08:48 elukey: deploy refinery to hdfs
08:34 elukey: deploy refinery to fix https://gerrit.wikimedia.org/r/c/analytics/refinery/+/668111
07:38 elukey: reboot an-worker1096 to pick up 5.10 kernel
17:10 elukey: update druid datasource on aqs (roll restart of aqs on aqs100*)
17:06 razzi: rebalance kafka partitions for webrequest_upload partition 8
14:20 elukey: reimage an-worker1099,1100,1101 (GPU worker nodes) to Debian Buster
10:16 elukey: add an-worker113[2,5-8] to the Analytics Hadoop cluster
23:15 mforns: finished deployment of refinery to hdfs
21:59 mforns: starting refinery deployment using scap
21:48 mforns: deployed refinery-source v0.1.2
17:26 razzi: rebalance kafka partitions for webrequest_upload partition 7
13:42 elukey: Add an-worker11[19,20-28,30,31] to Analytics Hadoop
10:21 elukey: roll restart druid historicals on druid public to pick up new cache settings (enable segment caching)
10:14 elukey: roll restart druid brokers on druid public to pick up new cache settings (no segment caching, only query caching)
08:01 elukey: manual start of performance-asotranking on stat1007 (requested by Gilles) - T276121
21:24 razzi: rebalance kafka partitions for webrequest_upload partition 6
18:14 razzi: restart timer that wasn't running on an-worker1101: sudo systemctl restart prometheus-debian-version-textfile.timer
17:40 elukey: reimage an-worker1098 (GPU worker node) to Buster
14:48 elukey: reimage an-worker1097 (gpu node) to debian buster
11:55 elukey: roll restart druid broker on druid-analytics (again) to enable query cache settings (missing config due to typo)
11:34 elukey: roll restart historical daemons (again) on druid-analytics to remove stale config and enable (finally) segment caching.
11:02 elukey: roll restart druid-broker and druid-historical daemons on druid-analytics to pick up new cache settings (disable segment caching on broker and enable it on historicals)
09:12 elukey: restart hadoop daemons on an-worker1112 to pick up the new disk
09:11 elukey: remount /dev/sdl on an-worker1112 (wasn't able to make it fail)
16:03 razzi: rebalance kafka partitions for webrequest_upload partition 4
12:33 elukey: reimaged an-worker1096 (GPU node) to Debian buster (preserving datanode dirs)
09:52 elukey: reimaged analytics1058 to debian buster (preserving datanode partitions)
07:50 elukey: attempt to reimage analytics1058 (part of the cluster, not a new worker node) to Buster
07:29 elukey: added journalnode partition to all hadoop workers not having it in the Analytics cluster
07:01 elukey: reboot an-worker1099 to clear out kernel soft lockup errors
06:59 elukey: restart datanode on an-worker1099 - soft lockup kernel errors
17:04 razzi: rebalance kafka partitions for webrequest_upload_3
13:36 elukey: drop /srv/backup/wikistats from thorium
13:35 elukey: drop /srv/backup/backup_wikistats_1 from thorium
11:14 elukey: add an-worker111[7,8] to Analytics Hadoop (were previously backup worker nodes)
08:50 elukey: move analytics-privatedata/search/product to fixed gid/uid on all buster nodes (including airflow/stat100x/launcher)
19:16 ottomata: service hadoop-yarn-nodemanager start on an-worker1112
16:03 milimetric: deployed refinery
14:09 elukey: roll restart druid brokers on druid public to pick up caffeine cache settings
14:03 elukey: roll restart druid brokers on druid analytics to pick up caffeine cache settings
11:08 elukey: restart druid-broker on an-druid1001 (used by Turnilo) with caffeine cache
09:01 elukey: roll restart druid brokers on druid public - locked
07:47 elukey: change gid/uid for druid + roll restart of all druid nodes
21:20 ottomata: started nodemanager on an-worker1112
21:15 razzi: rebalance kafka partitions for webrequest_upload partition 2
19:31 elukey: roll out new uid/gid for mapred/druid/analytics/yarn/hdfs for all buster nodes (no op for stretch)
17:47 elukey: change uid/gid for yarn/mapred/analytics/hdfs/druid on stat100x, an-presto100x
15:57 elukey: an-launcher1002's timers restored
15:28 elukey: stop timers on an-launcher1002 to change gid/uid for yarn/hdfs/mapred/analytics/druid and to reboot for kernel updates
15:23 elukey: deploy new uid/gid scheme for yarn/mapred/analytics/hdfs/druid on an-tool100[8,9]
15:22 elukey: deploy new uid/gid scheme for yarn/mapred/analytics/hdfs/druid on an-airflow1001, an-test* buster nodes
15:05 klausman: an-master1001 ~ $ sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp analytics-privatedata-users /wmf/data/raw/webrequest/webrequest_text/hourly/2021/02/22/01/webrequest*
14:51 elukey: drop /srv/backup-1007 on stat1008 to free space
19:27 ottomata: restart oozie on an-coord1001 to pick up new spark share lib without hadoop jars - T274384
14:38 ottomata: upgrade spark2 on analytics cluster to 2.4.4-bin-hadoop2.6-5~wmf0 (hadoop jars removed) - T274384
14:12 ottomata: upgrade spark2 on an-coord1001 to 2.4.4-bin-hadoop2.6-5~wmf0 (hadoop jars removed), will remove and auto-re add spark-2.4.4-assembly.zip in hdfs after running puppet here
14:07 ottomata: upgrade spark2 on stat1004 to 2.4.4-bin-hadoop2.6-5~wmf0 (hadoop jars removed)
09:01 elukey: reboot stat1005/stat1008 for kernel upgrades
15:53 elukey: restart oozie again to test another setting for role/admins
15:43 ottomata: installing spark 2.4.4 without hadoop jars on analytics test cluster - T274384
15:31 elukey: restart oozie to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/665352
14:34 joal: rerun mobile_apps-uniques-daily-wf-2021-2-18
09:16 elukey: stop and decom the hadoop backup cluster
18:38 razzi: rebalance kafka partition for webrequest_upload partition 1
17:27 elukey: an-coord1002 back in service with raid1 configured
15:48 elukey: stop hive/mysql on an-coord1002 as precautionary step to rebuild the md array
13:10 elukey: failover analytics-hive to an-coord1001 after maintenance (DNS change)
11:32 elukey: restart hive daemons on an-coord1001 to pick up new parquet settings
10:07 elukey: hive failover to an-coord1002 to apply new hive settings to an-coord1001
10:00 elukey: restart hive daemons on an-coord1002 (standby coord) to pick up new default parquet file format change
09:46 elukey: upgrade presto to 0.246-wmf on an-coord1001, an-presto*, stat100x
17:44 razzi: rebalance kafka partitions for webrequest_upload partition 0
16:14 razzi: rebalance kafka partitions for eqiad.mediawiki.api-request
07:04 elukey: reboot stat1004/stat1006/stat1007 for kernel upgrades
22:31 razzi: rebalance kafka partitions for codfw.mediawiki.api-request
17:44 razzi: rebalance kafka partitions for netflow
17:42 razzi: rebalance kafka partitions for atskafka_test_webrequest_text
07:32 elukey: restart hadoop daemons on an-worker1099 after reconfiguring a new disk
06:58 elukey: restart hdfs/yarn daemons on an-worker1097 to exclude a failed disk
20:38 mforns: running hdfs fsck to troubleshoot corrupt blocks
17:28 elukey: restart hdfs namenodes on the main cluster to pick up new racking changes (worker nodes from the backup cluster)
09:38 joal: Restart and backfill mediacount and mediarequest, and backfill mediarequest-AQS and mediacount archive
09:38 joal: deploy refinery onto hdfs
09:14 joal: Deploy hotfix for mediarequest and mediacount
19:19 milimetric: deployed refinery with query syntax fix for the last broken cassandra job and an updated EL whitelist
18:34 razzi: rebalance kafka partitions for atskafka_test_webrequest_text
18:31 razzi: rebalance kafka partitions for __consumer_offsets
17:48 joal: Rerun wikidata-articleplaceholder_metrics-wf-2021-2-10
17:47 joal: Rerun wikidata-specialentitydata_metrics-wf-2021-2-10
17:43 joal: Rerun wikidata-json_entity-weekly-wf-2021-02-01
17:08 elukey: reboot presto workers for kernel upgrade
16:32 mforns: finished deployment of analytics-refinery
15:26 mforns: started deployment of analytics-refinery
15:16 elukey: roll restart druid broker on druid-public to pick up new settings
07:54 elukey: roll restart of druid brokers on druid-public - locked after scheduled datasource deletion
07:47 elukey: force a manual run of refinery-druid-drop-public-snapshots on an-launcher1002 (3d before its natural start) - controlled execution to see how druid + 3xdataset replication reacts
14:26 joal: Restart oozie API job after spark sharelib fix (start: 2021-02-10T18:00)
14:20 joal: Rerun failed clicstream instance 2021-01 after sharelib fix
14:16 joal: Restart oozie after having fixed the spark-2.4.4 sharelib
14:12 joal: Fix oozie sharelib for spark-2.4.4 by copying oozie-sharelib-spark-4.3.0.jar onto the spark folder
02:19 milimetric: deployed again to fix old spelling error :) referererererer
00:05 milimetric: deployed refinery and synced to hdfs, restarting cassandra jobs gently
21:46 razzi: rebalance kafka partitions for eqiad.mediawiki.cirrussearch-request
21:10 razzi: rebalance kafka partitions for codfw.mediawiki.cirrussearch-request
19:11 elukey: drop /user/oozie/share + chown o+rx -R /user/oozie/share + restart oozie
17:56 razzi: rebalance kafka partitions for eventlogging-client-side
01:07 milimetric: deployed refinery with some fixes after BigTop upgrade, will restart three coordinators right now
22:04 razzi: rebalance kafka partitions for eqiad.resource-purge
20:51 joal: Rerun webrequest-load-coord-[text|upload] for 2021-02-09T07:00 after data was imported to camus
20:50 razzi: rebalance kafka partitions for codfw.resource-purge
20:31 joal: Rerun webrequest-load-coord-[text|upload] for 2021-02-09T06:00 after data was imported to camus
16:30 elukey: restart datanode on ana-worker1100
16:14 ottomata: restart datanode on analytics1059 with 16g heap
16:08 ottomata: restart datanode on an-worker1080 withh 16g heap
15:58 ottomata: restart datanode on analytics1058
15:55 ottomata: restart datenode on an-worker1115
15:38 elukey: restart namenode on an-master1002
15:01 elukey: restart an-worker1104 with 16g heap size to allow bootstrap
15:01 elukey: restart an-worker1103 with 16g heap size to allow bootstrap
14:57 elukey: restart an-worker1102 with 16g heap size to allow bootstrap
14:54 elukey: restart an-worker1090 with 16g heap size to allow bootstrap
14:50 elukey: restart analytics1072 with 16g heap size to allow bootstrap
14:50 elukey: restart analytics1069 with 16g heap size to allow bootstrap
14:08 elukey: restart analytics1069's datanode with bigger heap size
13:39 elukey: restart hdfs-datanode on analytics10[65,69] - failed to bootstrap due to issues reading datanode dirs
13:38 elukey: restart hdfs-datanode on an-worker1080 (test canary - not showing up in block report)
10:04 elukey: stop mysql replication an-coord1001 -> an-coord1002, an-coord1001 -> db1108
08:29 elukey: leave hdfs safemode to let distcp do its job
08:25 elukey: set hdfs safemode on for the Analytics cluster
08:19 elukey: umount /mnt/hdfs from all nodes using it
08:16 joal: Kill flink yarn app
08:08 elukey: stop jupyterhub on stat100x
08:07 elukey: stop hive on an-coord100[1,2] - prep step for bigtop upgrade
08:05 elukey: stop oozie an-coord1001 - prep step for bigtop upgrade
08:03 elukey: stop presto-server on an-presto100x and an-coord1001 - prep step for bigtop upgrade
07:28 elukey: roll out new apt bigtop changes across all hadoop-related nodes
07:19 joal: Killing yarn users applications
07:12 elukey: stop airflow on an-airflow1001 (prep step for bigtop)
07:09 elukey: stop namenode on an-worker1124 (backup cluster), create two new partitions for backup and namenode, restart namenode
06:14 elukey: disable timers on labstore nodes (prep step for bigtop)
06:11 elukey: disable systemd timers on an-launcher1002 (prep step for bigtop)
22:29 elukey: the previous entry was related to the Hadoop backup cluster
22:29 elukey: hdfs master failover an-worker1118 -> an-worker1124, created dedicated partition for /var/lib/hadoop/name (root partition filled up), restarted namenode on 1118 (now recovering edit logs)
18:44 razzi: rebalance kafka partitions for eventlogging_VirtualPageView
15:12 ottomata: set kafka topic retention to 31 days for (eqiad|codfw.rdf-streaming-updater.mutation) in kafka main-eqiad and main-codfw - T269619
20:31 razzi: rebalance kafka partitions for eventlogging_SearchSatisfaction
19:11 razzi: rebalance kafka partitions for eqiad.mediawiki.client.session_tick
18:38 razzi: rebalance kafka partitions for codfw.mediawiki.client.session_tick
17:53 razzi: rebalance kafka partitions for codfw.resource_change
17:53 razzi: rebalance kafka partitions for eqiad.resource_change
11:31 elukey: restart turnilo to pick up changes to the config (two new attributes to webrequest_128)
19:27 razzi: rebalance kafka partitions for eqiad.mediawiki.job.wikibase-addUsagesForPage
19:27 razzi: rebalance kafka partitions for codfw.mediawiki.job.wikibase-addUsagesForPage
19:22 razzi: rebalance kafka partitions for eventlogging_MobileWikiAppLinkPreview
17:04 elukey: restart presto coordinator on an-coord1001 to pick up logging settings (log to http-request.log)
17:02 elukey: roll restart presto on an-presto* to finally get http-request.log
11:28 elukey: move aqs druid snapshot config to 2021-01
09:01 elukey: restart superset and disable memcached caching
08:08 elukey: move an-worker1117 from Hadoop Analytics to Hadoop Backup
21:38 razzi: rebalance kafka partitions for eventlogging_MobileWikiAppLinkPreview
20:04 razzi: rebalance kafka partitions for eqiad.mediawiki.job.RecordLintJob
20:03 razzi: rebalance kafka partitions for codfw.mediawiki.job.RecordLintJob
18:28 razzi: rebalance kafka partitions for eqiad.mediawiki.job.refreshLinks
18:28 razzi: rebalance kafka partitions for codfw.mediawiki.job.refreshLinks
17:52 razzi: rebalance kafka partitions for eqiad.wdqs-internal.sparql-query
17:50 razzi: rebalance kafka partitions for codfw.wdqs-internal.sparql-query
14:48 elukey: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod -R o+rx /wmf/data/wmf/mediawiki/history_reduced
14:45 elukey: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod o+rx /wmf/data/wmf/mediawiki
14:40 elukey: kill + restart webrequest-druid-{hourly,daily} to pick up new changes after refinery deployment
14:30 elukey: kill + relaunch webrequest_load to pick up new changes after refinery deployment
14:28 elukey: relaunch edit-hourly-druid-coord 02-2021 after chmods
14:25 elukey: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod -R o+rx /wmf/data/wmf/edit
14:24 elukey: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod o+rx /wmf/data/wmf
10:57 elukey: deploy refinery to hdfs
10:36 elukey: released Refinery Source 0.1.0
08:54 elukey: drop v0.1.x tags from Refinery source upstream repo
08:48 elukey: drop refinery source artifacts v0.1.2 from Archiva
20:39 razzi: rebalance kafka partitions for eqiad.mediawiki.job.htmlCacheUpdate
20:39 razzi: rebalance kafka partitions for codfw.mediawiki.job.htmlCacheUpdate
19:29 ottomata: manually altered event.codemirrorusage to fix incompatible type change: https://phabricator.wikimedia.org/T269986#6797385
19:28 elukey: change archiva-ci password in pwstore, archiva and jenkins
17:53 razzi: rebalance kafka partitions for eqiad.wdqs-external.sparql-query
17:17 razzi: rebalance kafka partitions for eventlogging_CentralNoticeImpression
16:39 razzi: rebalance kafka partitions for eventlogging_InukaPageView
08:42 elukey: decommission an-worker1117 from the Hadoop cluster, to move it under the Backup cluster
21:27 razzi: rebalance kafka partitions for eqiad.mediawiki.job.cdnPurge
21:27 razzi: rebalance kafka partitions for codfw.mediawiki.job.cdnPurge
20:51 razzi: rebalance kafka partitions for eventlogging_PaintTiming
19:01 razzi: rebalance kafka partitions for eventlogging_LayoutShift
18:58 razzi: rebalance kafka partitions for eqiad.mediawiki.job.recentChangesUpdate
18:58 razzi: rebalance kafka partitions for codfw.mediawiki.job.recentChangesUpdate
18:23 razzi: rebalance kafka partitions for codfw.mediawiki.recentchange
18:09 razzi: rebalance kafka partitions for eqiad.resource_change
20:23 razzi: rebalance kafka partitions for eventlogging_NavigationTiming
19:30 razzi: rebalance kafka partitions for eqiad.mediawiki.revision-score
19:29 razzi: rebalance kafka partitions for codfw.mediawiki.revision-score
19:14 razzi: rebalance kafka partitions for eventlogging_CpuBenchmark
19:11 razzi: rebalance kafka partitions for eqiad.mediawiki.page-links-change
19:10 razzi: rebalance kafka partitions for codfw.mediawiki.page-links-change
14:33 elukey: rollback presto upgrade, worker seems not able to announce themselves to the query coordinator
14:08 elukey: upgrade presto to 0.246 (from 0.226) on an-presto1001 - worker node
14:02 elukey: upgrade presto to 0.246 (from 0.226) on an-coord1001 - query coordinator
07:44 joal: Copy /wmf/data/event_sanitized to backup cluster (T272846)
22:23 razzi: rebalance kafka partitions for eqiad.mediawiki.page-links-change
22:22 razzi: rebalance kafka partitions for codfw.mediawiki.page-links-change
22:01 razzi: rebalance kafka partitions for eventlogging_QuickSurveyInitiation
21:13 razzi: rebalance kafka partitions for topic eventlogging_EditAttemptStep
19:49 mforns: finished deployment of refinery (for v0.0.146)
18:57 mforns: starting deployment of refinery (for v0.0.146)
18:54 mforns: deployed refinery-source v0.0.146 using Jenkins
18:45 razzi: rebalance kafka partitions for topic eqiad.mediawiki.job.ORESFetchScoreJob
18:42 razzi: rebalance kafka partitions for topic codfw.mediawiki.job.ORESFetchScoreJob
18:22 razzi: rebalance kafka partitions for topic codfw.mediawiki.job.wikibase-InjectRCRecords
17:26 razzi: rebalance kafka partitions for topic eqiad.mediawiki.revision-tags-change
17:26 razzi: rebalance kafka partitions for topic codfw.mediawiki.revision-tags-change
16:32 razzi: rebalance kafka partitions for topic eventlogging_CodeMirrorUsage
16:16 elukey: manual failover of hdfs namenode active/master from an-master1002 to an-master1001
13:02 joal: Copy /wmf/data/event to backup cluster (30Tb) - T272846
11:15 elukey: add client_port and debug fields to X-Analytics in webrequest varnishkafka streams
16:39 razzi: reboot kafka-test1006 for kernel upgrade
09:37 elukey: reboot dbstore1005 for kernel upgrades
09:35 joal: Copy /wmf/data/discovery to backup cluster (21Tb) - T272846
09:31 elukey: reboot dbstore1003 for kernel upgrades
09:15 elukey: reboot dbstore1004 for kernel upgrades
09:07 joal: Copy /wmf/refinery to backup cluster (1.1Tb) - T272846
09:01 joal: Copy /wmf/discovery to backup cluster (120Gb) - T272846
08:42 joal: Copy /wmf/camus to backup cluster (120Gb) - T272846
20:42 razzi: rebalance kafka partitions for eqiad.mediawiki.page-properties-change.json
20:41 razzi: rebalance kafka partitions for codfw.mediawiki.page-properties-change
18:58 razzi: rebalance kafka partitions for eventlogging_ExternalGuidance
18:53 razzi: rebalance kafka partitions for eqiad.mediawiki.job.ChangeDeletionNotification
17:13 joal: Copy /user to backup cluster (92Tb) - T272846
16:23 elukey: drain+restart cassandra on aqs1004 to pick up the new openjdk (canary)
16:21 elukey: restart yarn and hdfs daemon on analytics1058 (canary node for new openjdk)
12:25 joal: Copy /wmf/data/archive to backup cluster (32Tb) - T272846
10:20 elukey: restart memcached on an-tool1010 to flush superset's cache
10:18 elukey: restart superset to remove druid datasources support - T263972
09:57 joal: Changing ownership of archive WMF files to analytics:analytics-privatedata-users after update of oozie jobs
17:38 mforns: finished refinery deploy to HDFS
17:28 mforns: restarted refine_event and refine_eventlogging_legacy in an-launcher1002
17:11 mforns: starting refinery deploy using scap
17:09 mforns: bumped up refinery-source jar version to 0.0.145 in puppet for Refine and DruidLoad jobs
16:44 mforns: Deployed refinery-source v0.0.145 using jenkins
09:48 joal: Raise druid-public default replication-factor from 2 to 3
18:54 razzi: rebooting nodes for druid public cluster via cookbook
16:49 ottomata: installed libsnappy-dev and python3-snappy on webperf1001
15:17 joal: Kill mediawiki-wikitext-history-wf-2020-12 as it was stuck and failed
11:19 elukey: block UA with 'python-requests.*' hitting AQS via Varnish
21:48 milimetric: refinery deployed, synced to hdfs, ready to restart 53 oozie jobs, will do so slowly over the next few hours
18:11 joal: Release refinery-source v0.0.144 to archiva with Jenkins
09:21 elukey: roll restart druid brokers on druid public - stuck after datasource drop
07:26 elukey: execute 'sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod o+rx /wmf/data/archive/mediawiki' on launcher to fix dir perms
15:11 elukey: restart timers 'analytics-*' on labstore100[6,7] to apply new permission settings
08:31 elukey: restart the failed hdfs rsync timers on labstore100[6,7] to kick off the remaining jobs
08:30 elukey: execute hdfs chmod o+x of /wmf/data/archive/projectview /wmf/data/archive/projectview/legacy /wmf/data/archive/pageview/legacy to unblock hdfs rsyncs
08:24 elukey: execute "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod o+rx /wmf/data/archive/pageview" to unblock labstore hdfs rsyncs
08:21 elukey: execute "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod o+rx /wmf/data/archive/geoeditors" to unblock labstore hdfs rsync
18:54 joal: Restart jobs for permissions-fix (clickstream, mediacounts-archive, geoeditors-public_monthly, geoeditors-yearly, mobile_app-uniques-[daily|monthly], pageview-daily_dump, pageview-hourly, projectview-geo, unique_devices-[per_domain|per_project_family]-[daily|monthly])
18:14 joal: Restart projectview-hourly job (permissions test)
18:03 joal: Deploy refinery onto HDFS
17:50 joal: deploy refinery with scap
10:01 elukey: restart varnishkafka-webrequest on cp5001 - timeouts to kafka-jumbo1001, librdkafka seems not recovering very well
08:46 elukey: force restart of check_webrequest_partitions.service on an-launcher1002
08:44 elukey: force restart of monitor_refine_eventlogging_legacy_failure_flags.service
08:18 elukey: raise default max executor heap size for Spark refine to 4G
18:22 elukey: chown -R /tmp/analytics analytics:analytics-privatedata-users (tmp dir for data quality stats tables)
18:21 elukey: "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chown -R analytics:analytics-privatedata-users /wmf/data/wmf/data_quality_stats"
18:10 elukey: disable temporarily hdfs-cleaner.timer to prevent /tmp/DataFrameToDruid to be dropped
18:08 elukey: chown -R /tmp/DataFrameToDruid analytics:druid (was: analytics:hdfs) on hdfs to temporarily unblock Hive2Druid jobs
16:31 elukey: remove /etc/mysql/conf.d/research-client.cnf from stat100x nodes
15:40 elukey: deprecate the 'reseachers' posix group for good
11:24 elukey: execute "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod -R o-rwx /wmf/data/event_sanitized" to fix some file permissions as well
10:36 elukey: execute "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chmod -R o-rwx /wmf/data/event" on an-master1001 to fix some file permissions (an-launcher executed timers during the past hours without the new umask) - T270629
09:37 elukey: forced re-run of monitor_refine_event_failure_flags.service on an-launcher1002 to clear alerts
08:26 joal: Rerunning 4 failed refine jobs (mediawiki_cirrussearch_request, day=6/hour=20|21, day=7/hour=0|2)
08:14 elukey: re-enable puppet on an-launcher1002 to apply new refine memory settings
07:59 elukey: re-enabling all oozie jobs previously suspended
07:54 elukey: restart oozie on an-coord1001
20:42 ottomata: starting remaining refine systemd timers
20:19 ottomata: restarted eventlogging_to_druid timers
20:19 ottomata: restarted drop systemd timers
20:18 ottomata: restarted reportupdater timers
20:14 ottomata: re-starting camus systemd timers
16:45 razzi: restart yarn nodemanagers
16:08 razzi: manually failover hdfs haadmin from an-master1002 to an-master1001
15:53 ottomata: stopping analytics systemd timers on an-launcher1002
21:32 ottomata: bumped mediawiki history snapshot version in AQS
20:45 ottomata: Refine changes: event tables now have is_wmf_domain, canary events are removed, and corrupt records will result in a better monitoring email
20:43 razzi: deploy aqs as part of train
19:17 razzi: deploying refinery for weekly train
09:29 joal: Manually reload unique-devices monthly in cassandra to fix T271170
22:22 razzi: reboot an-test-coord1001 to upgrade kernel
14:24 elukey: deprecate the analytics-users group
14:11 milimetric: reset-failed refinery-sqoop-whole-mediawiki.service
14:10 milimetric: manual sqoop finished, logs on an-launcher1002 at /var/log/refinery/sqoop-mediawiki.log and /var/log/refinery/sqoop-mediawiki-production.log
14:54 milimetric: deployed refinery hotfix for sqoop problem, after testing on three small wikis