Wikimedia Technology/Annual Plans/FY2019/TEC6: Address Infrastructure Gaps/Goals

Program Goals and Status for FY18/19 edit

  • Goal Owner: Mark Bergsma
  • Program Goals for FY18/19: At the conclusion of this program, Zend PHP7 will be the only PHP runtime supported or used in the Wikimedia Foundation production environment.
  • Annual Plan: TEC6 Address Infrastructure Gaps

edit

Outcome 2 / Output 3 edit

Technical staff have increased visibility into the operation of our services and infrastructure.

Modernize logging, alerting and metrics monitoring infrastructure

Dependencies on: Search Platform; Primary team: Infrastructure Foundations

Goal(s) edit

Adopt Logstash   Done

  • Review Logstash/Kibana's architecture and installation and identify next steps and gaps to be addressed.
  • Audit log producers across the infrastructure and plan their transition to centralized logging.
  • Investigate log shipping methods and standardize on them.

Status edit

  Note: July 2018

  In progress

  Note: August 14, 2018

  In progress

  Note: September 11, 2018

  In progress A comprehensive design document has been prepared for logging and is currently in final review.


Outcome 3 / Output 4 edit

Wikimedia projects and content are protected against major disasters that threaten availability.

Strengthen backups with reliable and redundant backup infrastructure

Dependencies on: Infrastructure Foundations; Primary teams: Data Persistence

Goal(s) edit

Monitor database backup generation for failure or incorrect generation   Done

  • Generate metrics and historic data about databases (objects, table and wiki sizes, growth over time, etc.)
  • Detect and alert on backup metrics anomalies

Status edit

  Note: July 30, 2018

  In progress

  Note: August 14, 2018

  In progress

  Note: September 11, 2018

  In progress Software to generate & track metrics for db backups has been written, and will soon be used to setup alerts on backup anomalies.


Outcome 4 / Output 6 edit

Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.

Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management

Dependencies on: Traffic; Primary teams: Infrastructure Foundations, Data center operations

Goal(s) edit

Migrate the hardware inventory from Racktables to Netbox   Done

  • Define Netbox existing and custom fields usage standards/best practices
  • Switch over from Racktables to Netbox
  • Stretch: Investigate Netbox reporting capabilities to automatically validate data
  • Stretch: Investigate Netbox potential future integrations, towards a single source of truth   To do

Status edit

  Note: July 30, 2018

  In progress

  Note: August 2018

  In progress

  Note: September 11, 2018

<  In progress A final proposal for custom fields usage standards/best practices is under discussion; work will continue (including switching to netbox) after the data center switch from eqiad to codfw.

edit

Outcome 2 / Output 3 edit

Technical staff have increased visibility into the operation of our services and infrastructure.

Modernize logging, alerting and metrics monitoring infrastructure

Primary teams: SRE / Infrastructure Foundations

Goal(s) edit

Begin the implementation of Q1's Logging Infrastructure design edit

  • Procure and provision Logging pipeline hardware in multiple datacenters   In progress
  • Migrate >=90% of existing Logstash traffic to the logging pipeline   In progress
  • Onboard at least 10 new non-sensitive log producers to the logging pipeline   In progress
  • Investigate approaches to ingest sensitive log producers   To do
  • [stretch] Deprecate >= 50% of udp2log producers   To do

Expand modern metrics infrastructure coverage edit

Status edit

  Note: November 14, 2018

updated goals for current status

  Note: December 12, 2018

The implementation of logging infrastructure is going well and mostly still   In progress, and is expected to be   Done by the end of December. The stretch goals will be done in Q3.
Expanding the metrics infra is going well and is   In progress and should be done by end of quarter.


Outcome 3 / Output 4 edit

Wikimedia projects and content are protected against major disasters that threaten availability.

Strengthen backups with reliable and redundant backup infrastructure

Primary teams: SRE / Data Persistence

Goal(s) edit

  • Design and prepare infrastructure for database binary backups   In progress
    • Research options for producing binary backups (lvm snapshots, cold backups, mariabackup)   Done
    • Implement a proof of concept of a snapshot cycle automation for a mediawiki section database   In progress
    • Procure hardware for binary backups   In progress

Status edit

  Note: November 14, 2018

updated goals for current status

  Note: December 12, 2018

This goal is going much slower than expected, due to various things and it will be completed in Q3.


Outcome 3 / Output 4 (Performance) edit

Wikimedia projects and content are protected against major disasters that threaten availability.

Primary teams: SRE / Data Persistence, Performance

Goal(s) edit

  • Test Performance implications of MySQL TLS connectivity in production, once ready (carried over from 1718Q4)   To do
  • Start migrating watchlist last-view updates to hybrid stash/async-DB to avoid the huge rate of DB writes on page views   To do

Status edit

  Note: November 14, 2018

updated goals for current status

  Note: December 12, 2018

TLS is still  N Stalled on DBA technology selection/implementation due to other work requirements that have higher priorities.
Watchlist also  N Stalled due to emergent work and other work that has higher priorities, we hope to get it done in early Q3.


Outcome 4 / Output 6 edit

Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.

Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management

Primary teams: SRE / Infrastructure Foundations

Goal(s) edit

Expand Spicerack library and SRE Cookbooks edit

  • Split and convert the existing wmf-auto-reimage-lib into Spicerack modules   In progress
  • Convert wmf-auto-reimage scripts to Cookbooks   In progress
  • Convert other wmf-* scripts to Cookbooks (e.g. decom, downtime, upgrade & reboot, upgrade Varnish)   To do
  • Generate documentation for Spicerack   To do

Expand Netbox usage edit

  • Upgrade Netbox to the latest version (>= 2.4)   Done
  • Track additional categories of infrastructure topology information (e.g. VLANs, IP space, network circuits, etc.)   Done
  • Explore Netbox/NAPALM integration to pull live data from network devices   Done
  • Develop and deploy at least three Netbox reports to assist with data correctness and consistency   In progress
  • [stretch] Add a Cumin backend for Netbox   To do

Status edit

  Note: November 14, 2018

The migration of logging to Logstash and metrics into Prometheus is   In progress. Logstash hardware for the codfw data center is still being procured. Spicerack modules are being written and refactored with wmf-auto-reimage functionality. Netbox has been upgraded to a new version.

  Note: December 12, 2018

Convert wmf-auto-reimage scripts to Cookbooks is   In progress and will mostly be finished in Q3 due to holidays. The other two goals will start after the conversion is done.
Upgrade Netbox to the latest version is   Done but the stretch goal will mostly tackled in Q3.


edit

Outcome 1 / Output 1 edit

Technical staff are able to deploy their changes to Production with confidence that their improvements have been tested in a credible staging environment to work according to expectation.

Create a staging cluster comparable to production infrastructure

Primary teams: SRE / Service Operations, Release Engineering

Goal(s) edit

First steps towards Canary Deployments

  • Introduce progressive rollouts to the mediawiki train
  • Introduce deployment run state in scap to keep track of successful scap runs
  • Investigate the use of versioning in MediaWiki, allowing scap to keep track of deployed revisions

Status edit

  Note: April 8, 2019

  • This has been  N Postponed to Q4


Outcome 2 / Output 3 edit

Technical staff have increased visibility into the operation of our services and infrastructure.

Modernize logging, alerting and metrics monitoring infrastructure

Primary teams: SRE / Infrastructure Foundations

Goal(s) edit

Build an understanding of our needs around external monitoring services edit

  • Produce a short document with a cost/benefit analysis of our current external monitoring systems
  • Gather a set of requirements, desires, and likely technology choices for an external monitoring system, with a focus on achievability in a short timeframe (1-2 quarters)

Increase utilization of application logging pipeline edit

  • Replace and expand Elasticsearch storage in eqiad and upgrade the cluster from Debian jessie to stretch
  • Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs (candidates: log4j, udp2log, syslog/syslog_tls etc.)
  • Retire udp2log: onboard its producers and consumers to the logging pipeline
  • [stretch] Implement sensitive log access control, onboard 3 sensitive log producers

Upgrade metrics monitoring infrastructure core components edit

  • Serve >= 50% of production Prometheus systems with Prometheus v2
  • Upgrade production prometheus-node-exporter to >= 0.16
  • [stretch] Investigate distributed and long term storage solutions for Prometheus
    • Formulate requirements around aggregation, retention, hardware, etc.
    • Evaluate M3 and Thanos

Status edit

  Note: April 8, 2019

  • Build an understanding of our needs around external monitoring services is {[partially done}} in Q3
  • Increase utilization of application logging pipeline is   Partially done - there is still work to be done on the 'Migrate at least 3 existing Logstash' goal (so,   Partially done) and the retiring udp2log and the stretch goal have been  N Postponed to Q4


Outcome 3 / Output 4 (SRE / Data Persistence) edit

Wikimedia projects and content are protected against major disasters that threaten availability.

Strengthen backups with reliable and redundant backup infrastructure

Primary teams: SRE / Data Persistence

Goal(s) edit

Design and prepare infrastructure for database binary backups

  • Design a backup policy for logical and binary backups for both short term and long term storage
  • Procure and setup final hardware for binary backups
  • Fully implement binary backups and its rotation policy for all MediaWiki metadata and misc databases

Status edit

  Note: April 8, 2019

  • Backup policy is   Done but the procure and implement has been  N Postponed to Q4


Outcome 4 / Output 6 edit

Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.

Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management

Primary teams: SRE / Infrastructure Foundations

Goal(s) edit

Build automated workflows for server provisioning edit

  • Take additional steps towards a "single source of truth" system (Netbox)
    • Upgrade Netbox to v2.5 and use the new cable tracking feature
    • Expose production VMs to Netbox and keep them synchronized with Ganeti
    • Incorporate at least two more categories of data (servers interfaces, server IPs, MAC addresses, network device IPs, management/OOB, etc.)
  • Redesign the server provisioning and decommisioning process to facilitate orchestration
    • Add Netbox module to Spicerack and integrate it in the reimage and decom cookbooks
    • Convert virtual machine creation script to a cookbook
    • Reduce the number of manual steps involved in the provisioning process by at least 4

Status edit

  Note: April 8, 2019

  • Both goals are   In progress and will continue into Q4

edit

Outcome 2 / Output 3 edit

Technical staff have increased visibility into the operation of our services and infrastructure.

Modernize logging, alerting and metrics monitoring infrastructure

Primary teams: SRE / Infrastructure Foundations

Dependencies on:

Goal(s) edit

Logging edit

  • Deprecate all non-Kafka logstash inputs
  • [stretch] Implement sensitive log access control, onboard 3 sensitive log producers

Metrics edit

  • 100% of Prometheus traffic served by Prometheus v2
  • Migrate all metrics originated by PoPs from statsd to Prometheus
  • Investigate distributed and long term storage solutions for Prometheus

Status edit

  Note: May 8, 2019

  • Logging - deprecating non-Kafka is   In progress, stretch goal is still   To do
  • Metrics: 100% of Prometheus traffic served by Prometheus v2 is now   Done! :)
  • Migrating the metrics and investigating the distributed storage solutions are   In progress

  Note: June 13, 2019

  • Logging: is   In progress but will might be pushed into next quarter along with the stretch goal.
  • Metrics: 100% of prometheus is   Done, migrate all metrics is currently  N Blocked but should be able to resolve it by end of quarter; investigating the long term storage is   Partially done and will be completely done by end of quarter.


Outcome 3 / Output 4 (SRE / Data Persistence) edit

Wikimedia projects and content are protected against major disasters that threaten availability.

Strengthen backups with reliable and redundant backup infrastructure

Primary teams: SRE / Data Persistence

Goal(s) edit

Stretch: Setup and deploy backup hardware edit

  • Install and setup eqiad/codfw backups/recovery hosts
  • Install and setup dump slaves
  • Perform fine tuning of snapshot and dumps performance on final hardware
  • Decommission old backups hosts dbstore1001, dbstore2001 and dbstore2002

Status edit

  Note: May 8,2019

  • Install and setup the backups and dump slaves are   Done and the rest is still   In progress, fine tuning is ongoing and removal will take place later.

  Note: June 13, 2019

  • Install and setup eqiad/codfw backups/recovery hosts is   Done
  • Install and setup dump slaves is   Done
  • Perform fine tuning of snapshot is still   In progress and will be done by end of quarter
  • Decommission old backups hosts is  N Blocked on time - we have to wait until the other work is done by end of quarter.


Outcome 4 / Output 6 edit

Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.

Automate common operational tasks around service deployment, maintenance and incident response and build automated workflows for data center infrastructure, network, and equipment lifecycle management

Primary teams: SRE (Infrastructure Foundations, Data Persistence, Service Operations)

Goal(s) edit

Database workflows automation edit

  • Complete and deploy the tool for pooling/depooling databases dynamically from MediaWiki (dbconfig)
  • Migrate MediaWiki to use etcd for the database configuration in production
  • Write Spicerack abstractions for common database operations (pool/depool)
  • [stretch] Write Spicerack cookbooks to automate 2 common DBA workflows

Status edit

  Note: May 8, 2019

  • This is fully   In progress except for the stretch goal

  Note: June 13, 2019

  • Complete and deploy the tool should be finished up by end of this quarter, the rest of this particular goal will go into next quarter.


Outcome 4 / Output 7 edit

Technical staff are able to implement and maintain services and infrastructure in an efficient manner with a minimal amount of manual tasks.

Provision a centralized, self-service identity and access management for privileged staff and volunteer accounts

Primary teams: SRE / Infrastructure Foundations

Dependencies on: Cloud Services, Security,

Goal(s) edit

Developer account management edit

  • Audit production and WMCS infrastructure and document all authenticated services and their authentication & authorization capabilities
  • Engage with stakeholders and collect functional and non-functional requirements for identity and access management for web services
  • Evaluate free & open source Identity Management/SSO software solutions against our requirements and create a short list of 1-2
  • Build a migration plan from OpenStackManager and Striker towards a unified identity and access management system for developer accounts

Status edit

  Note: May 8, 2019

  • Audit production and WMCS infrastructure and document is   In progress and the others are awaiting it's completion.

  Note: June 13, 2019

  • Audit production and WMCS infrastructure is   Done
  • Engage with stakeholders and collect functional and non-functional requirements is   In progress and should be done by end of quarter
  • Evaluate free & open source Identity Management/SSO software solutions is   Partially done
  • Build a migration plan is   To do but the team met this week and should be   In progress but probably finish early next quarter.