Wikimedia Release Engineering Team/Offsites/2018-05-Barcelona/Notes

These are the raw notes from our 2 days of offsite discussions.

Summary of action items edit

Data Data Data edit

  • Talk with Analytics - JR
  • Talk with CE/Bitergia - JR
  • Explore Bitergia - JR
  • Identify data sources we want to collect - RelEng (who know what systems)
  • Erik Bernhardson / Guillaume Lederrey

SWATs/Trains edit

  • Tyler reasses scap swat in mw-config from Mukunda
  • Look into parsing scap messages for known patterns and pulling out the data
  • Look into enabling scap start/done
  • Look into recording if mwdebug was used during the deploy (eg: 'scap stage')
  • H/Now will we get time for this?
  • Have Mukunda do a couple weeks of SWATs
    • Mukunda has a lot to say about this subject.... writeup incoming

Staging edit

  • Greg to talk with Deb about what to do next with talking to Victoria
  • Greg to figure out how we can better market what we are accomplishing (eg "monthly showcase")
  • Get a k8s cluster from SRE for CI to deploy to.

Data Data Data edit

Lead: Jean-René

  • Data for code-stewardship reviews (historic data)
    • Commits & patch sets
    • Jenkins & CI, test results discarded after 15||30 days
  • Where can I put new kind of data/metrics. Is there a shared environment to store them?
  • jr: for example, talking to explanatory testers. No idea about the result of their work. Hard to get new QA testers on board. Role is broad, but a sure thing is they will either produce or consume testing data.
  • We have lots of data/dashboard, but we have not statistics over long-term
  • antoine: raita was the dashboard (but it has been decomissioned)
    • Historic dashboard for metrics and data
    • Dan: targeted towards browser-tests
  • Hypothetical Entity Relationship (ER) diagram
    • Patchsets relate to deployments
    • Deployments relate to outages
    • Relationships in a tree format
  • Relationships between gerrit change and phabricator tasks
  • Developer/maintainers page. For an extension/skin JR would like to:
    • Activity (commits and changes)
    • Outstanding tasks
    • How it follows mediawiki latest standard (ex: extension.json, versions of linters, test coverage etc)
    • Tests that are running:
      • How frequent are errors
      • How many tests are failing
      • Average resolution for a failed test (E2E, unit tests failling on unrelated change because core changed months ago and extension is barely active)
      • The pace of changes being merged
    • extension status, alpha, maintenance, wikimedia deployed, obsolete. That is mostly on mediawiki.org (partly in CI config as "archived")
  • Overview of stewardship
  • github pulse ( https://github.com/wikimedia/mediawiki/pulse ) -- do we want that?
    • Human process oriented vs repository oriented (merges vs task closing)
    • time to resolution (TTR) for tasks (filed to resolved/declined/whatever)
      • but this is only meaningful for "bugs" not other planning type tasks
  • what are the systems that we have, how do we normalize the data for those systems, where do we put it?
    • A consistent interface for retrieving data
    • We need to keep all the data that we can -- get data outside of jenkins (for example we could send that data to elasticsearch, but currently this is locked-up in jenkins)
  • We have an agreement that we'd like to collect all the test data...somewhere somehow
  • Stewardship creates these open questions, useful for annual planning as well
  • Going through, system-by-system, and finding out what data we want to store

Open Questions

Next Steps:

  • Talk with Analytics - JR
  • Talk with CE/Bitergia - JR
  • Explore Bitergia - JR
  • Identify data sources we want to collect - RelEng (who know what systems)
  • Erik Bernhardson / Guillaume Lederrey


SWATs/Trains edit

Lead: Tyler

  • Automating/improving logging of SWATs and Trains - https://phabricator.wikimedia.org/T193311 :
  • It would be nice to have concrete data about SWAT windows without having to dig in the SAL. Some nice-to-have info: number of syncs per SWAT window and time spent deploying patches for a given SWAT window.
  • Problem: We've wanted to change SWAT windows/deploys. People hated that we wanted to change things (namely: reduce # of patchsets deployed and how they are done). We need data to make informed decisions. eg: correlating syncs with swats and outages.
  • Definition: SWAT is three 1 hour windows per day for developers to propose hotfixes/config changes. Served by releng / deployment group users.
  • now we have sync and we have windows and they're only relation is through the wiki pages
  • out of scope:
    • relating patches -> swat window
    • proposing patches in a window
    • Zeljko: we are just pushing buttons. We do not have much added value
  • NEEDs:
    • Given a time window, get the list of syncs / patchset deployed (and utlimately a developer / point of contact)
    • we need the data
    • a place to display/query it
  • Minimal Viable Solution
    • Have scap ask "is this a SWAT? y/n" each time it's not a full scap or --force
  • This Deployment did this Change associated with this Task.
  • what about...
  • current documentation https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#Full_deployment
    • current command: scap sync-file path/to/file 'SWAT: Commit message (T456)'
    • if the comment is not in this format, scap asks you swat/gerrit/phabricator
  • not allow deploys without first indicating what window you're starting
    • scap swat start or scap deploy start (or --force)
    • that informs scap on what how to act/log
  • mw-config.php
    • assume as soon as it's merged it's deployed

TODO

  • Tyler reasses scap swat in mw-config from Mukunda
  • Look into parsing scap messages for known patterns and pulling out the data
  • Look into enabling scap start/done
  • Look into recording if mwdebug was used during the deploy (eg: 'scap stage')
  • H/Now will we get time for this?
  • Have Mukunda do a couple weeks of SWATs
    • Mukunda has a lot to say about this subject.... writeup incoming

Staging edit

https://docs.google.com/document/d/1CT_pKjwiDmFhZZ9LW9mz0z434-wgr3NFdapUPWUvMNA/edit?ts=5aba5398#heading=h.ra4sbg2fs7zl 2018-2019 annual plan https://www.mediawiki.org/wiki/Wikimedia_Technology/Annual_Plans/FY2019

Lead: Greg

  • The presentation
  • The project as defined by operations is incomplete
  • The response to Victoria
    • We are here due to the initial issue of a choice between doing the Pipeline project vs a Staging project. That either/or is now a both/and.
    • Operations wants an environment that can potentially prevent outages depending on how they define it. It could potentially prevent outages of services that we don't control nor deploy.
    • We are making a survey to gather the current usage of the Beta Cluster that can help inform SRE's decisions/planning.
    • We have defined use cases
    • The other questions are best answered by SRE as they heavily depend on technical implementation decisions
    • protocol changes as proposed are out of scope to this dicussion and truthfully feel like reach through micromanagement without any real data nor reasoning.

What RelEng needs:

  • Just to continue to do our positive interaction with SRE in our weekly Pipeline meetings
  • A simple part of that is for SRE to provide a k8s cluster and/or namespace for CI to deploy to (as previously discussed and agreed upon)
  • Idea (Dan) rebrand "deployment pipeline" project to "Continuous Delivery of MediaWiki Stack"

NEXT:

  • Greg to talk with Deb about what to do next with talking to Victoria
  • Greg to figure out how we can better market what we are accomplishing (eg "monthly showcase")
  • Get a k8s cluster from SRE for CI to deploy to.


Developer Productivity JD edit

Lead: Greg Blog post: https://squiggle.city/~frencil/archives/20150625.html#anatomy_of_a_healthy_job_post

You will be leading the effort to improve overall developer productivity. We will want you to create a replacement for our homebuilt Vagrant-based local development environment using the latest technologies such as Kubernetes (minikube), Docker, and Helm. You will be working closely with several teams and volunteers in the community.

Responsibilities

  • Help engineer container based tooling for MediaWiki application development and deployment
  • Maintain integration of developer tooling into a continuous delivery pipeline
  • Proactively find and create productivity improvements
  • Working in a highly collaborative and open organization and community

Requirements

  • Proficiency with software, systems, or devops engineering
  • Collaboration skills are as, if not more, important as technical skills
  • Experience with continuous integration/deployment systems
  • Experience with virtualization or container technologies
  • Experience with server configuration management software

Nice to haves

  • Free Software experience
  • Experience working in a remote-first organization
  • Experience using a Kubernetes environment
  • MediaWiki and/or Wikimedia project experience
  • Golang experience


Moving to a "everyone deploys their own changes" model (for SWAT) edit

  • Why are SWATs scheduled?
  • Why are there only a limited number of people in-charge of doing them?

Z: Would like everyone already staff/contractors to be able to do their own deploys. Z: lot of european swat users now self deploy (eg: Amir, David Causse).

  • Turn SWATs into "volunteer patch deployment" windows. If you are staff/contractor, you deploy your own thing when you need to do it.


Pipeline Demo edit

Lead: Dan/Tyler https://integration.wikimedia.org/ci/job/service-pipeline-test-only-debug Job using Jenkins Pipeline. Defined in Groovy.

  • Presentation of Blubber and pipeline
  • What is minikube

Blubber and MediaWiki + extensions edit

  • We use docker-pkg w/ Quibble and Blubber in the pipeline. Is problem? No. Not really.
    • Use of docker-pkg is appropriate in domains that require/allow full control of Dockerfile and image build (root)
    • Base images are controlled by SRE (operations/docker-images/production-images)
    • CI images for use with Quibble are controlled by RelEng (integration/config)
  • Talked about whether we should use Quibble as entrypoint in pipeline testing. Should we? No. Probably not.
    • Different use case. Quibble depends on environment that has superset of MW+ext dependencies. Blubber is meant to be repo-authoritative.
    • EVERYTHING IS GREAT, AGAIN.
  • What does a Blubberized MediaWiki look like? For limited scope of FY1718Q4 goal ((MediaWiki + Math) + Mathoid)? For far future?
  • Discussion about how to deal with Debian dependencies and extensions depending on each other.
    • For Q4 goal, we don't technically need to solve the ext dependency issue (Math does not depend on other extensions or skins)

Are we testing a lot edit

all quibble jobs -- combinations mysql/vendor/php70 mysql/composer/php70 mysql/vendor/php55 mysql/vendor/hhvmT:

  • php/js lint/eslint
  • qunit/phpunit
  • webdriver.io