Beta Cluster/2014-15-Q3/20141110-meeting
https://www.mediawiki.org/wiki/Wikimedia_Engineering/2014-15_Goals/Q3#Beta_Cluster
Purpose of this meeting is to determine what cross-team support we should plan on for this work. Yuvi is the obvious crossover person from Ops and Antoine is the Deputy of Beta Cluster. Mark and I just fill out paperwork.
Attendees: Andrew B., YuviPanda, Mark B, Robla, Greg, Antoine, Damon
HHVM fcgi restart during scap runs cause 503s (and failed tests)
edithttps://bugzilla.wikimedia.org/show_bug.cgi?id=72366
- scap restarted hhvm causing 503. bd808 reverted it
- Ori started improving unit tests code coverage of pybal, he is also looking at adding pybal to beta cluster eventually
- Beta cluster has no LVS though
Dev code pipeline/Nightly
editInstead of a 2nd beta cluster (overkill maintenance cost), use multiversion to run two versions in //
- Phabricator tasks has been filled
What is going our definition of Done ?
Prod/BC reconciliation
edit- start with a diff between prod and BC
- have antoine, andrew, yuvi flesh out what the list of things we need to fix would be
Truly responsive repair of Beta Cluster
editVision : have bugs autofilled whenever anything screw up on beta cluster Beta cluster is a shared resource. Need a sheriff to babysit it. Code sheriff team to monitor the shared resource and report traces / bugs, grabs people to fix it up.
- result: build trust accros the org
Sheriff like process (https://wiki.mozilla.org/Sheriffing )
- someone who's on call who's responsibility it is to sheparding a fix for any breakages (across all of Engineering, prod/beta)
- advertise more about logstash which is really useful for devs to look at it and help babysit
- lots of developers fix issues by themselves, filling bug against their software and not bothering anyone (but fixing the issue nonetheless).
Goal: data-driven infrastructure/development planning for Beta. For example, gather enough data to know if we would benefit from a move to non-virtualized hardware for app servers.
Monitoring
edit- Diamond collecting metrics on each instance (cpu/disk usage etc)
- reported to a central Graphite
- JS frontend replacing ganglia https://tools.wmflabs.org/nagf/?project=deployment-prep
- Moving to Shinken, will give us more labs-specific checks
- not sure how to make BC the same set of checks as production due to a limitation of labs (tech details: no puppet collection)
- Labs is more complicated than production :)
- Beta cluster could use its own Shinken instance (complicates things for other instances)
- In production the monitoring checks are mostly active one (connecting to instance to execute some commands)
- Slowly moving to passive checks via Graphite
- puppet failures are now broadcasted
Puppet related
edit- beta laggings out and sometime broken by operations/puppet changes. Most figured out in the next hours thanks to monitoring.
- hiera() definitely helping and will improve
- puppet compiler: run it for every changeset on beta cluster? (it runs on request now for prod)
- needs to be async / overridable by ops so it does not block them (sometime does not make any sense)
- ops convention is to +2 / merge, test on one prod machine then generalize
- maybe another step could be introduced to test it on beta
Next?
edit- Test cluster on baremetal? Could do in the context of a performance cluster
- Can't really do to replicate the whole cluster stack though
- OpenStack could in theory provision baremetal
- Swift?
- Wikimedia labs infra is reliable