Wikimedia Cloud Services team/Onboarding Chico/Sessions
This page is obsolete. It is being retained for archival purposes. It may document extensions or features that are obsolete and/or no longer supported. Do not rely on the information here being up-to-date. |
Chico Questions
editnext session
editFlapping alerts in shinken T161898
editDiamond
edit- Is there a reason for collecting less metrics about puppet?
- We have an addapted minimalpuppetagent.py that collects a lot less than the original puppetagent.py (it alsos add a _check_sudo method)
- Maybe .puppetagent.changes.total, puppetagent.events.failure puppetagent.events.success and servers.hostname.puppetagent.events.total could be useful as well?
- https://diamond.readthedocs.io/en/latest/collectors/PuppetAgentCollector/
Puppet errors analysis
edit- https://phabricator.wikimedia.org/P6712 (js to get and play with json datapoints)
- I was thinking of doing a simple sum and tolerating somewhere between 3 and 5 failures before warning
- Ended up with -W 5 -C 15 on https://gerrit.wikimedia.org/r/411315
- Track Flaps
Bastion alerts
edit- Create alerts specfic to bastion
Using puppet on VPS
edit- Docs
https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster
PAWS
edit- How to test changes?
- Staging env
- tools-beta
K8s logging retention
edit- We can force pods to restart after 30 days, but it sounds like a terrible idea
- Revisit after tools-beta
Other tasks
edit- Is there something else I should be looking into?
2018 - 02 - 13
editHow is monitoring configured?
edit- We have Icinga being phased out for prometheus in productions servers
- Shinken in lab instances
- My goal is to add alerts for tools-bations, seems it should be done in Icinga/prometheus task T186552 https://phabricator.wikimedia.org/T186552
- We already collect CPU and IO data for tools-bations (https://tools.wmflabs.org/nagf/?project=tools#h_tools-bastion-03_cpu )
- I see we can use a check_graphite_series_threshold to get the loadavg like we are doing with iowait (from https://graphite-labs.wikimedia.org/ )
- There is no total_cpu metric, we need number of cores to know what to set for warning and critical in loadavg
- I see we can use a check_graphite_series_threshold to get the loadavg like we are doing with iowait (from https://graphite-labs.wikimedia.org/ )
- We already collect CPU and IO data for tools-bations (https://tools.wmflabs.org/nagf/?project=tools#h_tools-bastion-03_cpu )
- My goal is to add alerts for tools-bations, seems it should be done in Icinga/prometheus task T186552 https://phabricator.wikimedia.org/T186552
https://etherpad.wikimedia.org/p/chicoandchase https://graphite-labs.wikimedia.org/render/?width=674&height=377&_salt=1518534146.333&target=tools.tools-bastion-03.cpu.total.idle&from=-30d tc ifb tc an only manipulate send queues iotop iotop -ao https://graphite-labs.wikimedia.org/render/?width=674&height=377&_salt=1518535047.584&target=tools.tools-bastion-03.nfsiostat.labstore.ops&target=tools.tools-bastion-03.nfsiostat.labstore.ops_per_sec&target=tools.tools-bastion-03.nfsiostat.labstore1003.ops&target=tools.tools-bastion-03.nfsiostat.labstore1003.ops_per_sec
WMCS Phabricator etiquete
edit- Do we have documentation about how to triage tasks and move them arround projects and workboards?
- TBD
Cloud VPS / Horizon stuff
edit- I am still unfamiliar with the interfaces and common questions, maybe I should create a temp project and go through docs.
- make a task for a chicotestproject T187213
- Where are things configured?
- Wikitech
- operations-puppet repo
- Horizon
https://wikitech.wikimedia.org/wiki/Hiera:tools
~/git/wmf/puppet cpettet@cair>ls hieradata/labs/tools
toolsadmin.wikimedia.org
Wikitech docs
edit- Portal namespace for user facing docs
- /admin subpage for WMCS team
Other tasks
edit- Is there something else I should be looking into?
- Let's start slow and I'll try to integrate you into my sort of normal workflow
- Flapping alerts in shinken
- host* as way to get % of hosts in failure