Wikimedia Release Engineering Team/Checkin archive/20180924
2018-09-24
editVacations/Important dates
edit- September 27th (Thursday) - Antoine busy handling paperwork
- Beginning October - Mid october, Antoine to take off some weeks/days/part time
- October 5th (Friday) - Željko on a conference (https://2018.webcampzg.org/ )
- October 8th - Holiday (Indigenous People's Day, Independence Day - Željko)
- October 8th - New hire start date
- November 1 (Thursday) - Holiday (All Saints' Day - Željko)
- November 9th - Holiday (Veteran's Day)
- November 22+23 - Holidays (Thanksgiving)
- November 25-december 2nd: Mukunda vacation (in California ahead of the offsite)
- Week of December 3rd - Team offsite
- December 24-28 - Holidays (Christmas)
Rotating positions
editTrain
edit- Maniphest query for deployment blocker tasks: https://phabricator.wikimedia.org/maniphest/?project=PHID-PROJ-fmcvjrkfvvzz3gxavs3a&statuses=open%28%29&group=none&order=newest#R
- July 02 - wmf.11 - Zeljko - no train, Fourth of July
- July 09 - wmf.12 - Zeljko
- July 16 - wmf.13 - Zeljko
- July 23 - wmf.14 - Zeljko
- July 30 - wmf.15 - Mukunda
- Aug 06 - wmf.16 - Mukunda
- Aug 13 - wmf.17 - Mukunda (No train - Wednesday is a holiday)
- Aug 20 - wmf.18 - Tyler
- Aug 27 - wmf.19 - Dan && Antoine lurking over the shoulders
- Sep 03 - wmf.20 - Antoine
- Sep 10 - wmf.21 - Antoine (No train due to DC switchover)
- Sep 17 - wmf.22 - Antoine
- Sep 24 - wmf.23 - Zeljko <----
- Oct 01 - wmf.24 - Dan
- Oct 08 - wmf.25 - Dan (No train due to DC switchover)
- Oct 15 - wmf.26 - Mukunda (last 1.32 wmf.XX release, 1.33 starts the next week)
- Oct 22 - wmf.1 - Mukunda
SoS
edit- July 04 - Dan
- July 11 - Antoine
- July 18 - Antoine
- July 25 - Tyler
- Aug 01 - Tyler
- Aug 08 - Zeljko
- Aug 15 - Dan (No SoS this week)
- Aug 22 - Zeljko
- Aug 29 - Zeljko
- Sep 05 - Tyler / Željko
- Sep 12 - Tyler / Željko
- Sep 19 - Dan / Željko
- Sep 26 - Zeljko <----
- Oct 03 - Zeljko
- Oct 10 - Zeljko
- Oct 17 - Zeljko
- Oct 24 - Zeljko
- Oct 31 - Zeljko
Team Business
editHiring
edit- Software Engineer position open and reviewing/hiring for now
First Offsite
editDetails:
- Week of December 3rd
- At the Queen Mary hotel in Long Beach
- Deb T will be facilitating
Topics!
Development plans
edit- Due end of the week!
Needs attention
edit- 2018-09-10 -- Gerrit Privacy Policy & CoC patch
- https://phabricator.wikimedia.org/T196835
- 2018-09-17 -- Patches for new UI:
- (ops/puppet) Replace polygerrit theme in repo: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458523/
- (gerrit) Remove from repo: https://gerrit.wikimedia.org/r/#/c/operations/software/gerrit/+/458524/
- (ops/puppet) Add footer link for new UI: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458833/
- (ops/puppet) Add footer link for old UI: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/460914/
- All applied to: http://gerrit.tylercipriani.com:8080
- 2018-09-24 -- for puppet swat tomorrow
- 2018-09-10 -- Run mediawiki::maintenance scripts in Beta Cluster
- https://phabricator.wikimedia.org/T125976
- Tyler to create instance
- 2018-09-17 - not done
- 2018-09-24 -- done (deployment-mwmaint01)
Operational Excellence posts
edit- greg got it at 5:45 on Friday, hasn't had a chance to review yet....
- CI: https://docs.google.com/document/d/181-LQJ-iyxKYXEo93tEGEiAsm9dtPiA3iXVDCj_RSRI/edit?ts=5ba5620c
- Ops: https://docs.google.com/document/d/1dtkvwWGknReIqhA2wkGQSr1Exmel4Q00bttT3rXQBQE/edit?ts=5ba56205
Scrum of Scrums
edit- Greg to copy to etherpad after meeting: https://etherpad.wikimedia.org/p/Scrum-of-Scrums
This week
editRelease Engineering
edit- Blocked by:
- Blocking:
- Updates:
- Train Health:
- Log Health:
- T204871 web request took longer than 60 seconds and timed out (copy to callouts)
- Code Health:
- Creating communication channels (Phabricator https://phabricator.wikimedia.org/tag/code-health-metrics/, IRC, mailing list)
Last week
editRelease Engineering
edit- Blocked by:
- [WMCS] Increased quotas for vcpu and memory in integration project: https://phabricator.wikimedia.org/T204373
- Blocking:
- Updates:
- Train Health: no train last week due to DC switchover, train continues this week
- Log Health:
- Code Health:
- Code Health Metrics Working Group Kickoff last week
- Code Health Metrics Working Group meeting this week - further discuss/define the workgroup's scope and next steps
Train status and happenings
edit1.32.0-wmf.22 went well. Antoine wrote a quick summary at end of task with some thank you for people involved. https://phabricator.wikimedia.org/T191068#4604040
Things potentially worth attention:
- New but not blocking T204871: Promoting group1 to 1.32.0-wmf.22 caused a spam of web request took longer than 60 seconds and timed out
- wikiversions.json update (and probably any scap action) cause a spam of requests timeout. That selfs resolves. The timeouts were previously NOT enforced, so we probably always had the issue and they just show up now. To be investigated.
- For next train: the times out can be ignored for the next 3 or 4 minutes. See task for details.
- Worked around T204907: Scap is checking canary servers in dormant instead of active-dc
- scap dsh groups were still referencing EQIAD server making the canary check useless. Antoine changed to codfw hosts. A better solution would have to be found to change them automagically based on the active datacenter. Maybe conftool/etc can come to help.
- Known T204961: ORES requests for wikidatawiki models=damaging end up with HTTP request timed out
- When wikis change versions, ORES seems to have troubles handling the new requests. There are a few http timeouts when reaching ORES service. Amir stepped in immediately, asked on Friday whether that was UBN worthy, but Antoine said it could wait for Monday SWAT.
- thcipriani: maybe make a simple timeline/incident report for this? (frex https://wikitech.wikimedia.org/wiki/Incident_documentation/20180821-Train )
- ACTION: Antoine do this
Past week status updates
edit- All of it in table form: https://www.mediawiki.org/wiki/Wikimedia_Release_Engineering_Team/Goals/201718Q4
Quaterly Goals for Q1
editPipeline: Move verify stage from Minikube to CI k8s namespace in production context
edit- some movement for next quarter stuff -- zotero-v2/node10js images
Code Health
edit- T199253 - Investigate and propose record of origin (ROO) for deployed code (currently Developers/Maintainers page)
- On track to have first pass proposal defined
- Perform existing Stewardship review process for Q1 cycle.
- T199254 - Add test evaluation to post mortem review process.
- Review existing e2e test coverage.
- Define prioritization scheme.
- Prioritize e2e testing gaps.
- T199257 - make current unit testing coverage more visible by reporting out to Engineering Management.
- Will have first pass Code Health Newletter (which will include coverage info) by the end of week.
- T199259 - Platform and Search Platform teams are using TDM PoC
- T199262 - Identify key Tech Debt areas
- T199263 - Put in place Tech Debt management process for PEP
- T199261 - Define base Code Health metric set.
- Working group met last week as well, have base tasks defined, and have started defining some metric candidates.
Developer Productivity
edit- Make a hire to create the capacity needed for this program.
- Write and share a survey to measure developer satisfaction and areas for investment. - task T197635
- hiring
- survey?
Other work
editSelenium
edit- Q1 goals task: T198389 Q1 Selenium framework improvements
- T179188 Video recording for Selenium tests in Node.js - Antoine and Željko disagree on if it's done :) https://gerrit.wikimedia.org/r/c/mediawiki/core/+/422933
- T199133 Find top 15 target projects that could use Selenium tests to prevent incidents
- Review existing e2e test coverage - done
- Define prioritization scheme - doing
- Prioritize e2e testing gaps - next
Gerrit
editPhabricator
edit- Task types work: https://phabricator.wikimedia.org/T93499
- Blog post about task types: https://phabricator.wikimedia.org/phame/post/view/116/an_introduction_to_task_types_in_phabricator/
Jenkins
edit- Timo is writing a wikitech-l newsletter and including a section about our recent CI work (disk space issues, consolidation of instances, etc.). He wants to link out to a more substantial post from us. This would need to be done by Tuesday. :)
- (Covered. See Production Excellence section under Team Business)
QA
edit- Had QA sig meeting last week. Spoke with Elena to see if additional discussions about QA career paths took place in Audiences. None so far.
SCAP
edit- Scap REAL canary patch: https://phabricator.wikimedia.org/D1114
- thcipriani: accepted! land at will.
- the rebuildLocalisationCache.php takes 40 minutes task is complete
- Took 1m 7s without any changes, so it will be slower than that, but should be much much faster
Standup!
editAntoine
editDid train, a bit of quibble and CI config. Train went well!
- What I plan to do this week
- What I'm blocked on
- Other?
Dan
edit- What I plan to do this week
- Continuing my crusade of collecting Jenkins build duration stats
- Blubberoid Swagger/OpenAPI spec
- Development plan
- What I'm blocked on
- Understanding prometheus and/or best way to aggregate statsd buckets
- Review of my change to service-checker
- Other?
- Anyone feel like reviewing?:
- Blubberoid unit test
- Remove support for `sharedvolume` in Blubber
- thcipriani: will do some review on these :)
- Anyone feel like reviewing?:
Greg
edit- What I plan to do this week
- interviewing
- doing a SWAT today :)
- "finalize" ya'lls development plans
- ping Deb on when to start planning out our Offsite - delay this
- review of onboarding docs again (steal some good stuff from Discovery Team's) (thcipriani: https://wikitech.wikimedia.org/wiki/Ops_Onboarding ops has good stuff to steal, too :))
- production excellence blog review
- Pipeline presentation outlining?
- What I'm blocked on
- Other?
Jean-Rene
edit- What I plan to do this week
- wrap up Q1 Goals
- Dev plan
- What I'm blocked on
- Other?
Mukunda
edit- What I plan to do this week
- Finish development plan
- Scap swat https://phabricator.wikimedia.org/T196411
- Workiing with chase on custom "security issue" task type
- Some other things and stuff
- Get feedback on Dev Productivity survey
- What I'm blocked on
- Other?
Tyler
edit- What I plan to do this week
- Development plan convo
- CoC footer patch
- keyholder code review
- What I'm blocked on
- Other?
- zotero-v2 followup as needed
- scap workboard cleanup as there's time
Zeljko
edit- What I plan to do this week
- T191069 1.32.0-wmf.23 deployment blockers
- T198389 Q1 Selenium framework improvements
- T179188 Video recording for Selenium tests in Node.js - Antoine and Željko disagree on if it's done :) https://gerrit.wikimedia.org/r/c/mediawiki/core/+/422933
- T198389 Q1 Selenium framework improvements
- T199133 Find top 15 target projects that could use Selenium tests to prevent incidents
- Review existing e2e test coverage - done
- Define prioritization scheme - doing
- Prioritize e2e testing gaps - next
- T199133 Find top 15 target projects that could use Selenium tests to prevent incidents
- What I'm blocked on
- Other?
Grooming
editTeam Kanban Board Review and Triage
edit- closed and touched in the 7 days
- No update for 4 weeks
- No update for 3 weeks
- No update for 2 weeks
- No update for 1 week
- All Open
- Review To Triage column of #releng
Once / month-ish review of backlog(s)
edit- releng Review To Triage column of #releng
- releng-kanban Review unassigned in kanban
- releng-kanban Review 'backlog' colum of -kanban
- releng-next - Review for things we need to put on our kanban backlog
- releng-backlog - oh my, the huge backlog of things...