Wikimedia Performance Team/Sprints
This page is obsolete. It is being retained for archival purposes. It may document extensions or features that are obsolete and/or no longer supported. Do not rely on the information here being up-to-date. The Wikimedia Performance Team was disbanded with the WMF re-org that happened in July 2023. |
2023
editOutreach:
- Complete refresh of the frontend and backend guidelines and best practices (Timo, Peter, Aaron)
- Blog about the completion of the Multi-DC project (Aaron, Timo)
Insights:
- Frontend Synthetic: Move synthetic tests from AWS to bare metal (Peter) — In Q2 we evaluated in-house and external suppliers. We ended up choosing Hetzner. The server is available and accounted for in our budget.
- Frontend Synthetic: Reliably measure how fast a Wikipedia article would be without JavaScript (Peter)
- Frontend RUM: Add Long Tasks metrics to Navigation Timing (Barakat)
- Frontend RUM: Decommission coal and coal-web (Timo)
- Frontend RUM: Migrate navtiming processor from Graphite to Prometheus (Peter, Timo) — The python-prometheus client became a bottleneck in our non-parallelized setup. We reduced cardinality to resolve this.
- Backend: Profile time spent per component/extension in MW entry points (and visualise in Grafana) (Aaron)
- Backend: Increase retention of ArcLamp SVGs to 2 years (Timo)
- Backend: Add per-request flamegraph option to WikimediaDebug (Tim, Timo)
- Blog post: Flame graphs arrive in WikimediaDebug!
Improvement:
- ResourceLoader: Implement support for Source Maps (Tim, Timo)
- ResourceLoader: Implement continous verification of MediaWiki core's foreign resources in WMF CI (Timo)
- ResourceLoader: Raise Grade A JavaScript requirement from ES5 (2009) to ES6 (2015) (Timo)
- Rdbms: Reduce complexity of LB and LBF (Amir, Aaron, Timo)
- Rdbms: Evaluate LoadMonitor connection weighing improvements (Aaron, Tim)
- Support Serve production traffic via Kubernetes (Timo)Internal:
- Onboard
twoone new team members (Aaron, Peter, Tim, Timo, Larissa). - Support for various teams outside perf scope (Tim)
- Better Diffs: Wikidiff2 revise algorithm
- Async Fragments (officewiki document)
- (late arrival) IP Masking 2.0
Other goals that we considered but were post-poned, cancelled, or incomplete:
2022
editInsights:
- Frontend RUM: Migrate navtiming processor from Graphite to Prometheus (Peter, Timo) — Continue in 2023
- Frontend RUM: Expand navigation timing metrics to include modern user experience metrics (Peter, Timo)
- Frontend RUM: Update how we measure Layoutshift in Navigation Timing to reflect CLS metrics (Peter)
- Frontend Synthetic: Migrate synthetic tests infrastructure from AWS to bare metal (Larissa, Peter) — Moved to Q3 (Jan-Mar 2023)
- Frontend Synthetic: Bitbar: Add firefox capabilities (Peter)
- Backend: Understand the status of SLOs on Product side (Larissa) — Have been talking with Suman and Desiree trying to restart the discussions
- Backend: Cross-DC query Alerts (Aaron)
Improvement:
- Prepare MediaWiki for PHP 8.1 (Tim, Timo). — Done from our side. Waiting on SRE ServiceOps. They prioritize mediawiki-on-k8s until end of Q3, but will be able to tackle PHP 8.1 in the beginning of Q4 (April-June 2023)
- Rdbms: Better LoadBalancer connection pooling (Aaron)
- Research opportunities in static.php traffic to identify simpler and longer-lasting caching policies. Reduce backend traffic to static.php by more than 70%, and removing a custom WMF-specific endpoint in the process, in favour of standard MediaWiki routes, requiring less maintenance going forward. (T285232, T302465)
Other goals that we considered but were post-poned, cancelled, or incomplete:
- Multi-DC BagOStuff interfaces (Aaron)
- Find someone to run user interviews (Larissa) — Both Desiree and Marshal cannot help us at this time. Marshal suggested I run a couple of interviews on my own first, but we currently don't have the bandwidth to come up with a solid interview script and do the necessary pre-work
2021
editSee also internal 2021-2022 roadmap and internal Jan-Mar 2022 achievements.
Outreach:
- Support product development by Inuka Team (Wikipedia Preview), Reading Web (NearbyPages, and RelatedArticles), CPT (WebAuthn), Design Systems Team (WVUI/Vue.js), and WMDE (Kartographer-revid)
- Participate in SLO working group to help establish an SLO around MediaWiki Save Timing SLO.
- Participate in W3C WebPerf WG, provide feedback to Chrome team on Google Web Vitals and Chrome bugs.
- Organise the Web Performance devroom for FOSDEM 2021 (recordings).
- Speak at the We Love Speed conference (recording).
- Organise four Web Perf Hero awards.
Insights:
- Migrate our device lab to BitBar.
- Evaluate and build proof-of-concept synthetic testing on bare metal instead of at AWS.
- Write runbooks for investigating RUM alerts, WPT alerts, and WPR alerts.
- Support to SRE Observablity in developing a new Prometheus-compatible MW-Stats client library.
- On-going maintenance of WebPageTest, WebPageReplay, and Fresh-node.
Improvement:
- Multi-DC: Deploy MainStash DB and migrate away from Redis-based MainStash (T212129).
- Multi-DC: MariaDB-TLS tested and enabled for all wikis.
- Multi-DC: CDN routing logic written and deployed to Beta and Prod behind feature flag.
- ResourceLoader debug mode v2, reduce wait time on complex pages from ~1 minute to ~1 second.
- Guidance and code review for DBA-led normalization of "templatelinks" MediaWiki database table, to reduce storage pressure and improve query performance. (T299417)
- Support to SRE ServiceOps for MW-on-K8s project.
- Develop precache-based GlobalUserEdit API for CentralAuth, following an incident.
2020
editSee also internal 2020-2021 roadmap.
Outreach:
- Support product launch by Anti-Harrasment Team (IPInfo extension), and CPT (API Portal skin, API Portal OAuth extension, Changes to OAuth ext).
- Support development kick-off of Abstract Wikipedia (WikiLambda) through early check-in and 1-month team residency/matrixing in both directions.
- Organise the first Web Performance conference at FOSDEM (blogpost, recordings).
- Organise the first Web Perf Hero award.
- Get published in the Web Performance Calendar (4x: Human performance metrics, Profiling PHP at scale, Future of Web Vitals from a non-Googler, Setting up a device lab).
- Enable teams to create their own production error dashboards in Logstash with a template, written guide, and video presentation.
Insights:
- Expand navtiming RUM metrics pipeline with new Layout Shift metric.
- Kobiton setup for our device lab, expand to include iOS in addition to Android.
- Explore BitBar for our device lab.
- Explore moving WPT/WPR infra away from AWS.
Improvement:
2019
editSee also 2019-20 Q1#Performance and internal 2019-2020 roadmap.
- Outreach:
- Design and implement the AS Report, to expand and formalize collaborations to leverage our influence with browsers vendors and ISPs. (Announcement on Techblog).
- Initiate and work on Wikimedia Foundation becoming an official W3C member organization. This expands the Performance Team's participation in web standards and moves us from an "invited expert" (individual) to a represented membership organisation. (Announcement on wikimediafoundation.org)
- Support product launches by Parsing Team (Parsoid-PHP launch), Editing Team (DiscussionTools launch), Growth Team (GrowthExperiments launch), and Inuka Team (Wikipedia KaiOS app launch).
- Support RelEng around establishing production error triage workflows and semi-automation thereof.
- Organise WMF-wide frontend web performance training.
- Provide performance expertise to Frontend Architecture Working Group (FAWG).
- Get published in the Web Performance Calendar (2x: Measuring LT and FID, Big questions on RUM)
- Insights:
- Research and develop and test new RUM metrics that better match user perception (T187299, Meta-Wiki, Rossi 2019 paper).
- Organise and oversee implementation of First Paint metric in WebKit for Apple Safari (blog post).
- Introduce automatic developer-facing performance metrics for specific chunks of MediaWiki code in core and extensions, powered by WANObjectCache (T197849).
- Add more RUM metrics to the navtiming pipeline, including instrumentation for First Input Delay (T332012).
- Participate in Chrome Origin trial for Element Timing and provide feedback on upcoming W3C standard (blog post).
- Release WikimediaDebug v2 (blog post).
- Create our own Mobile Device Lab.
- On-going first-respondence to synthetic testing alerts, including investigating regressions after Chrome/Firefox releases and comms with upstream browser vendors.
- On-going maintenance of WebPageTest and WebPageReplay.
- On-going maintenance of XHGui, including dealing with MongoDB becoming non-free software by developing and upstreaming MySQL drivers for XHGui, and migration our install from MongoDB to MySQL.
- Improvements:
- PHP7 Transition: Finish the transition from HHVM and support SRE with instrumentation, sampling, and benchmarking.
- Multi-DC: Start work on MainStash DB.
- Faster MediaWiki backend startup time to reclaim PHP7 latency increase in certain areas. (T233886, T189966).
- Faster page load time, by reducing ResourceLoader startup cost (blog post).
- Guidance, CR and testing for new AbuseFilter parser (development by Daimona) to improve Save Timing (T156095).
2018
editSee also 2018-19 Q1, 2018-19 Q2, and internal 2018-2019 roadmap.
Insights:
- Annual Plans/FY2019/TEC1: Current levels of service are maintained and/or improved.
- Enhance performance testing infrastructure, including addition of Chrome Tracelog (T182510), and introduction of WebPageReplay+Browsertime (based on last year's research) to complement and eventually replace WebPageTest (T153360). Blog post: Performance testing in a controlled lab environment
- Introduce Excimer, a new sampling profiler for PHP 7 to replace HHVM Xenon (T176916). Includes creation of the new php-excimer extension (blog post).
- Implement new "Backend-Timing" metric on Apache PHP web servers, as first full measurement of MediaWiki latencies. Backed by Prometheus. (T131894)
- Migrate WebPageTest hosting from Windows to Linux (T165626)
- Expand synthetic testing to more non-English wikis.
- Introduce Fresnel, performance testing in MediaWiki CI jobs. (T133646).
- Review current research on performance perception (T165272, T187299). Essay: Perceived Performance (2018). Blog posts: Mobile web performance: the importance of the device, Machine learning: how to undersample the wrong way.
- Develop new "navtiming2" metric definitions, addressing what we learned since 2015, and enable use of stacked graphs (T104902).
- On-going maintenance of navtiming.py service, including migration to dedicated hardware, and support for failover to secondary datacenter.
Outreach:
- Measure performance from Asia both pre- and post- Singapore data center coming online (T169180, T168416), including a new navtiming capability for geographic oversampling (T169522). (blog post)
- Publish the first post in the Perf Matters at Wikipedia series.
- Get published in the Web Performance Calendar (5x: Magic numbers, Comparing HAR, Measuring Wikipedia, Why perf matters, AVIF).
Improvement:
- Annual Plans/FY2019/TEC1: Improve MediaWiki availability and reduce read-only impact from data center switchovers.
- Multi-DC: Develop integration and support for Mcrouter service in MediaWiki's WANObjectCache, support SRE's rollout of mcrouter service. (T198239)
- Annual Plans/FY2019/TEC4: PHP7 Migration: Guide the work and support other teams.
- Introduce support for packageFiles to ResourceLoader (T133462).
- Introduce support for WebP compression format to Thumbor.
- Reduce page load time by refactoring the startup module to need only one roundtrip instead of two, effectively loading jQuery in parallel outside the critical path. (T192623).
2017
editSee also Annual Plan/2017-2018#Technology, 2017-18 Q3, 2017-18 Q4, and internal 2017-2018 roadmap.
Outreach:
- Publish in the Web Performance Calendar (Automate performance regression alerts).
Insights:
- Program 1. Availability, performance, and maintenance.
- All production sites and services maintain current levels of availability or better.
- Maintain a comprehensive toolset to measure the performance of our platforms.
- Research reverse proxies technologies with objective to obtain more stable metrics from synthetic testing infrastructure, increasing confidence, reduce minimum regression size for detection. Evaluated Mahimahi, WebPageReplay, and mitmproxy; selected WebPageReplay. Deployed WebPageReplay+Browsertime to complement and eventually replace WebPageTest (T153360).
- Implement a performance alerting system atop Grafana. Establish it as a practice for other teams to follow. Two teams used it in the first year. T153169
- Develop new "navtiming2" metric definitions, addressing what we learned since 2015, and enable use of stacked graphs (T104902, blog post).
Improvement:
- Support for HHVM-PHP7 migration and upgrade, including development of php-excimer (T176916, blog post)
- Support regular data center switchovers, including development of EtcdConfig in MediaWiki core (T156924, T160178)
- Expand support in Thumbor to private wikis. Thumbor service replaces MediaWiki ImageHandler (3-part blog post series).
- Program 8. Progress towards multi-datacenter support (wikitech:Performance/Multi-DC MediaWiki).
- Faster Wikipedia time-to-logo. (blog post, T100999)
- Faster edit save timing. (blog post)
- Faster page load time. Reduce load time on 3G-Slow connections by one whole second, from 14s to 13s. T164299#3572231
- Phase out "mediawiki.legacy.wikibits" module to reduce page view cost. T122755
- Migrate MediaWiki core and all deployed extensions to jQuery 3, multi-month cross-team effort. T124742
2016
editSee also Perf Matters at Wikipedia in 2016 (Blog post), and Annual Plan/2016-2017 Program 4: Improve site performance.
Insights:
- Enhance performance testing infrastructure, including speeding up the infrastructure to achieve hourly testing instead every 3 hours (T151197), and adding new metrics for DOM size (T159362).
Improvement:
- Help develop Thumbor as service to replace MediaWiki FileHandler in production (3-part blog post series).
- Help guide and prepare for HTTP/2 roll out to Wikimedia CDN (blog post).
- Progress towards multi-datacenter support (wikitech:Performance/Multi-DC MediaWiki).