GitLab/Initialization/GitLab Implementation Plan
Gitlab is a version control and CI system. It is a web application written in Ruby on Rails using Postgresql and Redis as databases and other shared storage to store the source code repositories. There is an enterprise Edition (EE) which is licensed software and a Community Edition (CE) which is free. We will be talking about Gitlab CE, except where noted.
We currently use gerrit/zuul/jenkins in a similar function. Usage is: 12 gitops per sec, X repositories, 800 GB size, 1000s users, 250 active/month, about 30 CI VMs with 450,000 Ci minutes/month (43800 min/month, so 11 servers active at any given point in time…). VMs run in Wikimedia Cloud, and we would prefer to relocate to a private installation.
The current gerrit/zuul/jenkins system has a limited high availability setup with warm standbys in an alternate datacenter that could be activated within hours. The CI VMs however run on WMCS which is only available in the eqiad datacenter. We have not tested a failover of the systems.
Gitlab can be installed in a number of ways: via an “omnibus” package, downloading docker images, community packages, from source or on kubernetes via helm. Omnibus package and docker images are monolithic installations that contain all software components. After installation one uses a gitlab provided control tool and config file to turn off a component to migrate it out of the omnibus setup - for example the postgres database. Gitlab then runs “chef” to reconfigure itself. That way Gitlab abstracts system administration across different operating systems. On kubernetes the situation is different as there is a single package manager and administration utilities. Gitlab comes in individual helm charts and helm is used to configure each component.
Gitlab, the SaaS application gitlab (at gitlab.com) runs under kubernetes under Google Kubernetes Engine on the Google Cloud Platform.
For WMF we should run Gitlab under kubernetes, with dedicated servers for Postgresql to provide high availability and a dedicated server for Redis. Implementation of repository storage should be through Object Storage via WMF Swift.
Running Gitlab under kubernetes is aligned with the current direction in SRE, whereas running in the other modes would require reverse engineering for puppetization or acceptance of the recommended manual omnibus installation and monthly upgrade procedures.
Kubernetes would also provide a good platform for adjusting performance. Applying Gitlab’s sizing guidelines to our use case yields rather large requirements, which does not match the experiences of other OS organizations (debian, kde) that use smaller installations. Our base spec was 12 gitops per second, which leads to a 6,000 users system at Gitlab (20 API rps, 2 Web rps, 2 Git rps per 1000 users). Gitlab has sizing guidelines for 1,000, 5,000 and 10,000 user systems. The discrepancy is worth following up with Gitlab directly.
The kubernetes cluster needs to be reasonably modern and needs to have ingress installed. The configuration currently under development (Q2/Q3) in Service Ops should work. The cluster can be dedicated to gitlab and its runners. Under the current system the “runners” occupy about 30 small VMs at Wikimedia Cloud, and usage is fairly high (450,000 minutes/month). The VMs would become containers on kubernetes under Gitlab.
One SRE is required to setup the kubernetes cluster and configure high-availability for the postgresql database. The SRE will work on installing, configuring, documenting and testing the gitlab infrastructure and gitlab where involved, for example backup and recovery:
- Kubernetes persistent storage
- Postgresql configuration and pgbouncer connection pool setup
- Postgresql high availability, testing failover to standby databases and creation of new standbys
- Includes Gitlab impact testing and instructions for recovery
- Postgresql backup and recovery testing
- Outreach to Data Persistence for best practices and integration possibilities
- Redis failure and recovery
- Includes Gitlab impact testing and instructions for recovery
- OpenStack Swift Setup
- Outreach to Data Persistence for best practices
- Backup and Recovery
- Consistent Backup and Recovery
- Postgresql, Swift in sync?
- Monitoring through Prometheus and alerting conditions
- Research on typical gitlab/postgres/redis trouble indicators
- Logging to WMF Logstash
- Research on typical gitlab/postgres/redis trouble indicators
- CI/CD monitoring and logging trouble indicators
- Kubernetes replica numbers, CPU and memory usage
- Gitlab software upgrade with database schema change/migration
- Gitlab software upgrade
The installation will use WMF’s puppet for a reproducible setup. An experienced WMF SRE is needed to work on this project, puppetize in line with our norms and talktalking to the various SRE (ServiceOps, Data Persistence, Observability) groups.
An outside consultant can work on many of the above steps, but will have to spend time in learning WMF style technology: separation between our environments and puppet, for example and will still need a WMF SRE for final production setup.
A consultant without a WMF SRE can quickly work on a test installation that provides the above items following industry best practices and minimize the need for WMF SRE interaction. The output would be a reproducible setup, with an industry standard kubernetes, dedicated servers for postgresql HA and redis and running gitlab ready to be used by the Release Engineering team. The setup would be fully documented and have repeatable test procedures for HA, backup and recovery and various fault scenarios, failure of web servers, runners, etc and the software upgrades involved. Necessary infrastructure: 10 VMs with 8 GB each (3 k8s controlpane, 3 k8s workers for runners, 2 postgresql, 1 redis, 1 automation). To have the consultant work unencumbered an environment with local root is needed, a WMCS project would work, Ganeti with local root as well or a cloud provider.
A final integration project is needed to WMF puppetize the setup, use WMF monitoring and logging (Prometheus and logstash) and configure alerting, plus final testing for HA, backup and recovery using procedures described and used before by the consultant. The integration project would require an experienced WMF SRE plus the consultant.