User:ABorrero (WMF)/Notes/Onboarding notes
Timeline
editA timeline of how the onboarding process was.
week 1
editBasically following what was planned at Wikimedia_Cloud_Services_team/Onboarding_Arturo. Lots of paperwork. Lots of meetings using Google Hangout. Lots of new sutff, technologies, names and people. This was overwhelming. Try to be patient.
Registering to at least 4 wikis and creating profile in each of them:
- https://meta.wikimedia.org <-- Community & movement Wiki
- https://wikitech.wikimedia.org <-- Cloud team Wiki, CloudVPS frontend
- https://mediawiki.org <-- General Wiki about technology at WMF
- https://office.wikimedia.org <-- WMF intranet
Important meetings:
- WMCS weekly team meeting
- TechOPs weekly meeting
- Quarter goals meetings
- Chase meetings to sync and learn
- Bryan (as my manager) 1:1 meetings
- Meetings with other people for other several stuff (like GPG key signing)
Setting accounts and access for other services:
- Webmail, calendar, etc <-- Google services actually
- https://phabricator.wikimedia.org <-- tasks, tickets and projects management
- https://gerrit.wikimedia.org <-- code review
- pwstore <-- internal tool for password management
- SSH keys <-- to identify to SSH servers
- IRC channels <-- probably better use https://irccloud.com
- Mailing lists <-- several WMF mailing lists
Important learnings this week:
- Infra
- WMF projects, organization and structure
Got my first task assigned: https://phabricator.wikimedia.org/T179024
week 2
editFollow-up with meetings and learnings.
Continue with task: https://phabricator.wikimedia.org/T179024 <-- closed
Create these wiki notes.
Created a CloudVPS project and a virtual machine inside:
ssh aborrero-test-vm1.aborrero-test.eqiad.wmflabs
week 3
edit- Play with puppet-compiler and puppet-standalone (testing the unattended upgrades patches)
- Cultural orientation meetings
- Unattended upgrades https://phabricator.wikimedia.org/T177920 https://phabricator.wikimedia.org/T180254
- Wiki replicas https://phabricator.wikimedia.org/T173647
week 4
editTODO:
Done:
- document puppet-compiler and puppet-standalone learnings https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler
- wiki replicas https://phabricator.wikimedia.org/T173647
Docs for wiki-replicas automation:
Infra
editCloud have 2 main projects:
- CloudVPS (Openstack)
- Toolsforge
Also, there are other several important things:
- Puppet deployment
- Networking: management networks, physical network, bastions
- Datacenters and physical deployments
- NFS servers for shared storage and data
CloudVPS
editThis is the main infra for hosting in the wikimedia movement both for internal use and for volunteers and anyone who adds value to our movement. Is basically an old OpenStack deployment. Work is ongoing to move to OpenStack Liberty.
The wikitech frontend is a mediawiki plugin to perform tasks that nowadays can be done via Horizon.
There should be docs both for external users and for us (admins), for example:
workflow 1: server lists
editFor knowing more instances of a project:
- enter labcontrol1001.wikimedia.org
- get root. source /root/novaenv.sh
- run, for example:
OS_TENANT_ID=tools openstack server list
workflow 2: quotas
editAbout knowing and managing quotas:
root@labcontrol1001:~# source /root/novaenv.sh root@labcontrol1001:~# openstack quota show aborrero-test +----------------------+---------------+ | Field | Value | +----------------------+---------------+ | cores | 8 | | fixed-ips | 200 | | floating_ips | 0 | | injected-file-size | 10240 | | injected-files | 5 | | injected-path-size | 255 | | instances | 8 | | key-pairs | 100 | | project | aborrero-test | | properties | 128 | | ram | 16384 | | secgroup-rules | 20 | | secgroups | 10 | | server_group_members | 10 | | server_groups | 10 | +----------------------+---------------+
Upstream docs: https://docs.openstack.org/nova/pike/admin/quotas.html
workflow 3: wiki db replicas
editIf a new wiki is deployed in production, we should create a replica for Cloud VPS users to work with that database instead of the production one. We replicate the database but offer just a SQL view of the data, without private data.
Steps:
- DBAs setup the database and sanitize private data
- we run maintain-views and maintain-meta_p on labsdb servers
- we run wikireplica_dns
- check with sql command if that works
More docs and examples:
- https://phabricator.wikimedia.org/T173647
- https://wikitech.wikimedia.org/wiki/Add_a_wiki#Cloud_Services
- https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replica_DNS
current deployment
editAll servers are in the same subnet.
- labvirtXXXX <-- servers for openstack virtualization, compute
- labnetXXXX <-- servers implementing nova-network
- labdbXXXX <-- servers hosting wiki database replicas (without private data)
- labservicesXXXX <-- DNS servers
Toolforge
editSystem deployed inside CloudVPS (Openstack) as the tools tenant.
It runs 2 backends: gridengine, kubernetes
Two tools related projects maintained in part by the Cloud Services team are quarry and paws. (Quarry is actually not hosted in Toolforge currently. It has its own project.)
Composition and naming scheme
editThe tools cluster is composed of:
- tools-worker* <-- kubernetes node
- tools-exec* <-- gridengine
- 2 etcd clusters (1 kubernetes datastore for state, 1 flannel network overlay)
The kubernetes cluster has a flat network topology allowing each node (i.e. worker) to connect directly to each other. This is by using flannel.
Managing exec nodes
editIn case some operations require it (like testing a patch or doing maintenance), tools-exec* nodes can be depool'ed/repool'ed.
- Jump to login.tools.wmflabs.org.
- Leave a message to Server Admin Log: https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL (on IRC: !log tools depool node X for whatever)
- Run exec-manage depool tools-exec*.tools.eqiad.wmflabs
- Wait for jobs to end: exec-manage status tools-exec*.tools.eqiad.wmflabs.
- Jump to the node and use it. Beware of puppet running every 30 minutes, this may overwrite your files.
- Once finished, back to login.tools.wmflabs.org and run exec-manage repool tools-exec*.tools.eqiad.wmflabs and leave another SAL message.
Managing worker nodes
editIn case some operations require it (like testing a patch or doing maintenance), tools-worker* nodes can be cordoned/uncordoned.
- Jump to tools-k8s-master-01.tools.eqiad.wmflabs.
- Leave a message to Server Admin Log: https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL (on IRC: !log tools cordon node X for whatever)
- Run kubectl cordon tools-worker*.tools.eqiad.wmflabs
- Review status: kubectl get nodes. Drain if neccesary: kubectl drain tools-worker*.tools.eqiad.wmflabs
- Jump to the node and use it. Beware of puppet running every 30 minutes, this may overwrite your files.
- Once finished, un kubectl uncordon tools-worker*.tools.eqiad.wmflabs and leave another SAL message. Review status again.
To know which pods are scheduled in which nodes, run:
aborrero@tools-k8s-master-01:~$ sudo kubectl get pods --all-namespaces -o wide | grep tools-worker-1001
grantmetrics grantmetrics-1330309696-9rzri 1/1 Running 0 16d 192.168.168.2 tools-worker-1001.tools.eqiad.wmflabs
lziad p4wikibot-657229038-22rxw 1/1 Running 0 13d 192.168.168.5 tools-worker-1001.tools.eqiad.wmflabs
openstack-browser openstack-browser-148894442-vhs63 1/1 Running 0 6d 192.168.168.6 tools-worker-1001.tools.eqiad.wmflabs
versions versions-1535803801-j8v7s 1/1 Running 0 22d 192.168.168.4 tools-worker-1001.tools.eqiad.wmflabs
Access
edit- SSH bastions: login.tools.wmflabs.org
- Web interface:
Puppet
editThe puppet deployment is used for almost everything related to bare infrastructure.
There are several puppet repositories, the main one being operations/puppet.git.
Main documentation: https://wikitech.wikimedia.org/wiki/Puppet_coding
workflows
editDescription of several workflows.
generic patching workflow
edit- Set up SSH keys, gerrit and phabricator, LDAP groups
- Clone repository, for example:
git clone ssh://aborrero@gerrit.wikimedia.org:29418/operations/puppet.git
- Set up git-review https://www.mediawiki.org/wiki/Gerrit/git-review
- Develop patch, test it somewhere
- Push patch and await review. Update patch and push again if required.
- In gerrit, use Verified+2 and Submit buttons.
- Jump to puppetmaster1001.eqiad.wmnet and run sudo puppet-merge.
If the patch affects the tools project, then additionally:
- If requried, jump to tools-clushmaster-01.eqiad.wmflabs and run clush -w @all 'sudo puppet agent --test'
Advanced patching
editThere are 2 main approaches:
- Setting a puppet standalone master/agent to test patches and how they affect the final machine. https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster
- Running puppet-compile by hand to see final generated changes before deploying. https://phabricator.wikimedia.org/T97081#3681225
testing a patch
editIn order to test a patch, it would be necessary to have a real machine at hand.
In the tools project, get an tools-exec* node and depool/repool it (see specific docs in the tools section).
Other tests may require to compile the puppet catalog by hand before deploying it to agents.
physical servers
editPhysical servers are being installed using Puppet as well.
We use a combination of DHCP+PXE+Debian installer preseed to get it installed automatically.
In case a server needs to be reached via ILO, there are specific docs for this: https://wikitech.wikimedia.org/wiki/Platform-specific_documentation
deployment
editSome bits about the puppet deployment. Every project has his own puppet master.
For example:
- integration project: integration-puppetmaster01.integration.eqiad.wmflabs
- tools project: tools-puppetmaster-01.tools.eqiad.wmflabs
Each puppet master knows the facts for the servers/instances in his project.
DNS
editThere is a git repository for DNS: operations/dns.git. The workflow is similar to the one followed for operations/puppet.git (i.e. gerrit review and so on)
Namespaces and schemes:
- *.<dc>.wmnet <-- physical private network, not directly accessible from the public internet.
- *.wmflabs.org <-- public vlans, accessible from the public internet, proxyed by nginx or whatever. Things inside openstack, instances, project and so on. This will be eventually renamed to wmcloud at some point in the future.
- *.<dc>.wmflabs <-- virtual network inside openstack. Private network.
- *.wikimedia.org <-- general production
Example naming:
- silver.eqiad.wmnet <-- private name
- silver.wikimedia.org <-- public accessible name
- login.tools.wmflabs.org <-- access proxy (bastion) for the toolsforge Cloud VPS project.
- vm1.aborrero-test.eqiad.wmflabs <-- private address for vm1 inside the aborrero-test Cloud VPS project in eqiad. Private address which requires SSH proxy/bastion.
NFS
editNFS servers are being use to store shared data.
There are 2 main severs right now:
- labstore-secondary (actually, the primary)
- labstore1003
Cloud VPS and Tools both use the NFS backends.
Building blocks
editThe are 2 nodes cluster using DRBD+LVM and a floating IP (using proxy ARP). They use manual failover to avoid split brain-like situations.
Each node have a quota to avoid users overloading the servers. These quotas are tc controllers (like a QoS). In the past, overloading a server resulted in the whole NFS infra being rather slow, which resulted in all clients not accessing data.
Data in NFS
editThere are several data which are usually stored in the NFS backends:
- home directories
- scratch spaces
- wiki dumps (read only)
- project specific data
Networking
editSome bits about the WMF networks.
SSH bastions
editWe use bastion hosts as gateways to jump to backend servers. This is done by proxying commands and requires a specific config in ~.ssh/config.
Info: https://wikitech.wikimedia.org/wiki/Production_shell_access
Datacenters
editUsually machines and services are spread all across several datacenters.
Naming scheme is usually:
- machine1001.wikimedia.org <-- datacenter 1
- machine2001.wikimedia.org <-- datacenter 2
- machine3001.wikimedia.org <-- datacenter 3
- machine4001.wikimedia.org <-- datacenter 4
L2L3 design bits
editNo perimetral firewalls, host bases firewalls in each server. Subnets per rack rows.
Monitoring
editThe metrics stack is Graphana/Graphine/Diamond and for alerts Nagios.
Links:
Diamond collectors runs in every machine and sends metrics to the graphite server.
Wiki replicas
editMain article: https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Wiki_Replicas
Know more about general databases and how they share data: