diff --git a/tsa/howto/ganeti.mdwn b/tsa/howto/ganeti.mdwn index d3cd174c95b1775e78564627e43139aa94e10237..3e70a48210de0c44e82143902ced0d3136767c4e 100644 --- a/tsa/howto/ganeti.mdwn +++ b/tsa/howto/ganeti.mdwn @@ -34,6 +34,8 @@ disks (with [[drbd]]). Instances are normally assigned two nodes: a *primary* and a *secondary*: the *primary* is where the virtual machine actually runs and th *secondary* acts as a hot failover. +See also the more extensive [glossary in the Ganeti documentation](http://docs.ganeti.org/ganeti/2.15/html/glossary.html). + ## Adding a new instance This command creates a new guest, or "instance" in Ganeti's @@ -95,6 +97,34 @@ Also set reverse DNS for both IPv4 and IPv6 in [hetzner's robot](https://robot.y Then follow [[new-machine]]. +## Modifying an instance + +It's possible to change the IP, CPU, or memory allocation of an instance +using the [gnt-instance modify](http://docs.ganeti.org/ganeti/2.15/man/gnt-instance.html#modify) command: + + gnt-instance modify -B vcpus=2 test1.torproject.org + gnt-instance modify -B memory=4g test1.torproject.org + gnt-instance reboot test1.torproject.org + +IP address changes require a full stop and will require manual changes +to the `/etc/network/interfaces*` files: + + gnt-instance modify --net 0:modify,ip=116.202.120.175 test1.torproject.org + gnt-instance stop test1.torproject.org + gnt-instance start test1.torproject.org + gnt-instance console test1.torproject.org + +The [gnt-instance grow-disk](http://docs.ganeti.org/ganeti/2.15/man/gnt-instance.html#grow-disk) command can be used to change the size +of the underlying device: + + gnt-instance grow-disk test1.torproject.org 0 16g + gnt-instance reboot test1.torproject.org + +The number `0` in this context, indicates the first disk of the +instance. Then the filesystem needs to be resized inside the VM: + + ssh root@test1.torproject.org resize2fs /dev/sda1 + ## Destroying an instance This totally deletes the instance, including all mirrors and @@ -116,6 +146,75 @@ memory, and compare it with the node's capacity: watch -n5 -d 'gnt-instance list -o pnode,name,be/vcpus,be/memory,status,disk_template | sort; echo; gnt-node list' +The [gnt-cluster verify](http://docs.ganeti.org/ganeti/2.15/man/gnt-cluster.html#verify) command will also check to see if there's +enough space on secondaries to account for the failure of a +node. Healthy output looks like this: + + root@fsn-node-01:~# gnt-cluster verify + Submitted jobs 48030, 48031 + Waiting for job 48030 ... + Fri Jan 17 20:05:42 2020 * Verifying cluster config + Fri Jan 17 20:05:42 2020 * Verifying cluster certificate files + Fri Jan 17 20:05:42 2020 * Verifying hypervisor parameters + Fri Jan 17 20:05:42 2020 * Verifying all nodes belong to an existing group + Waiting for job 48031 ... + Fri Jan 17 20:05:42 2020 * Verifying group 'default' + Fri Jan 17 20:05:42 2020 * Gathering data (2 nodes) + Fri Jan 17 20:05:42 2020 * Gathering information about nodes (2 nodes) + Fri Jan 17 20:05:45 2020 * Gathering disk information (2 nodes) + Fri Jan 17 20:05:45 2020 * Verifying configuration file consistency + Fri Jan 17 20:05:45 2020 * Verifying node status + Fri Jan 17 20:05:45 2020 * Verifying instance status + Fri Jan 17 20:05:45 2020 * Verifying orphan volumes + Fri Jan 17 20:05:45 2020 * Verifying N+1 Memory redundancy + Fri Jan 17 20:05:45 2020 * Other Notes + Fri Jan 17 20:05:45 2020 * Hooks Results + +A sick node would have said something like this instead: + + Mon Oct 26 18:59:37 2009 * Verifying N+1 Memory redundancy + Mon Oct 26 18:59:37 2009 - ERROR: node node2: not enough memory to accommodate instance failovers should node node1 fail + +See the [ganeti manual](http://docs.ganeti.org/ganeti/2.15/html/walkthrough.html#n-1-errors) for a more extensive example + +## Moving instances and failove + +Ganeti is smart about assigning instances to nodes. There's also a +command (`hbal`) to automatically rebalance the cluster (see +below). If for some reason hbal doesn’t do what you want or you need +to move things around for other reasons, here are a few commands that +might be handy. + +Make an instance switch to using it's secondary: + + gnt-instance migrate test1.torproject.org + +Make all instances on a node switch to their secondaries: + + gnt-node migrate test1.torproject.org + +The `migrate` commands does a "live" migrate which should avoid any +downtime during the migration. It might be preferable to actually +shutdown the machine for some reason (for example if we actually want +to reboot because of a security upgrade). Or we might not be able to +live-migrate because the node is down. In this case, we do a +[failover](http://docs.ganeti.org/ganeti/2.15/html/admin.html#failing-over-an-instance) + + gnt-instance failover test1.torproject.org + +The [gnt-node evacuate](http://docs.ganeti.org/ganeti/2.15/man/gnt-node.html#evacuate) command can also be used to "empty" a given +node altogether, in case of an emergency: + + gnt-node evacuate -I . fsn-node-02.torproject.org + +Similarly, the [gnt-node failover](http://docs.ganeti.org/ganeti/2.15/man/gnt-node.html#failover) command can be used to +hard-recover from a completely crashed node: + + gnt-node failover fsn-node-02.torproject.org + +Note that you might need the `--ignore-consistency` flag if the +node is unresponsive. + ## Rebooting Those hosts need special care, as we can accomplish zero-downtime @@ -175,6 +274,44 @@ cluster. Here's an example run on a small cluster: web-fsn-01.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 4.0G web-fsn-02.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 4.0G +In the above example, you should notice that the `web-fsn` instances both +ended up on the same node. That's because the balancer did not know +that they should be distributed. A special configuration was done, +below, to avoid that problem in the future. But as a workaround, +instances can also be moved by hand and the cluster re-balanced. + +## Redundant instances distribution + +Some instances are redundant across the cluster and should *not* end up +on the same node. A good example are the `web-fsn-01` and `web-fsn-02` +instances which, in theory, would serve similar traffic. If they end +up on the same node, it might flood the network on that machine or at +least defeats the purpose of having redundant machines. + +The way to ensure they get distributed properly by the balancing +algorithm is to "tag" them. For the web nodes, for example, this was +performed on the master: + + gnt-instance add-tags web-fsn-01.torproject.org web-fsn + gnt-instance add-tags web-fsn-02.torproject.org web-fsn + gnt-cluster add-tags htools:iextags:web-fsn + +This tells Ganeti that `web-fsn` is an "exclusion tag" and the +optimizer will not try to schedule instances with those tags on the +same node. + +To see which tags are present, use: + + # gnt-cluster list-tags + htools:iextags:web-fsn + +You can also find which nodes are assigned to a tag with: + + # gnt-cluster search-tags web-fsn + /cluster htools:iextags:web-fsn + /instances/web-fsn-01.torproject.org web-fsn + /instances/web-fsn-02.torproject.org web-fsn + ## Adding and removing addresses on instances Say you created an instance but forgot to need to assign an extra @@ -182,10 +319,66 @@ IP. You can still do so with: gnt-instance modify --net -1:add,ip=116.202.120.174,network=gnt-fsn test01.torproject.org +## Importing foreign VMs + +We do not have documentation on how to do those imports yet, but +Riseup has [a section in their documentation](https://we.riseup.net/riseup+tech/ganeti#move-an-instance-from-one-cluster-to-another-from-) about this that we +might want to take a look at. The Ganeti manual also has a (very +short) section on [importing foreign instances](http://docs.ganeti.org/ganeti/2.15/html/admin.html#import-of-foreign-instances) + ## Pager playbook +### I/O overload + +In case of excessive I/O, it might be worth looking into which machine +is in cause. The [[drbd]] page explains how to map a DRBD device to a +VM. You can also find which logical volume is backing an instance (and +vice versa) with this command: + + lvs -o+tags + +This will list all logical volumes and their associated tags. If you +already know which logical volume you're looking for, you can address +it directly: + + root@fsn-node-01:~# lvs -o tags /dev/vg_ganeti_hdd/4091b668-1177-41ac-9310-1eac45b46620.disk2_data + LV Tags + originstname+bacula-director-01.torproject.org + +### Node failures + +Ganeti clusters are designed to be [self-healing](http://docs.ganeti.org/ganeti/2.15/html/admin.html#autorepair). As long as only +one machine disappears, the cluster should be able to recover by +failing over other nodes. This is currently done manually, see the +migrate section above. + +This could eventually be automated if such situations occur more +often, by scheduling a [harep](http://docs.ganeti.org/ganeti/2.15/man/harep.html) cron job, which isn't enabled in +Debian by default. See also the [autorepair](http://docs.ganeti.org/ganeti/2.15/html/admin.html#autorepair) section of the admin +manual. + +### Other troubleshooting + +Riseup has [documentation on various failure scenarios](https://we.riseup.net/riseup+tech/ganeti#failure-scenarios) including +master failover, which we haven't tested. There's also upstream +documentation on [changing node roles](http://docs.ganeti.org/ganeti/2.15/html/admin.html#changing-the-node-role) which might be useful for a +master failover scenario. + +The [walkthrough](http://docs.ganeti.org/ganeti/2.15/html/walkthrough.html) also has a few recipes to resolve common +problems. + ## Disaster recovery +If things get completely out of hand and the cluster becomes too +unreliable for service, the only solution is to rebuild another one +elsewhere. Since Ganeti 2.2, there is a [move-instance](http://docs.ganeti.org/ganeti/2.15/html/move-instance.html) command to +move instances between cluster that can be used for that purpose. + +If Ganeti is completely destroyed and its APIs don't work anymore, the +last resort is to restore all virtual machines from +[[backup]]. Hopefully, this should not happen except in the case of a +catastrophic data loss bug in Ganeti or [[drbd]]. + # Reference ## Installation @@ -195,13 +388,15 @@ IP. You can still do so with: - To create a new box, follow [[new-machine-hetzner-robot]] but change the following settings: - * Server: [PX62-NVMe](https://www.hetzner.com/dedicated-rootserver/px62-nvme?country=OTHER) + * Server: [PX62-NVMe][] * Location: `FSN1` * Operating system: Rescue * Additional drives: 2x10TB * Add in the comment form that the server needs to be in the same datacenter as the other machines (FSN1-DC13, but double-check) +[PX62-NVMe]: https://www.hetzner.com/dedicated-rootserver/px62-nvme?country=OTHER + - Make sure all nodes have the same LVM setup and the same network setup. They want openvswitch. Cf. host `fsn-node-01`'s /etc/network/interfaces. - Prepare all the nodes by configuring them in puppet. They should be in the class `roles::ganeti::fsn` if they @@ -279,17 +474,12 @@ New nodes can be provisioned within a week or two, depending on budget and hardware availability. ## Design -<!-- how this is built --> -<!-- should reuse and expand on the "proposed solution", it's a --> -<!-- "as-built" documented, whereas the "Proposed solution" is an --> -<!-- "architectural" document, which the final result might differ --> -<!-- from, sometimes significantly --> Our first Ganeti cluster (`gnt-fsn`) is made of multiple machines hosted with [Hetzner Robot](https://robot.your-server.de/), Hetzner's dedicated server hosting service. All machines use the same hardware to avoid problems with live migration. That is currently a customized build of the -[PX62-NVMe](https://www.hetzner.com/dedicated-rootserver/px62-nvme?country=OTHER) line. +[PX62-NVMe][] line. ### Network layout @@ -352,7 +542,7 @@ with: x86 Skylake-Server-IBRS Intel Xeon Processor (Skylake, IBRS) [...] -The current PX62 line is based on the [Coffee Lake](https://en.wikipedia.org/wiki/Coffee_Lake) Intel +The current [PX62 line][PX62-NVMe] is based on the [Coffee Lake](https://en.wikipedia.org/wiki/Coffee_Lake) Intel micro-architecture. The closest matching family would be `Skylake-Server` or `Skylake-Server-IBRS`, [according to wikichip](https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake#Compiler_support). Note that newer QEMU releases (4.2, currently in unstable) have more @@ -419,25 +609,58 @@ There is no issue tracker specifically for this project, [File][] or ## Overview -<!-- describe the overall project. should include a link to a ticket --> -<!-- that has a launch checklist --> +The project of creating a Ganeti cluster for Tor has appeared in the +summer of 2019. The machines were delivered by Hetzner in July 2019 +and setup by weasel by the end of the month. ## Goals -<!-- include bugs to be fixed --> + +The goal was to replace the aging group of KVM servers (kvm[1-5], AKA +textile, unifolium, macrum, kvm4 and kvm5). ### Must have + * arbitrary virtual machine provisionning + * redundant setup + * automated VM installation + * replacement of existing infrastructure + ### Nice to have + * fully configured in Puppet + * full high availability with automatic failover + * extra capacity for new projects + ### Non-Goals + * Docker or "container" provisionning - we consider this out of scope + for now + * self-provisionning by end-users: TPA remains in control of + provisionning + ## Approvals required -<!-- for example, legal, "vegas", accounting, current maintainer --> + +A budget was proposed by weasel in may 2019 and approved by Vegas in +June. An extension to the budget was approved in january 2020 by +Vegas. ## Proposed Solution +Setup a Ganeti cluster of two machines with a Hetzner vSwitch backend. + ## Cost +The design based on the [PX62 line][PX62-NVMe] has the following monthly cost +structure: + + * per server: 118EUR (79EUR + 39EUR for 2x10TB HDDs) + * IPv4 space: 35.29EUR (/27) + * IPv6 space: 8.40EUR (/64) + * bandwidth cost: 1EUR/TB (currently 38EUR) + +At three servers, that adds up to around 435EUR/mth. Up to date costs +are available in the [Tor VM hosts.xlsx](https://nc.torproject.net/apps/onlyoffice/5395) spreadsheet. + ## Alternatives considered <!-- include benchmarks and procedure if relevant -->