[Ganeti](http://ganeti.org/) is software designed to facilitate the management of virtual machines (KVM or Xen). It helps you move virtual machine instances from one node to another, create an instance with DRBD replication on another node and do the live migration from one to another, etc. [[_TOC_]] # Tutorial ## Listing virtual machines (instances) This will show the running guests, known as "instances": gnt-instance list ## Accessing serial console Our instances do serial console, starting in grub. To access it, run gnt-instance console test01.torproject.org To exit, use `^]` -- that is, Control-<Closing Bracket>. # How-to ## Glossary In Ganeti, we use the following terms: * **node** a physical machine is called a *node* and a * **instance** a virtual machine * **master**: a *node* where on which we issue Ganeti commands and that supervises all the other nodes Nodes are interconnected through a private network that is used to communicate commands and synchronise disks (with [howto/drbd](howto/drbd)). Instances are normally assigned two nodes: a *primary* and a *secondary*: the *primary* is where the virtual machine actually runs and the *secondary* acts as a hot failover. See also the more extensive [glossary in the Ganeti documentation](http://docs.ganeti.org/ganeti/2.15/html/glossary.html). ## Adding a new instance This command creates a new guest, or "instance" in Ganeti's vocabulary with 10G root, 2G swap, 20G spare on SSD, 800G on HDD, 8GB ram and 2 CPU cores: gnt-instance add \ -o debootstrap+bullseye \ -t drbd --no-wait-for-sync \ --net 0:ip=pool,network=gnt-fsn13-02 \ --no-ip-check \ --no-name-check \ --disk 0:size=10G \ --disk 1:size=2G,name=swap \ --disk 2:size=20G \ --disk 3:size=800G,vg=vg_ganeti_hdd \ --backend-parameters memory=8g,vcpus=2 \ test-01.torproject.org ### What that does This configures the following: * redundant disks in a DRBD mirror, use `-t plain` instead of `-t drbd` for tests as that avoids syncing of disks and will speed things up considerably (even with `--no-wait-for-sync` there are some operations that block on synced mirrors). Only one node should be provided as the argument for `--node` then. * three partitions: one on the default VG (SSD), one on another (HDD) and a swap file on the default VG, if you don't specify a swap device, a 512MB swapfile is created in `/swapfile`. TODO: configure disk 2 and 3 automatically in installer. (`/var` and `/srv`?) * 8GB of RAM with 2 virtual CPUs * an IP allocated from the public gnt-fsn pool: `gnt-instance add` will print the IPv4 address it picked to stdout. The IPv6 address can be found in `/var/log/ganeti/os/` on the primary node of the instance, see below. * with the `test-01.torproject.org` hostname ### Next steps To find the root password, ssh host key fingerprints, and the IPv6 address, run this **on the node where the instance was created**, for example: egrep 'root password|configured eth0 with|SHA256' $(ls -tr /var/log/ganeti/os/* | tail -1) | grep -v $(hostname) We copy root's authorized keys into the new instance, so you should be able to log in with your token. You will be required to change the root password immediately. Pick something nice and document it in `tor-passwords`. Also set reverse DNS for both IPv4 and IPv6 in [hetzner's robot](https://robot.your-server.de/) (Chek under servers -> vSwitch -> IPs) or in our own reverse zone files (if delegated). Then follow [howto/new-machine](howto/new-machine). ### Known issues * **allocator failures**: Note that you may need to use the `--node` parameter to pick on which machines you want the machine to end up, otherwise Ganeti will choose for you (and may fail). Use, for example, `--node fsn-node-01:fsn-node-02` to use `node-01` as primary and `node-02` as secondary. The allocator can sometimes fail if the allocator is upset about something in the cluster, for example: Can's find primary node using iallocator hail: Request failed: No valid allocation solutions, failure reasons: FailMem: 2, FailN1: 2 This situation is covered by [ticket 33785](https://bugs.torproject.org/33785). If this problem occurs, it might be worth [rebalancing the cluster](#rebalancing-a-cluster). * **ping failure**: there is a bug in `ganeti-instance-debootstrap` which misconfigures `ping` (among other things), see [bug 31781](https://bugs.torproject.org/31781). It's currently patched in our version of the Debian package, but that patch might disappear if Debian upgrade the package without [shipping our patch](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=944538). Note that this was fixed in Debian bullseye and later. ### Other examples This is a typical server creation in the `gnt-chi` cluster: gnt-instance add \ -o debootstrap+bullseye \ -t drbd --no-wait-for-sync \ --net 0:ip=pool,network=gnt-chi-01 \ --no-ip-check \ --no-name-check \ --disk 0:size=10G \ --disk 1:size=2G,name=swap \ --disk 2:size=20G \ --backend-parameters memory=8g,vcpus=2 \ test-01.torproject.org A simple test machine, with only 1G of disk, ram, and 1 CPU, without DRBD, in the FSN cluster: gnt-instance add \ -o debootstrap+bullseye \ -t plain --no-wait-for-sync \ --net 0:ip=pool,network=gnt-fsn13-02 \ --no-ip-check \ --no-name-check \ --disk 0:size=10G \ --disk 1:size=2G,name=swap \ --backend-parameters memory=1g,vcpus=1 \ test-01.torproject.org Do not forget to follow the [next steps](#next-steps), above. ### iSCSI integration To create a VM with iSCSI backing, a disk must first be created on the SAN, then adopted in a VM, which needs to be *reinstalled* on top of that. This is typical how large disks are provisionned in the `gnt-chi` cluster, in the [Cymru POP](howto/new-machine-cymru). The following instructions assume you are on a node with an [iSCSI initiator properly setup](howto/new-machine-cymru#iscsi-initiator-setup), and the [SAN cluster management tools setup](howto/new-machine-cymru#san-management-tools-setup). It also assumes you are familiar with the `SMcli` tool, see the [storage servers documentation](howto/new-machine-cymru#storage-servers) for an introduction on that. 1. create a dedicated disk group and virtual disk on the SAN, assign it to the host group and propagate the multipath config across the cluster nodes: /usr/local/sbin/tpo-create-san-disks --san chi-node-03 --name test-01 --capacity 500 2. confirm that multipath works, it should look something like this": root@chi-node-01:~# multipath -ll test-01 (36782bcb00063c6a500000d67603f7abf) dm-20 DELL,MD32xxi size=500G features='5 queue_if_no_path pg_init_retries 50 queue_mode mq' hwhandler='1 rdac' wp=rw |-+- policy='round-robin 0' prio=6 status=active | |- 11:0:0:4 sdi 8:128 active ready running | |- 12:0:0:4 sdj 8:144 active ready running | `- 9:0:0:4 sdh 8:112 active ready running `-+- policy='round-robin 0' prio=1 status=enabled |- 10:0:0:4 sdk 8:160 active ghost running |- 7:0:0:4 sdl 8:176 active ghost running `- 8:0:0:4 sdm 8:192 active ghost running root@chi-node-01:~# 3. adopt the disk in Ganeti: gnt-instance add \ -n chi-node-01.torproject.org \ -o debootstrap+bullseye \ -t blockdev --no-wait-for-sync \ --net 0:ip=pool,network=gnt-chi-01 \ --no-ip-check \ --no-name-check \ --disk 0:adopt=/dev/disk/by-id/dm-name-test-01 \ --backend-parameters memory=8g,vcpus=2 \ test-01.torproject.org NOTE: the actual node must be manually picked because the `hail` allocator doesn't seem to know about block devices. NOTE: mixing DRBD and iSCSI volumes on a single instance is not supported. 4. at this point, the VM probably doesn't boot, because for some reason the `gnt-instance-debootstrap` doesn't fire when disks are adopted. so you need to reinstall the machine, which involves stopping it first: gnt-instance shutdown --timeout=0 test-01 gnt-instance reinstall test-01 HACK one: the current installer fails on weird partionning errors, see [upstream bug 13](https://github.com/ganeti/instance-debootstrap/issues/13). We applied [this patch](https://github.com/ganeti/instance-debootstrap/commit/e0df6b1fd25dc3e111851ae42872df0a757ac4a9) as a workaround to avoid failures when the installer attempts to partition the virtual disk. From here on, follow the [next steps](#next-steps) above. TODO: This would ideally be automated by an external storage provider, see the [storage reference for more information](#storage). ### Troubleshooting If a Ganeti instance install fails, it will show the end of the install log, for example: ``` Thu Aug 26 14:11:09 2021 - INFO: Selected nodes for instance tb-pkgstage-01.torproject.org via iallocator hail: chi-node-02.torproject.org, chi-node-01.torproject.org Thu Aug 26 14:11:09 2021 - INFO: NIC/0 inherits netparams ['br0', 'bridged', ''] Thu Aug 26 14:11:09 2021 - INFO: Chose IP 38.229.82.29 from network gnt-chi-01 Thu Aug 26 14:11:10 2021 * creating instance disks... Thu Aug 26 14:12:58 2021 adding instance tb-pkgstage-01.torproject.org to cluster config Thu Aug 26 14:12:58 2021 adding disks to cluster config Thu Aug 26 14:13:00 2021 * checking mirrors status Thu Aug 26 14:13:01 2021 - INFO: - device disk/0: 30.90% done, 3m 32s remaining (estimated) Thu Aug 26 14:13:01 2021 - INFO: - device disk/2: 0.60% done, 55m 26s remaining (estimated) Thu Aug 26 14:13:01 2021 * checking mirrors status Thu Aug 26 14:13:02 2021 - INFO: - device disk/0: 31.20% done, 3m 40s remaining (estimated) Thu Aug 26 14:13:02 2021 - INFO: - device disk/2: 0.60% done, 52m 13s remaining (estimated) Thu Aug 26 14:13:02 2021 * pausing disk sync to install instance OS Thu Aug 26 14:13:03 2021 * running the instance OS create scripts... Thu Aug 26 14:16:31 2021 * resuming disk sync Failure: command execution error: Could not add os for instance tb-pkgstage-01.torproject.org on node chi-node-02.torproject.org: OS create script failed (exited with exit code 1), last lines in the log file: Setting up openssh-sftp-server (1:7.9p1-10+deb10u2) ... Setting up openssh-server (1:7.9p1-10+deb10u2) ... Creating SSH2 RSA key; this may take some time ... 2048 SHA256:ZTeMxYSUDTkhUUeOpDWpbuOzEAzOaehIHW/lJarOIQo root@chi-node-02 (RSA) Creating SSH2 ED25519 key; this may take some time ... 256 SHA256:MWKeA8vJKkEG4TW+FbG2AkupiuyFFyoVWNVwO2WG0wg root@chi-node-02 (ED25519) Created symlink /etc/systemd/system/sshd.service \xe2\x86\x92 /lib/systemd/system/ssh.service. Created symlink /etc/systemd/system/multi-user.target.wants/ssh.service \xe2\x86\x92 /lib/systemd/system/ssh.service. invoke-rc.d: could not determine current runlevel Setting up ssh (1:7.9p1-10+deb10u2) ... Processing triggers for systemd (241-7~deb10u8) ... Processing triggers for libc-bin (2.28-10) ... Errors were encountered while processing: linux-image-4.19.0-17-amd64 E: Sub-process /usr/bin/dpkg returned an error code (1) run-parts: /etc/ganeti/instance-debootstrap/hooks/ssh exited with return code 100 Using disk /dev/drbd4 as swap... Setting up swapspace version 1, size = 2 GiB (2147479552 bytes) no label, UUID=96111754-c57d-43f2-83d0-8e1c8b4688b4 Not using disk 2 (/dev/drbd5) because it is not named 'swap' (name: ) root@chi-node-01:~# ``` Here the failure which tripped the install is: ``` Errors were encountered while processing: linux-image-4.19.0-17-amd64 E: Sub-process /usr/bin/dpkg returned an error code (1) ``` But the actual error is higher up, and we need to go look at the logs on the server for this, in this case in `chi-node-02:/var/log/ganeti/os/add-debootstrap+buster-tb-pkgstage-01.torproject.org-2021-08-26_14_13_04.log`, we can find the real problem: ``` Setting up linux-image-4.19.0-17-amd64 (4.19.194-3) ... /etc/kernel/postinst.d/initramfs-tools: update-initramfs: Generating /boot/initrd.img-4.19.0-17-amd64 W: Couldn't identify type of root file system for fsck hook /etc/kernel/postinst.d/zz-update-grub: /usr/sbin/grub-probe: error: cannot find a device for / (is /dev mounted?). run-parts: /etc/kernel/postinst.d/zz-update-grub exited with return code 1 dpkg: error processing package linux-image-4.19.0-17-amd64 (--configure): installed linux-image-4.19.0-17-amd64 package post-installation script subprocess returned error exit status 1 ``` In this case, oddly enough, even though Ganeti thought the install had failed, the machine can actually start: ``` gnt-instance start tb-pkgstage-01.torproject.org ``` ... and after a while, we can even get a console: ``` gnt-instance start tb-pkgstage-01.torproject.org ``` And in *that* case, the procedure can just continue from here on: reset the root password, and just make sure you finish the install: ``` apt install linux-image-amd64 ``` In the above case, the `sources-list` post-install hook was buggy: it wasn't mounting `/dev` and friends before launching the upgrades, which was causing issues when a kernel upgrade was queued. And *if* you are debugging an installer and by mistake end up with half-open filesystems and stray DRBD devices, do take a look at the [LVM](howto/lvm) and [DRBD documentation](howto/drbd). ## Modifying an instance ### CPU, memory changes It's possible to change the IP, CPU, or memory allocation of an instance using the [gnt-instance modify](http://docs.ganeti.org/ganeti/2.15/man/gnt-instance.html#modify) command: gnt-instance modify -B vcpus=4 test1.torproject.org gnt-instance modify -B memory=8g test1.torproject.org gnt-instance reboot test1.torproject.org ### IP address change IP address changes require a full stop and will require manual changes to the `/etc/network/interfaces*` files: gnt-instance modify --net 0:modify,ip=116.202.120.175 test1.torproject.org gnt-instance stop test1.torproject.org gnt-instance start test1.torproject.org gnt-instance console test1.torproject.org ### Resizing disks The [gnt-instance grow-disk](http://docs.ganeti.org/ganeti/2.15/man/gnt-instance.html#grow-disk) command can be used to change the size of the underlying device: gnt-instance grow-disk --absolute test1.torproject.org 0 16g gnt-instance reboot test1.torproject.org The number `0` in this context, indicates the first disk of the instance. The amount specified is the final disk size (because of the `--absolute` flag). In the above example, the final disk size will be 16GB. To *add* space to the existing disk, remove the `--absolute` flag: gnt-instance grow-disk test1.torproject.org 0 16g gnt-instance reboot test1.torproject.org In the above example, 16GB will be **ADDED** to the disk. Be careful with resizes, because it's not possible to revert such a change: `grow-disk` does support shrinking disks. The only way to revert the change is by exporting / importing the instance. Note the reboot, above, will impose a downtime. See [upstream bug 28](https://github.com/ganeti/ganeti/issues/28) about improving that. Then the filesystem needs to be resized inside the VM: ssh root@test1.torproject.org #### Resizing under LVM Use `pvs` to display information about the physical volumes: root@cupani:~# pvs PV VG Fmt Attr PSize PFree /dev/sdc vg_test lvm2 a-- <8.00g 1020.00m Resize the physical volume to take up the new space: pvresize /dev/sdc Use `lvs` to display information about logical volumes: # lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert var-opt vg_test-01 -wi-ao---- <10.00g test-backup vg_test-01_hdd -wi-ao---- <20.00g Use lvextend to add space to the volume: lvextend -l '+100%FREE' vg_test-01/var-opt Finally resize the filesystem: resize2fs /dev/vg_test-01/var-opt See also the [LVM howto](howto/lvm). #### Resizing without LVM, no partitions If there's no LVM inside the VM (a more common configuration nowadays), the above procedure will obviously not work. If this is a secondary disk (e.g. `/dev/sdc`) there is a good chance a partition was created directly on it and that you do not need to repartition the drive. This is an example of a good configuration if we want to resize `sdc`: ``` root@bacula-director-01:~# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT fd0 2:0 1 4K 0 disk sda 8:0 0 10G 0 disk └─sda1 8:1 0 10G 0 part / sdb 8:16 0 2G 0 disk [SWAP] sdc 8:32 0 250G 0 disk /srv ``` Note that if we would need to resize `sda`, we'd have to follow the other procedure, in the next section. If we check the free disk space on the device we will notice it has not changed yet: ``` # df -h /srv Filesystem Size Used Avail Use% Mounted on /dev/sdc 196G 160G 27G 86% /srv ``` The resize is then simply: ``` # resize2fs /dev/sdc resize2fs 1.44.5 (15-Dec-2018) Filesystem at /dev/sdc is mounted on /srv; on-line resizing required old_desc_blocks = 25, new_desc_blocks = 32 The filesystem on /dev/sdc is now 65536000 (4k) blocks long. ``` Read on for the most complicated scenario. #### Resizing without LVM, with partitions If the filesystem to resize is not *directly* on the device, you will need to resize the partition manually, which can be done using fdisk. In the following example we have a `sda1` partition that we want to extend from 10G to 20G to fill up the free space on `/dev/sda`. Here is what the partition layout looks like before the resize: ``` # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT fd0 2:0 1 4K 0 disk sda 8:0 0 40G 0 disk └─sda1 8:1 0 20G 0 part / sdb 8:16 0 4G 0 disk [SWAP] ``` We use `sfdisk` to resize the partition to take up all available space, in this case, with the magic: echo ", +" | sfdisk -N 1 --no-act /dev/sda Note the `--no-act` here, which you'll need to remove to actually make the change, the above is just a preview to make sure you will do the right thing. Here's a working example: ``` # echo ", +" | sfdisk -N 1 --no-reread /dev/sda Disk /dev/sda: 40 GiB, 42949672960 bytes, 83886080 sectors Disk model: QEMU HARDDISK Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: dos Disk identifier: 0x00000000 Old situation: Device Boot Start End Sectors Size Id Type /dev/sda1 * 2048 41943039 41940992 20G 83 Linux /dev/sda1: New situation: Disklabel type: dos Disk identifier: 0x00000000 Device Boot Start End Sectors Size Id Type /dev/sda1 * 2048 83886079 83884032 40G 83 Linux The partition table has been altered. Calling ioctl() to re-read partition table. Re-reading the partition table failed.: Device or resource busy The kernel still uses the old table. The new table will be used at the next reboot or after you run partprobe(8) or kpartx(8). Syncing disks. ``` Note that the partition table wasn't updated: ``` # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT fd0 2:0 1 4K 0 disk sda 8:0 0 40G 0 disk └─sda1 8:1 0 20G 0 part / sdb 8:16 0 4G 0 disk [SWAP] ``` So we need to reboot: ``` reboot ``` Note: a previous version of this guide was using `fdisk` instead, but that guide was destroying and recreating the partition, which seemed too error-prone. The above procedure is more annoying (because of the reboot below) but should be less dangerous. TODO: next time, test with `--force` instead of `--no-reread` to see if we still need a reboot. Now we check the partitions again: ``` # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT fd0 2:0 1 4K 0 disk sda 8:0 0 40G 0 disk └─sda1 8:1 0 40G 0 part / sdb 8:16 0 4G 0 disk [SWAP] ``` If we check the free space on the device, we will notice it has not changed yet: ``` # df -h / Filesystem Size Used Avail Use% Mounted on /dev/sda1 20G 16G 2.8G 86% / ``` We need to resize it: ``` # resize2fs /dev/sda1 resize2fs 1.44.5 (15-Dec-2018) Filesystem at /dev/sda1 is mounted on /; on-line resizing required old_desc_blocks = 2, new_desc_blocks = 3 The filesystem on /dev/sda1 is now 10485504 (4k) blocks long. ``` The resize is now complete. #### Resizing an iSCSI LUN All the above procedures detail the normal use case where disks are hosted as "plain" files or with the DRBD backend. However, some instances (most notably in the gnt-chi cluster) have their storage backed by an iSCSI SAN. Growing a disk hosted on a SAN like the Dell PowerVault MD3200i involves several steps beginning with resizing the LUN itself. In the example below, we're going to grow the disk associated with the `tb-build-03` instance. > It should be noted that the instance was setup in a peculiar way: it > has one LUN per partition, instead of one big LUN partitioned > correctly. The instructions below therefore mention a LUN named > `tb-build-03-srv`, but normally there should be a single LUN named > after the hostname of the machine, in this case it should have been > named simply `tb-build-03`. First, we identify how much space is available on the virtual disks' diskGroup: # SMcli -n chi-san-01 -c "show allVirtualDisks summary;" STANDARD VIRTUAL DISKS SUMMARY Number of standard virtual disks: 5 Name Thin Provisioned Status Capacity Accessible by Source tb-build-03-srv No Optimal 700.000 GB Host Group gnt-chi Disk Group 5 This shows that `tb-build-03-srv` is hosted on Disk Group "5": # SMcli -n chi-san-01 -c "show diskGroup [5];" DETAILS Name: 5 Status: Optimal Capacity: 1,852.026 GB Current owner: RAID Controller Module in slot 1 Data Service (DS) Attributes RAID level: 5 Physical Disk media type: Physical Disk Physical Disk interface type: Serial Attached SCSI (SAS) Enclosure loss protection: No Secure Capable: No Secure: No Total Virtual Disks: 1 Standard virtual disks: 1 Repository virtual disks: 0 Free Capacity: 1,152.026 GB Associated physical disks - present (in piece order) Total physical disks present: 3 Enclosure Slot 0 6 1 11 0 7 `Free Capacity` indicates about 1,5 TB of free space available. So we can go ahead with the actual resize: # SMcli -n chi-san-01 -p $PASSWORD -c "set virtualdisk [\"tb-build-03-srv\"] addCapacity=100GB;" Next, we need to make all nodes in the cluster to rescan the iSCSI LUNs and have `multipathd` resize the device node. This is accomplished by running this command on the primary node (eg. `chi-node-01`): # gnt-cluster command "iscsiadm -m node --rescan; multipathd -v3 -k\"resize map tb-build-srv\"" The success of this step can be validated by looking at the output of `lsblk`: the device nodes associated with the LUN should now display the new size. The output should be identical across the cluster nodes. In order for ganeti/qemu to make this extra space available to the instance, a reboot must be performed from outside the instance. Then the normal resize procedure can happen inside the virtual machine, see [resizing under LVM](#resizing-under-lvm), [resizing without LVM, no partitions](#resizing-without-lvm-no-partitions), or [Resizing without LVM, with partitions](#resizing-without-lvm-with-partitions), depending on the situation. ### Removing an iSCSI LUN Use this procedure before to a virtual disk from one of the iSCSI SANs. First, we'll need to gather a some information about the disk to remove. * Which SAN is hosting the disk * What LUN is assigned to the disk * The WWID of both the SAN and the virtual disk /usr/local/sbin/tpo-show-san-disks SMcli -n chi-san-03 -S -quick -c "show storageArray summary;" | grep "Storage array world-wide identifier" cat /etc/multipath/conf.d/test-01.conf Second, remove the multipath config and reload: gnt-cluster command rm /etc/multipath/conf.d/test-01.conf gnt-cluster command "multipath -r ; multipath -w {disk-wwid} ; multipath -r" Then, remove the iSCSI device nodes. Running `iscsiadm --rescan` does not remove LUNs which have been deleted from the SAN. Be very careful with this command, it will delete device nodes without prejudice and cause data corruption if they are still in use! gnt-cluster command "find /dev/disk/by-path/ -name \*{san-wwid}-lun-{lun} -exec readlink {} \; | cut -d/ -f3 | while read -d $'\n' n; do echo 1 > /sys/block/\$n/device/delete; done" Finally, the disk group can be deleted from the SAN (all the virtual disks it contains will be deleted): SMcli -n chi-san-03 -p $SAN_PASSWORD -S -quick -c "delete diskGroup [<disk-group-number>];" ### Adding disks A disk can be added to an instance with the `modify` command as well. This, for example, will add a 100GB disk to the `test1` instance on teh `vg_ganeti_hdd` volume group, which is "slow" rotating disks: gnt-instance modify --disk add:size=100g,vg=vg_ganeti_hdd test1.torproject.org gnt-instance reboot test1.torproject.org ### Changing disk type If you have, say, a test instance that was created with a `plain` disk template but we actually want it in production, with a `drbd` disk template. Switching to `drbd` is easy: gnt-instance shutdown test-01 gnt-instance modify -t drbd test-01 gnt-instance start test-01 The second command will use the allocator to find a secondary node. If that fails, you can assign a node manually with `-n`. You can also switch back to `plain`, although you should generally never do that. See also the [upstream procedure](https://docs.ganeti.org/docs/ganeti/3.0/html/admin.html#conversion-of-an-instance-s-disk-type) and [design document](https://docs.ganeti.org/docs/ganeti/3.0/html/design-disk-conversion.html). ### Detaching a disk If you need to remove a volume from an instance without destroying data, it's possible to detach it. First, you must identify the disk's uuid using `gnt-instance info`, then: gnt-instance modify --disk <uuid>:detach test-01 ### Adding a network interface on the rfc1918 vlan We have a vlan that some VMs that do not have public addresses sit on. Its vlanid is 4002 and its backed by Hetzner vswitch vSwitch #11973 "fsn-gnt-rfc1918-traffic". Note that traffic on this vlan will travel in the clear between nodes. To add an instance to this vlan, give it a second network interface using gnt-instance modify --net add:link=br0,vlan=4002,mode=openvswitch test1.torproject.org ## Destroying an instance This totally deletes the instance, including all mirrors and everything, be very careful with it: gnt-instance remove test01.torproject.org ## Getting information Information about an instances can be found in the rather verbose `gnt-instance info`: root@fsn-node-01:~# gnt-instance info tb-build-02.torproject.org - Instance name: tb-build-02.torproject.org UUID: 8e9f3ca6-204f-4b6c-8e3e-6a8fda137c9b Serial number: 5 Creation time: 2020-12-15 14:06:41 Modification time: 2020-12-15 14:07:31 State: configured to be up, actual state is up Nodes: - primary: fsn-node-03.torproject.org group: default (UUID 8c32fd09-dc4c-4237-9dd2-3da3dfd3189e) - secondaries: fsn-node-04.torproject.org (group default, group UUID 8c32fd09-dc4c-4237-9dd2-3da3dfd3189e) Operating system: debootstrap+buster A quick command that can be done is this, which shows the primary/secondary for a given instance: gnt-instance info tb-build-02.torproject.org | grep -A 3 Nodes An equivalent command will show the primary and secondary for *all* instances, on top of extra information (like the CPU count, memory and disk usage): gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort It can be useful to run this in a loop to see changes: watch -n5 -d 'gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort' ## Disk operations (DRBD) Instances should be setup using the DRBD backend, in which case you should probably take a look at [howto/drbd](howto/drbd) if you have problems with that. Ganeti handles most of the logic there so that should generally not be necessary. ## Evaluating cluster capacity This will list instances repeatedly, but also show their assigned memory, and compare it with the node's capacity: gnt-instance list -o pnode,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort && echo && gnt-node list The latter does not show disk usage for secondary volume groups (see [upstream issue 1379](https://github.com/ganeti/ganeti/issues/1379)), for a complete picture of disk usage, use: gnt-node list-storage The [gnt-cluster verify](http://docs.ganeti.org/ganeti/2.15/man/gnt-cluster.html#verify) command will also check to see if there's enough space on secondaries to account for the failure of a node. Healthy output looks like this: root@fsn-node-01:~# gnt-cluster verify Submitted jobs 48030, 48031 Waiting for job 48030 ... Fri Jan 17 20:05:42 2020 * Verifying cluster config Fri Jan 17 20:05:42 2020 * Verifying cluster certificate files Fri Jan 17 20:05:42 2020 * Verifying hypervisor parameters Fri Jan 17 20:05:42 2020 * Verifying all nodes belong to an existing group Waiting for job 48031 ... Fri Jan 17 20:05:42 2020 * Verifying group 'default' Fri Jan 17 20:05:42 2020 * Gathering data (2 nodes) Fri Jan 17 20:05:42 2020 * Gathering information about nodes (2 nodes) Fri Jan 17 20:05:45 2020 * Gathering disk information (2 nodes) Fri Jan 17 20:05:45 2020 * Verifying configuration file consistency Fri Jan 17 20:05:45 2020 * Verifying node status Fri Jan 17 20:05:45 2020 * Verifying instance status Fri Jan 17 20:05:45 2020 * Verifying orphan volumes Fri Jan 17 20:05:45 2020 * Verifying N+1 Memory redundancy Fri Jan 17 20:05:45 2020 * Other Notes Fri Jan 17 20:05:45 2020 * Hooks Results A sick node would have said something like this instead: Mon Oct 26 18:59:37 2009 * Verifying N+1 Memory redundancy Mon Oct 26 18:59:37 2009 - ERROR: node node2: not enough memory to accommodate instance failovers should node node1 fail See the [ganeti manual](http://docs.ganeti.org/ganeti/2.15/html/walkthrough.html#n-1-errors) for a more extensive example Also note the `hspace -L` command, which can tell you how many instances can be created in a given cluster. It uses the "standard" instance template defined in the cluster (which we haven't configured yet). ## Moving instances and failover Ganeti is smart about assigning instances to nodes. There's also a command (`hbal`) to automatically rebalance the cluster (see below). If for some reason `hbal` doesn’t do what you want or you need to move things around for other reasons, here are a few commands that might be handy. Make an instance switch to using it's secondary: gnt-instance migrate test1.torproject.org Make all instances on a node switch to their secondaries: gnt-node migrate test1.torproject.org The `migrate` commands does a "live" migrate which should avoid any downtime during the migration. It might be preferable to actually shutdown the machine for some reason (for example if we actually want to reboot because of a security upgrade). Or we might not be able to live-migrate because the node is down. In this case, we do a [failover](http://docs.ganeti.org/ganeti/2.15/html/admin.html#failing-over-an-instance) gnt-instance failover test1.torproject.org The [gnt-node evacuate](http://docs.ganeti.org/ganeti/2.15/man/gnt-node.html#evacuate) command can also be used to "empty" a given node altogether, in case of an emergency: gnt-node evacuate -I . fsn-node-02.torproject.org Similarly, the [gnt-node failover](http://docs.ganeti.org/ganeti/2.15/man/gnt-node.html#failover) command can be used to hard-recover from a completely crashed node: gnt-node failover fsn-node-02.torproject.org Note that you might need the `--ignore-consistency` flag if the node is unresponsive. ## Importing external libvirt instances Assumptions: * `INSTANCE`: name of the instance being migrated, the "old" one being outside the cluster and the "new" one being the one created inside the cluster (e.g. `chiwui.torproject.org`) * `SPARE_NODE`: a ganeti node with free space (e.g. `fsn-node-03.torproject.org`) where the `INSTANCE` will be migrated * `MASTER_NODE`: the master ganeti node (e.g. `fsn-node-01.torproject.org`) * `KVM_HOST`: the machine which we migrate the `INSTANCE` from * the `INSTANCE` has only `root` and `swap` partitions * the `SPARE_NODE` has space in `/srv/` to host all the virtual machines to import, to check, use: fab -H crm-ext-01.torproject.org,crm-int-01.torproject.org,forrestii.torproject.org,nevii.torproject.org,rude.torproject.org,troodi.torproject.org,vineale.torproject.org libvirt.du -p kvm3.torproject.org | sed '/-swap$/d;s/ .*$//' <f | awk '{s+=$1} END {print s}' You will very likely need to create a `/srv` big enough for this, for example: lvcreate -L 300G vg_ganeti -n srv-tmp && mkfs /dev/vg_ganeti/srv-tmp && mount /dev/vg_ganeti/srv-tmp /srv Import procedure: 1. pick a viable SPARE NODE to import the INSTANCE (see "evaluating cluster capacity" above, when in doubt) and find on which KVM HOST the INSTANCE lives 2. copy the disks, without downtime: ./ganeti -v -H $INSTANCE libvirt-import --ganeti-node $SPARE_NODE --libvirt-host $KVM_HOST 3. copy the disks again, this time suspending the machine: ./ganeti -v -H $INSTANCE libvirt-import --ganeti-node $SPARE_NODE --libvirt-host $KVM_HOST --suspend --adopt 4. renumber the host: ./ganeti -v -H $INSTANCE renumber-instance --ganeti-node $SPARE_NODE 5. test services by changing your `/etc/hosts`, possibly warning service admins: > Subject: $INSTANCE IP address change planned for Ganeti migration > > I will soon migrate this virtual machine to the new ganeti cluster. this > will involve an IP address change which might affect the service. > > Please let me know if there are any problems you can think of. in > particular, do let me know if any internal (inside the server) or external > (outside the server) services hardcodes the IP address of the virtual > machine. > > A test instance has been setup. You can test the service by > adding the following to your /etc/hosts: > > 116.202.120.182 $INSTANCE > 2a01:4f8:fff0:4f:266:37ff:fe32:cfb2 $INSTANCE 6. destroy test instance: gnt-instance remove $INSTANCE 7. lower TTLs to 5 minutes. this procedure varies a lot according to the service, but generally if all DNS entries are `CNAME`s pointing to the main machine domain name, the TTL can be lowered by adding a `dnsTTL` entry in the LDAP entry for this host. For example, this sets the TTL to 5 minutes: dnsTTL: 300 Then to make the changes immediate, you need the following commands: ssh root@alberti.torproject.org sudo -u sshdist ud-generate && ssh root@nevii.torproject.org ud-replicate Warning: if you migrate one of the hosts ud-ldap depends on, this can fail and not only the TTL will not update, but it might also fail to update the IP address in the below procedure. See [ticket 33766](https://bugs.torproject.org/33766) for details. 8. shutdown original instance and redo migration as in step 3 and 4: fab -H $INSTANCE reboot.halt-and-wait --delay-shutdown 60 --reason='migrating to new server' && ./ganeti -v -H $INSTANCE libvirt-import --ganeti-node $SPARE_NODE --libvirt-host $KVM_HOST --adopt && ./ganeti -v -H $INSTANCE renumber-instance --ganeti-node $SPARE_NODE 9. final test procedure TODO: establish host-level test procedure and run it here. 10. switch to DRBD, still on the Ganeti MASTER NODE: gnt-instance stop $INSTANCE && gnt-instance modify -t drbd $INSTANCE && gnt-instance failover -f $INSTANCE && gnt-instance start $INSTANCE The above can sometimes fail if the allocator is upset about something in the cluster, for example: Can's find secondary node using iallocator hail: Request failed: No valid allocation solutions, failure reasons: FailMem: 2, FailN1: 2 This situation is covered by [ticket 33785](https://bugs.torproject.org/33785). To work around the allocator, you can specify a secondary node directly: gnt-instance modify -t drbd -n fsn-node-04.torproject.org $INSTANCE && gnt-instance failover -f $INSTANCE && gnt-instance start $INSTANCE TODO: move into fabric, maybe in a `libvirt-import-live` or `post-libvirt-import` job that would also do the renumbering below 11. change IP address in the following locations: * LDAP (`ipHostNumber` field, but also change the `physicalHost` and `l` fields!). Also drop the dnsTTL attribute while you're at it. * Puppet (grep in tor-puppet source, run `puppet agent -t; ud-replicate` on pauli) * DNS (grep in tor-dns source, `puppet agent -t; ud-replicate` on nevii) * nagios (don't forget to change the parent) * reverse DNS (upstream web UI, e.g. Hetzner Robot) * grep for the host's IP address on itself: grep -r -e 78.47.38.227 -e 2a01:4f8:fff0:4f:266:37ff:fe77:1ad8 /etc grep -r -e 78.47.38.227 -e 2a01:4f8:fff0:4f:266:37ff:fe77:1ad8 /srv * grep for the host's IP on *all* hosts: cumin-all-puppet cumin-all 'grep -r -e 78.47.38.227 -e 2a01:4f8:fff0:4f:266:37ff:fe77:1ad8 /etc' TODO: move those jobs into fabric 12. retire old instance (only a tiny part of [howto/retire-a-host](howto/retire-a-host)): ./retire -H $INSTANCE retire-instance --parent-host $KVM_HOST 12. update the [Nextcloud spreadsheet](https://nc.torproject.net/apps/onlyoffice/5395) to remove the machine from the KVM host 13. warn users about the migration, for example: > To: tor-project@lists.torproject.org > Subject: cupani AKA git-rw IP address changed > > The main git server, cupani, is the machine you connect to when you push > or pull git repositories over ssh to git-rw.torproject.org. That > machines has been migrated to the new Ganeti cluster. > > This required an IP address change from: > > 78.47.38.228 2a01:4f8:211:6e8:0:823:4:1 > > to: > > 116.202.120.182 2a01:4f8:fff0:4f:266:37ff:fe32:cfb2 > > DNS has been updated and preliminary tests show that everything is > mostly working. You *will* get a warning about the IP address change > when connecting over SSH, which will go away after the first > connection. > > Warning: Permanently added the ED25519 host key for IP address '116.202.120.182' to the list of known hosts. > > That is normal. The SSH fingerprints of the host did *not* change. > > Please do report any other anomaly using the normal channels: > > https://gitlab.torproject.org/tpo/tpa/team/-/wikis/support > > The service was unavailable for about an hour during the migration. ## Importing external libvirt instances, manual This procedure is now easier to accomplish with the Fabric tools written especially for this purpose. Use the above procedure instead. This is kept for historical reference. Assumptions: * `INSTANCE`: name of the instance being migrated, the "old" one being outside the cluster and the "new" one being the one created inside the cluster (e.g. `chiwui.torproject.org`) * `SPARE_NODE`: a ganeti node with free space (e.g. `fsn-node-03.torproject.org`) where the `INSTANCE` will be migrated * `MASTER_NODE`: the master ganeti node (e.g. `fsn-node-01.torproject.org`) * `KVM_HOST`: the machine which we migrate the `INSTANCE` from * the `INSTANCE` has only `root` and `swap` partitions Import procedure: 1. pick a viable SPARE NODE to import the instance (see "evaluating cluster capacity" above, when in doubt), login to the three servers, setting the proper environment everywhere, for example: MASTER_NODE=fsn-node-01.torproject.org SPARE_NODE=fsn-node-03.torproject.org KVM_HOST=kvm1.torproject.org INSTANCE=test.torproject.org 2. establish VM specs, on the KVM HOST: * disk space in GiB: for disk in /srv/vmstore/$INSTANCE/*; do printf "$disk: " echo "$(qemu-img info --output=json $disk | jq '."virtual-size"') / 1024 / 1024 / 1024" | bc -l done * number of CPU cores: sed -n '/<vcpu/{s/[^>]*>//;s/<.*//;p}' < /etc/libvirt/qemu/$INSTANCE.xml * memory, assuming from KiB to GiB: echo "$(sed -n '/<memory/{s/[^>]*>//;s/<.*//;p}' < /etc/libvirt/qemu/$INSTANCE.xml) /1024 /1024" | bc -l TODO: make sure the memory line is in KiB and that the number makes sense. * on the INSTANCE, find the swap device UUID so we can recreate it later: blkid -t TYPE=swap -s UUID -o value 3. setup a copy channel, on the SPARE NODE: ssh-agent bash ssh-add /etc/ssh/ssh_host_ed25519_key cat /etc/ssh/ssh_host_ed25519_key.pub on the KVM HOST: echo "$KEY_FROM_SPARE_NODE" >> /etc/ssh/userkeys/root 4. copy the `.qcow` file(s) over, from the KVM HOST to the SPARE NODE: rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-root /srv/ rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-lvm /srv/ || true Note: it's possible there is not enough room in `/srv`: in the base Ganeti installs, everything is in the same root partition (`/`) which will fill up if the instance is (say) over ~30GiB. In that case, create a filesystem in `/srv`: (mkdir /root/srv && mv /srv/* /root/srv true) || true && lvcreate -L 200G vg_ganeti -n srv && mkfs /dev/vg_ganeti/srv && echo "/dev/vg_ganeti/srv /srv ext4 rw,noatime,errors=remount-ro 0 2" >> /etc/fstab && mount /srv && ( mv /root/srv/* ; rmdir /root/srv ) This partition can be reclaimed once the VM migrations are completed, as it needlessly takes up space on the node. 5. on the SPARE NODE, create and initialize a logical volume with the predetermined size: lvcreate -L 4GiB -n $INSTANCE-swap vg_ganeti mkswap --uuid $SWAP_UUID /dev/vg_ganeti/$INSTANCE-swap lvcreate -L 20GiB -n $INSTANCE-root vg_ganeti qemu-img convert /srv/$INSTANCE-root -O raw /dev/vg_ganeti/$INSTANCE-root lvcreate -L 40GiB -n $INSTANCE-lvm vg_ganeti_hdd qemu-img convert /srv/$INSTANCE-lvm -O raw /dev/vg_ganeti_hdd/$INSTANCE-lvm Note how we assume two disks above, but the instance might have a different configuration that would require changing the above. The above, common, configuration is to have an LVM disk separate from the "root" disk, the former being on a HDD, but the HDD is sometimes completely omitted and sizes can differ. Sometimes it might be worth using pv to get progress on long transfers: qemu-img convert /srv/$INSTANCE-lvm -O raw /srv/$INSTANCE-lvm.raw pv /srv/$INSTANCE-lvm.raw | dd of=/dev/vg_ganeti_hdd/$INSTANCE-lvm bs=4k TODO: ideally, the above procedure (and many steps below as well) would be automatically deduced from the disk listing established in the first step. 6. on the MASTER NODE, create the instance, adopting the LV: gnt-instance add -t plain \ -n fsn-node-03 \ --disk 0:adopt=$INSTANCE-root \ --disk 1:adopt=$INSTANCE-swap \ --disk 2:adopt=$INSTANCE-lvm,vg=vg_ganeti_hdd \ --backend-parameters memory=2g,vcpus=2 \ --net 0:ip=pool,network=gnt-fsn \ --no-name-check \ --no-ip-check \ -o debootstrap+default \ $INSTANCE 7. cross your fingers and watch the party: gnt-instance console $INSTANCE 9. IP address change on new instance: edit `/etc/hosts` and `/etc/network/interfaces` by hand and add IPv4 and IPv6 ip. IPv4 configuration can be found in: gnt-instance show $INSTANCE Latter can be guessed by concatenating `2a01:4f8:fff0:4f::` and the IPv6 local local address without `fe80::`. For example: a link local address of `fe80::266:37ff:fe65:870f/64` should yield the following configuration: iface eth0 inet6 static accept_ra 0 address 2a01:4f8:fff0:4f:266:37ff:fe65:870f/64 gateway 2a01:4f8:fff0:4f::1 TODO: reuse `gnt-debian-interfaces` from the ganeti puppet module script here? 10. functional tests: change your `/etc/hosts` to point to the new server and see if everything still kind of works 11. shutdown original instance 12. resync and reconvert image, on the Ganeti MASTER NODE: gnt-instance stop $INSTANCE on the Ganeti node: rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-root /srv/ && qemu-img convert /srv/$INSTANCE-root -O raw /dev/vg_ganeti/$INSTANCE-root && rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-lvm /srv/ && qemu-img convert /srv/$INSTANCE-lvm -O raw /dev/vg_ganeti_hdd/$INSTANCE-lvm 13. switch to DRBD, still on the Ganeti MASTER NODE: gnt-instance modify -t drbd $INSTANCE gnt-instance failover $INSTANCE gnt-instance startup $INSTANCE 14. redo IP adress change in `/etc/network/interfaces` and `/etc/hosts` 15. final functional test 16. change IP address in the following locations: * nagios (don't forget to change the parent) * LDAP (`ipHostNumber` field, but also change the `physicalHost` and `l` fields!) * Puppet (grep in tor-puppet source, run `puppet agent -t; ud-replicate` on pauli) * DNS (grep in tor-dns source, `puppet agent -t; ud-replicate` on nevii) * reverse DNS (upstream web UI, e.g. Hetzner Robot) 17. decomission old instance ([howto/retire-a-host](howto/retire-a-host)) ### Troubleshooting * if boot takes a long time and you see a message like this on the console: [ *** ] A start job is running for dev-disk-by\x2duuid-484b5...26s / 1min 30s) ... which is generally followed by: [DEPEND] Dependency failed for /dev/disk/by-…6f4b5-f334-4173-8491-9353d4f94e04. [DEPEND] Dependency failed for Swap. it means the swap device UUID wasn't setup properly, and does not match the one provided in `/etc/fstab`. That is probably because you missed the `mkswap -U` step documented above. ### References * [Upstream docs](http://docs.ganeti.org/ganeti/2.15/html/admin.html#import-of-foreign-instances) have the canonical incantation: gnt-instance add -t plain -n HOME_NODE ... --disk 0:adopt=lv_name[,vg=vg_name] INSTANCE_NAME * [DSA docs](https://dsa.debian.org/howto/install-ganeti/) also use disk adoption and have a procedure to migrate to DRBD * [Riseup docs](https://we.riseup.net/riseup+tech/ganeti#move-an-instance-from-one-cluster-to-another-from-) suggest creating a VM without installing, shutting down and then syncing Ganeti [supports importing and exporting](http://docs.ganeti.org/ganeti/2.15/html/design-ovf-support.html?highlight=qcow) from the [Open Virtualization Format](https://en.wikipedia.org/wiki/Open_Virtualization_Format) (OVF), but unfortunately it [doesn't seem libvirt supports *exporting* to OVF](https://forums.centos.org/viewtopic.php?t=49231). There's a [virt-convert](http://manpages.debian.org/virt-convert) tool which can *import* OVF, but not the reverse. The [libguestfs](http://www.libguestfs.org/) library also has a [converter](http://www.libguestfs.org/virt-v2v.1.html) but it also doesn't support exporting to OVF or anything Ganeti can load directly. So people have written [their own conversion tools](https://virtuallyhyper.com/2013/06/migrate-from-libvirt-kvm-to-virtualbox/) or [their own conversion procedure](https://scienceofficersblog.blogspot.com/2014/04/using-cloud-images-with-ganeti.html). Ganeti also supports [file-backed instances](http://docs.ganeti.org/ganeti/2.15/html/design-file-based-storage.html) but "adoption" is specifically designed for logical volumes, so it doesn't work for our use case. ## Rebooting Those hosts need special care, as we can accomplish zero-downtime reboots on those machines. The `reboot` script in `tsa-misc` takes care of the special steps involved (which is basically to empty a node before rebooting it). Such a reboot should be ran interactively. ### Full fleet reboot This command will reboot the entire Ganeti fleets, including the hosted VMs, use this when (for example) you have kernel upgrades to deploy everywhere: ./reboot --skip-ganeti-empty -v --reason 'qemu flagged in needrestart' \ -H chi-node-0{1,2,3,4,5,6,7,8,9}.torproject.org \ chi-node-1{0,1}.torproject.org \ fsn-node-0{1,2,3,4,5,6,7,8}.torproject.org This is long and rather disruptive. Notifications should be posted on IRC, in `#tor-project`, as instances are rebooted. It can take about a day to complete a full fleet-wide reboot. ### Node-only reboot In certain cases (Open vSwitch restarts, for example), only the nodes need a reboot, and not the instances. In that case, you want to reboot the nodes but before that, migrate the instances off the node and then migrate it back when done. This incantation should do so: ./reboot --ganeti-migrate-back -v --reason 'Open vSwitch upgrade' \ -H fsn-node-0{1,2,3,4,5,6,7,8}.torproject.org This should cause no user-visible disruption. ### Instance-only restarts An alternative procedure should be used if only the `ganeti.service` requires a restart. This happens when a Qemu dependency that has been upgraded, for example `libxml` or OpenSSL. This will only migrate the VMs without rebooting the hosts: ./reboot --ganeti-migrate-back --kind=cancel -v --reason 'qemu flagged in needrestart' \ -H chi-node-0{1,2,3,4,5,6,7,8,9}.torproject.org \ chi-node-1{0,1}.torproject.org \ fsn-node-0{1,2,3,4,5,6,7,8}.torproject.org This should cause no user-visible disruption. ## Rebalancing a cluster After a reboot or a downtime, all nodes might end up on the same machine. This is normally handled by the reboot script, but it might be desirable to do this by hand if there was a crash or another special condition. This can be easily corrected with this command, which will spread instances around the cluster to balance it: hbal -L -C -v -p The above will show the proposed solution, with the state of the cluster before, and after (`-p`) and the commands to get there (`-C`). To actually execute the commands, you can copy-paste those commands. An alternative is to pass the `-X` argument, to tell `hbal` to actually issue the commands itself: hbal -L -C -v -p -X This will automatically move the instances around and rebalance the cluster. Here's an example run on a small cluster: root@fsn-node-01:~# gnt-instance list Instance Hypervisor OS Primary_node Status Memory loghost01.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 2.0G onionoo-backend-01.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 12.0G static-master-fsn.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 8.0G web-fsn-01.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 4.0G web-fsn-02.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 4.0G root@fsn-node-01:~# hbal -L -X Loaded 2 nodes, 5 instances Group size 2 nodes, 5 instances Selected node group: default Initial check done: 0 bad nodes, 0 bad instances. Initial score: 8.45007519 Trying to minimize the CV... 1. onionoo-backend-01 fsn-node-02:fsn-node-01 => fsn-node-01:fsn-node-02 4.98124611 a=f 2. loghost01 fsn-node-02:fsn-node-01 => fsn-node-01:fsn-node-02 1.78271883 a=f Cluster score improved from 8.45007519 to 1.78271883 Solution length=2 Got job IDs 16345 Got job IDs 16346 root@fsn-node-01:~# gnt-instance list Instance Hypervisor OS Primary_node Status Memory loghost01.torproject.org kvm debootstrap+buster fsn-node-01.torproject.org running 2.0G onionoo-backend-01.torproject.org kvm debootstrap+buster fsn-node-01.torproject.org running 12.0G static-master-fsn.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 8.0G web-fsn-01.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 4.0G web-fsn-02.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 4.0G In the above example, you should notice that the `web-fsn` instances both ended up on the same node. That's because the balancer did not know that they should be distributed. A special configuration was done, below, to avoid that problem in the future. But as a workaround, instances can also be moved by hand and the cluster re-balanced. Also notice that `-X` does not show the job output, use `ganeti-watch-jobs` for that, in another terminal. See the [job inspection](#job-inspection) section for more details on that. ### Redundant instances distribution Some instances are redundant across the cluster and should *not* end up on the same node. A good example are the `web-fsn-01` and `web-fsn-02` instances which, in theory, would serve similar traffic. If they end up on the same node, it might flood the network on that machine or at least defeats the purpose of having redundant machines. The way to ensure they get distributed properly by the balancing algorithm is to "tag" them. For the web nodes, for example, this was performed on the master: gnt-cluster add-tags htools:iextags:service gnt-instance add-tags web-fsn-01.torproject.org service:web-fsn gnt-instance add-tags web-fsn-02.torproject.org service:web-fsn This tells Ganeti that `web-fsn` is an "exclusion tag" and the optimizer will not try to schedule instances with those tags on the same node. To see which tags are present, use: # gnt-cluster list-tags htools:iextags:service You can also find which nodes are assigned to a tag with: # gnt-cluster search-tags service /cluster htools:iextags:service /instances/web-fsn-01.torproject.org service:web-fsn /instances/web-fsn-02.torproject.org service:web-fsn IMPORTANT: a previous version of this article mistakenly indicated that a new cluster-level tag had to be created for each service. That method did *not* work. The [hbal manpage](http://docs.ganeti.org/ganeti/current/man/hbal.html#exclusion-tags) explicitely mentions that the cluster-level tag is a *prefix* that can be used to create *multiple* such tags. This configuration also happens to be simpler and easier to use... ### HDD migration restrictions Cluster balancing works well until there are inconsistencies between how nodes are configured. In our case, some nodes have HDDs (Hard Disk Drives, AKA spinning rust) and others do not. Therefore, it's not possible to move an instance from a node with a disk allocated on the HDD to a node that does not have such a disk. Yet somehow the allocator is not smart enough to tell, and you will get the following error when doing an automatic rebalancing: one of the migrate failed and stopped the cluster balance: Can't create block device: Can't create block device <LogicalVolume(/dev/vg_ganeti_hdd/98d30e7d-0a47-4a7d-aeed-6301645d8469.disk3_data, visible as /dev/, size=102400m)> on node fsn-node-07.torproject.org for instance gitlab-02.torproject.org: Can't create block device: Can't compute PV info for vg vg_ganeti_hdd In this case, it is trying to migrate the `gitlab-02` server from `fsn-node-01` (which has an HDD) to `fsn-node-07` (which hasn't), which naturally fails. This is a known limitation of the Ganeti code. There has been a [draft design document for multiple storage unit support](http://docs.ganeti.org/ganeti/master/html/design-multi-storage-htools.html) since 2015, but it has [never been implemented](https://github.com/ganeti/ganeti/issues/865). There has been multiple issues reported upstream on the subject: * [208: Bad behaviour when multiple volume groups exists on nodes](https://github.com/ganeti/ganeti/issues/208) * [1199: unable to mark storage as unavailable for allocation](https://github.com/ganeti/ganeti/issues/1199) * [1240: Disk space check with multiple VGs is broken](https://github.com/ganeti/ganeti/issues/1240) * [1379: Support for displaying/handling multiple volume groups](https://github.com/ganeti/ganeti/issues/1379) Unfortunately, there are no known workarounds for this, at least not that fix the `hbal` command. It *is* possible to exclude the faulty migration from the pool of possible moves, however, for example in the above case: hbal -L -v -C -P --exclude-instances gitlab-02.torproject.org It's also possible to use the `--no-disk-moves` option to avoid disk move operations altogether. Both workarounds obviously do not correctly balance the cluster... Note that we have also tried to use `htools:migration` tags to workaround that issue, but [those do not work for secondary instances](https://github.com/ganeti/ganeti/issues/1497). For this we would need to setup [node groups](http://docs.ganeti.org/ganeti/current/html/man-gnt-group.html) instead. A good trick is to look at the solution proposed by `hbal`: Trying to minimize the CV... 1. tbb-nightlies-master fsn-node-01:fsn-node-02 => fsn-node-04:fsn-node-02 6.12095251 a=f r:fsn-node-04 f 2. bacula-director-01 fsn-node-01:fsn-node-03 => fsn-node-03:fsn-node-01 4.56735007 a=f 3. staticiforme fsn-node-02:fsn-node-04 => fsn-node-02:fsn-node-01 3.99398707 a=r:fsn-node-01 4. cache01 fsn-node-07:fsn-node-05 => fsn-node-07:fsn-node-01 3.55940346 a=r:fsn-node-01 5. vineale fsn-node-05:fsn-node-06 => fsn-node-05:fsn-node-01 3.18480313 a=r:fsn-node-01 6. pauli fsn-node-06:fsn-node-07 => fsn-node-06:fsn-node-01 2.84263128 a=r:fsn-node-01 7. neriniflorum fsn-node-05:fsn-node-02 => fsn-node-05:fsn-node-01 2.59000393 a=r:fsn-node-01 8. static-master-fsn fsn-node-01:fsn-node-02 => fsn-node-02:fsn-node-01 2.47345604 a=f 9. polyanthum fsn-node-02:fsn-node-07 => fsn-node-07:fsn-node-02 2.47257956 a=f 10. forrestii fsn-node-07:fsn-node-06 => fsn-node-06:fsn-node-07 2.45119245 a=f Cluster score improved from 8.92360196 to 2.45119245 Look at the last column. The `a=` field shows what "action" will be taken. A `f` is a failover (or "migrate"), and a `r:` is a `replace-disks`, with the new secondary after the semi-colon (`:`). In the above case, the proposed solution is correct: no secondary node is in the range of nodes that lacks HDDs (`fsn-node-0[5-7]`). If one of the disk replaces hits one of the nodes without HDD, then it's when you use `--exclude-instances` to find a better solution. A typical exclude is: hbal -L -v -C -P --exclude-instance=bacula-director-01,tbb-nightlies-master,eugeni,winklerianum,woronowii,rouyi,loghost01,materculae,gayi,weissii Another option is to specifically look for instances that do not have a HDD and migrate only those. In my situation, `gnt-cluster verify` was complaining that `fsn-node-02` was full, so I looked for all the instances on that node and found the ones which didn't have a HDD: gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status \ | sort | grep 'fsn-node-02' | awk '{print $3}' | \ while read instance ; do printf "checking $instance: " if gnt-instance info $instance | grep -q hdd ; then echo "HAS HDD" else echo "NO HDD" fi done Then you can manually `migrate -f` (to fail over to the secondary) and `replace-disks -n` (to find another secondary) the instances that *can* be migrated out of the four first machines (which have HDDs) to the last three (which do not). Look at the memory usage in `gnt-node list` to pick the best node. In general, if a given node in the first four is overloaded, a good trick is to look for one that can be failed over, with, for example: gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort | grep '^fsn-node-0[1234]' | grep 'fsn-node-0[5678]' ... or, for a particular node (say fsn-node-04): gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort | grep ^fsn-node-04 | grep 'fsn-node-0[5678]' The instances listed there would be ones that can be migrated to their secondary to give `fsn-node-04` some breathing room. ## Adding and removing addresses on instances Say you created an instance but forgot to need to assign an extra IP. You can still do so with: gnt-instance modify --net -1:add,ip=116.202.120.174,network=gnt-fsn test01.torproject.org ## Job inspection Sometimes it can be useful to look at the active jobs. It might be, for example, that another user has queued a bunch of jobs in another terminal which you do not have access to, or some automated process did (Nagios, for example, runs `gnt-cluster verify` once in a while). Ganeti has this concept of "jobs" which can provide information about those. The command `gnt-job list` will show the entire job history, and `gnt-job list --running` will show running jobs. `gnt-job watch` can be used to watch a specific job. We have a wrapper called `ganeti-watch-jobs` which automatically shows the output of whatever job is currently running and exits when all jobs complete. This is particularly useful while [rebalancing the cluster](#rebalancing-a-cluster) as `hbal -X` does not show the job output... ## Open vSwitch crash course and debugging [Open vSwitch](https://www.openvswitch.org/) is used in the `gnt-fsn` cluster to connect the multiple machines with each other through [Hetzner's "vswitch"](https://wiki.hetzner.de/index.php/Vswitch/en) system. You will typically not need to deal with Open vSwitch, as Ganeti takes care of configuring the network on instance creation and migration. But if you believe there might be a problem with it, you can consider reading the following: * [Documentation portal](https://docs.openvswitch.org/en/latest/) * [Tutorials](https://docs.openvswitch.org/en/latest/tutorials/index.html_) * [Debugging Open vSwitch slides](https://www.openvswitch.org/support/slides/OVS-Debugging-110414.pdf) ## Accessing the QEMU control ports There is a magic warp zone on the node where an instance is running: ``` nc -U /var/run/ganeti/kvm-hypervisor/ctrl/$INSTANCE.monitor ``` This drops you in the [QEMU monitor](https://people.redhat.com/pbonzini/qemu-test-doc/_build/html/topics/pcsys_005fmonitor.html) which can do all sorts of things including adding/removing devices, save/restore the VM state, pause/resume the VM, do screenshots, etc. There are many sockets in the `ctrl` directory, including: * `.serial`: the instance's serial port * `.monitor`: the QEMU monitor control port * `.qmp`: the same, but with a JSON interface that I can't figure out (the `-qmp` argument to `qemu`) * `.kvmd`: same as the above? ## Pager playbook ### I/O overload In case of excessive I/O, it might be worth looking into which machine is in cause. The [howto/drbd](howto/drbd) page explains how to map a DRBD device to a VM. You can also find which logical volume is backing an instance (and vice versa) with this command: lvs -o+tags This will list all logical volumes and their associated tags. If you already know which logical volume you're looking for, you can address it directly: root@fsn-node-01:~# lvs -o tags /dev/vg_ganeti_hdd/4091b668-1177-41ac-9310-1eac45b46620.disk2_data LV Tags originstname+bacula-director-01.torproject.org ### Node failure Ganeti clusters are designed to be [self-healing](http://docs.ganeti.org/ganeti/2.15/html/admin.html#autorepair). As long as only one machine disappears, the cluster should be able to recover by failing over other nodes. This is currently done manually, however. WARNING: the following procedure should be considered a LAST RESORT. In the vast majority of cases, it is simpler and less risky to just restart the node using a remote power cycle to restore the service than risking a split brain scenario which this procedure can case when not followed properly. WARNING, AGAIN: if for some reason the node you are failing over from actually returns on its own without you being able to stop it, it may run those DRBD disks and virtual machines, and you *may* end up in a split brain scenario. Normally, the node asks the master for which VM to start, so it should be safe to failover from a node that is NOT the master, but make sure the rest of the cluster is healthy before going ahead with this procedure. If, say, `fsn-node-07` completely fails and you need to restore service to the virtual machines running on that server, you can failover to the secondaries. Before you do, however, you need to be completely confident it is not still running in parallel, which could lead to a "split brain" scenario. For that, just cut the power to the machine using out of band management (e.g. on Hetzner, power down the machine through the Hetzner Robot, on Cymru, use the iDRAC to cut the power to the main board). Once the machine is powered down, instruct Ganeti to stop using it altogether: gnt-node modify --offline=yes fsn-node-07 Then, once the machine is offline and Ganeti also agrees, switch all the instances on that node to their secondaries: gnt-node failover fsn-node-07.torproject.org It's possible that you need `--ignore-consistency` but this has caused trouble in the past (see [40229](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40229)). In any case, it is [not used at the WMF](https://wikitech.wikimedia.org/wiki/Ganeti#Failed_hardware_node), for example, they explicitly say that never needed the flag. Note that it will still try to connect to the failed node to shutdown the DRBD devices, as a last resort. Recovering from the failure should be automatic: once the failed server is repaired and restarts, it will contact the master to ask for instances to start. Since the machines the instances have been migrated, none will be started and there *should* not be any inconsistencies. Once the machine is up and running and you are confident you do not have a split brain scenario, you can re-add the machine to the cluster with: gnt-node add --readd fsn-node-07.torproject.org Once that is done, rebalance the cluster because you now have an empty node which could be reused (hopefully). It might, obviously, be worth exploring the root case of the failure, however, before readding the machine to the cluster. Recoveries could eventually be automated if such situations occur more often, by scheduling a [harep](http://docs.ganeti.org/ganeti/2.15/man/harep.html) cron job, which isn't enabled in Debian by default. See also the [autorepair](http://docs.ganeti.org/docs/ganeti/2.15/html/admin.html#autorepair) section of the admin manual. ### Master node failure A master node failure is a special case, as you may not have access to the node to run Ganeti commands. The [Ganeti wiki master failover procedure](https://github.com/ganeti/ganeti/wiki/Common-Issues#master-failuresafter-a-failure-two-nodes-think-they-are-master) has good documentation on this, but we also include scenarios specific to our use cases, to make sure this is also available offline. There are two different scenarios that might require a master failover: 1. the master is *expected* to fail or go down for maintenance (looming HDD failure, planned maintenance) and we want to retain availability 2. the master has completely failed (motherboard fried, power failure, etc) The key difference between scenario 1 and 2 here is that in scenario 1, the master is *still* available. #### Scenario 1: preventive maintenance This is the best case scenario, as the master is still available. In that case, it should simply be a matter of doing the `master-failover` command and marking the old master as offline. On the machine you want to elect as the new master: gnt-cluster master-failover gnt-node modify --offline yes OLDMASTER.torproject.org When the old master is available again, re-add it to the cluster with: gnt-node add --readd OLDMASTER.torproject.org Note that it *should* be safe to boot the old master normally, as long as it doesn't think it's the master before reboot. That is because it's the master which tells nodes which VMs to start on boot. You can check that by running this on the OLDMASTER: gnt-cluster getmaster It should return the *NEW* master. Here's an example of a routine failover performed on `fsn-node-01`, the nominal master of the `gnt-fsn` cluster, falling over to a secondary master (we picked `fsn-node-02` here) in prevision for a disk replacement: root@fsn-node-02:~# gnt-cluster master-failover root@fsn-node-02:~# gnt-cluster getmaster fsn-node-02.torproject.org root@fsn-node-02:~# gnt-node modify --offline yes fsn-node-01.torproject.org Tue Jun 21 14:30:56 2022 Failed to stop KVM daemon on node 'fsn-node-01.torproject.org': Node is marked offline Modified node fsn-node-01.torproject.org - master_candidate -> False - offline -> True And indeed, `fsn-node-01` now thinks it's not the master anymore: root@fsn-node-01:~# gnt-cluster getmaster fsn-node-02.torproject.org And this is how the node was recovered, after a reboot, on the new master: root@fsn-node-02:~# gnt-node add --readd fsn-node-01.torproject.org 2022-06-21 16:43:52,666: The certificate differs after being reencoded. Please renew the certificates cluster-wide to prevent future inconsistencies. Tue Jun 21 16:43:54 2022 - INFO: Readding a node, the offline/drained flags were reset Tue Jun 21 16:43:54 2022 - INFO: Node will be a master candidate And to promote it back, on the old master: root@fsn-node-01:~# gnt-cluster master-failover root@fsn-node-01:~# And both nodes agree on who the master is: root@fsn-node-01:~# gnt-cluster getmaster fsn-node-01.torproject.org root@fsn-node-02:~# gnt-cluster getmaster fsn-node-01.torproject.org Now is a good time to verify the cluster too: gnt-cluster verify That's pretty much it! See [tpo/tpa/team#40805](https://gitlab.torproject.org/tpo/tpa/team/-/issues/incident/40805) for the rest of that incident. #### Scenario 2: complete master node failure In this scenario, the master node is *completely* unavailable. In this case, the [Ganeti wiki master failover procedure](https://github.com/ganeti/ganeti/wiki/Common-Issues#master-failuresafter-a-failure-two-nodes-think-they-are-master) should be followed pretty much to the letter. WARNING: if you follow this procedure and skip step 1, you will probably end up with a split brain scenario (recovery documented below). So make absolutely sure the old master is *REALLY* unavailable before moving ahead with this. The procedure is, at the time of writing (WARNING: UNTESTED): 1. Make sure that the original failed master won't start again while a new master is present, preferably by physically shutting down the node. 2. To upgrade one of the master candidates to the master, issue the following command on the machine you intend to be the new master: gnt-cluster master-failover 3. Offline the old master so the new master doesn't try to communicate with it. Issue the following command: gnt-node modify --offline yes oldmaster 4. If there were any DRBD instances on the old master node, they can be failed over by issuing the following commands: gnt-node evacuate -s oldmaster gnt-node evacuate -p oldmaster 5. Any plain instances on the old master need to be recreated again. If the old master becomes available again, re-add it to the cluster with: gnt-node add --readd OLDMASTER.torproject.org The above procedure is UNTESTED. See also the [Riseup master failover procedure](https://we.riseup.net/riseup+tech/ganeti#primary-node-fails) for further ideas. ### Split brain recovery A split brain occurred during a partial failure, failover, then unexpected recovery of `fsn-node-07` ([issue 40229](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40229)). It might occur in other scenarios, but this section documents that specific one. Hopefully the recovery will be similar in other scenarios. The split brain was the result of an operator running this command to failover the instances running on the node: gnt-node failover --ignore-consistency fsn-node-07.torproject.org The symptom of the split brain is that the VM is running on two machines. You will see that in `gnt-cluster verify`: Thu Apr 22 01:28:04 2021 * Verifying node status Thu Apr 22 01:28:04 2021 - ERROR: instance palmeri.torproject.org: instance should not run on node fsn-node-07.torproject.org Thu Apr 22 01:28:04 2021 - ERROR: instance onionoo-backend-02.torproject.org: instance should not run on node fsn-node-07.torproject.org Thu Apr 22 01:28:04 2021 - ERROR: instance polyanthum.torproject.org: instance should not run on node fsn-node-07.torproject.org Thu Apr 22 01:28:04 2021 - ERROR: instance onionbalance-01.torproject.org: instance should not run on node fsn-node-07.torproject.org Thu Apr 22 01:28:04 2021 - ERROR: instance henryi.torproject.org: instance should not run on node fsn-node-07.torproject.org Thu Apr 22 01:28:04 2021 - ERROR: instance nevii.torproject.org: instance should not run on node fsn-node-07.torproject.org In the above, the verification finds an instance running on an unexpected server (the old primary). Disks will be in a similar "degraded" state, according to `gnt-cluster verify`: Thu Apr 22 01:28:04 2021 * Verifying instance status Thu Apr 22 01:28:04 2021 - WARNING: instance onionoo-backend-02.torproject.org: disk/0 on fsn-node-07.torproject.org is degraded; local disk state is 'ok' Thu Apr 22 01:28:04 2021 - WARNING: instance onionoo-backend-02.torproject.org: disk/1 on fsn-node-07.torproject.org is degraded; local disk state is 'ok' Thu Apr 22 01:28:04 2021 - WARNING: instance onionoo-backend-02.torproject.org: disk/2 on fsn-node-07.torproject.org is degraded; local disk state is 'ok' Thu Apr 22 01:28:04 2021 - WARNING: instance onionoo-backend-02.torproject.org: disk/0 on fsn-node-06.torproject.org is degraded; local disk state is 'ok' Thu Apr 22 01:28:04 2021 - WARNING: instance onionoo-backend-02.torproject.org: disk/1 on fsn-node-06.torproject.org is degraded; local disk state is 'ok' Thu Apr 22 01:28:04 2021 - WARNING: instance onionoo-backend-02.torproject.org: disk/2 on fsn-node-06.torproject.org is degraded; local disk state is 'ok' We can also see that symptom on an individual instance: root@fsn-node-01:~# gnt-instance info onionbalance-01.torproject.org - Instance name: onionbalance-01.torproject.org [...] Disks: - disk/0: drbd, size 10.0G access mode: rw nodeA: fsn-node-05.torproject.org, minor=29 nodeB: fsn-node-07.torproject.org, minor=26 port: 11031 on primary: /dev/drbd29 (147:29) in sync, status *DEGRADED* on secondary: /dev/drbd26 (147:26) in sync, status *DEGRADED* [...] The first (optional) thing to do in a split brain scenario is to stop the damage made by running instances: stop all the instances running in parallel, on both the previous and new primaries: gnt-instance stop $INSTANCES Then on `fsn-node-07` just use `kill(1)` to shutdown the `qemu` processes running the VMs directly. Now the instances should all be shutdown and no further changes will be done on the VM that could possibly be lost. (This step is optional because you can also skip straight to the hard decision below, while leaving the instances running. But that adds pressure to you, and we don't want to do that to your poor brain right now.) That will leave you time to make a more important decision: which node will be authoritative (which will keep running as primary) and which one will "lose" (and will have its instances destroyed)? There's no easy good or wrong answer for this: it's a judgement call. In any case, there might already been data loss: for as long as both nodes were available and the VMs running on both, data registered on one of the nodes during the split brain will be lost when we destroy the state on the "losing" node. If you have picked the previous primary as the "new" primary, you will need to *first* revert the failover and flip the instances back to the previous primary: for instance in $INSTANCES; do gnt-instance failover $instance done When that is done, or if you have picked the "new" primary (the one the instances were originally failed over to) as the official one: you need to fix the disks' state. For this, flip to a "plain" disk (i.e. turn off DRBD) and turn DRBD back on. This will stop mirroring the disk, and reallocate a new disk in the right place. Assuming all instances are stopped, this should do it: for instance in $INSTANCES ; do gnt-instance modify -t plain $instance gnt-instance modify -t drbd --no-wait-for-sync $instance gnt-instance start $instance gnt-instance console $instance done Then the machines should be back up on a single machine and the split brain scenario resolved. Note that this means the other side of the DRBD mirror will be destroyed in the procedure, that is the step that drops the data which was sent to the wrong part of the "split brain". Once everything is back to normal, it might be a good idea to rebalance the cluster. References: * the `-t plain` hack comes from [this post on the Ganeti list](https://groups.google.com/g/ganeti/c/l8www_IcFFI) * [this procedure](https://blkperl.github.io/split-brain-ganeti.html) suggests using `replace-disks -n` which also works, but requires us to pick the secondary by hand each time, which is annoying * [this procedure](https://www.ipserverone.info/knowledge-base/how-to-fix-drbd-recovery-from-split-brain/) has instructions on how to recover at the DRBD level directly, but have not required those instructions so far ### Bridge configuration failures If you get the following error while trying to bring up the bridge: root@chi-node-02:~# ifup br0 add bridge failed: Package not installed run-parts: /etc/network/if-pre-up.d/bridge exited with return code 1 ifup: failed to bring up br0 ... it might be the bridge cannot find a way to load the kernel module, because kernel module loading has been disabled. Reboot with the `/etc/no_modules_disabled` file present: touch /etc/no_modules_disabled reboot It might be that the machine took too long to boot because it's not in mandos and the operator took too long to enter the LUKS passphrase. Re-enable the machine with this command on mandos: mandos-ctl --enable chi-node-02.torproject ### Cleaning up orphan disks Sometimes `gnt-cluster verify` will give this warning, particularly after a failed rebalance: * Verifying orphan volumes - WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/27dd3687-8953-447e-8632-adf4aa4e11b6.disk0_meta is unknown - WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/27dd3687-8953-447e-8632-adf4aa4e11b6.disk0_data is unknown - WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_meta is unknown - WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_data is unknown This can happen when an instance was partially migrated to a node (in this case `fsn-node-06`) but the migration failed because (for example) there was no HDD on the target node. The fix here is simply to remove the logical volumes on the target node: ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/27dd3687-8953-447e-8632-adf4aa4e11b6.disk0_meta ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/27dd3687-8953-447e-8632-adf4aa4e11b6.disk0_data ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_meta ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_data ### Cleaning up ghost disks Under certain circumstances, you might end up with "ghost" disks, for example: Tue Oct 4 13:24:07 2022 - ERROR: cluster : ghost disk 'ed225e68-83af-40f7-8d8c-cf7e46adad54' in temporary DRBD map It's unclear how this happens, but in this specific case it is believed the problem occurred because a disk failed to add to an instance being resized. It's *possible* this is a situation similar to the one above, in which case you must first find *where* the ghost disk is, with something like: gnt-cluster command 'lvs --noheadings' | grep 'ed225e68-83af-40f7-8d8c-cf7e46adad54' If this finds a device, you can remove it as normal: ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/ed225e68-83af-40f7-8d8c-cf7e46adad54.disk1_data ... but in this case, the DRBD map is *not* associated with a logical volume. You can also check the `dmsetup` output for a match as well: gnt-cluster command 'dmsetup ls' | grep 'ed225e68-83af-40f7-8d8c-cf7e46adad54' According to [this discussion](https://groups.google.com/g/ganeti/c/s5qoh26T1yA), it's possible that restarting ganeti on all nodes might clear out the issue: gnt-cluster command 'service ganeti restart' If *all* the "ghost" disks mentioned are not actually found anywhere in the cluster, either in the device mapper or logical volumes, it might just be stray data leftover in the data file. So it *looks* like the proper way to do this is to *remove* the temporary file where this data is stored: gnt-cluster command 'grep ed225e68-83af-40f7-8d8c-cf7e46adad54 /var/lib/ganeti/tempres.data' ssh ... service ganeti stop ssh ... rm /var/lib/ganeti/tempres.data ssh ... service ganeti start gnt-cluster verify That solution was proposed in [this discussion](https://groups.google.com/g/ganeti/c/SMR3yNek3Js). Anarcat toured the Ganeti source code and found that the `ComputeDRBDMap` function, in the Haskell codebase, basically just sucks the data out of that `tempres.data` JSON file, and dumps it into the Python side of things. Then the Python code looks for those disks in its internal disk list and compares. It's pretty unlikely that the warning would happen with the disks still being around, therefore. ### Fixing inconsistent disks Sometimes `gnt-cluster verify` will give this error: WARNING: instance materculae.torproject.org: disk/0 on fsn-node-02.torproject.org is degraded; local disk state is 'ok' ... or worse: ERROR: instance materculae.torproject.org: couldn't retrieve status for disk/2 on fsn-node-03.torproject.org: Can't find device <DRBD8(hosts=46cce2d9-ddff-4450-a2d6-b2237427aa3c/10-053e482a-c9f9-49a1-984d-50ae5b4563e6/22, port=11177, backend=<LogicalVolume(/dev/vg_ganeti/486d3e6d-e503-4d61-a8d9-31720c7291bd.disk2_data, visible as /dev/, size=10240m)>, metadev=<LogicalVolume(/dev/vg_ganeti/486d3e6d-e503-4d61-a8d9-31720c7291bd.disk2_meta, visible as /dev/, size=128m)>, visible as /dev/disk/2, size=10240m)> The fix for both is to run: gnt-instance activate-disks materculae.torproject.org This will make sure disks are correctly setup for the instance. If you have a lot of those warnings, pipe the output into this filter, for example: gnt-cluster verify | grep -e 'WARNING: instance' -e 'ERROR: instance' | sed 's/.*instance//;s/:.*//' | sort -u | while read instance; do gnt-instance activate-disks $instance done If you see an error like this: DRBD CRITICAL: Device 28 WFConnection UpToDate, Device 3 WFConnection UpToDate, Device 31 WFConnection UpToDate, Device 4 WFConnection UpToDate In this case, it's warning that the node has device 4, 28, and 31 in `WFConnection` state, which is incorrect. This might not be detected by Ganeti and therefore requires some hand-holding. This is documented in the [resyncing disks section of out DRBD documentation](howto/drbd#resyncing-disks). Like in the above scenario, the solution is basically to run `activate-disks` on the affected instances. ### Not enough memory for failovers Another error that `gnt-cluster verify` can give you is, for example: - ERROR: node fsn-node-04.torproject.org: not enough memory to accomodate instance failovers should node fsn-node-03.torproject.org fail (16384MiB needed, 10724MiB available) The solution is to [rebalance the cluster](#rebalancing-a-cluster). ### Can't assemble device after creation It's possible that Ganeti fails to create an instance with this error: Thu Jan 14 20:01:00 2021 - WARNING: Device creation failed Failure: command execution error: Can't create block device <DRBD8(hosts=d1b54252-dd81-479b-a9dc-2ab1568659fa/0-3aa32c9d-c0a7-44bb-832d-851710d04765/0, port=11005, backend=<LogicalVolume(/dev/vg_ganeti/3f60a066-c957-4a86-9fae-65525fe3f3c7.disk0_data, not visible, size=10240m)>, metadev=<LogicalVolume(/dev/vg_ganeti/3f60a066-c957-4a86-9fae-65525fe3f3c7.disk0_meta, not visible, size=128m)>, visible as /dev/disk/0, size=10240m)> on node chi-node-03.torproject.org for instance build-x86-13.torproject.org: Can't assemble device after creation, unusual event: drbd0: timeout while configuring network In this case, the problem was that `chi-node-03` had an incorrect `secondary_ip` set. The immediate fix was to correctly set the secondary address of the node: gnt-node modify --secondary-ip=172.30.130.3 chi-node-03.torproject.org Then `gnt-cluster verify` was complaining about the leftover DRBD device: - ERROR: node chi-node-03.torproject.org: unallocated drbd minor 0 is in use For this, see [DRBD: deleting a stray device](howto/drbd#deleting-a-stray-device). ### SSH key verification failures Ganeti uses SSH to launch arbitrary commands (as root!) on other nodes. It does this using a funky command, from `node-daemon.log`: ssh -oEscapeChar=none -oHashKnownHosts=no \ -oGlobalKnownHostsFile=/var/lib/ganeti/known_hosts \ -oUserKnownHostsFile=/dev/null -oCheckHostIp=no \ -oConnectTimeout=10 -oHostKeyAlias=chignt.torproject.org -oPort=22 -oBatchMode=yes -oStrictHostKeyChecking=yes -4 \ root@chi-node-03.torproject.org This has caused us some problems in the Ganeti buster to bullseye upgrade, possibly because of changes in host verification routines in OpenSSH. The problem was documented in [issue 1608 upstream](https://github.com/ganeti/ganeti/issues/1608) and [tpo/tpa/team#40383](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40383). A workaround is to synchronize Ganeti's `known_hosts` file: grep 'chi-node-0[0-9]' /etc/ssh/ssh_known_hosts | grep -v 'initramfs' | grep ssh-rsa | sed 's/[^ ]* /chignt.torproject.org /' >> /var/lib/ganeti/known_hosts Note that the above assumes only a < 10 nodes cluster. ### Other troubleshooting The [walkthrough](http://docs.ganeti.org/ganeti/2.15/html/walkthrough.html) also has a few recipes to resolve common problems. See also the [common issues page](https://github.com/ganeti/ganeti/wiki/Common-Issues) in the Ganeti wiki. Look into logs on the relevant nodes (particularly `/var/log/ganeti/node-daemon.log`, which shows all commands ran by ganeti) when you have problems. ### Migrating a VM between clusters The [export/import](https://docs.ganeti.org/docs/ganeti/3.0/html/admin.html#export-import) mechanism can also be used to export and import VMs one at a time, if only a subset of the cluster needs to be evacuated. Note that this procedure is still a work in progress. A simulation was performed in [tpo/tpa/team#40917](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40917), a proper procedure might vary from this significantly. In particular, there are some optimizations possible through things like [zerofree](https://tracker.debian.org/pkg/zerofree) and compression... 1. find nodes to host the exported VM on the source cluster and the target cluster; it needs enough disk space in `/var/lib/ganeti/export` to keep a copy of a snapshot of the VM: df -h /var/lib/ganeti/export 2. have the right kernel modules loaded, which might require a reboot of the source node: modprobe dm_snapshot 3. on the master of the source Ganeti cluster, export the VM to the source node, also use `--noshutdown` if you cannot afford to have downtime on the VM *and* you are ready to lose data accumulated after the snapshot: gnt-backup export -n fsn-node-01.torproject.org test-01.torproject.org WARNING: this step is currently not working if there's a second disk (or swap device? to be confirmed), see [this upstream issue for details](https://github.com/ganeti/instance-debootstrap/issues/18). for now we're deploying the "nocloud" export/import mechanisms through Puppet to workaround that problem which means the whole disk is copied (as opposed to only the used parts) 4. copy the VM snapshot from the source node to node in the target cluster: rsync -a /var/lib/ganeti/export/test-01.torproject.org/ root@chi-node-02.torproject.org:/var/lib/ganeti/export/test-01.torproject.org/ 5. on the master of the target Ganeti cluster, import the VM: gnt-backup import -n chi-node-08:chi-node-07 --src-node=chi-node-02.torproject.org --src-dir=/var/lib/ganeti/export/test-01.torproject.org/ test-01.torproject.org 6. enter the restored server console to change the IP address: gnt-instance console test-01.torproject.org 7. if everything looks well, change the IP in LDAP 8. destroy the old VM ### Mass migrating instances to a new cluster The [move-instance](https://docs.ganeti.org/docs/ganeti/3.0/html/move-instance.html) command can do this. TODO: document mass cluster migrations. ### Reboot procedures If you get this email in Nagios: Subject: ** PROBLEM Service Alert: chi-node-01/needrestart is WARNING ** ... and in the detailed results, you see: WARN - Kernel: 5.10.0-19-amd64, Microcode: CURRENT, Services: 1 (!), Containers: none, Sessions: none Services: - ganeti.service You can try to make `needrestart` fix Ganeti by hand: root@chi-node-01:~# needrestart Scanning processes... Scanning candidates... Scanning processor microcode... Scanning linux images... Running kernel seems to be up-to-date. The processor microcode seems to be up-to-date. Restarting services... systemctl restart ganeti.service No containers need to be restarted. No user sessions are running outdated binaries. root@chi-node-01:~# ... but it's actually likely this didn't fix anything. A rerun will yield the same result. That is likely because the virtual machines, running inside a `qemu` process, need a restart. This can be fixed by rebooting the entire host, if it needs a reboot, or, if it doesn't, just migrating the VMs around. See the [Ganeti reboot procedures](#rebooting) for how to proceed from here on. This is likely a case of an [Instance-only restart](#instance-only-restarts). ### Slow disk sync after rebooting/Broken migrate-back After rebooting a node with high-traffic instances, the node's disks may take several minutes to sync. While the disks are syncing, the `reboot` script's `--ganeti-migrate-back` option can fail ``` Wed Aug 10 21:48:22 2022 Migrating instance onionbalance-02.torproject.org Wed Aug 10 21:48:22 2022 * checking disk consistency between source and target Wed Aug 10 21:48:23 2022 - WARNING: Can't find disk on node chi-node-08.torproject.org Failure: command execution error: Disk 0 is degraded or not fully synchronized on target node, aborting migration unexpected exception during reboot: [<UnexpectedExit: cmd='gnt-instance migrate -f onionbalance-02.torproject.org' exited=1>] Encountered a bad command exit code! Command: 'gnt-instance migrate -f onionbalance-02.torproject.org' ``` When this happens, `gnt-cluter verify` may show a large amount of errors for node status and instance status ``` Wed Aug 10 21:49:37 2022 * Verifying node status Wed Aug 10 21:49:37 2022 - ERROR: node chi-node-08.torproject.org: drbd minor 0 of disk 1e713d4e-344c-4c39-9286-cb47bcaa8da3 (attached in instance 'probetelemetry-01.torproject.org') is not active Wed Aug 10 21:49:37 2022 - ERROR: node chi-node-08.torproject.org: drbd minor 1 of disk 1948dcb7-b281-4ad3-a2e4-cdaf3fa159a0 (attached in instance 'probetelemetry-01.torproject.org') is not active Wed Aug 10 21:49:37 2022 - ERROR: node chi-node-08.torproject.org: drbd minor 2 of disk 25986a9f-3c32-4f11-b546-71d432b1848f (attached in instance 'probetelemetry-01.torproject.org') is not active Wed Aug 10 21:49:37 2022 - ERROR: node chi-node-08.torproject.org: drbd minor 3 of disk 7f3a5ef1-b522-4726-96cf-010d57436dd5 (attached in instance 'static-gitlab-shim.torproject.org') is not active Wed Aug 10 21:49:37 2022 - ERROR: node chi-node-08.torproject.org: drbd minor 4 of disk bfd77fb0-b8ec-44dc-97ad-fd65d6c45850 (attached in instance 'static-gitlab-shim.torproject.org') is not active Wed Aug 10 21:49:37 2022 - ERROR: node chi-node-08.torproject.org: drbd minor 5 of disk c1828d0a-87c5-49db-8abb-ee00ccabcb73 (attached in instance 'static-gitlab-shim.torproject.org') is not active Wed Aug 10 21:49:37 2022 - ERROR: node chi-node-08.torproject.org: drbd minor 8 of disk 1f3f4f1e-0dfa-4443-aabf-0f3b4c7d2dc4 (attached in instance 'onionbalance-02.torproject.org') is not active Wed Aug 10 21:49:37 2022 - ERROR: node chi-node-08.torproject.org: drbd minor 9 of disk bbd5b2e9-8dbb-42f4-9c10-ef0df7f59b85 (attached in instance 'onionbalance-02.torproject.org') is not active Wed Aug 10 21:49:37 2022 * Verifying instance status Wed Aug 10 21:49:37 2022 - WARNING: instance static-gitlab-shim.torproject.org: disk/0 on chi-node-04.torproject.org is degraded; local disk state is 'ok' Wed Aug 10 21:49:37 2022 - WARNING: instance static-gitlab-shim.torproject.org: disk/1 on chi-node-04.torproject.org is degraded; local disk state is 'ok' Wed Aug 10 21:49:37 2022 - WARNING: instance static-gitlab-shim.torproject.org: disk/2 on chi-node-04.torproject.org is degraded; local disk state is 'ok' Wed Aug 10 21:49:37 2022 - ERROR: instance static-gitlab-shim.torproject.org: couldn't retrieve status for disk/0 on chi-node-08.torproject.org: Can't find device <DRBD8(hosts=0d8b8663-e2bd-42e7-9e8d-e4502fa621b8/3-3aa32c9d-c0a7-44bb-832d-851710d04765/8, port=11040, backend=<LogicalVolume(/dev/vg_ganeti/b1913b02-14f4-4c0e-9d78-970bd34f5291.disk0_data, visible as /dev/, size=10240m)>, metadev=<LogicalVolume(/dev/vg_ganeti/b1913b02-14f4-4c0e-9d78-970bd34f5291.disk0_meta, visible as /dev/, size=128m)>, visible as /dev/disk/0, size=10240m)> Wed Aug 10 21:49:37 2022 - ERROR: instance static-gitlab-shim.torproject.org: couldn't retrieve status for disk/1 on chi-node-08.torproject.org: Can't find device <DRBD8(hosts=0d8b8663-e2bd-42e7-9e8d-e4502fa621b8/4-3aa32c9d-c0a7-44bb-832d-851710d04765/11, port=11041, backend=<LogicalVolume(/dev/vg_ganeti/5fc54069-ee70-499a-9987-8201a604ee77.disk1_data, visible as /dev/, size=2048m)>, metadev=<LogicalVolume(/dev/vg_ganeti/5fc54069-ee70-499a-9987-8201a604ee77.disk1_meta, visible as /dev/, size=128m)>, visible as /dev/disk/1, size=2048m)> Wed Aug 10 21:49:37 2022 - ERROR: instance static-gitlab-shim.torproject.org: couldn't retrieve status for disk/2 on chi-node-08.torproject.org: Can't find device <DRBD8(hosts=0d8b8663-e2bd-42e7-9e8d-e4502fa621b8/5-3aa32c9d-c0a7-44bb-832d-851710d04765/12, port=11042, backend=<LogicalVolume(/dev/vg_ganeti/5d092bcf-d229-47cd-bb2b-04dfe241fb68.disk2_data, visible as /dev/, size=20480m)>, metadev=<LogicalVolume(/dev/vg_ganeti/5d092bcf-d229-47cd-bb2b-04dfe241fb68.disk2_meta, visible as /dev/, size=128m)>, visible as /dev/disk/2, size=20480m)> Wed Aug 10 21:49:37 2022 - WARNING: instance probetelemetry-01.torproject.org: disk/0 on chi-node-06.torproject.org is degraded; local disk state is 'ok' Wed Aug 10 21:49:37 2022 - WARNING: instance probetelemetry-01.torproject.org: disk/1 on chi-node-06.torproject.org is degraded; local disk state is 'ok' Wed Aug 10 21:49:37 2022 - WARNING: instance probetelemetry-01.torproject.org: disk/2 on chi-node-06.torproject.org is degraded; local disk state is 'ok' Wed Aug 10 21:49:37 2022 - ERROR: instance probetelemetry-01.torproject.org: couldn't retrieve status for disk/0 on chi-node-08.torproject.org: Can't find device <DRBD8(hosts=e2efd223-53e1-44f4-b84d-38f6eb26dcbb/3-0d8b8663-e2bd-42e7-9e8d-e4502fa621b8/0, port=11035, backend=<LogicalVolume(/dev/vg_ganeti/4b699f8a-ebde-4680-bfda-4e1a2e191b8f.disk0_data, visible as /dev/, size=10240m)>, metadev=<LogicalVolume(/dev/vg_ganeti/4b699f8a-ebde-4680-bfda-4e1a2e191b8f.disk0_meta, visible as /dev/, size=128m)>, visible as /dev/disk/0, size=10240m)> Wed Aug 10 21:49:37 2022 - ERROR: instance probetelemetry-01.torproject.org: couldn't retrieve status for disk/1 on chi-node-08.torproject.org: Can't find device <DRBD8(hosts=e2efd223-53e1-44f4-b84d-38f6eb26dcbb/4-0d8b8663-e2bd-42e7-9e8d-e4502fa621b8/1, port=11036, backend=<LogicalVolume(/dev/vg_ganeti/e5f56f72-1492-4596-8957-ce442ef0fcd5.disk1_data, visible as /dev/, size=2048m)>, metadev=<LogicalVolume(/dev/vg_ganeti/e5f56f72-1492-4596-8957-ce442ef0fcd5.disk1_meta, visible as /dev/, size=128m)>, visible as /dev/disk/1, size=2048m)> Wed Aug 10 21:49:37 2022 - ERROR: instance probetelemetry-01.torproject.org: couldn't retrieve status for disk/2 on chi-node-08.torproject.org: Can't find device <DRBD8(hosts=e2efd223-53e1-44f4-b84d-38f6eb26dcbb/5-0d8b8663-e2bd-42e7-9e8d-e4502fa621b8/2, port=11037, backend=<LogicalVolume(/dev/vg_ganeti/ee280ecd-78cb-46c6-aca4-db23a0ae1454.disk2_data, visible as /dev/, size=51200m)>, metadev=<LogicalVolume(/dev/vg_ganeti/ee280ecd-78cb-46c6-aca4-db23a0ae1454.disk2_meta, visible as /dev/, size=128m)>, visible as /dev/disk/2, size=51200m)> Wed Aug 10 21:49:37 2022 - WARNING: instance onionbalance-02.torproject.org: disk/0 on chi-node-09.torproject.org is degraded; local disk state is 'ok' Wed Aug 10 21:49:37 2022 - WARNING: instance onionbalance-02.torproject.org: disk/1 on chi-node-09.torproject.org is degraded; local disk state is 'ok' Wed Aug 10 21:49:37 2022 - ERROR: instance onionbalance-02.torproject.org: couldn't retrieve status for disk/0 on chi-node-08.torproject.org: Can't find device <DRBD8(hosts=0d8b8663-e2bd-42e7-9e8d-e4502fa621b8/8-86e465ce-60df-4a6f-be17-c6abb33eaf88/4, port=11022, backend=<LogicalVolume(/dev/vg_ganeti/3b0e4300-d4c1-4b7c-970a-f20b2214dab5.disk0_data, visible as /dev/, size=10240m)>, metadev=<LogicalVolume(/dev/vg_ganeti/3b0e4300-d4c1-4b7c-970a-f20b2214dab5.disk0_meta, visible as /dev/, size=128m)>, visible as /dev/disk/0, size=10240m)> Wed Aug 10 21:49:37 2022 - ERROR: instance onionbalance-02.torproject.org: couldn't retrieve status for disk/1 on chi-node-08.torproject.org: Can't find device <DRBD8(hosts=0d8b8663-e2bd-42e7-9e8d-e4502fa621b8/9-86e465ce-60df-4a6f-be17-c6abb33eaf88/5, port=11021, backend=<LogicalVolume(/dev/vg_ganeti/ec75f295-1e09-46df-b2c2-4fa24f064401.disk1_data, visible as /dev/, size=2048m)>, metadev=<LogicalVolume(/dev/vg_ganeti/ec75f295-1e09-46df-b2c2-4fa24f064401.disk1_meta, visible as /dev/, size=128m)>, visible as /dev/disk/1, size=2048m)> ``` This is usually a false alarm, and the warnings and errors will disappear in a few minutes when the disk finishes syncing. Re-check `gnt-cluster verify` every few minutes, and manually migrate the instances back when the errors disappear. If such an error persists, consider telling Ganeti to "re-seat" the disks (so to speak) with, for example: gnt-instance activate-disks onionbalance-02.torproject.org ## Disaster recovery If things get completely out of hand and the cluster becomes too unreliable for service, the only solution is to rebuild another one elsewhere. Since Ganeti 2.2, there is a [move-instance](https://docs.ganeti.org/docs/ganeti/3.0/html/move-instance.html) command to move instances between cluster that can be used for that purpose. See the [mass migration procedure](#mass-migrating-instances-to-a-new-cluster) above. The [export/import](https://docs.ganeti.org/docs/ganeti/3.0/html/admin.html#export-import) mechanism can also be used to export and import VMs one at a time, if only a subset of the cluster needs to be evacuated. See the [migrating a VM between clusters](#migrating-a-vm-between-clusters) procedure above. If Ganeti is completely destroyed and its APIs don't work anymore, the last resort is to restore all virtual machines from [howto/backup](howto/backup). Hopefully, this should not happen except in the case of a catastrophic data loss bug in Ganeti or [howto/drbd](howto/drbd). # Reference ## Installation Ganeti is typically installed as part of the [bare bones machine installation process](howto/new-machine), typically as part of the "post-install configuration" procedure, once the machine is fully installed and configured. Typically, we add a new *node* to an existing *cluster*. Below are cluster-specific procedures to add a new *node* to each existing cluster, alongside the configuration of the cluster as it was done at the time (and how it could be used to rebuild a cluster from scratch). Make sure you use the procedure specific to the cluster you are working on. Note that this is *not* about installing virtual machines (VMs) *inside* a Ganeti cluster: for that you want to look at the [new instance procedure](#adding-a-new-instance). ### New gnt-fsn node 1. To create a new box, follow [howto/new-machine-hetzner-robot](howto/new-machine-hetzner-robot) but change the following settings: * Server: [PX62-NVMe][] * Location: `FSN1` * Operating system: Rescue * Additional drives: 2x10TB HDD (update: starting from fsn-node-05, we are *not* ordering additional drives to save on costs, see [ticket 33083](https://bugs.torproject.org/33083) for rationale) * Add in the comment form that the server needs to be in the same datacenter as the other machines (FSN1-DC13, but double-check) [PX62-NVMe]: https://www.hetzner.com/dedicated-rootserver/px62-nvme?country=OTHER 2. follow the [howto/new-machine](howto/new-machine) post-install configuration 3. Add the server to the two `vSwitch` systems in [Hetzner Robot web UI](https://robot.your-server.de/vswitch) 4. install openvswitch and allow modules to be loaded: touch /etc/no_modules_disabled reboot apt install openvswitch-switch 5. Allocate a private IP address in the `30.172.in-addr.arpa` zone (and the `torproject.org` zone) for the node, in the `admin/dns/domains.git` repository 6. copy over the `/etc/network/interfaces` from another ganeti node, changing the `address` and `gateway` fields to match the local entry. 7. knock on wood, cross your fingers, pet a cat, help your local book store, and reboot: reboot 8. Prepare all the nodes by configuring them in Puppet, by adding the class `roles::ganeti::fsn` to the node 9. Re-enable modules disabling: rm /etc/no_modules_disabled 10. run puppet across the ganeti cluster to ensure ipsec tunnels are up: cumin -p 0 'C:roles::ganeti::fsn' 'puppet agent -t' 11. reboot again: reboot 12. Then the node is ready to be added to the cluster, by running this on the master node: gnt-node add \ --secondary-ip 172.30.135.2 \ --no-ssh-key-check \ --no-node-setup \ fsn-node-02.torproject.org If this is an entirely new cluster, you need a different procedure, see [the cluster initialization procedure](#gnt-fsn-cluster-initialization) instead. 13. make sure everything is great in the cluster: gnt-cluster verify If that takes a long time and eventually fails with erors like: ERROR: node fsn-node-03.torproject.org: ssh communication with node 'fsn-node-06.torproject.org': ssh problem: ssh: connect to host fsn-node-06.torproject.org port 22: Connection timed out\'r\n ... that is because the [howto/ipsec](howto/ipsec) tunnels between the nodes are failing. Make sure Puppet has run across the cluster (step 10 above) and see [howto/ipsec](howto/ipsec) for further diagnostics. For example, the above would be fixed with: ssh fsn-node-03.torproject.org "puppet agent -t; service ipsec reload" ssh fsn-node-06.torproject.org "puppet agent -t; service ipsec reload; ipsec up gnt-fsn-be::fsn-node-03" ### gnt-fsn cluster initialization This procedure replaces the `gnt-node add` step in the initial setup of the first Ganeti node when the `gnt-fsn` cluster was setup: gnt-cluster init \ --master-netdev vlan-gntbe \ --vg-name vg_ganeti \ --secondary-ip 172.30.135.1 \ --enabled-hypervisors kvm \ --nic-parameters mode=openvswitch,link=br0,vlan=4000 \ --mac-prefix 00:66:37 \ --no-ssh-init \ --no-etc-hosts \ fsngnt.torproject.org The above assumes that `fsngnt` is already in DNS. See the [MAC address prefix selection](#mac-address-prefix-selection) section for information on how the `--mac-prefix` argument was selected. Then the following extra configuration was performed: gnt-cluster modify --reserved-lvs vg_ganeti/root,vg_ganeti/swap gnt-cluster modify -H kvm:kernel_path=,initrd_path= gnt-cluster modify -H kvm:security_model=pool gnt-cluster modify -H kvm:kvm_extra='-device virtio-rng-pci\,bus=pci.0\,addr=0x1e\,max-bytes=1024\,period=1000' gnt-cluster modify -H kvm:disk_cache=none gnt-cluster modify -H kvm:disk_discard=unmap gnt-cluster modify -H kvm:scsi_controller_type=virtio-scsi-pci gnt-cluster modify -H kvm:disk_type=scsi-hd gnt-cluster modify -H kvm:migration_bandwidth=950 gnt-cluster modify -H kvm:migration_downtime=500 gnt-cluster modify -H kvm:migration_caps=postcopy-ram gnt-cluster modify -D drbd:c-plan-ahead=0,disk-custom='--c-plan-ahead 0' gnt-cluster modify --uid-pool 4000-4019 The [network configuration](#network-configuration) (below) must also be performed for the address blocks reserved in the cluster. ### New gnt-chi node 1. to create a new box, follow the [cymru new-machine howto](howto/new-machine-cymru) 2. follow the [howto/new-machine](howto/new-machine) post-install configuration 3. Allocate a private IP address in the `30.172.in-addr.arpa` zone for the node, in the `admin/dns/domains.git` repository 4. add the private IP address to the eth1 interface, for example in `/etc/network/interfaces.d/eth1`: auto eth1 iface eth1 inet static address 172.30.130.5/24 This IP must be allocated in the reverse DNS zone file (`30.172.in-addr.arpa`) and the `torproject.org` zone file in the `dns/domains.git` repository. 5. enable the interface: ifup eth1 6. setup a bridge on the public interface, replacing the `eth0` blocks with something like: auto eth0 iface eth0 inet manual auto br0 iface br0 inet static address 38.229.82.104/24 gateway 38.229.82.1 bridge_ports eth0 bridge_stp off bridge_fd 0 # IPv6 configuration iface br0 inet6 static accept_ra 0 address 2604:8800:5000:82:baca:3aff:fe5d:8774/64 gateway 2604:8800:5000:82::1 6. allow modules to be loaded, cross your fingers that you didn't screw up the network configuration above, and reboot: touch /etc/no_modules_disabled reboot 7. configure the node in Puppet by adding it to the `roles::ganeti::chi` class, and run Puppet on the new node: puppet agent -t 8. re-disable module loading: rm /etc/no_modules_disabled 9. run puppet across the ganeti cluster to firewalls are correctly configured: cumin -p 0 'C:roles::ganeti::chi' 'puppet agent -t' 10. Then the node is ready to be added to the cluster, by running this on the master node: gnt-node add \ --secondary-ip 172.30.130.5 \ --no-ssh-key-check \ --no-node-setup \ chi-node-05.torproject.org If this is an entirely new cluster, you need a different procedure, see [the cluster initialization procedure](#gnt-fsn-cluster-initialization) instead. 11. make sure everything is great in the cluster: gnt-cluster verify If the last step fails with SSH errors, you may need to re-synchronise the SSH `known_hosts` file, see [SSH key verification failures](#ssh-key-verification-failures). ### gnt-chi cluster initialization This procedure replaces the `gnt-node add` step in the initial setup of the first Ganeti node when the `gnt-chi` cluster was setup: gnt-cluster init \ --master-netdev eth1 \ --nic-parameters link=br0 \ --vg-name vg_ganeti \ --secondary-ip 172.30.130.1 \ --enabled-hypervisors kvm \ --mac-prefix 06:66:38 \ --no-ssh-init \ --no-etc-hosts \ chignt.torproject.org The above assumes that `chignt` is already in DNS. See the [MAC address prefix selection](#mac-address-prefix-selection) section for information on how the `--mac-prefix` argument was selected. Then the following extra configuration was performed: ``` gnt-cluster modify --reserved-lvs vg_ganeti/root,vg_ganeti/swap gnt-cluster modify -H kvm:kernel_path=,initrd_path= gnt-cluster modify -H kvm:security_model=pool gnt-cluster modify -H kvm:kvm_extra='-device virtio-rng-pci\,bus=pci.0\,addr=0x1e\,max-bytes=1024\,period=1000' gnt-cluster modify -H kvm:disk_cache=none gnt-cluster modify -H kvm:disk_discard=unmap gnt-cluster modify -H kvm:scsi_controller_type=virtio-scsi-pci gnt-cluster modify -H kvm:disk_type=scsi-hd gnt-cluster modify -H kvm:migration_bandwidth=950 gnt-cluster modify -H kvm:migration_downtime=500 gnt-cluster modify -H kvm:migration_caps=postcopy-ram gnt-cluster modify -D drbd:c-plan-ahead=0,disk-custom='--c-plan-ahead 0' gnt-cluster modify --uid-pool 4000-4019 ``` The upper limit for CPU count and memory size were doubled, to 16 and 64G, respectively, with: ``` gnt-cluster modify --ipolicy-bounds-specs \ max:cpu-count=16,disk-count=16,disk-size=1048576,\ memory-size=65536,nic-count=8,spindle-use=12\ /min:cpu-count=1,disk-count=1,disk-size=1024,\ memory-size=128,nic-count=1,spindle-use=1 ``` NOTE: watch out for whitespace here. The [original source](https://johnny85v.wordpress.com/2016/06/13/ganeti-commands/) for this command had too much whitespace, which fails with: Failure: unknown/wrong parameter name 'Missing value for key '' in option --ipolicy-bounds-specs' The disk templates also had to be modified to account for iSCSI devices: gnt-cluster modify --enabled-disk-templates drbd,plain,blockdev gnt-cluster modify --ipolicy-disk-templates drbd,plain,blockdev The [network configuration](#network-configuration) (below) must also be performed for the address blocks reserved in the cluster. This is the actual initial configuration performed: gnt-network add --network 38.229.82.0/24 --gateway 38.229.82.1 --network6 2604:8800:5000:82::/64 --gateway6 2604:8800:5000:82::1 gnt-chi-01 gnt-network connect --nic-parameters=link=br0 gnt-chi-01 default The following IPs were reserved: gnt-network modify --add-reserved-ips=38.229.82.2,38.229.82.3,38.229.82.4,38.229.82.5,38.229.82.6,38.229.82.7,38.229.82.8,38.229.82.9,38.229.82.10,38.229.82.11,38.229.82.12,38.229.82.13,38.229.82.14,38.229.82.15,38.229.82.16,38.229.82.17,38.229.82.18,38.229.82.19 gnt-chi-01 The first two are for the gateway, but the rest is temporary and might be reclaimed eventually. ### Network configuration IP allocation is managed by Ganeti through the `gnt-network(8)` system. Say we have `192.0.2.0/24` reserved for the cluster, with the host IP `192.0.2.100` and the gateway on `192.0.2.1`. You will create this network with: gnt-network add --network 192.0.2.0/24 --gateway 192.0.2.1 example-network If there's also IPv6, it would look something like this: gnt-network add --network 192.0.2.0/24 --gateway 192.0.2.1 --network6 2001:db8::/32 --gateway6 fe80::1 example-network Note: the actual name of the network (`example-network`) above, should follow the convention established in [doc/naming-scheme](doc/naming-scheme). Then we associate the new network to the default node group: gnt-network connect --nic-parameters=link=br0,vlan=4000,mode=openvswitch example-network default The arguments to `--nic-parameters` come from the values configured in the cluster, above. The current values can be found with `gnt-cluster info`. For example, the second ganeti network block was assigned with the following commands: gnt-network add --network 49.12.57.128/27 --gateway 49.12.57.129 gnt-fsn13-02 gnt-network connect --nic-parameters=link=br0,vlan=4000,mode=openvswitch gnt-fsn13-02 default IP addresses can be reserved with the `--reserved-ips` argument to the modify command, for example: gnt-network modify --add-reserved-ips=38.229.82.2,38.229.82.3,38.229.82.4,38.229.82.5,38.229.82.6,38.229.82.7,38.229.82.8,38.229.82.9,38.229.82.10,38.229.82.11,38.229.82.12,38.229.82.13,38.229.82.14,38.229.82.15,38.229.82.16,38.229.82.17,38.229.82.18,38.229.82.19 gnt-chi-01 gnt-chi-01 Note that the gateway and nodes IP addresses are automatically reserved, this is for hosts outside of the cluster. The network name must follow the [naming convention](doc/naming-scheme). ## SLA As long as the cluster is not over capacity, it should be able to survive the loss of a node in the cluster unattended. Justified machines can be provisionned within a few business days without problems. New nodes can be provisioned within a week or two, depending on budget and hardware availability. ## Design Our first Ganeti cluster (`gnt-fsn`) is made of multiple machines hosted with [Hetzner Robot](https://robot.your-server.de/), Hetzner's dedicated server hosting service. All machines use the same hardware to avoid problems with live migration. That is currently a customized build of the [PX62-NVMe][] line. ### Network layout Machines are interconnected over a [vSwitch](https://wiki.hetzner.de/index.php/Vswitch/en), a "virtual layer 2 network" probably implemented using [Software-defined Networking](https://en.wikipedia.org/wiki/Software-defined_networking) (SDN) on top of Hetzner's network. The details of that implementation do not matter much to us, since we do not trust the network and run an IPsec layer on top of the vswitch. We communicate with the `vSwitch` through [Open vSwitch](https://en.wikipedia.org/wiki/Open_vSwitch) (OVS), which is (currently manually) configured on each node of the cluster. There are two distinct IPsec networks: * `gnt-fsn-public`: the public network, which maps to the `fsn-gnt-inet-vlan` vSwitch at Hetzner, the `vlan-gntinet` OVS network, and the `gnt-fsn` network pool in Ganeti. it provides public IP addresses and routing across the network. instances get IP allocated in this network. * `gnt-fsn-be`: the private ganeti network which maps to the `fsn-gnt-backend-vlan` vSwitch at Hetzner and the `vlan-gntbe` OVS network. it has no matching `gnt-network` component and IP addresses are allocated manually in the 172.30.135.0/24 network through DNS. it provides internal routing for Ganeti commands and [howto/drbd](howto/drbd) storage mirroring. ### MAC address prefix selection The MAC address prefix for the gnt-fsn cluster (`00:66:37:...`) seems to have been picked arbitrarily. While it does not conflict with a known existing prefix, it could eventually be issued to a manufacturer and reused, possibly leading to a MAC address clash. The closest is currently Huawei: $ grep ^0066 /var/lib/ieee-data/oui.txt 00664B (base 16) HUAWEI TECHNOLOGIES CO.,LTD Such a clash is fairly improbable, because that new manufacturer would need to show up on the local network as well. Still, new clusters SHOULD use a different MAC address prefix in a [locally administered address](https://en.wikipedia.org/wiki/MAC_address#Universal_vs._local) (LAA) space, which "are distinguished by setting the second-least-significant bit of the first octet of the address". In other words, the MAC address must have 2, 6, A or E as a its second [quad](https://en.wikipedia.org/wiki/Nibble). In other words, the MAC address must look like one of those: x2 - xx - xx - xx - xx - xx x6 - xx - xx - xx - xx - xx xA - xx - xx - xx - xx - xx xE - xx - xx - xx - xx - xx We used `06:66:38` in the gnt-chi cluster for that reason. We picked the `06:66` prefix to ressemble the existing `00:66` prefix used in `gnt-fsn` but varied the last quad (from `:37` to `:38`) to make them slightly more different-looking. Obviously, it's unlikely the MAC addresses will be compared across clusters in the short term. But it's technically possible a MAC bridge could be established if an exotic VPN bridge gets established between the two networks in the future, so it's good to have some difference. ### Hardware variations We considered experimenting with the new AX line ([AX51-NVMe](https://www.hetzner.com/dedicated-rootserver/ax51-nvme?country=OTHER)) but in the past DSA had problems live-migrating (it wouldn't immediately fail but there were "issues" after). So we might need to [failover](http://docs.ganeti.org/ganeti/2.15/man/gnt-instance.html#failover) instead of migrate between those parts of the cluster. There are also doubts that the Linux kernel supports those shiny new processors at all: similar processors had [trouble booting before Linux 5.5](https://www.phoronix.com/scan.php?page=news_item&px=Threadripper-3000-MCE-5.5-Fix) for example, so it might be worth waiting a little before switching to that new platform, even if it's cheaper. See the cluster configuration section below for a larger discussion of CPU emulation. ### CPU emulation Note that we might want to tweak the `cpu_type` parameter. By default, it emulates a lot of processing that can be delegated to the host CPU instead. If we use `kvm:cpu_type=host`, then each node will tailor the emulation system to the CPU on the node. But that might make the live migration more brittle: VMs or processes can crash after a live migrate because of a slightly different configuration (microcode, CPU, kernel and QEMU versions all play a role). So we need to find the lowest common demoninator in CPU families. The list of available families supported by QEMU varies between releases, but is visible with: # qemu-system-x86_64 -cpu help Available CPUs: x86 486 x86 Broadwell Intel Core Processor (Broadwell) [...] x86 Skylake-Client Intel Core Processor (Skylake) x86 Skylake-Client-IBRS Intel Core Processor (Skylake, IBRS) x86 Skylake-Server Intel Xeon Processor (Skylake) x86 Skylake-Server-IBRS Intel Xeon Processor (Skylake, IBRS) [...] The current [PX62 line][PX62-NVMe] is based on the [Coffee Lake](https://en.wikipedia.org/wiki/Coffee_Lake) Intel micro-architecture. The closest matching family would be `Skylake-Server` or `Skylake-Server-IBRS`, [according to wikichip](https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake#Compiler_support). Note that newer QEMU releases (4.2, currently in unstable) have more supported features. In that context, of course, supporting different CPU manufacturers (say AMD vs Intel) is impractical: they will have totally different families that are not compatible with each other. This will break live migration, which can trigger crashes and problems in the migrated virtual machines. If there are problems live-migrating between machines, it is still possible to "failover" (`gnt-instance failover` instead of `migrate`) which shuts off the machine, fails over disks, and starts it on the other side. That's not such of a big problem: we often need to reboot the guests when we reboot the hosts anyways. But it does complicate our work. Of course, it's also possible that live migrates work fine if *no* `cpu_type` at all is specified in the cluster, but that needs to be verified. Nodes could also [grouped](http://docs.ganeti.org/ganeti/2.15/man/gnt-group.html) to limit (automated) live migration to a subset of nodes. References: * <https://dsa.debian.org/howto/install-ganeti/> * <https://qemu.weilnetz.de/doc/qemu-doc.html#recommendations_005fcpu_005fmodels_005fx86> ### Installer The [ganeti-instance-debootstrap](https://tracker.debian.org/pkg/ganeti-instance-debootstrap) package is used to install instances. It is configured through Puppet with the [shared ganeti module](https://forge.puppet.com/smash/ganeti), which deploys a few hooks to automate the install as much as possible. The installer will: 1. setup grub to respond on the serial console 2. setup and log a random root password 3. make sure SSH is installed and log the public keys and fingerprints 4. setup swap if a labeled partition is present, or a 512MB swapfile otherwise 5. setup basic static networking through `/etc/network/interfaces.d` We have custom configurations on top of that to: 1. add a few base packages 2. do our own custom SSH configuration 3. fix the hostname to be a FQDN 4. add a line to `/etc/hosts` 5. add a tmpfs There is work underway to refactor and automate the install better, see [ticket 31239](https://bugs.torproject.org/31239) for details. ### Storage TODO: document how DRBD works in general, and how it's setup here in particular. See also the [DRBD documentation](howto/drbd). The Cymru PoP has an iSCSI cluster for large filesystem storage. Ideally, this would be automated inside Ganeti, some quick links: * [search for iSCSI in the ganeti-devel mailing list](https://www.mail-archive.com/search?l=ganeti-devel@googlegroups.com&q=iscsi&submit.x=0&submit.y=0) * in particular a [discussion of integrating SANs into ganeti](https://groups.google.com/forum/m/?_escaped_fragment_=topic/ganeti/P7JU_0YGn9s) seems to say "just do it manually" (paraphrasing) and [this discussion has an actual implementation](https://groups.google.com/forum/m/?_escaped_fragment_=topic/ganeti/kkXFDgvg2rY), [gnt-storage-eql](https://github.com/atta/gnt-storage-eql) * it could be implemented as an [external storage provider](https://github.com/ganeti/ganeti/wiki/External-Storage-Providers), see the [documentation](http://docs.ganeti.org/ganeti/2.10/html/design-shared-storage.html) * the DSA docs are in two parts: [iscsi](https://dsa.debian.org/howto/iscsi/) and [export-iscsi](https://dsa.debian.org/howto/export-iscsi/) * someone made a [Kubernetes provisionner](https://github.com/nmaupu/dell-provisioner) for our hardware which could provide sample code For now, iSCSI volumes are manually created and passed to new virtual machines. ## Issues There is no issue tracker specifically for this project, [File][] or [search][] for issues in the [team issue tracker][search] with the ~Ganeti label. [File]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/new [search]: https://gitlab.torproject.org/tpo/tpa/team/-/issues?label_name%5B%5D=Ganeti Upstream Ganeti has of course its own [issue tracker on GitHub](https://github.com/ganeti/ganeti/issues). ## Monitoring and testing <!-- TODO: describe how this service is monitored and how it can be tested --> <!-- after major changes like IP address changes or upgrades --> ## Logs and metrics Ganeti logs a significant amount of information in `/var/log/ganeti.log`. Those logs are of particular interest: * `node-daemon.log`: all low-level commands and HTTP requests on the node daemon, includes, for example, LVM and DRBD commands * `os/*$hostname*.log`: installation log for machine `$hostname` It does not expose performance metrics that are digested by Prometheus right now, but that would be an interesting feature to add. ## Other documentation * [Ganeti](http://www.ganeti.org/) * [Ganeti documentation home](http://docs.ganeti.org/) * [Main manual](http://docs.ganeti.org/ganeti/master/html/) * [Manual pages](http://docs.ganeti.org/ganeti/master/man/) * [Wiki](https://github.com/ganeti/ganeti/wiki) * [Issues](https://github.com/ganeti/ganeti/issues) * [Google group](https://groups.google.com/forum/#!forum/ganeti) * [Wikimedia foundation documentation](https://wikitech.wikimedia.org/wiki/Ganeti) * [Riseup documentation](https://we.riseup.net/riseup+tech/ganeti) * [DSA](https://dsa.debian.org/howto/install-ganeti/) * [OSUOSL wiki](https://wiki.osuosl.org/ganeti/) # Discussion ## Overview The project of creating a Ganeti cluster for Tor has appeared in the summer of 2019. The machines were delivered by Hetzner in July 2019 and setup by weasel by the end of the month. ## Goals The goal was to replace the aging group of KVM servers (`kvm[1-5]`, AKA `textile`, `unifolium`, `macrum`, `kvm4` and `kvm5`). ### Must have * arbitrary virtual machine provisionning * redundant setup * automated VM installation * replacement of existing infrastructure ### Nice to have * fully configured in Puppet * full high availability with automatic failover * extra capacity for new projects ### Non-Goals * Docker or "container" provisionning - we consider this out of scope for now * self-provisionning by end-users: TPA remains in control of provisionning ## Approvals required A budget was proposed by weasel in may 2019 and approved by Vegas in June. An extension to the budget was approved in january 2020 by Vegas. ## Proposed Solution Setup a Ganeti cluster of two machines with a Hetzner vSwitch backend. ## Cost The design based on the [PX62 line][PX62-NVMe] has the following monthly cost structure: * per server: 118EUR (79EUR + 39EUR for 2x10TB HDDs) * IPv4 space: 35.29EUR (/27) * IPv6 space: 8.40EUR (/64) * bandwidth cost: 1EUR/TB (currently 38EUR) At three servers, that adds up to around 435EUR/mth. Up to date costs are available in the [Tor VM hosts.xlsx](https://nc.torproject.net/apps/onlyoffice/5395) spreadsheet. ## Alternatives considered <!-- include benchmarks and procedure if relevant --> Note that the instance install is possible also [through FAI, see the Ganeti wiki for examples](https://github.com/ganeti/ganeti/wiki/System-template-with-FAI). There are GUIs for Ganeti that we are not using, but could, if we want to grant more users access: * [Ganeti Web manager](https://ganeti-webmgr.readthedocs.io/) is a "Django based web frontend for managing Ganeti virtualization clusters. Since Ganeti only provides a command-line interface, Ganeti Web Manager’s goal is to provide a user friendly web interface to Ganeti via Ganeti’s Remote API. On top of Ganeti it provides a permission system for managing access to clusters and virtual machines, an in browser VNC console, and vm state and resource visualizations" * [Synnefo](https://www.synnefo.org/) is a "complete open source cloud stack written in Python that provides Compute, Network, Image, Volume and Storage services, similar to the ones offered by AWS. Synnefo manages multiple Ganeti clusters at the backend for handling of low-level VM operations and uses Archipelago to unify cloud storage. To boost 3rd-party compatibility, Synnefo exposes the OpenStack APIs to users."