[Ganeti](http://ganeti.org/) is software designed to facilitate the management of virtual machines (KVM or Xen). It helps you move virtual machine instances from one node to another, create an instance with DRBD replication on another node and do the live migration from one to another, etc. [[_TOC_]] # Tutorial ## Listing virtual machines (instances) This will show the running guests, known as "instances": gnt-instance list ## Accessing serial console Our instances do serial console, starting in grub. To access it, run gnt-instance console test01.torproject.org To exit, use `^]` -- that is, Control-<Closing Bracket>. # How-to ## Glossary In Ganeti, we use the following terms: * **node** a physical machine is called a *node* and a * **instance** a virtual machine * **master**: a *node* where on which we issue Ganeti commands and that supervises all the other nodes Nodes are interconnected through a private network that is used to communicate commands and synchronise disks (with [howto/drbd](howto/drbd)). Instances are normally assigned two nodes: a *primary* and a *secondary*: the *primary* is where the virtual machine actually runs and the *secondary* acts as a hot failover. See also the more extensive [glossary in the Ganeti documentation](http://docs.ganeti.org/ganeti/2.15/html/glossary.html). ## Adding a new instance This command creates a new guest, or "instance" in Ganeti's vocabulary with 10G root, 2G swap, 20G spare on SSD, 800G on HDD, 8GB ram and 2 CPU cores: gnt-instance add \ -o debootstrap+bullseye \ -t drbd --no-wait-for-sync \ --net 0:ip=pool,network=gnt-fsn13-02 \ --no-ip-check \ --no-name-check \ --disk 0:size=10G \ --disk 1:size=2G,name=swap \ --disk 2:size=20G \ --disk 3:size=800G,vg=vg_ganeti_hdd \ --backend-parameters memory=8g,vcpus=2 \ test-01.torproject.org ### What that does This configures the following: * redundant disks in a DRBD mirror, use `-t plain` instead of `-t drbd` for tests as that avoids syncing of disks and will speed things up considerably (even with `--no-wait-for-sync` there are some operations that block on synced mirrors). Only one node should be provided as the argument for `--node` then. * three partitions: one on the default VG (SSD), one on another (HDD) and a swap file on the default VG, if you don't specify a swap device, a 512MB swapfile is created in `/swapfile`. TODO: configure disk 2 and 3 automatically in installer. (`/var` and `/srv`?) * 8GB of RAM with 2 virtual CPUs * an IP allocated from the public gnt-fsn pool: `gnt-instance add` will print the IPv4 address it picked to stdout. The IPv6 address can be found in `/var/log/ganeti/os/` on the primary node of the instance, see below. * with the `test-01.torproject.org` hostname ### Next steps To find the root password, ssh host key fingerprints, and the IPv6 address, run this **on the node where the instance was created**, for example: egrep 'root password|configured eth0 with|SHA256' $(ls -tr /var/log/ganeti/os/* | tail -1) | grep -v $(hostname) We copy root's authorized keys into the new instance, so you should be able to log in with your token. You will be required to change the root password immediately. Pick something nice and document it in `tor-passwords`. Also set reverse DNS for both IPv4 and IPv6 in [hetzner's robot](https://robot.your-server.de/) (Chek under servers -> vSwitch -> IPs) or in our own reverse zone files (if delegated). Then follow [howto/new-machine](howto/new-machine). ### Known issues * **allocator failures**: Note that you may need to use the `--node` parameter to pick on which machines you want the machine to end up, otherwise Ganeti will choose for you (and may fail). Use, for example, `--node fsn-node-01:fsn-node-02` to use `node-01` as primary and `node-02` as secondary. The allocator can sometimes fail if the allocator is upset about something in the cluster, for example: Can's find primary node using iallocator hail: Request failed: No valid allocation solutions, failure reasons: FailMem: 2, FailN1: 2 This situation is covered by [ticket 33785](https://bugs.torproject.org/33785). If this problem occurs, it might be worth [rebalancing the cluster](#rebalancing-a-cluster). * **ping failure**: there is a bug in `ganeti-instance-debootstrap` which misconfigures `ping` (among other things), see [bug 31781](https://bugs.torproject.org/31781). It's currently patched in our version of the Debian package, but that patch might disappear if Debian upgrade the package without [shipping our patch](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=944538). Note that this was fixed in Debian bullseye and later. ### Other examples This is the same without the HDD partition, in the `gnt-chi` cluster: gnt-instance add \ -o debootstrap+bullseye \ -t drbd --no-wait-for-sync \ --net 0:ip=pool,network=gnt-chi-01 \ --no-ip-check \ --no-name-check \ --disk 0:size=10G \ --disk 1:size=2G,name=swap \ --disk 2:size=20G \ --backend-parameters memory=8g,vcpus=2 \ test-01.torproject.org A simple test machine, with only 1G of disk, ram, and 1 CPU, without DRBD, in the FSN cluster: gnt-instance add \ -o debootstrap+bullseye \ -t plain --no-wait-for-sync \ --net 0:ip=pool,network=gnt-fsn13-02 \ --no-ip-check \ --no-name-check \ --disk 0:size=10G \ --disk 1:size=2G,name=swap \ --backend-parameters memory=1g,vcpus=1 \ test-01.torproject.org Do not forget to follow the [next steps](#next-steps), above. ### iSCSI integration To create a VM with iSCSI backing, a disk must first be created on the SAN, then adopted in a VM, which needs to be *reinstalled* on top of that. This is typical how large disks are provisionned in the `gnt-chi` cluster, in the [Cymru POP](howto/new-machine-cymru). The following instructions assume you are on a node with an [iSCSI initiator properly setup](howto/new-machine-cymru#iscsi-initiator-setup), and the [SAN cluster management tools setup](howto/new-machine-cymru#san-management-tools-setup). It also assumes you are familiar with the `SMcli` tool, see the [storage servers documentation](howto/new-machine-cymru#storage-servers) for an introduction on that. This assumes you are creating a 500GB VM, partitioned on the Linux host, *not* on the iSCSI volume. TODO: change those instructions to create one volume per partition, so that those can be resized more easily. The following is how `tb-build-03` was setup. 1. create the disk on the SAN and assign it to the host group: puppet agent --disable "creating a SAN disk" $EDITOR /usr/local/sbin/tpo-create-san-disks /usr/local/sbin/tpo-create-san-disks puppet agent --enable WARNING: the above script needs to be edited before it does the right thing. It will show the LUN numbers in use below. This, obviously, is not ideal, and should be replaced by a Ganeti external storage provider. NOTE: the `logicalUnitNumber` here must be an increment from the previous highest LUN. See also the [disk creation instructions](howto/new-machine-cymru#creating-a-disk) for a discussion. 2. configure the disk on all Ganeti nodes, in Puppet's `profile::ganeti::chi` class: iscsi::multipath::alias { 'web-chi-03': wwid => '36782bcb00063c6a500000d67603f7abf', } 3. propagate the magic to all nodes in the cluster: gnt-cluster command "puppet agent -t ; iscsiadm -m node --rescan ; multipath -r" 4. confirm that multipath works, it should look something like this": root@chi-node-01:~# multipath -ll web-chi-03-srv (36782bcb00063c6a500000d67603f7abf) dm-20 DELL,MD32xxi size=500G features='5 queue_if_no_path pg_init_retries 50 queue_mode mq' hwhandler='1 rdac' wp=rw |-+- policy='round-robin 0' prio=6 status=active | |- 11:0:0:4 sdi 8:128 active ready running | |- 12:0:0:4 sdj 8:144 active ready running | `- 9:0:0:4 sdh 8:112 active ready running `-+- policy='round-robin 0' prio=1 status=enabled |- 10:0:0:4 sdk 8:160 active ghost running |- 7:0:0:4 sdl 8:176 active ghost running `- 8:0:0:4 sdm 8:192 active ghost running root@chi-node-01:~# and the device `/dev/mapper/web-chi-03` should exist. 6. adopt the disks in Ganeti: gnt-instance add \ -n chi-node-04.torproject.org \ -o debootstrap+bullseye \ -t blockdev --no-wait-for-sync \ --net 0:ip=pool,network=gnt-chi-01 \ --no-ip-check \ --no-name-check \ --disk 0:adopt=/dev/disk/by-id/dm-name-tb-build-03-root \ --disk 1:adopt=/dev/disk/by-id/dm-name-tb-build-03-swap,name=swap \ --disk 2:adopt=/dev/disk/by-id/dm-name-tb-build-03-srv \ --backend-parameters memory=16g,vcpus=8 \ tb-build-03.torproject.org NOTE: the actual node must be manually picked because the `hail` allocator doesn't seem to know about block devices. 7. at this point, the VM probably doesn't boot, because for some reason the `gnt-instance-debootstrap` doesn't fire when disks are adopted. so you need to reinstall the machine, which involves stopping it first: gnt-instance shutdown --timeout=0 tb-build-03 gnt-instance reinstall tb-build-03 HACK: the current installer fails on weird partionning errors, see [upstream bug 13](https://github.com/ganeti/instance-debootstrap/issues/13). We applied [patch 14](https://github.com/ganeti/instance-debootstrap/pull/14) on `chi-node-04` and sent it upstream for review before committing to maintaining this in Debian or elsewhere. It should be tested on other installs beforehand as well. From here on, follow the [next steps](#next-steps) above. TODO: This would ideally be automated by an external storage provider, see the [storage reference for more information](#storage). ### Troubleshooting If a Ganeti instance install fails, it will show the end of the install log, for example: ``` Thu Aug 26 14:11:09 2021 - INFO: Selected nodes for instance tb-pkgstage-01.torproject.org via iallocator hail: chi-node-02.torproject.org, chi-node-01.torproject.org Thu Aug 26 14:11:09 2021 - INFO: NIC/0 inherits netparams ['br0', 'bridged', ''] Thu Aug 26 14:11:09 2021 - INFO: Chose IP 38.229.82.29 from network gnt-chi-01 Thu Aug 26 14:11:10 2021 * creating instance disks... Thu Aug 26 14:12:58 2021 adding instance tb-pkgstage-01.torproject.org to cluster config Thu Aug 26 14:12:58 2021 adding disks to cluster config Thu Aug 26 14:13:00 2021 * checking mirrors status Thu Aug 26 14:13:01 2021 - INFO: - device disk/0: 30.90% done, 3m 32s remaining (estimated) Thu Aug 26 14:13:01 2021 - INFO: - device disk/2: 0.60% done, 55m 26s remaining (estimated) Thu Aug 26 14:13:01 2021 * checking mirrors status Thu Aug 26 14:13:02 2021 - INFO: - device disk/0: 31.20% done, 3m 40s remaining (estimated) Thu Aug 26 14:13:02 2021 - INFO: - device disk/2: 0.60% done, 52m 13s remaining (estimated) Thu Aug 26 14:13:02 2021 * pausing disk sync to install instance OS Thu Aug 26 14:13:03 2021 * running the instance OS create scripts... Thu Aug 26 14:16:31 2021 * resuming disk sync Failure: command execution error: Could not add os for instance tb-pkgstage-01.torproject.org on node chi-node-02.torproject.org: OS create script failed (exited with exit code 1), last lines in the log file: Setting up openssh-sftp-server (1:7.9p1-10+deb10u2) ... Setting up openssh-server (1:7.9p1-10+deb10u2) ... Creating SSH2 RSA key; this may take some time ... 2048 SHA256:ZTeMxYSUDTkhUUeOpDWpbuOzEAzOaehIHW/lJarOIQo root@chi-node-02 (RSA) Creating SSH2 ED25519 key; this may take some time ... 256 SHA256:MWKeA8vJKkEG4TW+FbG2AkupiuyFFyoVWNVwO2WG0wg root@chi-node-02 (ED25519) Created symlink /etc/systemd/system/sshd.service \xe2\x86\x92 /lib/systemd/system/ssh.service. Created symlink /etc/systemd/system/multi-user.target.wants/ssh.service \xe2\x86\x92 /lib/systemd/system/ssh.service. invoke-rc.d: could not determine current runlevel Setting up ssh (1:7.9p1-10+deb10u2) ... Processing triggers for systemd (241-7~deb10u8) ... Processing triggers for libc-bin (2.28-10) ... Errors were encountered while processing: linux-image-4.19.0-17-amd64 E: Sub-process /usr/bin/dpkg returned an error code (1) run-parts: /etc/ganeti/instance-debootstrap/hooks/ssh exited with return code 100 Using disk /dev/drbd4 as swap... Setting up swapspace version 1, size = 2 GiB (2147479552 bytes) no label, UUID=96111754-c57d-43f2-83d0-8e1c8b4688b4 Not using disk 2 (/dev/drbd5) because it is not named 'swap' (name: ) root@chi-node-01:~# ``` Here the failure which tripped the install is: ``` Errors were encountered while processing: linux-image-4.19.0-17-amd64 E: Sub-process /usr/bin/dpkg returned an error code (1) ``` But the actual error is higher up, and we need to go look at the logs on the server for this, in this case in `chi-node-02:/var/log/ganeti/os/add-debootstrap+buster-tb-pkgstage-01.torproject.org-2021-08-26_14_13_04.log`, we can find the real problem: ``` Setting up linux-image-4.19.0-17-amd64 (4.19.194-3) ... /etc/kernel/postinst.d/initramfs-tools: update-initramfs: Generating /boot/initrd.img-4.19.0-17-amd64 W: Couldn't identify type of root file system for fsck hook /etc/kernel/postinst.d/zz-update-grub: /usr/sbin/grub-probe: error: cannot find a device for / (is /dev mounted?). run-parts: /etc/kernel/postinst.d/zz-update-grub exited with return code 1 dpkg: error processing package linux-image-4.19.0-17-amd64 (--configure): installed linux-image-4.19.0-17-amd64 package post-installation script subprocess returned error exit status 1 ``` In this case, oddly enough, even though Ganeti thought the install had failed, the machine can actually start: ``` gnt-instance start tb-pkgstage-01.torproject.org ``` ... and after a while, we can even get a console: ``` gnt-instance start tb-pkgstage-01.torproject.org ``` And in *that* case, the procedure can just continue from here on: reset the root password, and just make sure you finish the install: ``` apt install linux-image-amd64 ``` In the above case, the `sources-list` post-install hook was buggy: it wasn't mounting `/dev` and friends before launching the upgrades, which was causing issues when a kernel upgrade was queued. And *if* you are debugging an installer and by mistake end up with half-open filesystems and stray DRBD devices, do take a look at the [LVM](howto/lvm) and [DRBD documentation](howto/drbd). ## Modifying an instance ### CPU, memory changes It's possible to change the IP, CPU, or memory allocation of an instance using the [gnt-instance modify](http://docs.ganeti.org/ganeti/2.15/man/gnt-instance.html#modify) command: gnt-instance modify -B vcpus=4 test1.torproject.org gnt-instance modify -B memory=8g test1.torproject.org gnt-instance reboot test1.torproject.org ### IP address change IP address changes require a full stop and will require manual changes to the `/etc/network/interfaces*` files: gnt-instance modify --net 0:modify,ip=116.202.120.175 test1.torproject.org gnt-instance stop test1.torproject.org gnt-instance start test1.torproject.org gnt-instance console test1.torproject.org ### Resizing disks The [gnt-instance grow-disk](http://docs.ganeti.org/ganeti/2.15/man/gnt-instance.html#grow-disk) command can be used to change the size of the underlying device: gnt-instance grow-disk --absolute test1.torproject.org 0 16g gnt-instance reboot test1.torproject.org The number `0` in this context, indicates the first disk of the instance. The amount specified is the final disk size (because of the `--absolute` flag). In the above example, the final disk size will be 16GB. To *add* space to the existing disk, remove the `--absolute` flag: gnt-instance grow-disk test1.torproject.org 0 16g gnt-instance reboot test1.torproject.org In the above example, 16GB will be **ADDED** to the disk. Be careful with resizes, because it's not possible to revert such a change: `grow-disk` does support shrinking disks. The only way to revert the change is by exporting / importing the instance. Note the reboot, above, will impose a downtime. See [upstream bug 28](https://github.com/ganeti/ganeti/issues/28) about improving that. Then the filesystem needs to be resized inside the VM: ssh root@test1.torproject.org #### Resizing under LVM Use `pvs` to display information about the physical volumes: root@cupani:~# pvs PV VG Fmt Attr PSize PFree /dev/sdc vg_test lvm2 a-- <8.00g 1020.00m Resize the physical volume to take up the new space: pvresize /dev/sdc Use `lvs` to display information about logical volumes: # lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert var-opt vg_test-01 -wi-ao---- <10.00g test-backup vg_test-01_hdd -wi-ao---- <20.00g Use lvextend to add space to the volume: lvextend -l '+100%FREE' vg_test-01/var-opt Finally resize the filesystem: resize2fs /dev/vg_test-01/var-opt See also the [LVM howto](howto/lvm). #### Resizing without LVM, no partitions If there's no LVM inside the VM (a more common configuration nowadays), the above procedure will obviously not work. If this is a secondary disk (e.g. `/dev/sdc`) there is a good chance a partition was created directly on it and that you do not need to repartition the drive. This is an example of a good configuration if we want to resize `sdc`: ``` root@bacula-director-01:~# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT fd0 2:0 1 4K 0 disk sda 8:0 0 10G 0 disk └─sda1 8:1 0 10G 0 part / sdb 8:16 0 2G 0 disk [SWAP] sdc 8:32 0 250G 0 disk /srv ``` Note that if we would need to resize `sda`, we'd have to follow the other procedure, in the next section. If we check the free disk space on the device we will notice it has not changed yet: ``` # df -h /srv Filesystem Size Used Avail Use% Mounted on /dev/sdc 196G 160G 27G 86% /srv ``` The resize is then simply: ``` # resize2fs /dev/sdc resize2fs 1.44.5 (15-Dec-2018) Filesystem at /dev/sdc is mounted on /srv; on-line resizing required old_desc_blocks = 25, new_desc_blocks = 32 The filesystem on /dev/sdc is now 65536000 (4k) blocks long. ``` Read on for the most complicated scenario. #### Resizing without LVM, with partitions If the filesystem to resize is not *directly* on the device, you will need to resize the partition manually, which can be done using fdisk. In the following example we have a `sda1` partition that we want to extend from 10G to 20G to fill up the free space on `/dev/sda`. Here is what the partition layout looks like before the resize: ``` # lsblk -a NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT fd0 2:0 1 4K 0 disk sda 8:0 0 20G 0 disk └─sda1 8:1 0 10G 0 part / sdb 8:16 0 2G 0 disk [SWAP] sdc 8:32 0 40G 0 disk /srv ``` We use fdisk on the device: ``` # fdisk /dev/sda Welcome to fdisk (util-linux 2.33.1). Changes will remain in memory only, until you decide to write them. Be careful before using the write command. Command (m for help): p # prints the partition table Disk /dev/sda: 20 GiB, 21474836480 bytes, 41943040 sectors Disk model: QEMU HARDDISK Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: dos Disk identifier: 0x73ab5f76 Device Boot Start End Sectors Size Id Type /dev/sda1 * 2048 20971519 20969472 10G 83 Linux # note the starting sector for later ``` Now we delete the partition. Note that the data will not be deleted, only the partition table will be altered: ``` Command (m for help): d Selected partition 1 Partition 1 has been deleted. Command (m for help): p Disk /dev/sda: 20 GiB, 21474836480 bytes, 41943040 sectors Disk model: QEMU HARDDISK Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: dos Disk identifier: 0x73ab5f76 ``` Now we create the new partition to take up the whole space: ``` Command (m for help): n Partition type p primary (0 primary, 0 extended, 4 free) e extended (container for logical partitions) Select (default p): p Partition number (1-4, default 1): 1 First sector (2048-41943039, default 2048): 2048 # this is the starting sector from above. Last sector, +/-sectors or +/-size{K,M,G,T,P} (2048-41943039, default 41943039): 41943039 Created a new partition 1 of type 'Linux' and of size 20 GiB. Partition #1 contains a ext4 signature. Do you want to remove the signature? [Y]es/[N]o: n # we want to keep the previous signature Command (m for help): p Disk /dev/sda: 20 GiB, 21474836480 bytes, 41943040 sectors Disk model: QEMU HARDDISK Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: dos Disk identifier: 0x73ab5f76 Device Boot Start End Sectors Size Id Type /dev/sda1 2048 41943039 41940992 20G 83 Linux Command (m for help): w The partition table has been altered. Syncine disks. ``` Now we check the partitions: ``` # lsblk -a NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT fd0 2:0 1 4K 0 disk sda 8:0 0 20G 0 disk └─sda1 8:1 0 20G 0 part / sdb 8:16 0 2G 0 disk [SWAP] sdc 8:32 0 40G 0 disk /srv ``` If we check the free disk space on the device we will notice it has not changed yet: ``` # df -h / Filesystem Size Used Avail Use% Mounted on /dev/sda1 9.8G 8.5G 874M 91% / ``` We need to resize it: ``` # resize2fs /dev/sda1 resize2fs 1.44.5 (15-Dec-2018) Filesystem at /dev/sda1 is mounted on /; on-line resizing required old_desc_blocks = 2, new_desc_blocks = 3 The filesystem on /dev/sda1 is now 5242624 (4k) blocks long. ``` The resize is now complete. #### Resizing an iSCSI LUN Growing a disk hosted on a iSCSI SAN like the Dell PowerVault MD3200i involves several steps beginning with resizing the LUN itself. First, we identify how much space is available on the virtual disks' diskGroup: # SMcli -n chi-san-01 -c "show allVirtualDisks summary;" STANDARD VIRTUAL DISKS SUMMARY Number of standard virtual disks: 5 Name Thin Provisioned Status Capacity Accessible by Source example-01-srv No Optimal 700.000 GB Host Group gnt-chi Disk Group 5 This shows that `tb-build-03-srv` is hosted on Disk Group "5": # SMcli -n chi-san-01 -c "show diskGroup [5];" DETAILS Name: 5 Status: Optimal Capacity: 1,852.026 GB Current owner: RAID Controller Module in slot 1 Data Service (DS) Attributes RAID level: 5 Physical Disk media type: Physical Disk Physical Disk interface type: Serial Attached SCSI (SAS) Enclosure loss protection: No Secure Capable: No Secure: No Total Virtual Disks: 1 Standard virtual disks: 1 Repository virtual disks: 0 Free Capacity: 1,152.026 GB Associated physical disks - present (in piece order) Total physical disks present: 3 Enclosure Slot 0 6 1 11 0 7 `Free Capacity` indicates about 1,5 TB of free space available. So we can go ahead with the actual resize: # SMcli -n chi-san-01 -p $PASSWORD -c "set virtualdisk [\"example-01-srv\"] addCapacity=100GB;" Next, on the instance's primary node we need to tell `iscsiadm` to rescan the LUN. To do this we first need to learn the iSCSI `targetname` we need to run the rescan command against. # multipath -ll This shows which device nodes (eg. `sdw`) are associated with the volume we need to resize. There are usually six such nodes for each iSCSI LUN, and they will be listed under the same "Target:" header in the output of the next command: # iscsiadm -m session -P 3 | grep -e ^Target -e 'Attached scsi disk' -e 'Current Portal' To trigger the iSCSI rescan: # iscsiadm -m node --targetname iqn.foo.org.example -R The success of this step can be validated by looking at the output of `lsblk`: the device nodes associated with the LUN should now display the new size. Next, we need to also *kick* `multipathd` to make it rescan the iSCSI LUN. The volume name used here must correspond to the volume name in the output of `multipath -ll`. # multipathd -v3 -k"resize map example-01-srv" Another look at the output of `multipath -ll` should confirm the volume now reflects the new size of the underlying iSCSI LUN. In order for ganeti/qemu to make this extra space available to the instance, a reboot must be performed from outside the instance. Depending on whether LVM volumes or partitions are used within the VM, there could be extra steps required before running the `resize2fs` command. See the instructions above for details on how to resize those bits. ### Adding disks A disk can be added to an instance with the `modify` command as well. This, for example, will add a 100GB disk to the `test1` instance on teh `vg_ganeti_hdd` volume group, which is "slow" rotating disks: gnt-instance modify --disk add:size=100g,vg=vg_ganeti_hdd test1.torproject.org gnt-instance reboot test1.torproject.org ### Changing disk type If you have, say, a test instance that was created with a `plain` disk template but we actually want it in production, with a `drbd` disk template. Switching to `drbd` is easy: gnt-instance shutdown test-01 gnt-instance modify -t drbd test-01 gnt-instance start test-01 The second command will use the allocator to find a secondary node. If that fails, you can assign a node manually with `-n`. You can also switch back to `plain`, although you should generally never do that. See also the [upstream procedure](https://docs.ganeti.org/docs/ganeti/3.0/html/admin.html#conversion-of-an-instance-s-disk-type) and [design document](https://docs.ganeti.org/docs/ganeti/3.0/html/design-disk-conversion.html). ### Adding a network interface on the rfc1918 vlan We have a vlan that some VMs that do not have public addresses sit on. Its vlanid is 4002 and its backed by Hetzner vswitch vSwitch #11973 "fsn-gnt-rfc1918-traffic". Note that traffic on this vlan will travel in the clear between nodes. To add an instance to this vlan, give it a second network interface using gnt-instance modify --net add:link=br0,vlan=4002,mode=openvswitch test1.torproject.org ## Destroying an instance This totally deletes the instance, including all mirrors and everything, be very careful with it: gnt-instance remove test01.torproject.org ## Getting information Information about an instances can be found in the rather verbose `gnt-instance info`: root@fsn-node-01:~# gnt-instance info tb-build-02.torproject.org - Instance name: tb-build-02.torproject.org UUID: 8e9f3ca6-204f-4b6c-8e3e-6a8fda137c9b Serial number: 5 Creation time: 2020-12-15 14:06:41 Modification time: 2020-12-15 14:07:31 State: configured to be up, actual state is up Nodes: - primary: fsn-node-03.torproject.org group: default (UUID 8c32fd09-dc4c-4237-9dd2-3da3dfd3189e) - secondaries: fsn-node-04.torproject.org (group default, group UUID 8c32fd09-dc4c-4237-9dd2-3da3dfd3189e) Operating system: debootstrap+buster A quick command that can be done is this, which shows the primary/secondary for a given instance: gnt-instance info tb-build-02.torproject.org | grep -A 3 Nodes An equivalent command will show the primary and secondary for *all* instances, on top of extra information (like the CPU count, memory and disk usage): gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort It can be useful to run this in a loop to see changes: watch -n5 -d 'gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort' ## Disk operations (DRBD) Instances should be setup using the DRBD backend, in which case you should probably take a look at [howto/drbd](howto/drbd) if you have problems with that. Ganeti handles most of the logic there so that should generally not be necessary. ## Evaluating cluster capacity This will list instances repeatedly, but also show their assigned memory, and compare it with the node's capacity: gnt-instance list -o pnode,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort && echo && gnt-node list The latter does not show disk usage for secondary volume groups (see [upstream issue 1379](https://github.com/ganeti/ganeti/issues/1379)), for a complete picture of disk usage, use: gnt-node list-storage The [gnt-cluster verify](http://docs.ganeti.org/ganeti/2.15/man/gnt-cluster.html#verify) command will also check to see if there's enough space on secondaries to account for the failure of a node. Healthy output looks like this: root@fsn-node-01:~# gnt-cluster verify Submitted jobs 48030, 48031 Waiting for job 48030 ... Fri Jan 17 20:05:42 2020 * Verifying cluster config Fri Jan 17 20:05:42 2020 * Verifying cluster certificate files Fri Jan 17 20:05:42 2020 * Verifying hypervisor parameters Fri Jan 17 20:05:42 2020 * Verifying all nodes belong to an existing group Waiting for job 48031 ... Fri Jan 17 20:05:42 2020 * Verifying group 'default' Fri Jan 17 20:05:42 2020 * Gathering data (2 nodes) Fri Jan 17 20:05:42 2020 * Gathering information about nodes (2 nodes) Fri Jan 17 20:05:45 2020 * Gathering disk information (2 nodes) Fri Jan 17 20:05:45 2020 * Verifying configuration file consistency Fri Jan 17 20:05:45 2020 * Verifying node status Fri Jan 17 20:05:45 2020 * Verifying instance status Fri Jan 17 20:05:45 2020 * Verifying orphan volumes Fri Jan 17 20:05:45 2020 * Verifying N+1 Memory redundancy Fri Jan 17 20:05:45 2020 * Other Notes Fri Jan 17 20:05:45 2020 * Hooks Results A sick node would have said something like this instead: Mon Oct 26 18:59:37 2009 * Verifying N+1 Memory redundancy Mon Oct 26 18:59:37 2009 - ERROR: node node2: not enough memory to accommodate instance failovers should node node1 fail See the [ganeti manual](http://docs.ganeti.org/ganeti/2.15/html/walkthrough.html#n-1-errors) for a more extensive example Also note the `hspace -L` command, which can tell you how many instances can be created in a given cluster. It uses the "standard" instance template defined in the cluster (which we haven't configured yet). ## Moving instances and failover Ganeti is smart about assigning instances to nodes. There's also a command (`hbal`) to automatically rebalance the cluster (see below). If for some reason `hbal` doesn’t do what you want or you need to move things around for other reasons, here are a few commands that might be handy. Make an instance switch to using it's secondary: gnt-instance migrate test1.torproject.org Make all instances on a node switch to their secondaries: gnt-node migrate test1.torproject.org The `migrate` commands does a "live" migrate which should avoid any downtime during the migration. It might be preferable to actually shutdown the machine for some reason (for example if we actually want to reboot because of a security upgrade). Or we might not be able to live-migrate because the node is down. In this case, we do a [failover](http://docs.ganeti.org/ganeti/2.15/html/admin.html#failing-over-an-instance) gnt-instance failover test1.torproject.org The [gnt-node evacuate](http://docs.ganeti.org/ganeti/2.15/man/gnt-node.html#evacuate) command can also be used to "empty" a given node altogether, in case of an emergency: gnt-node evacuate -I . fsn-node-02.torproject.org Similarly, the [gnt-node failover](http://docs.ganeti.org/ganeti/2.15/man/gnt-node.html#failover) command can be used to hard-recover from a completely crashed node: gnt-node failover fsn-node-02.torproject.org Note that you might need the `--ignore-consistency` flag if the node is unresponsive. ## Importing external libvirt instances Assumptions: * `INSTANCE`: name of the instance being migrated, the "old" one being outside the cluster and the "new" one being the one created inside the cluster (e.g. `chiwui.torproject.org`) * `SPARE_NODE`: a ganeti node with free space (e.g. `fsn-node-03.torproject.org`) where the `INSTANCE` will be migrated * `MASTER_NODE`: the master ganeti node (e.g. `fsn-node-01.torproject.org`) * `KVM_HOST`: the machine which we migrate the `INSTANCE` from * the `INSTANCE` has only `root` and `swap` partitions * the `SPARE_NODE` has space in `/srv/` to host all the virtual machines to import, to check, use: fab -H crm-ext-01.torproject.org,crm-int-01.torproject.org,forrestii.torproject.org,nevii.torproject.org,rude.torproject.org,troodi.torproject.org,vineale.torproject.org libvirt.du -p kvm3.torproject.org | sed '/-swap$/d;s/ .*$//' <f | awk '{s+=$1} END {print s}' You will very likely need to create a `/srv` big enough for this, for example: lvcreate -L 300G vg_ganeti -n srv-tmp && mkfs /dev/vg_ganeti/srv-tmp && mount /dev/vg_ganeti/srv-tmp /srv Import procedure: 1. pick a viable SPARE NODE to import the INSTANCE (see "evaluating cluster capacity" above, when in doubt) and find on which KVM HOST the INSTANCE lives 2. copy the disks, without downtime: ./ganeti -v -H $INSTANCE libvirt-import --ganeti-node $SPARE_NODE --libvirt-host $KVM_HOST 3. copy the disks again, this time suspending the machine: ./ganeti -v -H $INSTANCE libvirt-import --ganeti-node $SPARE_NODE --libvirt-host $KVM_HOST --suspend --adopt 4. renumber the host: ./ganeti -v -H $INSTANCE renumber-instance --ganeti-node $SPARE_NODE 5. test services by changing your `/etc/hosts`, possibly warning service admins: > Subject: $INSTANCE IP address change planned for Ganeti migration > > I will soon migrate this virtual machine to the new ganeti cluster. this > will involve an IP address change which might affect the service. > > Please let me know if there are any problems you can think of. in > particular, do let me know if any internal (inside the server) or external > (outside the server) services hardcodes the IP address of the virtual > machine. > > A test instance has been setup. You can test the service by > adding the following to your /etc/hosts: > > 116.202.120.182 $INSTANCE > 2a01:4f8:fff0:4f:266:37ff:fe32:cfb2 $INSTANCE 6. destroy test instance: gnt-instance remove $INSTANCE 7. lower TTLs to 5 minutes. this procedure varies a lot according to the service, but generally if all DNS entries are `CNAME`s pointing to the main machine domain name, the TTL can be lowered by adding a `dnsTTL` entry in the LDAP entry for this host. For example, this sets the TTL to 5 minutes: dnsTTL: 300 Then to make the changes immediate, you need the following commands: ssh root@alberti.torproject.org sudo -u sshdist ud-generate && ssh root@nevii.torproject.org ud-replicate Warning: if you migrate one of the hosts ud-ldap depends on, this can fail and not only the TTL will not update, but it might also fail to update the IP address in the below procedure. See [ticket 33766](https://bugs.torproject.org/33766) for details. 8. shutdown original instance and redo migration as in step 3 and 4: fab -H $INSTANCE reboot.halt-and-wait --delay-shutdown 60 --reason='migrating to new server' && ./ganeti -v -H $INSTANCE libvirt-import --ganeti-node $SPARE_NODE --libvirt-host $KVM_HOST --adopt && ./ganeti -v -H $INSTANCE renumber-instance --ganeti-node $SPARE_NODE 9. final test procedure TODO: establish host-level test procedure and run it here. 10. switch to DRBD, still on the Ganeti MASTER NODE: gnt-instance stop $INSTANCE && gnt-instance modify -t drbd $INSTANCE && gnt-instance failover -f $INSTANCE && gnt-instance start $INSTANCE The above can sometimes fail if the allocator is upset about something in the cluster, for example: Can's find secondary node using iallocator hail: Request failed: No valid allocation solutions, failure reasons: FailMem: 2, FailN1: 2 This situation is covered by [ticket 33785](https://bugs.torproject.org/33785). To work around the allocator, you can specify a secondary node directly: gnt-instance modify -t drbd -n fsn-node-04.torproject.org $INSTANCE && gnt-instance failover -f $INSTANCE && gnt-instance start $INSTANCE TODO: move into fabric, maybe in a `libvirt-import-live` or `post-libvirt-import` job that would also do the renumbering below 11. change IP address in the following locations: * LDAP (`ipHostNumber` field, but also change the `physicalHost` and `l` fields!). Also drop the dnsTTL attribute while you're at it. * Puppet (grep in tor-puppet source, run `puppet agent -t; ud-replicate` on pauli) * DNS (grep in tor-dns source, `puppet agent -t; ud-replicate` on nevii) * nagios (don't forget to change the parent) * reverse DNS (upstream web UI, e.g. Hetzner Robot) * grep for the host's IP address on itself: grep -r -e 78.47.38.227 -e 2a01:4f8:fff0:4f:266:37ff:fe77:1ad8 /etc grep -r -e 78.47.38.227 -e 2a01:4f8:fff0:4f:266:37ff:fe77:1ad8 /srv * grep for the host's IP on *all* hosts: cumin-all-puppet cumin-all 'grep -r -e 78.47.38.227 -e 2a01:4f8:fff0:4f:266:37ff:fe77:1ad8 /etc' TODO: move those jobs into fabric 12. retire old instance (only a tiny part of [howto/retire-a-host](howto/retire-a-host)): ./retire -H $INSTANCE retire-instance --parent-host $KVM_HOST 12. update the [Nextcloud spreadsheet](https://nc.torproject.net/apps/onlyoffice/5395) to remove the machine from the KVM host 13. warn users about the migration, for example: > To: tor-project@lists.torproject.org > Subject: cupani AKA git-rw IP address changed > > The main git server, cupani, is the machine you connect to when you push > or pull git repositories over ssh to git-rw.torproject.org. That > machines has been migrated to the new Ganeti cluster. > > This required an IP address change from: > > 78.47.38.228 2a01:4f8:211:6e8:0:823:4:1 > > to: > > 116.202.120.182 2a01:4f8:fff0:4f:266:37ff:fe32:cfb2 > > DNS has been updated and preliminary tests show that everything is > mostly working. You *will* get a warning about the IP address change > when connecting over SSH, which will go away after the first > connection. > > Warning: Permanently added the ED25519 host key for IP address '116.202.120.182' to the list of known hosts. > > That is normal. The SSH fingerprints of the host did *not* change. > > Please do report any other anomaly using the normal channels: > > https://gitlab.torproject.org/tpo/tpa/team/-/wikis/support > > The service was unavailable for about an hour during the migration. ## Importing external libvirt instances, manual This procedure is now easier to accomplish with the Fabric tools written especially for this purpose. Use the above procedure instead. This is kept for historical reference. Assumptions: * `INSTANCE`: name of the instance being migrated, the "old" one being outside the cluster and the "new" one being the one created inside the cluster (e.g. `chiwui.torproject.org`) * `SPARE_NODE`: a ganeti node with free space (e.g. `fsn-node-03.torproject.org`) where the `INSTANCE` will be migrated * `MASTER_NODE`: the master ganeti node (e.g. `fsn-node-01.torproject.org`) * `KVM_HOST`: the machine which we migrate the `INSTANCE` from * the `INSTANCE` has only `root` and `swap` partitions Import procedure: 1. pick a viable SPARE NODE to import the instance (see "evaluating cluster capacity" above, when in doubt), login to the three servers, setting the proper environment everywhere, for example: MASTER_NODE=fsn-node-01.torproject.org SPARE_NODE=fsn-node-03.torproject.org KVM_HOST=kvm1.torproject.org INSTANCE=test.torproject.org 2. establish VM specs, on the KVM HOST: * disk space in GiB: for disk in /srv/vmstore/$INSTANCE/*; do printf "$disk: " echo "$(qemu-img info --output=json $disk | jq '."virtual-size"') / 1024 / 1024 / 1024" | bc -l done * number of CPU cores: sed -n '/<vcpu/{s/[^>]*>//;s/<.*//;p}' < /etc/libvirt/qemu/$INSTANCE.xml * memory, assuming from KiB to GiB: echo "$(sed -n '/<memory/{s/[^>]*>//;s/<.*//;p}' < /etc/libvirt/qemu/$INSTANCE.xml) /1024 /1024" | bc -l TODO: make sure the memory line is in KiB and that the number makes sense. * on the INSTANCE, find the swap device UUID so we can recreate it later: blkid -t TYPE=swap -s UUID -o value 3. setup a copy channel, on the SPARE NODE: ssh-agent bash ssh-add /etc/ssh/ssh_host_ed25519_key cat /etc/ssh/ssh_host_ed25519_key.pub on the KVM HOST: echo "$KEY_FROM_SPARE_NODE" >> /etc/ssh/userkeys/root 4. copy the `.qcow` file(s) over, from the KVM HOST to the SPARE NODE: rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-root /srv/ rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-lvm /srv/ || true Note: it's possible there is not enough room in `/srv`: in the base Ganeti installs, everything is in the same root partition (`/`) which will fill up if the instance is (say) over ~30GiB. In that case, create a filesystem in `/srv`: (mkdir /root/srv && mv /srv/* /root/srv true) || true && lvcreate -L 200G vg_ganeti -n srv && mkfs /dev/vg_ganeti/srv && echo "/dev/vg_ganeti/srv /srv ext4 rw,noatime,errors=remount-ro 0 2" >> /etc/fstab && mount /srv && ( mv /root/srv/* ; rmdir /root/srv ) This partition can be reclaimed once the VM migrations are completed, as it needlessly takes up space on the node. 5. on the SPARE NODE, create and initialize a logical volume with the predetermined size: lvcreate -L 4GiB -n $INSTANCE-swap vg_ganeti mkswap --uuid $SWAP_UUID /dev/vg_ganeti/$INSTANCE-swap lvcreate -L 20GiB -n $INSTANCE-root vg_ganeti qemu-img convert /srv/$INSTANCE-root -O raw /dev/vg_ganeti/$INSTANCE-root lvcreate -L 40GiB -n $INSTANCE-lvm vg_ganeti_hdd qemu-img convert /srv/$INSTANCE-lvm -O raw /dev/vg_ganeti_hdd/$INSTANCE-lvm Note how we assume two disks above, but the instance might have a different configuration that would require changing the above. The above, common, configuration is to have an LVM disk separate from the "root" disk, the former being on a HDD, but the HDD is sometimes completely omitted and sizes can differ. Sometimes it might be worth using pv to get progress on long transfers: qemu-img convert /srv/$INSTANCE-lvm -O raw /srv/$INSTANCE-lvm.raw pv /srv/$INSTANCE-lvm.raw | dd of=/dev/vg_ganeti_hdd/$INSTANCE-lvm bs=4k TODO: ideally, the above procedure (and many steps below as well) would be automatically deduced from the disk listing established in the first step. 6. on the MASTER NODE, create the instance, adopting the LV: gnt-instance add -t plain \ -n fsn-node-03 \ --disk 0:adopt=$INSTANCE-root \ --disk 1:adopt=$INSTANCE-swap \ --disk 2:adopt=$INSTANCE-lvm,vg=vg_ganeti_hdd \ --backend-parameters memory=2g,vcpus=2 \ --net 0:ip=pool,network=gnt-fsn \ --no-name-check \ --no-ip-check \ -o debootstrap+default \ $INSTANCE 7. cross your fingers and watch the party: gnt-instance console $INSTANCE 9. IP address change on new instance: edit `/etc/hosts` and `/etc/network/interfaces` by hand and add IPv4 and IPv6 ip. IPv4 configuration can be found in: gnt-instance show $INSTANCE Latter can be guessed by concatenating `2a01:4f8:fff0:4f::` and the IPv6 local local address without `fe80::`. For example: a link local address of `fe80::266:37ff:fe65:870f/64` should yield the following configuration: iface eth0 inet6 static accept_ra 0 address 2a01:4f8:fff0:4f:266:37ff:fe65:870f/64 gateway 2a01:4f8:fff0:4f::1 TODO: reuse `gnt-debian-interfaces` from the ganeti puppet module script here? 10. functional tests: change your `/etc/hosts` to point to the new server and see if everything still kind of works 11. shutdown original instance 12. resync and reconvert image, on the Ganeti MASTER NODE: gnt-instance stop $INSTANCE on the Ganeti node: rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-root /srv/ && qemu-img convert /srv/$INSTANCE-root -O raw /dev/vg_ganeti/$INSTANCE-root && rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-lvm /srv/ && qemu-img convert /srv/$INSTANCE-lvm -O raw /dev/vg_ganeti_hdd/$INSTANCE-lvm 13. switch to DRBD, still on the Ganeti MASTER NODE: gnt-instance modify -t drbd $INSTANCE gnt-instance failover $INSTANCE gnt-instance startup $INSTANCE 14. redo IP adress change in `/etc/network/interfaces` and `/etc/hosts` 15. final functional test 16. change IP address in the following locations: * nagios (don't forget to change the parent) * LDAP (`ipHostNumber` field, but also change the `physicalHost` and `l` fields!) * Puppet (grep in tor-puppet source, run `puppet agent -t; ud-replicate` on pauli) * DNS (grep in tor-dns source, `puppet agent -t; ud-replicate` on nevii) * reverse DNS (upstream web UI, e.g. Hetzner Robot) 17. decomission old instance ([howto/retire-a-host](howto/retire-a-host)) ### Troubleshooting * if boot takes a long time and you see a message like this on the console: [ *** ] A start job is running for dev-disk-by\x2duuid-484b5...26s / 1min 30s) ... which is generally followed by: [DEPEND] Dependency failed for /dev/disk/by-…6f4b5-f334-4173-8491-9353d4f94e04. [DEPEND] Dependency failed for Swap. it means the swap device UUID wasn't setup properly, and does not match the one provided in `/etc/fstab`. That is probably because you missed the `mkswap -U` step documented above. ### References * [Upstream docs](http://docs.ganeti.org/ganeti/2.15/html/admin.html#import-of-foreign-instances) have the canonical incantation: gnt-instance add -t plain -n HOME_NODE ... --disk 0:adopt=lv_name[,vg=vg_name] INSTANCE_NAME * [DSA docs](https://dsa.debian.org/howto/install-ganeti/) also use disk adoption and have a procedure to migrate to DRBD * [Riseup docs](https://we.riseup.net/riseup+tech/ganeti#move-an-instance-from-one-cluster-to-another-from-) suggest creating a VM without installing, shutting down and then syncing Ganeti [supports importing and exporting](http://docs.ganeti.org/ganeti/2.15/html/design-ovf-support.html?highlight=qcow) from the [Open Virtualization Format](https://en.wikipedia.org/wiki/Open_Virtualization_Format) (OVF), but unfortunately it [doesn't seem libvirt supports *exporting* to OVF](https://forums.centos.org/viewtopic.php?t=49231). There's a [virt-convert](http://manpages.debian.org/virt-convert) tool which can *import* OVF, but not the reverse. The [libguestfs](http://www.libguestfs.org/) library also has a [converter](http://www.libguestfs.org/virt-v2v.1.html) but it also doesn't support exporting to OVF or anything Ganeti can load directly. So people have written [their own conversion tools](https://virtuallyhyper.com/2013/06/migrate-from-libvirt-kvm-to-virtualbox/) or [their own conversion procedure](https://scienceofficersblog.blogspot.com/2014/04/using-cloud-images-with-ganeti.html). Ganeti also supports [file-backed instances](http://docs.ganeti.org/ganeti/2.15/html/design-file-based-storage.html) but "adoption" is specifically designed for logical volumes, so it doesn't work for our use case. ## Rebooting Those hosts need special care, as we can accomplish zero-downtime reboots on those machines. The `reboot` script in `tsa-misc` takes care of the special steps involved (which is basically to empty a node before rebooting it). Such a reboot should be ran interactively, inside a `tmux` or `screen` session, and takes over 15 minutes to complete right now, but depends on the size of the cluster (in terms of core memory usage). Once the reboot is completed, all instances might end up on a single machine, and the cluster might need to be rebalanced, see below. (Note: the update script should eventually do that, see [ticket 33406](https://bugs.torproject.org/33406)). ## Rebalancing a cluster After a reboot or a downtime, all nodes might end up on the same machine. This is normally handled by the reboot script, but it might be desirable to do this by hand if there was a crash or another special condition. This can be easily corrected with this command, which will spread instances around the cluster to balance it: hbal -L -C -v -P The above will show the proposed solution, with the state of the cluster before, and after (`-P`) and the commands to get there (`-C`). To actually execute the commands, you can copy-paste those commands. An alternative is to pass the `-X` argument, to tell `hbal` to actually issue the commands itself: hbal -L -C -v -P -X This will automatically move the instances around and rebalance the cluster. Here's an example run on a small cluster: root@fsn-node-01:~# gnt-instance list Instance Hypervisor OS Primary_node Status Memory loghost01.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 2.0G onionoo-backend-01.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 12.0G static-master-fsn.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 8.0G web-fsn-01.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 4.0G web-fsn-02.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 4.0G root@fsn-node-01:~# hbal -L -X Loaded 2 nodes, 5 instances Group size 2 nodes, 5 instances Selected node group: default Initial check done: 0 bad nodes, 0 bad instances. Initial score: 8.45007519 Trying to minimize the CV... 1. onionoo-backend-01 fsn-node-02:fsn-node-01 => fsn-node-01:fsn-node-02 4.98124611 a=f 2. loghost01 fsn-node-02:fsn-node-01 => fsn-node-01:fsn-node-02 1.78271883 a=f Cluster score improved from 8.45007519 to 1.78271883 Solution length=2 Got job IDs 16345 Got job IDs 16346 root@fsn-node-01:~# gnt-instance list Instance Hypervisor OS Primary_node Status Memory loghost01.torproject.org kvm debootstrap+buster fsn-node-01.torproject.org running 2.0G onionoo-backend-01.torproject.org kvm debootstrap+buster fsn-node-01.torproject.org running 12.0G static-master-fsn.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 8.0G web-fsn-01.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 4.0G web-fsn-02.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 4.0G In the above example, you should notice that the `web-fsn` instances both ended up on the same node. That's because the balancer did not know that they should be distributed. A special configuration was done, below, to avoid that problem in the future. But as a workaround, instances can also be moved by hand and the cluster re-balanced. Also notice that `-X` does not show the job output, use `ganeti-watch-jobs` for that, in another terminal. See the [job inspection](#job-inspection) section for more details on that. ### Redundant instances distribution Some instances are redundant across the cluster and should *not* end up on the same node. A good example are the `web-fsn-01` and `web-fsn-02` instances which, in theory, would serve similar traffic. If they end up on the same node, it might flood the network on that machine or at least defeats the purpose of having redundant machines. The way to ensure they get distributed properly by the balancing algorithm is to "tag" them. For the web nodes, for example, this was performed on the master: gnt-cluster add-tags htools:iextags:service gnt-instance add-tags web-fsn-01.torproject.org service:web-fsn gnt-instance add-tags web-fsn-02.torproject.org service:web-fsn This tells Ganeti that `web-fsn` is an "exclusion tag" and the optimizer will not try to schedule instances with those tags on the same node. To see which tags are present, use: # gnt-cluster list-tags htools:iextags:service You can also find which nodes are assigned to a tag with: # gnt-cluster search-tags service /cluster htools:iextags:service /instances/web-fsn-01.torproject.org service:web-fsn /instances/web-fsn-02.torproject.org service:web-fsn IMPORTANT: a previous version of this article mistakenly indicated that a new cluster-level tag had to be created for each service. That method did *not* work. The [hbal manpage](http://docs.ganeti.org/ganeti/current/man/hbal.html#exclusion-tags) explicitely mentions that the cluster-level tag is a *prefix* that can be used to create *multiple* such tags. This configuration also happens to be simpler and easier to use... ### HDD migration restrictions Cluster balancing works well until there are inconsistencies between how nodes are configured. In our case, some nodes have HDDs (Hard Disk Drives, AKA spinning rust) and others do not. Therefore, it's not possible to move an instance from a node with a disk allocated on the HDD to a node that does not have such a disk. Yet somehow the allocator is not smart enough to tell, and you will get the following error when doing an automatic rebalancing: one of the migrate failed and stopped the cluster balance: Can't create block device: Can't create block device <LogicalVolume(/dev/vg_ganeti_hdd/98d30e7d-0a47-4a7d-aeed-6301645d8469.disk3_data, visible as /dev/, size=102400m)> on node fsn-node-07.torproject.org for instance gitlab-02.torproject.org: Can't create block device: Can't compute PV info for vg vg_ganeti_hdd In this case, it is trying to migrate the `gitlab-02` server from `fsn-node-01` (which has an HDD) to `fsn-node-07` (which hasn't), which naturally fails. This is a known limitation of the Ganeti code. There has been a [draft design document for multiple storage unit support](http://docs.ganeti.org/ganeti/master/html/design-multi-storage-htools.html) since 2015, but it has [never been implemented](https://github.com/ganeti/ganeti/issues/865). There has been multiple issues reported upstream on the subject: * [208: Bad behaviour when multiple volume groups exists on nodes](https://github.com/ganeti/ganeti/issues/208) * [1199: unable to mark storage as unavailable for allocation](https://github.com/ganeti/ganeti/issues/1199) * [1240: Disk space check with multiple VGs is broken](https://github.com/ganeti/ganeti/issues/1240) * [1379: Support for displaying/handling multiple volume groups](https://github.com/ganeti/ganeti/issues/1379) Unfortunately, there are no known workarounds for this, at least not that fix the `hbal` command. It *is* possible to exclude the faulty migration from the pool of possible moves, however, for example in the above case: hbal -L -v -C -P --exclude-instances gitlab-02.torproject.org It's also possible to use the `--no-disk-moves` option to avoid disk move operations altogether. Both workarounds obviously do not correctly balance the cluster... Note that we have also tried to use `htools:migration` tags to workaround that issue, but [those do not work for secondary instances](https://github.com/ganeti/ganeti/issues/1497). For this we would need to setup [node groups](http://docs.ganeti.org/ganeti/current/html/man-gnt-group.html) instead. A good trick is to look at the solution proposed by `hbal`: Trying to minimize the CV... 1. tbb-nightlies-master fsn-node-01:fsn-node-02 => fsn-node-04:fsn-node-02 6.12095251 a=f r:fsn-node-04 f 2. bacula-director-01 fsn-node-01:fsn-node-03 => fsn-node-03:fsn-node-01 4.56735007 a=f 3. staticiforme fsn-node-02:fsn-node-04 => fsn-node-02:fsn-node-01 3.99398707 a=r:fsn-node-01 4. cache01 fsn-node-07:fsn-node-05 => fsn-node-07:fsn-node-01 3.55940346 a=r:fsn-node-01 5. vineale fsn-node-05:fsn-node-06 => fsn-node-05:fsn-node-01 3.18480313 a=r:fsn-node-01 6. pauli fsn-node-06:fsn-node-07 => fsn-node-06:fsn-node-01 2.84263128 a=r:fsn-node-01 7. neriniflorum fsn-node-05:fsn-node-02 => fsn-node-05:fsn-node-01 2.59000393 a=r:fsn-node-01 8. static-master-fsn fsn-node-01:fsn-node-02 => fsn-node-02:fsn-node-01 2.47345604 a=f 9. polyanthum fsn-node-02:fsn-node-07 => fsn-node-07:fsn-node-02 2.47257956 a=f 10. forrestii fsn-node-07:fsn-node-06 => fsn-node-06:fsn-node-07 2.45119245 a=f Cluster score improved from 8.92360196 to 2.45119245 Look at the last column. The `a=` field shows what "action" will be taken. A `f` is a failover (or "migrate"), and a `r:` is a `replace-disks`, with the new secondary after the semi-colon (`:`). In the above case, the proposed solution is correct: no secondary node is in the range of nodes that lacks HDDs (`fsn-node-0[5-7]`). If one of the disk replaces hits one of the nodes without HDD, then it's when you use `--exclude-instances` to find a better solution. A typical exclude is: hbal -L -v -C -P --exclude-instance=bacula-director-01,tbb-nightlies-master,eugeni,winklerianum,woronowii,rouyi,loghost01,materculae,gayi,weissii Another option is to specifically look for instances that do not have a HDD and migrate only those. In my situation, `gnt-cluster verify` was complaining that `fsn-node-02` was full, so I looked for all the instances on that node and found the ones which didn't have a HDD: gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status \ | sort | grep 'fsn-node-02' | awk '{print $3}' | \ while read instance ; do printf "checking $instance: " if gnt-instance info $instance | grep -q hdd ; then echo "HAS HDD" else echo "NO HDD" fi done Then you can manually `migrate -f` (to fail over to the secondary) and `replace-disks -n` (to find another secondary) the instances that *can* be migrated out of the four first machines (which have HDDs) to the last three (which do not). Look at the memory usage in `gnt-node list` to pick the best node. In general, if a given node in the first four is overloaded, a good trick is to look for one that can be failed over, with, for example: gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort | grep '^fsn-node-0[1234]' | grep 'fsn-node-0[5678]' ... or, for a particular node (say fsn-node-04): gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort | grep ^fsn-node-04 | grep 'fsn-node-0[5678]' The instances listed there would be ones that can be migrated to their secondary to give `fsn-node-04` some breathing room. ## Adding and removing addresses on instances Say you created an instance but forgot to need to assign an extra IP. You can still do so with: gnt-instance modify --net -1:add,ip=116.202.120.174,network=gnt-fsn test01.torproject.org ## Job inspection Sometimes it can be useful to look at the active jobs. It might be, for example, that another user has queued a bunch of jobs in another terminal which you do not have access to, or some automated process did (Nagios, for example, runs `gnt-cluster verify` once in a while). Ganeti has this concept of "jobs" which can provide information about those. The command `gnt-job list` will show the entire job history, and `gnt-job list --running` will show running jobs. `gnt-job watch` can be used to watch a specific job. We have a wrapper called `ganeti-watch-jobs` which automatically shows the output of whatever job is currently running and exits when all jobs complete. This is particularly useful while [rebalancing the cluster](#rebalancing-a-cluster) as `hbal -X` does not show the job output... ## Open vSwitch crash course and debugging [Open vSwitch](https://www.openvswitch.org/) is used in the `gnt-fsn` cluster to connect the multiple machines with each other through [Hetzner's "vswitch"](https://wiki.hetzner.de/index.php/Vswitch/en) system. You will typically not need to deal with Open vSwitch, as Ganeti takes care of configuring the network on instance creation and migration. But if you believe there might be a problem with it, you can consider reading the following: * [Documentation portal](https://docs.openvswitch.org/en/latest/) * [Tutorials](https://docs.openvswitch.org/en/latest/tutorials/index.html_) * [Debugging Open vSwitch slides](https://www.openvswitch.org/support/slides/OVS-Debugging-110414.pdf) ## Accessing the QEMU control ports There is a magic warp zone on the node where an instance is running: ``` nc -U /var/run/ganeti/kvm-hypervisor/ctrl/$INSTANCE.monitor ``` This drops you in the [QEMU monitor](https://people.redhat.com/pbonzini/qemu-test-doc/_build/html/topics/pcsys_005fmonitor.html) which can do all sorts of things including adding/removing devices, save/restore the VM state, pause/resume the VM, do screenshots, etc. There are many sockets in the `ctrl` directory, including: * `.serial`: the instance's serial port * `.monitor`: the QEMU monitor control port * `.qmp`: the same, but with a JSON interface that I can't figure out (the `-qmp` argument to `qemu`) * `.kvmd`: same as the above? ## Pager playbook ### I/O overload In case of excessive I/O, it might be worth looking into which machine is in cause. The [howto/drbd](howto/drbd) page explains how to map a DRBD device to a VM. You can also find which logical volume is backing an instance (and vice versa) with this command: lvs -o+tags This will list all logical volumes and their associated tags. If you already know which logical volume you're looking for, you can address it directly: root@fsn-node-01:~# lvs -o tags /dev/vg_ganeti_hdd/4091b668-1177-41ac-9310-1eac45b46620.disk2_data LV Tags originstname+bacula-director-01.torproject.org ### Node failure Ganeti clusters are designed to be [self-healing](http://docs.ganeti.org/ganeti/2.15/html/admin.html#autorepair). As long as only one machine disappears, the cluster should be able to recover by failing over other nodes. This is currently done manually, however. WARNING: the following procedure should be considered a LAST RESORT. In the vast majority of cases, it is simpler and less risky to just restart the node using a remote power cycle to restore the service than risking a split brain scenario which this procedure can case when not followed properly. WARNING, AGAIN: if for some reason the node you are failing over from actually returns on its own without you being able to stop it, it may run those DRBD disks and virtual machines, and you *may* end up in a split brain scenario. If, say, `fsn-node-07` completely fails and you need to restore service to the virtual machines running on that server, you can failover to the secondaries. Before you do, however, you need to be completely confident it is not still running in parallel, which could lead to a "split brain" scenario. For that, just cut the power to the machine using out of band management (e.g. on Hetzner, power down the machine through the Hetzner Robot, on Cymru, use the iDRAC to cut the power to the main board). Once the machine is powered down, instruct Ganeti to stop using it altogether: gnt-node modify --offline=yes fsn-node-07 Then, once the machine is offline and Ganeti also agrees, switch all the instances on that node to their secondaries: gnt-node failover fsn-node-07.torproject.org It's possible that you need `--ignore-consistency` but this has caused trouble in the past (see [40229](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40229)). In any case, it is [not used at the WMF](https://wikitech.wikimedia.org/wiki/Ganeti#Failed_hardware_node), for example, they explicitly say that never needed the flag. Note that it will still try to connect to the failed node to shutdown the DRBD devices, as a last resort. Recovering from the failure should be automatic: once the failed server is repaired and restarts, it will contact the master to ask for instances to start. Since the machines the instances have been migrated, none will be started and there *should* not be any inconsistencies. Once the machine is up and running and you are confident you do not have a split brain scenario, you can re-add the machine to the cluster with: gnt-node add --readd fsn-node-07.torproject.org Once that is done, rebalance the cluster because you now have an empty node which could be reused (hopefully). It might, obviously, be worth exploring the root case of the failure, however, before readding the machine to the cluster. Recoveries could eventually be automated if such situations occur more often, by scheduling a [harep](http://docs.ganeti.org/ganeti/2.15/man/harep.html) cron job, which isn't enabled in Debian by default. See also the [autorepair](http://docs.ganeti.org/docs/ganeti/2.15/html/admin.html#autorepair) section of the admin manual. ### Master node failure A master node failure is a special case, as you do not have access to the node to run Ganeti commands. We have not established our own procedure for this yet, see: * [Ganeti wiki master failover procedure](https://github.com/ganeti/ganeti/wiki/Common-Issues#master-failuresafter-a-failure-two-nodes-think-they-are-master) * [Riseup master failover procedure](https://we.riseup.net/riseup+tech/ganeti#primary-node-fails) TODO: expand documentation on master node failure recovery. ### Split brain recovery A split brain occurred during a partial failure, failover, then unexpected recovery of `fsn-node-07` ([issue 40229](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40229)). It might occur in other scenarios, but this section documents that specific one. Hopefully the recovery will be similar in other scenarios. The split brain was the result of an operator running this command to failover the instances running on the node: gnt-node failover --ignore-consistency fsn-node-07.torproject.org The symptom of the split brain is that the VM is running on two machines. You will see that in `gnt-cluster verify`: Thu Apr 22 01:28:04 2021 * Verifying node status Thu Apr 22 01:28:04 2021 - ERROR: instance palmeri.torproject.org: instance should not run on node fsn-node-07.torproject.org Thu Apr 22 01:28:04 2021 - ERROR: instance onionoo-backend-02.torproject.org: instance should not run on node fsn-node-07.torproject.org Thu Apr 22 01:28:04 2021 - ERROR: instance polyanthum.torproject.org: instance should not run on node fsn-node-07.torproject.org Thu Apr 22 01:28:04 2021 - ERROR: instance onionbalance-01.torproject.org: instance should not run on node fsn-node-07.torproject.org Thu Apr 22 01:28:04 2021 - ERROR: instance henryi.torproject.org: instance should not run on node fsn-node-07.torproject.org Thu Apr 22 01:28:04 2021 - ERROR: instance nevii.torproject.org: instance should not run on node fsn-node-07.torproject.org In the above, the verification finds an instance running on an unexpected server (the old primary). Disks will be in a similar "degraded" state, according to `gnt-cluster verify`: Thu Apr 22 01:28:04 2021 * Verifying instance status Thu Apr 22 01:28:04 2021 - WARNING: instance onionoo-backend-02.torproject.org: disk/0 on fsn-node-07.torproject.org is degraded; local disk state is 'ok' Thu Apr 22 01:28:04 2021 - WARNING: instance onionoo-backend-02.torproject.org: disk/1 on fsn-node-07.torproject.org is degraded; local disk state is 'ok' Thu Apr 22 01:28:04 2021 - WARNING: instance onionoo-backend-02.torproject.org: disk/2 on fsn-node-07.torproject.org is degraded; local disk state is 'ok' Thu Apr 22 01:28:04 2021 - WARNING: instance onionoo-backend-02.torproject.org: disk/0 on fsn-node-06.torproject.org is degraded; local disk state is 'ok' Thu Apr 22 01:28:04 2021 - WARNING: instance onionoo-backend-02.torproject.org: disk/1 on fsn-node-06.torproject.org is degraded; local disk state is 'ok' Thu Apr 22 01:28:04 2021 - WARNING: instance onionoo-backend-02.torproject.org: disk/2 on fsn-node-06.torproject.org is degraded; local disk state is 'ok' We can also see that symptom on an individual instance: root@fsn-node-01:~# gnt-instance info onionbalance-01.torproject.org - Instance name: onionbalance-01.torproject.org [...] Disks: - disk/0: drbd, size 10.0G access mode: rw nodeA: fsn-node-05.torproject.org, minor=29 nodeB: fsn-node-07.torproject.org, minor=26 port: 11031 on primary: /dev/drbd29 (147:29) in sync, status *DEGRADED* on secondary: /dev/drbd26 (147:26) in sync, status *DEGRADED* [...] The first (optional) thing to do in a split brain scenario is to stop the damage made by running instances: stop all the instances running in parallel, on both the previous and new primaries: gnt-instance stop $INSTANCES Then on `fsn-node-07` just use `kill(1)` to shutdown the `qemu` processes running the VMs directly. Now the instances should all be shutdown and no further changes will be done on the VM that could possibly be lost. (This step is optional because you can also skip straight to the hard decision below, while leaving the instances running. But that adds pressure to you, and we don't want to do that to your poor brain right now.) That will leave you time to make a more important decision: which node will be authoritative (which will keep running as primary) and which one will "lose" (and will have its instances destroyed)? There's no easy good or wrong answer for this: it's a judgement call. In any case, there might already been data loss: for as long as both nodes were available and the VMs running on both, data registered on one of the nodes during the split brain will be lost when we destroy the state on the "losing" node. If you have picked the previous primary as the "new" primary, you will need to *first* revert the failover and flip the instances back to the previous primary: for instance in $INSTANCES; do gnt-instance failover $instance done When that is done, or if you have picked the "new" primary (the one the instances were originally failed over to) as the official one: you need to fix the disks' state. For this, flip to a "plain" disk (i.e. turn off DRBD) and turn DRBD back on. This will stop mirroring the disk, and reallocate a new disk in the right place. Assuming all instances are stopped, this should do it: for instance in $INSTANCES ; do gnt-instance modify -t plain $instance gnt-instance modify -t drbd --no-wait-for-sync $instance gnt-instance start $instance gnt-instance console $instance done Then the machines should be back up on a single machine and the split brain scenario resolved. Note that this means the other side of the DRBD mirror will be destroyed in the procedure, that is the step that drops the data which was sent to the wrong part of the "split brain". Once everything is back to normal, it might be a good idea to rebalance the cluster. References: * the `-t plain` hack comes from [this post on the Ganeti list](https://groups.google.com/g/ganeti/c/l8www_IcFFI) * [this procedure](https://blkperl.github.io/split-brain-ganeti.html) suggests using `replace-disks -n` which also works, but requires us to pick the secondary by hand each time, which is annoying * [this procedure](https://www.ipserverone.info/knowledge-base/how-to-fix-drbd-recovery-from-split-brain/) has instructions on how to recover at the DRBD level directly, but have not required those instructions so far ### Bridge configuration failures If you get the following error while trying to bring up the bridge: root@chi-node-02:~# ifup br0 add bridge failed: Package not installed run-parts: /etc/network/if-pre-up.d/bridge exited with return code 1 ifup: failed to bring up br0 ... it might be the bridge cannot find a way to load the kernel module, because kernel module loading has been disabled. Reboot with the `/etc/no_modules_disabled` file present: touch /etc/no_modules_disabled reboot It might be that the machine took too long to boot because it's not in mandos and the operator took too long to enter the LUKS passphrase. Re-enable the machine with this command on mandos: mandos-ctl --enable chi-node-02.torproject ### Cleaning up orphan disks Sometimes `gnt-cluster verify` will give this warning, particularly after a failed rebalance: * Verifying orphan volumes - WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/27dd3687-8953-447e-8632-adf4aa4e11b6.disk0_meta is unknown - WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/27dd3687-8953-447e-8632-adf4aa4e11b6.disk0_data is unknown - WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_meta is unknown - WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_data is unknown This can happen when an instance was partially migrated to a node (in this case `fsn-node-06`) but the migration failed because (for example) there was no HDD on the target node. The fix here is simply to remove the logical volumes on the target node: ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/27dd3687-8953-447e-8632-adf4aa4e11b6.disk0_meta ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/27dd3687-8953-447e-8632-adf4aa4e11b6.disk0_data ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_meta ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_data ### Fixing inconsistent disks Sometimes `gnt-cluster verify` will give this error: WARNING: instance materculae.torproject.org: disk/0 on fsn-node-02.torproject.org is degraded; local disk state is 'ok' ... or worse: ERROR: instance materculae.torproject.org: couldn't retrieve status for disk/2 on fsn-node-03.torproject.org: Can't find device <DRBD8(hosts=46cce2d9-ddff-4450-a2d6-b2237427aa3c/10-053e482a-c9f9-49a1-984d-50ae5b4563e6/22, port=11177, backend=<LogicalVolume(/dev/vg_ganeti/486d3e6d-e503-4d61-a8d9-31720c7291bd.disk2_data, visible as /dev/, size=10240m)>, metadev=<LogicalVolume(/dev/vg_ganeti/486d3e6d-e503-4d61-a8d9-31720c7291bd.disk2_meta, visible as /dev/, size=128m)>, visible as /dev/disk/2, size=10240m)> The fix for both is to run: gnt-instance activate-disks materculae.torproject.org This will make sure disks are correctly setup for the instance. If you have a lot of those warnings, pipe the output into this filter, for example: gnt-cluster verify | grep -e 'WARNING: instance' -e 'ERROR: instance' | sed 's/.*instance//;s/:.*//' | sort -u | while read instance; do gnt-instance activate-disks $instance done ### Not enough memory for failovers Another error that `gnt-cluster verify` can give you is, for example: - ERROR: node fsn-node-04.torproject.org: not enough memory to accomodate instance failovers should node fsn-node-03.torproject.org fail (16384MiB needed, 10724MiB available) The solution is to [rebalance the cluster](#rebalancing-a-cluster). ### Can't assemble device after creation It's possible that Ganeti fails to create an instance with this error: Thu Jan 14 20:01:00 2021 - WARNING: Device creation failed Failure: command execution error: Can't create block device <DRBD8(hosts=d1b54252-dd81-479b-a9dc-2ab1568659fa/0-3aa32c9d-c0a7-44bb-832d-851710d04765/0, port=11005, backend=<LogicalVolume(/dev/vg_ganeti/3f60a066-c957-4a86-9fae-65525fe3f3c7.disk0_data, not visible, size=10240m)>, metadev=<LogicalVolume(/dev/vg_ganeti/3f60a066-c957-4a86-9fae-65525fe3f3c7.disk0_meta, not visible, size=128m)>, visible as /dev/disk/0, size=10240m)> on node chi-node-03.torproject.org for instance build-x86-13.torproject.org: Can't assemble device after creation, unusual event: drbd0: timeout while configuring network In this case, the problem was that `chi-node-03` had an incorrect `secondary_ip` set. The immediate fix was to correctly set the secondary address of the node: gnt-node modify --secondary-ip=172.30.130.3 chi-node-03.torproject.org Then `gnt-cluster verify` was complaining about the leftover DRBD device: - ERROR: node chi-node-03.torproject.org: unallocated drbd minor 0 is in use For this, see [DRBD: deleting a stray device](howto/drbd#deleting-a-stray-device). ### SSH key verification failures Ganeti uses SSH to launch arbitrary commands (as root!) on other nodes. It does this using a funky command, from `node-daemon.log`: ssh -oEscapeChar=none -oHashKnownHosts=no \ -oGlobalKnownHostsFile=/var/lib/ganeti/known_hosts \ -oUserKnownHostsFile=/dev/null -oCheckHostIp=no \ -oConnectTimeout=10 -oHostKeyAlias=chignt.torproject.org -oPort=22 -oBatchMode=yes -oStrictHostKeyChecking=yes -4 \ root@chi-node-03.torproject.org This has caused us some problems in the Ganeti buster to bullseye upgrade, possibly because of changes in host verification routines in OpenSSH. The problem was documented in [issue 1608 upstream](https://github.com/ganeti/ganeti/issues/1608) and [tpo/tpa/team#40383](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40383). A workaround is to synchronize Ganeti's `known_hosts` file: grep 'chi-node-0[0-9]' /etc/ssh/ssh_known_hosts | grep -v 'initramfs' | grep ssh-rsa | sed 's/[^ ]* /chignt.torproject.org /' >> /var/lib/ganeti/known_hosts Note that the above assumes only a < 10 nodes cluster. ### Other troubleshooting The [walkthrough](http://docs.ganeti.org/ganeti/2.15/html/walkthrough.html) also has a few recipes to resolve common problems. See also the [common issues page](https://github.com/ganeti/ganeti/wiki/Common-Issues) in the Ganeti wiki. Look into logs on the relevant nodes (particularly `/var/log/ganeti/node-daemon.log`, which shows all commands ran by ganeti) when you have problems. ## Disaster recovery If things get completely out of hand and the cluster becomes too unreliable for service, the only solution is to rebuild another one elsewhere. Since Ganeti 2.2, there is a [move-instance](http://docs.ganeti.org/ganeti/2.15/html/move-instance.html) command to move instances between cluster that can be used for that purpose. If Ganeti is completely destroyed and its APIs don't work anymore, the last resort is to restore all virtual machines from [howto/backup](howto/backup). Hopefully, this should not happen except in the case of a catastrophic data loss bug in Ganeti or [howto/drbd](howto/drbd). # Reference ## Installation Ganeti is typically installed as part of the [bare bones machine installation process](howto/new-machine), typically as part of the "post-install configuration" procedure, once the machine is fully installed and configured. Typically, we add a new *node* to an existing *cluster*. Below are cluster-specific procedures to add a new *node* to each existing cluster, alongside the configuration of the cluster as it was done at the time (and how it could be used to rebuild a cluster from scratch). Make sure you use the procedure specific to the cluster you are working on. Note that this is *not* about installing virtual machines (VMs) *inside* a Ganeti cluster: for that you want to look at the [new instance procedure](#adding-a-new-instance). ### New gnt-fsn node 1. To create a new box, follow [howto/new-machine-hetzner-robot](howto/new-machine-hetzner-robot) but change the following settings: * Server: [PX62-NVMe][] * Location: `FSN1` * Operating system: Rescue * Additional drives: 2x10TB HDD (update: starting from fsn-node-05, we are *not* ordering additional drives to save on costs, see [ticket 33083](https://bugs.torproject.org/33083) for rationale) * Add in the comment form that the server needs to be in the same datacenter as the other machines (FSN1-DC13, but double-check) [PX62-NVMe]: https://www.hetzner.com/dedicated-rootserver/px62-nvme?country=OTHER 2. follow the [howto/new-machine](howto/new-machine) post-install configuration 3. Add the server to the two `vSwitch` systems in [Hetzner Robot web UI](https://robot.your-server.de/vswitch) 4. install openvswitch and allow modules to be loaded: touch /etc/no_modules_disabled reboot apt install openvswitch-switch 5. Allocate a private IP address in the `30.172.in-addr.arpa` zone (and the `torproject.org` zone) for the node, in the `admin/dns/domains.git` repository 6. copy over the `/etc/network/interfaces` from another ganeti node, changing the `address` and `gateway` fields to match the local entry. 7. knock on wood, cross your fingers, pet a cat, help your local book store, and reboot: reboot 8. Prepare all the nodes by configuring them in Puppet, by adding the class `roles::ganeti::fsn` to the node 9. Re-enable modules disabling: rm /etc/no_modules_disabled 10. run puppet across the ganeti cluster to ensure ipsec tunnels are up: cumin -p 0 'C:roles::ganeti::fsn' 'puppet agent -t' 11. reboot again: reboot 12. Then the node is ready to be added to the cluster, by running this on the master node: gnt-node add \ --secondary-ip 172.30.135.2 \ --no-ssh-key-check \ --no-node-setup \ fsn-node-02.torproject.org If this is an entirely new cluster, you need a different procedure, see [the cluster initialization procedure](#gnt-fsn-cluster-initialization) instead. 13. make sure everything is great in the cluster: gnt-cluster verify If that takes a long time and eventually fails with erors like: ERROR: node fsn-node-03.torproject.org: ssh communication with node 'fsn-node-06.torproject.org': ssh problem: ssh: connect to host fsn-node-06.torproject.org port 22: Connection timed out\'r\n ... that is because the [howto/ipsec](howto/ipsec) tunnels between the nodes are failing. Make sure Puppet has run across the cluster (step 10 above) and see [howto/ipsec](howto/ipsec) for further diagnostics. For example, the above would be fixed with: ssh fsn-node-03.torproject.org "puppet agent -t; service ipsec reload" ssh fsn-node-06.torproject.org "puppet agent -t; service ipsec reload; ipsec up gnt-fsn-be::fsn-node-03" ### gnt-fsn cluster initialization This procedure replaces the `gnt-node add` step in the initial setup of the first Ganeti node when the `gnt-fsn` cluster was setup: gnt-cluster init \ --master-netdev vlan-gntbe \ --vg-name vg_ganeti \ --secondary-ip 172.30.135.1 \ --enabled-hypervisors kvm \ --nic-parameters mode=openvswitch,link=br0,vlan=4000 \ --mac-prefix 00:66:37 \ --no-ssh-init \ --no-etc-hosts \ fsngnt.torproject.org The above assumes that `fsngnt` is already in DNS. See the [MAC address prefix selection](#mac-address-prefix-selection) section for information on how the `--mac-prefix` argument was selected. Then the following extra configuration was performed: gnt-cluster modify --reserved-lvs vg_ganeti/root,vg_ganeti/swap gnt-cluster modify -H kvm:kernel_path=,initrd_path= gnt-cluster modify -H kvm:security_model=pool gnt-cluster modify -H kvm:kvm_extra='-device virtio-rng-pci\,bus=pci.0\,addr=0x1e\,max-bytes=1024\,period=1000' gnt-cluster modify -H kvm:disk_cache=none gnt-cluster modify -H kvm:disk_discard=unmap gnt-cluster modify -H kvm:scsi_controller_type=virtio-scsi-pci gnt-cluster modify -H kvm:disk_type=scsi-hd gnt-cluster modify -H kvm:migration_bandwidth=950 gnt-cluster modify -H kvm:migration_downtime=500 gnt-cluster modify -D drbd:c-plan-ahead=0,disk-custom='--c-plan-ahead 0' gnt-cluster modify --uid-pool 4000-4019 The [network configuration](#network-configuration) (below) must also be performed for the address blocks reserved in the cluster. ### New gnt-chi node 1. to create a new box, follow the [cymru new-machine howto](howto/new-machine-cymru) 2. follow the [howto/new-machine](howto/new-machine) post-install configuration 3. Allocate a private IP address in the `30.172.in-addr.arpa` zone for the node, in the `admin/dns/domains.git` repository 4. add the private IP address to the eth1 interface, for example in `/etc/network/interfaces.d/eth1`: auto eth1 iface eth1 inet static address 172.30.130.5/24 This IP must be allocated in the reverse DNS zone file (`30.172.in-addr.arpa`) and the `torproject.org` zone file in the `dns/domains.git` repository. 5. enable the interface: ifup eth1 6. setup a bridge on the public interface, replacing the `eth0` blocks with something like: auto eth0 iface eth0 inet manual auto br0 iface br0 inet static address 38.229.82.104/24 gateway 38.229.82.1 bridge_ports eth0 bridge_stp off bridge_fd 0 # IPv6 configuration iface br0 inet6 static accept_ra 0 address 2604:8800:5000:82:baca:3aff:fe5d:8774/64 gateway 2604:8800:5000:82::1 6. allow modules to be loaded, cross your fingers that you didn't screw up the network configuration above, and reboot: touch /etc/no_modules_disabled reboot 7. configure the node in Puppet by adding it to the `roles::ganeti::chi` class, and run Puppet on the new node: puppet agent -t 8. re-disable module loading: rm /etc/no_modules_disabled 9. run puppet across the ganeti cluster to firewalls are correctly configured: cumin -p 0 'C:roles::ganeti::chi' 'puppet agent -t' 10. Then the node is ready to be added to the cluster, by running this on the master node: gnt-node add \ --secondary-ip 172.30.130.5 \ --no-ssh-key-check \ --no-node-setup \ chi-node-05.torproject.org If this is an entirely new cluster, you need a different procedure, see [the cluster initialization procedure](#gnt-fsn-cluster-initialization) instead. 11. make sure everything is great in the cluster: gnt-cluster verify If the last step fails with SSH errors, you may need to re-synchronise the SSH `known_hosts` file, see [SSH key verification failures](#ssh-key-verification-failures). ### gnt-chi cluster initialization This procedure replaces the `gnt-node add` step in the initial setup of the first Ganeti node when the `gnt-chi` cluster was setup: gnt-cluster init \ --master-netdev eth1 \ --nic-parameters link=br0 \ --vg-name vg_ganeti \ --secondary-ip 172.30.130.1 \ --enabled-hypervisors kvm \ --mac-prefix 06:66:38 \ --no-ssh-init \ --no-etc-hosts \ chignt.torproject.org The above assumes that `chignt` is already in DNS. See the [MAC address prefix selection](#mac-address-prefix-selection) section for information on how the `--mac-prefix` argument was selected. Then the following extra configuration was performed: ``` gnt-cluster modify --reserved-lvs vg_ganeti/root,vg_ganeti/swap gnt-cluster modify -H kvm:kernel_path=,initrd_path= gnt-cluster modify -H kvm:security_model=pool gnt-cluster modify -H kvm:kvm_extra='-device virtio-rng-pci\,bus=pci.0\,addr=0x1e\,max-bytes=1024\,period=1000' gnt-cluster modify -H kvm:disk_cache=none gnt-cluster modify -H kvm:disk_discard=unmap gnt-cluster modify -H kvm:scsi_controller_type=virtio-scsi-pci gnt-cluster modify -H kvm:disk_type=scsi-hd gnt-cluster modify -H kvm:migration_bandwidth=950 gnt-cluster modify -H kvm:migration_downtime=500 gnt-cluster modify -D drbd:c-plan-ahead=0,disk-custom='--c-plan-ahead 0' gnt-cluster modify --uid-pool 4000-4019 ``` The upper limit for CPU count and memory size were doubled, to 16 and 64G, respectively, with: ``` gnt-cluster modify --ipolicy-bounds-specs \ max:cpu-count=16,disk-count=16,disk-size=1048576,\ memory-size=65536,nic-count=8,spindle-use=12\ /min:cpu-count=1,disk-count=1,disk-size=1024,\ memory-size=128,nic-count=1,spindle-use=1 ``` NOTE: watch out for whitespace here. The [original source](https://johnny85v.wordpress.com/2016/06/13/ganeti-commands/) for this command had too much whitespace, which fails with: Failure: unknown/wrong parameter name 'Missing value for key '' in option --ipolicy-bounds-specs' The disk templates also had to be modified to account for iSCSI devices: gnt-cluster modify --enabled-disk-templates drbd,plain,blockdev gnt-cluster modify --ipolicy-disk-templates drbd,plain,blockdev The [network configuration](#network-configuration) (below) must also be performed for the address blocks reserved in the cluster. This is the actual initial configuration performed: gnt-network add --network 38.229.82.0/24 --gateway 38.229.82.1 --network6 2604:8800:5000:82::/64 --gateway6 2604:8800:5000:82::1 gnt-chi-01 gnt-network connect --nic-parameters=link=br0 gnt-chi-01 default The following IPs were reserved: gnt-network modify --add-reserved-ips=38.229.82.2,38.229.82.3,38.229.82.4,38.229.82.5,38.229.82.6,38.229.82.7,38.229.82.8,38.229.82.9,38.229.82.10,38.229.82.11,38.229.82.12,38.229.82.13,38.229.82.14,38.229.82.15,38.229.82.16,38.229.82.17,38.229.82.18,38.229.82.19 gnt-chi-01 The first two are for the gateway, but the rest is temporary and might be reclaimed eventually. ### Network configuration IP allocation is managed by Ganeti through the `gnt-network(8)` system. Say we have `192.0.2.0/24` reserved for the cluster, with the host IP `192.0.2.100` and the gateway on `192.0.2.1`. You will create this network with: gnt-network add --network 192.0.2.0/24 --gateway 192.0.2.1 example-network If there's also IPv6, it would look something like this: gnt-network add --network 192.0.2.0/24 --gateway 192.0.2.1 --network6 2001:db8::/32 --gateway6 fe80::1 example-network Note: the actual name of the network (`example-network`) above, should follow the convention established in [doc/naming-scheme](doc/naming-scheme). Then we associate the new network to the default node group: gnt-network connect --nic-parameters=link=br0,vlan=4000,mode=openvswitch example-network default The arguments to `--nic-parameters` come from the values configured in the cluster, above. The current values can be found with `gnt-cluster info`. For example, the second ganeti network block was assigned with the following commands: gnt-network add --network 49.12.57.128/27 --gateway 49.12.57.129 gnt-fsn13-02 gnt-network connect --nic-parameters=link=br0,vlan=4000,mode=openvswitch gnt-fsn13-02 default IP addresses can be reserved with the `--reserved-ips` argument to the modify command, for example: gnt-network modify --add-reserved-ips=38.229.82.2,38.229.82.3,38.229.82.4,38.229.82.5,38.229.82.6,38.229.82.7,38.229.82.8,38.229.82.9,38.229.82.10,38.229.82.11,38.229.82.12,38.229.82.13,38.229.82.14,38.229.82.15,38.229.82.16,38.229.82.17,38.229.82.18,38.229.82.19 gnt-chi-01 gnt-chi-01 Note that the gateway and nodes IP addresses are automatically reserved, this is for hosts outside of the cluster. The network name must follow the [naming convention](doc/naming-scheme). ## SLA As long as the cluster is not over capacity, it should be able to survive the loss of a node in the cluster unattended. Justified machines can be provisionned within a few business days without problems. New nodes can be provisioned within a week or two, depending on budget and hardware availability. ## Design Our first Ganeti cluster (`gnt-fsn`) is made of multiple machines hosted with [Hetzner Robot](https://robot.your-server.de/), Hetzner's dedicated server hosting service. All machines use the same hardware to avoid problems with live migration. That is currently a customized build of the [PX62-NVMe][] line. ### Network layout Machines are interconnected over a [vSwitch](https://wiki.hetzner.de/index.php/Vswitch/en), a "virtual layer 2 network" probably implemented using [Software-defined Networking](https://en.wikipedia.org/wiki/Software-defined_networking) (SDN) on top of Hetzner's network. The details of that implementation do not matter much to us, since we do not trust the network and run an IPsec layer on top of the vswitch. We communicate with the `vSwitch` through [Open vSwitch](https://en.wikipedia.org/wiki/Open_vSwitch) (OVS), which is (currently manually) configured on each node of the cluster. There are two distinct IPsec networks: * `gnt-fsn-public`: the public network, which maps to the `fsn-gnt-inet-vlan` vSwitch at Hetzner, the `vlan-gntinet` OVS network, and the `gnt-fsn` network pool in Ganeti. it provides public IP addresses and routing across the network. instances get IP allocated in this network. * `gnt-fsn-be`: the private ganeti network which maps to the `fsn-gnt-backend-vlan` vSwitch at Hetzner and the `vlan-gntbe` OVS network. it has no matching `gnt-network` component and IP addresses are allocated manually in the 172.30.135.0/24 network through DNS. it provides internal routing for Ganeti commands and [howto/drbd](howto/drbd) storage mirroring. ### MAC address prefix selection The MAC address prefix for the gnt-fsn cluster (`00:66:37:...`) seems to have been picked arbitrarily. While it does not conflict with a known existing prefix, it could eventually be issued to a manufacturer and reused, possibly leading to a MAC address clash. The closest is currently Huawei: $ grep ^0066 /var/lib/ieee-data/oui.txt 00664B (base 16) HUAWEI TECHNOLOGIES CO.,LTD Such a clash is fairly improbable, because that new manufacturer would need to show up on the local network as well. Still, new clusters SHOULD use a different MAC address prefix in a [locally administered address](https://en.wikipedia.org/wiki/MAC_address#Universal_vs._local) (LAA) space, which "are distinguished by setting the second-least-significant bit of the first octet of the address". In other words, the MAC address must have 2, 6, A or E as a its second [quad](https://en.wikipedia.org/wiki/Nibble). In other words, the MAC address must look like one of those: x2 - xx - xx - xx - xx - xx x6 - xx - xx - xx - xx - xx xA - xx - xx - xx - xx - xx xE - xx - xx - xx - xx - xx We used `06:66:38` in the gnt-chi cluster for that reason. We picked the `06:66` prefix to ressemble the existing `00:66` prefix used in `gnt-fsn` but varied the last quad (from `:37` to `:38`) to make them slightly more different-looking. Obviously, it's unlikely the MAC addresses will be compared across clusters in the short term. But it's technically possible a MAC bridge could be established if an exotic VPN bridge gets established between the two networks in the future, so it's good to have some difference. ### Hardware variations We considered experimenting with the new AX line ([AX51-NVMe](https://www.hetzner.com/dedicated-rootserver/ax51-nvme?country=OTHER)) but in the past DSA had problems live-migrating (it wouldn't immediately fail but there were "issues" after). So we might need to [failover](http://docs.ganeti.org/ganeti/2.15/man/gnt-instance.html#failover) instead of migrate between those parts of the cluster. There are also doubts that the Linux kernel supports those shiny new processors at all: similar processors had [trouble booting before Linux 5.5](https://www.phoronix.com/scan.php?page=news_item&px=Threadripper-3000-MCE-5.5-Fix) for example, so it might be worth waiting a little before switching to that new platform, even if it's cheaper. See the cluster configuration section below for a larger discussion of CPU emulation. ### CPU emulation Note that we might want to tweak the `cpu_type` parameter. By default, it emulates a lot of processing that can be delegated to the host CPU instead. If we use `kvm:cpu_type=host`, then each node will tailor the emulation system to the CPU on the node. But that might make the live migration more brittle: VMs or processes can crash after a live migrate because of a slightly different configuration (microcode, CPU, kernel and QEMU versions all play a role). So we need to find the lowest common demoninator in CPU families. The list of available families supported by QEMU varies between releases, but is visible with: # qemu-system-x86_64 -cpu help Available CPUs: x86 486 x86 Broadwell Intel Core Processor (Broadwell) [...] x86 Skylake-Client Intel Core Processor (Skylake) x86 Skylake-Client-IBRS Intel Core Processor (Skylake, IBRS) x86 Skylake-Server Intel Xeon Processor (Skylake) x86 Skylake-Server-IBRS Intel Xeon Processor (Skylake, IBRS) [...] The current [PX62 line][PX62-NVMe] is based on the [Coffee Lake](https://en.wikipedia.org/wiki/Coffee_Lake) Intel micro-architecture. The closest matching family would be `Skylake-Server` or `Skylake-Server-IBRS`, [according to wikichip](https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake#Compiler_support). Note that newer QEMU releases (4.2, currently in unstable) have more supported features. In that context, of course, supporting different CPU manufacturers (say AMD vs Intel) is impractical: they will have totally different families that are not compatible with each other. This will break live migration, which can trigger crashes and problems in the migrated virtual machines. If there are problems live-migrating between machines, it is still possible to "failover" (`gnt-instance failover` instead of `migrate`) which shuts off the machine, fails over disks, and starts it on the other side. That's not such of a big problem: we often need to reboot the guests when we reboot the hosts anyways. But it does complicate our work. Of course, it's also possible that live migrates work fine if *no* `cpu_type` at all is specified in the cluster, but that needs to be verified. Nodes could also [grouped](http://docs.ganeti.org/ganeti/2.15/man/gnt-group.html) to limit (automated) live migration to a subset of nodes. References: * <https://dsa.debian.org/howto/install-ganeti/> * <https://qemu.weilnetz.de/doc/qemu-doc.html#recommendations_005fcpu_005fmodels_005fx86> ### Installer The [ganeti-instance-debootstrap](https://tracker.debian.org/pkg/ganeti-instance-debootstrap) package is used to install instances. It is configured through Puppet with the [shared ganeti module](https://forge.puppet.com/smash/ganeti), which deploys a few hooks to automate the install as much as possible. The installer will: 1. setup grub to respond on the serial console 2. setup and log a random root password 3. make sure SSH is installed and log the public keys and fingerprints 4. setup swap if a labeled partition is present, or a 512MB swapfile otherwise 5. setup basic static networking through `/etc/network/interfaces.d` We have custom configurations on top of that to: 1. add a few base packages 2. do our own custom SSH configuration 3. fix the hostname to be a FQDN 4. add a line to `/etc/hosts` 5. add a tmpfs There is work underway to refactor and automate the install better, see [ticket 31239](https://bugs.torproject.org/31239) for details. ### Storage TODO: document how DRBD works in general, and how it's setup here in particular. See also the [DRBD documentation](howto/drbd). The Cymru PoP has an iSCSI cluster for large filesystem storage. Ideally, this would be automated inside Ganeti, some quick links: * [search for iSCSI in the ganeti-devel mailing list](https://www.mail-archive.com/search?l=ganeti-devel@googlegroups.com&q=iscsi&submit.x=0&submit.y=0) * in particular a [discussion of integrating SANs into ganeti](https://groups.google.com/forum/m/?_escaped_fragment_=topic/ganeti/P7JU_0YGn9s) seems to say "just do it manually" (paraphrasing) and [this discussion has an actual implementation](https://groups.google.com/forum/m/?_escaped_fragment_=topic/ganeti/kkXFDgvg2rY), [gnt-storage-eql](https://github.com/atta/gnt-storage-eql) * it could be implemented as an [external storage provider](https://github.com/ganeti/ganeti/wiki/External-Storage-Providers), see the [documentation](http://docs.ganeti.org/ganeti/2.10/html/design-shared-storage.html) * the DSA docs are in two parts: [iscsi](https://dsa.debian.org/howto/iscsi/) and [export-iscsi](https://dsa.debian.org/howto/export-iscsi/) * someone made a [Kubernetes provisionner](https://github.com/nmaupu/dell-provisioner) for our hardware which could provide sample code For now, iSCSI volumes are manually created and passed to new virtual machines. ## Issues There is no issue tracker specifically for this project, [File][] or [search][] for issues in the [team issue tracker][search] component. [File]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/new [search]: https://gitlab.torproject.org/tpo/tpa/team/-/issues Ganeti has of course its own [issue tracker on GitHub](https://github.com/ganeti/ganeti/issues). ## Monitoring and testing <!-- TODO: describe how this service is monitored and how it can be tested --> <!-- after major changes like IP address changes or upgrades --> ## Logs and metrics Ganeti logs a significant amount of information in `/var/log/ganeti.log`. Those logs are of particular interest: * `node-daemon.log`: all low-level commands and HTTP requests on the node daemon, includes, for example, LVM and DRBD commands * `os/*$hostname*.log`: installation log for machine `$hostname` It does not expose performance metrics that are digested by Prometheus right now, but that would be an interesting feature to add. ## Other documentation * [Ganeti](http://www.ganeti.org/) * [Ganeti documentation home](http://docs.ganeti.org/) * [Main manual](http://docs.ganeti.org/ganeti/master/html/) * [Manual pages](http://docs.ganeti.org/ganeti/master/man/) * [Wiki](https://github.com/ganeti/ganeti/wiki) * [Issues](https://github.com/ganeti/ganeti/issues) * [Google group](https://groups.google.com/forum/#!forum/ganeti) * [Wikimedia foundation documentation](https://wikitech.wikimedia.org/wiki/Ganeti) * [Riseup documentation](https://we.riseup.net/riseup+tech/ganeti) * [DSA](https://dsa.debian.org/howto/install-ganeti/) * [OSUOSL wiki](https://wiki.osuosl.org/ganeti/) # Discussion ## Overview The project of creating a Ganeti cluster for Tor has appeared in the summer of 2019. The machines were delivered by Hetzner in July 2019 and setup by weasel by the end of the month. ## Goals The goal was to replace the aging group of KVM servers (`kvm[1-5]`, AKA `textile`, `unifolium`, `macrum`, `kvm4` and `kvm5`). ### Must have * arbitrary virtual machine provisionning * redundant setup * automated VM installation * replacement of existing infrastructure ### Nice to have * fully configured in Puppet * full high availability with automatic failover * extra capacity for new projects ### Non-Goals * Docker or "container" provisionning - we consider this out of scope for now * self-provisionning by end-users: TPA remains in control of provisionning ## Approvals required A budget was proposed by weasel in may 2019 and approved by Vegas in June. An extension to the budget was approved in january 2020 by Vegas. ## Proposed Solution Setup a Ganeti cluster of two machines with a Hetzner vSwitch backend. ## Cost The design based on the [PX62 line][PX62-NVMe] has the following monthly cost structure: * per server: 118EUR (79EUR + 39EUR for 2x10TB HDDs) * IPv4 space: 35.29EUR (/27) * IPv6 space: 8.40EUR (/64) * bandwidth cost: 1EUR/TB (currently 38EUR) At three servers, that adds up to around 435EUR/mth. Up to date costs are available in the [Tor VM hosts.xlsx](https://nc.torproject.net/apps/onlyoffice/5395) spreadsheet. ## Alternatives considered <!-- include benchmarks and procedure if relevant --> Note that the instance install is possible also [through FAI, see the Ganeti wiki for examples](https://github.com/ganeti/ganeti/wiki/System-template-with-FAI). There are GUIs for Ganeti that we are not using, but could, if we want to grant more users access: * [Ganeti Web manager](https://ganeti-webmgr.readthedocs.io/) is a "Django based web frontend for managing Ganeti virtualization clusters. Since Ganeti only provides a command-line interface, Ganeti Web Manager’s goal is to provide a user friendly web interface to Ganeti via Ganeti’s Remote API. On top of Ganeti it provides a permission system for managing access to clusters and virtual machines, an in browser VNC console, and vm state and resource visualizations" * [Synnefo](https://www.synnefo.org/) is a "complete open source cloud stack written in Python that provides Compute, Network, Image, Volume and Storage services, similar to the ones offered by AWS. Synnefo manages multiple Ganeti clusters at the backend for handling of low-level VM operations and uses Archipelago to unify cloud storage. To boost 3rd-party compatibility, Synnefo exposes the OpenStack APIs to users."