Ganeti is software designed to facilitate the management of virtual machines (KVM or Xen). It helps you move virtual machine instances from one node to another, create an instance with DRBD replication on another node and do the live migration from one to another, etc.
- Tutorial
-
How-to
- Glossary
- Adding a new instance
- Modifying an instance
- Destroying an instance
- Getting information
- Disk operations (DRBD)
- Evaluating cluster capacity
- Moving instances and failover
- Importing external libvirt instances
- Importing external libvirt instances, manual
- Rebooting
- Rebalancing a cluster
- Adding and removing addresses on instances
- Job inspection
- Open vSwitch crash course and debugging
- Pager playbook
- Disaster recovery
- Reference
- Discussion
Tutorial
Listing virtual machines (instances)
This will show the running guests, known as "instances":
gnt-instance list
Accessing serial console
Our instances do serial console, starting in grub. To access it, run
gnt-instance console test01.torproject.org
To exit, use ^] -- that is, Control-<Closing Bracket>.
How-to
Glossary
In Ganeti, we use the following terms:
- node a physical machine is called a node and a
- instance a virtual machine
- master: a node where on which we issue Ganeti commands and that supervises all the other nodes
Nodes are interconnected through a private network that is used to communicate commands and synchronise disks (with howto/drbd). Instances are normally assigned two nodes: a primary and a secondary: the primary is where the virtual machine actually runs and the secondary acts as a hot failover.
See also the more extensive glossary in the Ganeti documentation.
Adding a new instance
This command creates a new guest, or "instance" in Ganeti's vocabulary with 10G root, 2G swap, 20G spare on SSD, 800G on HDD, 8GB ram and 2 CPU cores:
gnt-instance add \
-o debootstrap+bullseye \
-t drbd --no-wait-for-sync \
--net 0:ip=pool,network=gnt-fsn13-02 \
--no-ip-check \
--no-name-check \
--disk 0:size=10G \
--disk 1:size=2G,name=swap \
--disk 2:size=20G \
--disk 3:size=800G,vg=vg_ganeti_hdd \
--backend-parameters memory=8g,vcpus=2 \
test-01.torproject.org
What that does
This configures the following:
- redundant disks in a DRBD mirror, use
-t plaininstead of-t drbdfor tests as that avoids syncing of disks and will speed things up considerably (even with--no-wait-for-syncthere are some operations that block on synced mirrors). Only one node should be provided as the argument for--nodethen. - three partitions: one on the default VG (SSD), one on another (HDD)
and a swap file on the default VG, if you don't specify a swap device,
a 512MB swapfile is created in
/swapfile. TODO: configure disk 2 and 3 automatically in installer. (/varand/srv?) - 8GB of RAM with 2 virtual CPUs
- an IP allocated from the public gnt-fsn pool:
gnt-instance addwill print the IPv4 address it picked to stdout. The IPv6 address can be found in/var/log/ganeti/os/on the primary node of the instance, see below. - with the
test-01.torproject.orghostname
Next steps
To find the root password, ssh host key fingerprints, and the IPv6 address, run this on the node where the instance was created, for example:
egrep 'root password|configured eth0 with|SHA256' $(ls -tr /var/log/ganeti/os/* | tail -1) | grep -v $(hostname)
We copy root's authorized keys into the new instance, so you should be able to
log in with your token. You will be required to change the root password immediately.
Pick something nice and document it in tor-passwords.
Also set reverse DNS for both IPv4 and IPv6 in hetzner's robot (Chek under servers -> vSwitch -> IPs) or in our own reverse zone files (if delegated).
Then follow howto/new-machine.
Known issues
-
allocator failures: Note that you may need to use the
--nodeparameter to pick on which machines you want the machine to end up, otherwise Ganeti will choose for you (and may fail). Use, for example,--node fsn-node-01:fsn-node-02to usenode-01as primary andnode-02as secondary. The allocator can sometimes fail if the allocator is upset about something in the cluster, for example:Can's find primary node using iallocator hail: Request failed: No valid allocation solutions, failure reasons: FailMem: 2, FailN1: 2This situation is covered by ticket 33785. If this problem occurs, it might be worth rebalancing the cluster.
-
ping failure: there is a bug in
ganeti-instance-debootstrapwhich misconfiguresping(among other things), see bug 31781. It's currently patched in our version of the Debian package, but that patch might disappear if Debian upgrade the package without shipping our patch. Note that this was fixed in Debian bullseye and later.
Other examples
This is the same without the HDD partition, in the gnt-chi cluster:
gnt-instance add \
-o debootstrap+bullseye \
-t drbd --no-wait-for-sync \
--net 0:ip=pool,network=gnt-chi-01 \
--no-ip-check \
--no-name-check \
--disk 0:size=10G \
--disk 1:size=2G,name=swap \
--disk 2:size=20G \
--backend-parameters memory=8g,vcpus=2 \
test-01.torproject.org
A simple test machine, with only 1G of disk, ram, and 1 CPU, without DRBD, in the FSN cluster:
gnt-instance add \
-o debootstrap+bullseye \
-t plain --no-wait-for-sync \
--net 0:ip=pool,network=gnt-fsn13-02 \
--no-ip-check \
--no-name-check \
--disk 0:size=10G \
--disk 1:size=2G,name=swap \
--backend-parameters memory=1g,vcpus=1 \
test-01.torproject.org
Do not forget to follow the next steps, above.
iSCSI integration
To create a VM with iSCSI backing, a disk must first be created on the
SAN, then adopted in a VM, which needs to be reinstalled on top of
that. This is typical how large disks are provisionned in the
gnt-chi cluster, in the Cymru POP.
The following instructions assume you are on a node with an iSCSI
initiator properly setup, and the SAN cluster management tools
setup. It also assumes you are familiar with the SMcli tool, see
the storage servers documentation for an introduction on that.
This assumes you are creating a 500GB VM, partitioned on the Linux
host, not on the iSCSI volume. TODO: change those instructions to
create one volume per partition, so that those can be resized more
easily. The following is how tb-build-03 was setup.
-
create the disk on the SAN and assign it to the host group:
puppet agent --disable "creating a SAN disk" $EDITOR /usr/local/sbin/tpo-create-san-disks /usr/local/sbin/tpo-create-san-disks puppet agent --enableWARNING: the above script needs to be edited before it does the right thing. It will show the LUN numbers in use below. This, obviously, is not ideal, and should be replaced by a Ganeti external storage provider.
NOTE: the
logicalUnitNumberhere must be an increment from the previous highest LUN. See also the disk creation instructions for a discussion. -
configure the disk on all Ganeti nodes, in Puppet's
profile::ganeti::chiclass:iscsi::multipath::alias { 'web-chi-03': wwid => '36782bcb00063c6a500000d67603f7abf', } -
propagate the magic to all nodes in the cluster:
gnt-cluster command "puppet agent -t ; iscsiadm -m node --rescan ; multipath -r" -
confirm that multipath works, it should look something like this":
root@chi-node-01:~# multipath -ll web-chi-03-srv (36782bcb00063c6a500000d67603f7abf) dm-20 DELL,MD32xxi size=500G features='5 queue_if_no_path pg_init_retries 50 queue_mode mq' hwhandler='1 rdac' wp=rw |-+- policy='round-robin 0' prio=6 status=active | |- 11:0:0:4 sdi 8:128 active ready running | |- 12:0:0:4 sdj 8:144 active ready running | `- 9:0:0:4 sdh 8:112 active ready running `-+- policy='round-robin 0' prio=1 status=enabled |- 10:0:0:4 sdk 8:160 active ghost running |- 7:0:0:4 sdl 8:176 active ghost running `- 8:0:0:4 sdm 8:192 active ghost running root@chi-node-01:~#and the device
/dev/mapper/web-chi-03should exist. -
adopt the disks in Ganeti:
gnt-instance add \ -n chi-node-04.torproject.org \ -o debootstrap+bullseye \ -t blockdev --no-wait-for-sync \ --net 0:ip=pool,network=gnt-chi-01 \ --no-ip-check \ --no-name-check \ --disk 0:adopt=/dev/disk/by-id/dm-name-tb-build-03-root \ --disk 1:adopt=/dev/disk/by-id/dm-name-tb-build-03-swap,name=swap \ --disk 2:adopt=/dev/disk/by-id/dm-name-tb-build-03-srv \ --backend-parameters memory=16g,vcpus=8 \ tb-build-03.torproject.orgNOTE: the actual node must be manually picked because the
hailallocator doesn't seem to know about block devices. -
at this point, the VM probably doesn't boot, because for some reason the
gnt-instance-debootstrapdoesn't fire when disks are adopted. so you need to reinstall the machine, which involves stopping it first:gnt-instance shutdown --timeout=0 tb-build-03 gnt-instance reinstall tb-build-03HACK: the current installer fails on weird partionning errors, see upstream bug 13. We applied patch 14 on
chi-node-04and sent it upstream for review before committing to maintaining this in Debian or elsewhere. It should be tested on other installs beforehand as well.
From here on, follow the next steps above.
TODO: This would ideally be automated by an external storage provider, see the storage reference for more information.
Troubleshooting
If a Ganeti instance install fails, it will show the end of the install log, for example:
Thu Aug 26 14:11:09 2021 - INFO: Selected nodes for instance tb-pkgstage-01.torproject.org via iallocator hail: chi-node-02.torproject.org, chi-node-01.torproject.org
Thu Aug 26 14:11:09 2021 - INFO: NIC/0 inherits netparams ['br0', 'bridged', '']
Thu Aug 26 14:11:09 2021 - INFO: Chose IP 38.229.82.29 from network gnt-chi-01
Thu Aug 26 14:11:10 2021 * creating instance disks...
Thu Aug 26 14:12:58 2021 adding instance tb-pkgstage-01.torproject.org to cluster config
Thu Aug 26 14:12:58 2021 adding disks to cluster config
Thu Aug 26 14:13:00 2021 * checking mirrors status
Thu Aug 26 14:13:01 2021 - INFO: - device disk/0: 30.90% done, 3m 32s remaining (estimated)
Thu Aug 26 14:13:01 2021 - INFO: - device disk/2: 0.60% done, 55m 26s remaining (estimated)
Thu Aug 26 14:13:01 2021 * checking mirrors status
Thu Aug 26 14:13:02 2021 - INFO: - device disk/0: 31.20% done, 3m 40s remaining (estimated)
Thu Aug 26 14:13:02 2021 - INFO: - device disk/2: 0.60% done, 52m 13s remaining (estimated)
Thu Aug 26 14:13:02 2021 * pausing disk sync to install instance OS
Thu Aug 26 14:13:03 2021 * running the instance OS create scripts...
Thu Aug 26 14:16:31 2021 * resuming disk sync
Failure: command execution error:
Could not add os for instance tb-pkgstage-01.torproject.org on node chi-node-02.torproject.org: OS create script failed (exited with exit code 1), last lines in the log file:
Setting up openssh-sftp-server (1:7.9p1-10+deb10u2) ...
Setting up openssh-server (1:7.9p1-10+deb10u2) ...
Creating SSH2 RSA key; this may take some time ...
2048 SHA256:ZTeMxYSUDTkhUUeOpDWpbuOzEAzOaehIHW/lJarOIQo root@chi-node-02 (RSA)
Creating SSH2 ED25519 key; this may take some time ...
256 SHA256:MWKeA8vJKkEG4TW+FbG2AkupiuyFFyoVWNVwO2WG0wg root@chi-node-02 (ED25519)
Created symlink /etc/systemd/system/sshd.service \xe2\x86\x92 /lib/systemd/system/ssh.service.
Created symlink /etc/systemd/system/multi-user.target.wants/ssh.service \xe2\x86\x92 /lib/systemd/system/ssh.service.
invoke-rc.d: could not determine current runlevel
Setting up ssh (1:7.9p1-10+deb10u2) ...
Processing triggers for systemd (241-7~deb10u8) ...
Processing triggers for libc-bin (2.28-10) ...
Errors were encountered while processing:
linux-image-4.19.0-17-amd64
E: Sub-process /usr/bin/dpkg returned an error code (1)
run-parts: /etc/ganeti/instance-debootstrap/hooks/ssh exited with return code 100
Using disk /dev/drbd4 as swap...
Setting up swapspace version 1, size = 2 GiB (2147479552 bytes)
no label, UUID=96111754-c57d-43f2-83d0-8e1c8b4688b4
Not using disk 2 (/dev/drbd5) because it is not named 'swap' (name: )
root@chi-node-01:~#
Here the failure which tripped the install is:
Errors were encountered while processing:
linux-image-4.19.0-17-amd64
E: Sub-process /usr/bin/dpkg returned an error code (1)
But the actual error is higher up, and we need to go look at the logs
on the server for this, in this case in
chi-node-02:/var/log/ganeti/os/add-debootstrap+buster-tb-pkgstage-01.torproject.org-2021-08-26_14_13_04.log,
we can find the real problem:
Setting up linux-image-4.19.0-17-amd64 (4.19.194-3) ...
/etc/kernel/postinst.d/initramfs-tools:
update-initramfs: Generating /boot/initrd.img-4.19.0-17-amd64
W: Couldn't identify type of root file system for fsck hook
/etc/kernel/postinst.d/zz-update-grub:
/usr/sbin/grub-probe: error: cannot find a device for / (is /dev mounted?).
run-parts: /etc/kernel/postinst.d/zz-update-grub exited with return code 1
dpkg: error processing package linux-image-4.19.0-17-amd64 (--configure):
installed linux-image-4.19.0-17-amd64 package post-installation script subprocess returned error exit status 1
In this case, oddly enough, even though Ganeti thought the install had failed, the machine can actually start:
gnt-instance start tb-pkgstage-01.torproject.org
... and after a while, we can even get a console:
gnt-instance start tb-pkgstage-01.torproject.org
And in that case, the procedure can just continue from here on: reset the root password, and just make sure you finish the install:
apt install linux-image-amd64
In the above case, the sources-list post-install hook was buggy: it
wasn't mounting /dev and friends before launching the upgrades,
which was causing issues when a kernel upgrade was queued.
And if you are debugging an installer and by mistake end up with half-open filesystems and stray DRBD devices, do take a look at the LVM and DRBD documentation.
Modifying an instance
CPU, memory changes
It's possible to change the IP, CPU, or memory allocation of an instance using the gnt-instance modify command:
gnt-instance modify -B vcpus=4 test1.torproject.org
gnt-instance modify -B memory=8g test1.torproject.org
gnt-instance reboot test1.torproject.org
IP address change
IP address changes require a full stop and will require manual changes
to the /etc/network/interfaces* files:
gnt-instance modify --net 0:modify,ip=116.202.120.175 test1.torproject.org
gnt-instance stop test1.torproject.org
gnt-instance start test1.torproject.org
gnt-instance console test1.torproject.org
Resizing disks
The gnt-instance grow-disk command can be used to change the size of the underlying device:
gnt-instance grow-disk --absolute test1.torproject.org 0 16g
gnt-instance reboot test1.torproject.org
The number 0 in this context, indicates the first disk of the
instance. The amount specified is the final disk size (because of the
--absolute flag). In the above example, the final disk size will be
16GB. To add space to the existing disk, remove the --absolute
flag:
gnt-instance grow-disk test1.torproject.org 0 16g
gnt-instance reboot test1.torproject.org
In the above example, 16GB will be ADDED to the disk. Be careful
with resizes, because it's not possible to revert such a change:
grow-disk does support shrinking disks. The only way to revert the
change is by exporting / importing the instance.
Then the filesystem needs to be resized inside the VM:
ssh root@test1.torproject.org
Resizing under LVM
Use pvs to display information about the physical volumes:
root@cupani:~# pvs
PV VG Fmt Attr PSize PFree
/dev/sdc vg_test lvm2 a-- <8.00g 1020.00m
Resize the physical volume to take up the new space:
pvresize /dev/sdc
Use lvs to display information about logical volumes:
# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
var-opt vg_test-01 -wi-ao---- <10.00g
test-backup vg_test-01_hdd -wi-ao---- <20.00g
Use lvextend to add space to the volume:
lvextend -l '+100%FREE' vg_test-01/var-opt
Finally resize the filesystem:
resize2fs /dev/vg_test-01/var-opt
See also the LVM howto.
Resizing without LVM
If there's no LVM inside the VM (a more common configuration nowadays), the above procedure will obviously not work.
You might need to resize the partition manually, which can be done
using fdisk. In the following example we have a sda1 partition that
we want to extend from 10G to 20G to fill up the free space on
/dev/sda. Here is what the partition layout looks like before the resize:
# lsblk -a
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
fd0 2:0 1 4K 0 disk
sda 8:0 0 20G 0 disk
└─sda1 8:1 0 10G 0 part /
sdb 8:16 0 2G 0 disk [SWAP]
sdc 8:32 0 40G 0 disk /srv
If sdc is the resized disk, the kernel might not have noticed the
size change, and you might need to kick it. There might be easier
ways, but a reboot would sure do it:
reboot
And in that case, the partition is already resized, so you do not
need to go through the fdisk process below and jump straight to the
last resize2fs step.
We use fdisk on the device:
# fdisk /dev/sda
Welcome to fdisk (util-linux 2.33.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.
Command (m for help): p # prints the partition table
Disk /dev/sda: 20 GiB, 21474836480 bytes, 41943040 sectors
Disk model: QEMU HARDDISK
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x73ab5f76
Device Boot Start End Sectors Size Id Type
/dev/sda1 * 2048 20971519 20969472 10G 83 Linux # note the starting sector for later
Now we delete the partition. Note that the data will not be deleted, only the partition table will be altered:
Command (m for help): d
Selected partition 1
Partition 1 has been deleted.
Command (m for help): p
Disk /dev/sda: 20 GiB, 21474836480 bytes, 41943040 sectors
Disk model: QEMU HARDDISK
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x73ab5f76
Now we create the new partition to take up the whole space:
Command (m for help): n
Partition type
p primary (0 primary, 0 extended, 4 free)
e extended (container for logical partitions)
Select (default p): p
Partition number (1-4, default 1): 1
First sector (2048-41943039, default 2048): 2048 # this is the starting sector from above.
Last sector, +/-sectors or +/-size{K,M,G,T,P} (2048-41943039, default 41943039): 41943039
Created a new partition 1 of type 'Linux' and of size 20 GiB.
Partition #1 contains a ext4 signature.
Do you want to remove the signature? [Y]es/[N]o: n # we want to keep the previous signature
Command (m for help): p
Disk /dev/sda: 20 GiB, 21474836480 bytes, 41943040 sectors
Disk model: QEMU HARDDISK
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x73ab5f76
Device Boot Start End Sectors Size Id Type
/dev/sda1 2048 41943039 41940992 20G 83 Linux
Command (m for help): w
The partition table has been altered.
Syncing disks.
Now we check the partitions:
# lsblk -a
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
fd0 2:0 1 4K 0 disk
sda 8:0 0 20G 0 disk
└─sda1 8:1 0 20G 0 part /
sdb 8:16 0 2G 0 disk [SWAP]
sdc 8:32 0 40G 0 disk /srv
If we check the free disk space on the device we will notice it has not changed yet:
# df -h /
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 9.8G 8.5G 874M 91% /
We need to resize it:
# resize2fs /dev/sda1
resize2fs 1.44.5 (15-Dec-2018)
Filesystem at /dev/sda1 is mounted on /; on-line resizing required
old_desc_blocks = 2, new_desc_blocks = 3
The filesystem on /dev/sda1 is now 5242624 (4k) blocks long.
The resize is now complete.
Adding disks
A disk can be added to an instance with the modify command as
well. This, for example, will add a 100GB disk to the test1 instance
on teh vg_ganeti_hdd volume group, which is "slow" rotating disks:
gnt-instance modify --disk add:size=100g,vg=vg_ganeti_hdd test1.torproject.org
gnt-instance reboot test1.torproject.org
Changing disk type
If you have, say, a test instance that was created with a plain disk
template but we actually want it in production, with a drbd disk
template. Switching to drbd is easy:
gnt-instance shutdown test-01
gnt-instance modify -t drbd test-01
gnt-instance start test-01
The second command will use the allocator to find a secondary node. If
that fails, you can assign a node manually with -n.
You can also switch back to plain, although you should generally
never do that.
See also the upstream procedure and design document.
Adding a network interface on the rfc1918 vlan
We have a vlan that some VMs that do not have public addresses sit on. Its vlanid is 4002 and its backed by Hetzner vswitch vSwitch #11973 "fsn-gnt-rfc1918-traffic". Note that traffic on this vlan will travel in the clear between nodes.
To add an instance to this vlan, give it a second network interface using
gnt-instance modify --net add:link=br0,vlan=4002,mode=openvswitch test1.torproject.org
Destroying an instance
This totally deletes the instance, including all mirrors and everything, be very careful with it:
gnt-instance remove test01.torproject.org
Getting information
Information about an instances can be found in the rather verbose
gnt-instance info:
root@fsn-node-01:~# gnt-instance info tb-build-02.torproject.org
- Instance name: tb-build-02.torproject.org
UUID: 8e9f3ca6-204f-4b6c-8e3e-6a8fda137c9b
Serial number: 5
Creation time: 2020-12-15 14:06:41
Modification time: 2020-12-15 14:07:31
State: configured to be up, actual state is up
Nodes:
- primary: fsn-node-03.torproject.org
group: default (UUID 8c32fd09-dc4c-4237-9dd2-3da3dfd3189e)
- secondaries: fsn-node-04.torproject.org (group default, group UUID 8c32fd09-dc4c-4237-9dd2-3da3dfd3189e)
Operating system: debootstrap+buster
A quick command that can be done is this, which shows the primary/secondary for a given instance:
gnt-instance info tb-build-02.torproject.org | grep -A 3 Nodes
An equivalent command will show the primary and secondary for all instances, on top of extra information (like the CPU count, memory and disk usage):
gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort
It can be useful to run this in a loop to see changes:
watch -n5 -d 'gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort'
Disk operations (DRBD)
Instances should be setup using the DRBD backend, in which case you should probably take a look at howto/drbd if you have problems with that. Ganeti handles most of the logic there so that should generally not be necessary.
Evaluating cluster capacity
This will list instances repeatedly, but also show their assigned memory, and compare it with the node's capacity:
gnt-instance list -o pnode,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort &&
echo &&
gnt-node list
The latter does not show disk usage for secondary volume groups (see upstream issue 1379), for a complete picture of disk usage, use:
gnt-node list-storage
The gnt-cluster verify command will also check to see if there's enough space on secondaries to account for the failure of a node. Healthy output looks like this:
root@fsn-node-01:~# gnt-cluster verify
Submitted jobs 48030, 48031
Waiting for job 48030 ...
Fri Jan 17 20:05:42 2020 * Verifying cluster config
Fri Jan 17 20:05:42 2020 * Verifying cluster certificate files
Fri Jan 17 20:05:42 2020 * Verifying hypervisor parameters
Fri Jan 17 20:05:42 2020 * Verifying all nodes belong to an existing group
Waiting for job 48031 ...
Fri Jan 17 20:05:42 2020 * Verifying group 'default'
Fri Jan 17 20:05:42 2020 * Gathering data (2 nodes)
Fri Jan 17 20:05:42 2020 * Gathering information about nodes (2 nodes)
Fri Jan 17 20:05:45 2020 * Gathering disk information (2 nodes)
Fri Jan 17 20:05:45 2020 * Verifying configuration file consistency
Fri Jan 17 20:05:45 2020 * Verifying node status
Fri Jan 17 20:05:45 2020 * Verifying instance status
Fri Jan 17 20:05:45 2020 * Verifying orphan volumes
Fri Jan 17 20:05:45 2020 * Verifying N+1 Memory redundancy
Fri Jan 17 20:05:45 2020 * Other Notes
Fri Jan 17 20:05:45 2020 * Hooks Results
A sick node would have said something like this instead:
Mon Oct 26 18:59:37 2009 * Verifying N+1 Memory redundancy
Mon Oct 26 18:59:37 2009 - ERROR: node node2: not enough memory to accommodate instance failovers should node node1 fail
See the ganeti manual for a more extensive example
Also note the hspace -L command, which can tell you how many
instances can be created in a given cluster. It uses the "standard"
instance template defined in the cluster (which we haven't configured
yet).
Moving instances and failover
Ganeti is smart about assigning instances to nodes. There's also a
command (hbal) to automatically rebalance the cluster (see
below). If for some reason hbal doesn’t do what you want or you need
to move things around for other reasons, here are a few commands that
might be handy.
Make an instance switch to using it's secondary:
gnt-instance migrate test1.torproject.org
Make all instances on a node switch to their secondaries:
gnt-node migrate test1.torproject.org
The migrate commands does a "live" migrate which should avoid any
downtime during the migration. It might be preferable to actually
shutdown the machine for some reason (for example if we actually want
to reboot because of a security upgrade). Or we might not be able to
live-migrate because the node is down. In this case, we do a
failover
gnt-instance failover test1.torproject.org
The gnt-node evacuate command can also be used to "empty" a given node altogether, in case of an emergency:
gnt-node evacuate -I . fsn-node-02.torproject.org
Similarly, the gnt-node failover command can be used to hard-recover from a completely crashed node:
gnt-node failover fsn-node-02.torproject.org
Note that you might need the --ignore-consistency flag if the
node is unresponsive.
Importing external libvirt instances
Assumptions:
-
INSTANCE: name of the instance being migrated, the "old" one being outside the cluster and the "new" one being the one created inside the cluster (e.g.chiwui.torproject.org) -
SPARE_NODE: a ganeti node with free space (e.g.fsn-node-03.torproject.org) where theINSTANCEwill be migrated -
MASTER_NODE: the master ganeti node (e.g.fsn-node-01.torproject.org) -
KVM_HOST: the machine which we migrate theINSTANCEfrom -
the
INSTANCEhas onlyrootandswappartitions -
the
SPARE_NODEhas space in/srv/to host all the virtual machines to import, to check, use:fab -H crm-ext-01.torproject.org,crm-int-01.torproject.org,forrestii.torproject.org,nevii.torproject.org,rude.torproject.org,troodi.torproject.org,vineale.torproject.org libvirt.du -p kvm3.torproject.org | sed '/-swap$/d;s/ .*$//' <f | awk '{s+=$1} END {print s}'You will very likely need to create a
/srvbig enough for this, for example:lvcreate -L 300G vg_ganeti -n srv-tmp && mkfs /dev/vg_ganeti/srv-tmp && mount /dev/vg_ganeti/srv-tmp /srv
Import procedure:
-
pick a viable SPARE NODE to import the INSTANCE (see "evaluating cluster capacity" above, when in doubt) and find on which KVM HOST the INSTANCE lives
-
copy the disks, without downtime:
./ganeti -v -H $INSTANCE libvirt-import --ganeti-node $SPARE_NODE --libvirt-host $KVM_HOST -
copy the disks again, this time suspending the machine:
./ganeti -v -H $INSTANCE libvirt-import --ganeti-node $SPARE_NODE --libvirt-host $KVM_HOST --suspend --adopt -
renumber the host:
./ganeti -v -H $INSTANCE renumber-instance --ganeti-node $SPARE_NODE -
test services by changing your
/etc/hosts, possibly warning service admins:Subject: $INSTANCE IP address change planned for Ganeti migration
I will soon migrate this virtual machine to the new ganeti cluster. this will involve an IP address change which might affect the service.
Please let me know if there are any problems you can think of. in particular, do let me know if any internal (inside the server) or external (outside the server) services hardcodes the IP address of the virtual machine.
A test instance has been setup. You can test the service by adding the following to your /etc/hosts:
116.202.120.182 $INSTANCE 2a01:4f8:fff0:4f:266:37ff:fe32:cfb2 $INSTANCE -
destroy test instance:
gnt-instance remove $INSTANCE -
lower TTLs to 5 minutes. this procedure varies a lot according to the service, but generally if all DNS entries are
CNAMEs pointing to the main machine domain name, the TTL can be lowered by adding adnsTTLentry in the LDAP entry for this host. For example, this sets the TTL to 5 minutes:dnsTTL: 300Then to make the changes immediate, you need the following commands:
ssh root@alberti.torproject.org sudo -u sshdist ud-generate && ssh root@nevii.torproject.org ud-replicateWarning: if you migrate one of the hosts ud-ldap depends on, this can fail and not only the TTL will not update, but it might also fail to update the IP address in the below procedure. See ticket 33766 for details.
-
shutdown original instance and redo migration as in step 3 and 4:
fab -H $INSTANCE reboot.halt-and-wait --delay-shutdown 60 --reason='migrating to new server' && ./ganeti -v -H $INSTANCE libvirt-import --ganeti-node $SPARE_NODE --libvirt-host $KVM_HOST --adopt && ./ganeti -v -H $INSTANCE renumber-instance --ganeti-node $SPARE_NODE -
final test procedure
TODO: establish host-level test procedure and run it here.
-
switch to DRBD, still on the Ganeti MASTER NODE:
gnt-instance stop $INSTANCE && gnt-instance modify -t drbd $INSTANCE && gnt-instance failover -f $INSTANCE && gnt-instance start $INSTANCEThe above can sometimes fail if the allocator is upset about something in the cluster, for example:
Can's find secondary node using iallocator hail: Request failed: No valid allocation solutions, failure reasons: FailMem: 2, FailN1: 2This situation is covered by ticket 33785. To work around the allocator, you can specify a secondary node directly:
gnt-instance modify -t drbd -n fsn-node-04.torproject.org $INSTANCE && gnt-instance failover -f $INSTANCE && gnt-instance start $INSTANCETODO: move into fabric, maybe in a
libvirt-import-liveorpost-libvirt-importjob that would also do the renumbering below -
change IP address in the following locations:
-
LDAP (
ipHostNumberfield, but also change thephysicalHostandlfields!). Also drop the dnsTTL attribute while you're at it. -
Puppet (grep in tor-puppet source, run
puppet agent -t; ud-replicateon pauli) -
DNS (grep in tor-dns source,
puppet agent -t; ud-replicateon nevii) -
nagios (don't forget to change the parent)
-
reverse DNS (upstream web UI, e.g. Hetzner Robot)
-
grep for the host's IP address on itself:
grep -r -e 78.47.38.227 -e 2a01:4f8:fff0:4f:266:37ff:fe77:1ad8 /etc grep -r -e 78.47.38.227 -e 2a01:4f8:fff0:4f:266:37ff:fe77:1ad8 /srv -
grep for the host's IP on all hosts:
cumin-all-puppet cumin-all 'grep -r -e 78.47.38.227 -e 2a01:4f8:fff0:4f:266:37ff:fe77:1ad8 /etc'
TODO: move those jobs into fabric
-
-
retire old instance (only a tiny part of howto/retire-a-host):
./retire -H $INSTANCE retire-instance --parent-host $KVM_HOST -
update the Nextcloud spreadsheet to remove the machine from the KVM host
-
warn users about the migration, for example:
To: tor-project@lists.torproject.org Subject: cupani AKA git-rw IP address changed
The main git server, cupani, is the machine you connect to when you push or pull git repositories over ssh to git-rw.torproject.org. That machines has been migrated to the new Ganeti cluster.
This required an IP address change from:
78.47.38.228 2a01:4f8:211:6e8:0:823:4:1to:
116.202.120.182 2a01:4f8:fff0:4f:266:37ff:fe32:cfb2DNS has been updated and preliminary tests show that everything is mostly working. You will get a warning about the IP address change when connecting over SSH, which will go away after the first connection.
Warning: Permanently added the ED25519 host key for IP address '116.202.120.182' to the list of known hosts.That is normal. The SSH fingerprints of the host did not change.
Please do report any other anomaly using the normal channels:
https://gitlab.torproject.org/tpo/tpa/team/-/wikis/support
The service was unavailable for about an hour during the migration.
Importing external libvirt instances, manual
This procedure is now easier to accomplish with the Fabric tools written especially for this purpose. Use the above procedure instead. This is kept for historical reference.
Assumptions:
-
INSTANCE: name of the instance being migrated, the "old" one being outside the cluster and the "new" one being the one created inside the cluster (e.g.chiwui.torproject.org) -
SPARE_NODE: a ganeti node with free space (e.g.fsn-node-03.torproject.org) where theINSTANCEwill be migrated -
MASTER_NODE: the master ganeti node (e.g.fsn-node-01.torproject.org) -
KVM_HOST: the machine which we migrate theINSTANCEfrom - the
INSTANCEhas onlyrootandswappartitions
Import procedure:
-
pick a viable SPARE NODE to import the instance (see "evaluating cluster capacity" above, when in doubt), login to the three servers, setting the proper environment everywhere, for example:
MASTER_NODE=fsn-node-01.torproject.org SPARE_NODE=fsn-node-03.torproject.org KVM_HOST=kvm1.torproject.org INSTANCE=test.torproject.org -
establish VM specs, on the KVM HOST:
-
disk space in GiB:
for disk in /srv/vmstore/$INSTANCE/*; do printf "$disk: " echo "$(qemu-img info --output=json $disk | jq '."virtual-size"') / 1024 / 1024 / 1024" | bc -l done -
number of CPU cores:
sed -n '/<vcpu/{s/[^>]*>//;s/<.*//;p}' < /etc/libvirt/qemu/$INSTANCE.xml -
memory, assuming from KiB to GiB:
echo "$(sed -n '/<memory/{s/[^>]*>//;s/<.*//;p}' < /etc/libvirt/qemu/$INSTANCE.xml) /1024 /1024" | bc -lTODO: make sure the memory line is in KiB and that the number makes sense.
-
on the INSTANCE, find the swap device UUID so we can recreate it later:
blkid -t TYPE=swap -s UUID -o value
-
-
setup a copy channel, on the SPARE NODE:
ssh-agent bash ssh-add /etc/ssh/ssh_host_ed25519_key cat /etc/ssh/ssh_host_ed25519_key.pubon the KVM HOST:
echo "$KEY_FROM_SPARE_NODE" >> /etc/ssh/userkeys/root -
copy the
.qcowfile(s) over, from the KVM HOST to the SPARE NODE:rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-root /srv/ rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-lvm /srv/ || trueNote: it's possible there is not enough room in
/srv: in the base Ganeti installs, everything is in the same root partition (/) which will fill up if the instance is (say) over ~30GiB. In that case, create a filesystem in/srv:(mkdir /root/srv && mv /srv/* /root/srv true) || true && lvcreate -L 200G vg_ganeti -n srv && mkfs /dev/vg_ganeti/srv && echo "/dev/vg_ganeti/srv /srv ext4 rw,noatime,errors=remount-ro 0 2" >> /etc/fstab && mount /srv && ( mv /root/srv/* ; rmdir /root/srv )This partition can be reclaimed once the VM migrations are completed, as it needlessly takes up space on the node.
-
on the SPARE NODE, create and initialize a logical volume with the predetermined size:
lvcreate -L 4GiB -n $INSTANCE-swap vg_ganeti mkswap --uuid $SWAP_UUID /dev/vg_ganeti/$INSTANCE-swap lvcreate -L 20GiB -n $INSTANCE-root vg_ganeti qemu-img convert /srv/$INSTANCE-root -O raw /dev/vg_ganeti/$INSTANCE-root lvcreate -L 40GiB -n $INSTANCE-lvm vg_ganeti_hdd qemu-img convert /srv/$INSTANCE-lvm -O raw /dev/vg_ganeti_hdd/$INSTANCE-lvmNote how we assume two disks above, but the instance might have a different configuration that would require changing the above. The above, common, configuration is to have an LVM disk separate from the "root" disk, the former being on a HDD, but the HDD is sometimes completely omitted and sizes can differ.
Sometimes it might be worth using pv to get progress on long transfers:
qemu-img convert /srv/$INSTANCE-lvm -O raw /srv/$INSTANCE-lvm.raw pv /srv/$INSTANCE-lvm.raw | dd of=/dev/vg_ganeti_hdd/$INSTANCE-lvm bs=4kTODO: ideally, the above procedure (and many steps below as well) would be automatically deduced from the disk listing established in the first step.
-
on the MASTER NODE, create the instance, adopting the LV:
gnt-instance add -t plain \ -n fsn-node-03 \ --disk 0:adopt=$INSTANCE-root \ --disk 1:adopt=$INSTANCE-swap \ --disk 2:adopt=$INSTANCE-lvm,vg=vg_ganeti_hdd \ --backend-parameters memory=2g,vcpus=2 \ --net 0:ip=pool,network=gnt-fsn \ --no-name-check \ --no-ip-check \ -o debootstrap+default \ $INSTANCE -
cross your fingers and watch the party:
gnt-instance console $INSTANCE -
IP address change on new instance:
edit
/etc/hostsand/etc/network/interfacesby hand and add IPv4 and IPv6 ip. IPv4 configuration can be found in:gnt-instance show $INSTANCELatter can be guessed by concatenating
2a01:4f8:fff0:4f::and the IPv6 local local address withoutfe80::. For example: a link local address offe80::266:37ff:fe65:870f/64should yield the following configuration:iface eth0 inet6 static accept_ra 0 address 2a01:4f8:fff0:4f:266:37ff:fe65:870f/64 gateway 2a01:4f8:fff0:4f::1TODO: reuse
gnt-debian-interfacesfrom the ganeti puppet module script here? -
functional tests: change your
/etc/hoststo point to the new server and see if everything still kind of works -
shutdown original instance
-
resync and reconvert image, on the Ganeti MASTER NODE:
gnt-instance stop $INSTANCEon the Ganeti node:
rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-root /srv/ && qemu-img convert /srv/$INSTANCE-root -O raw /dev/vg_ganeti/$INSTANCE-root && rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-lvm /srv/ && qemu-img convert /srv/$INSTANCE-lvm -O raw /dev/vg_ganeti_hdd/$INSTANCE-lvm -
switch to DRBD, still on the Ganeti MASTER NODE:
gnt-instance modify -t drbd $INSTANCE gnt-instance failover $INSTANCE gnt-instance startup $INSTANCE -
redo IP adress change in
/etc/network/interfacesand/etc/hosts -
final functional test
-
change IP address in the following locations:
- nagios (don't forget to change the parent)
- LDAP (
ipHostNumberfield, but also change thephysicalHostandlfields!) - Puppet (grep in tor-puppet source, run
puppet agent -t; ud-replicateon pauli) - DNS (grep in tor-dns source,
puppet agent -t; ud-replicateon nevii) - reverse DNS (upstream web UI, e.g. Hetzner Robot)
-
decomission old instance (howto/retire-a-host)
Troubleshooting
-
if boot takes a long time and you see a message like this on the console:
[ *** ] A start job is running for dev-disk-by\x2duuid-484b5...26s / 1min 30s)... which is generally followed by:
[DEPEND] Dependency failed for /dev/disk/by-…6f4b5-f334-4173-8491-9353d4f94e04. [DEPEND] Dependency failed for Swap.it means the swap device UUID wasn't setup properly, and does not match the one provided in
/etc/fstab. That is probably because you missed themkswap -Ustep documented above.
References
-
Upstream docs have the canonical incantation:
gnt-instance add -t plain -n HOME_NODE ... --disk 0:adopt=lv_name[,vg=vg_name] INSTANCE_NAME -
DSA docs also use disk adoption and have a procedure to migrate to DRBD
-
Riseup docs suggest creating a VM without installing, shutting down and then syncing
Ganeti supports importing and exporting from the Open Virtualization Format (OVF), but unfortunately it doesn't seem libvirt supports exporting to OVF. There's a virt-convert tool which can import OVF, but not the reverse. The libguestfs library also has a converter but it also doesn't support exporting to OVF or anything Ganeti can load directly.
So people have written their own conversion tools or their own conversion procedure.
Ganeti also supports file-backed instances but "adoption" is specifically designed for logical volumes, so it doesn't work for our use case.
Rebooting
Those hosts need special care, as we can accomplish zero-downtime
reboots on those machines. The reboot script in tsa-misc takes
care of the special steps involved (which is basically to empty a
node before rebooting it).
Such a reboot should be ran interactively, inside a tmux or screen
session, and takes over 15 minutes to complete right now, but depends
on the size of the cluster (in terms of core memory usage).
Once the reboot is completed, all instances might end up on a single machine, and the cluster might need to be rebalanced, see below. (Note: the update script should eventually do that, see ticket 33406).
Rebalancing a cluster
After a reboot or a downtime, all nodes might end up on the same machine. This is normally handled by the reboot script, but it might be desirable to do this by hand if there was a crash or another special condition.
This can be easily corrected with this command, which will spread instances around the cluster to balance it:
hbal -L -C -v -P
The above will show the proposed solution, with the state of the
cluster before, and after (-P) and the commands to get there
(-C). To actually execute the commands, you can copy-paste those
commands. An alternative is to pass the -X argument, to tell hbal
to actually issue the commands itself:
hbal -L -C -v -P -X
This will automatically move the instances around and rebalance the cluster. Here's an example run on a small cluster:
root@fsn-node-01:~# gnt-instance list
Instance Hypervisor OS Primary_node Status Memory
loghost01.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 2.0G
onionoo-backend-01.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 12.0G
static-master-fsn.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 8.0G
web-fsn-01.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 4.0G
web-fsn-02.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 4.0G
root@fsn-node-01:~# hbal -L -X
Loaded 2 nodes, 5 instances
Group size 2 nodes, 5 instances
Selected node group: default
Initial check done: 0 bad nodes, 0 bad instances.
Initial score: 8.45007519
Trying to minimize the CV...
1. onionoo-backend-01 fsn-node-02:fsn-node-01 => fsn-node-01:fsn-node-02 4.98124611 a=f
2. loghost01 fsn-node-02:fsn-node-01 => fsn-node-01:fsn-node-02 1.78271883 a=f
Cluster score improved from 8.45007519 to 1.78271883
Solution length=2
Got job IDs 16345
Got job IDs 16346
root@fsn-node-01:~# gnt-instance list
Instance Hypervisor OS Primary_node Status Memory
loghost01.torproject.org kvm debootstrap+buster fsn-node-01.torproject.org running 2.0G
onionoo-backend-01.torproject.org kvm debootstrap+buster fsn-node-01.torproject.org running 12.0G
static-master-fsn.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 8.0G
web-fsn-01.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 4.0G
web-fsn-02.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 4.0G
In the above example, you should notice that the web-fsn instances both
ended up on the same node. That's because the balancer did not know
that they should be distributed. A special configuration was done,
below, to avoid that problem in the future. But as a workaround,
instances can also be moved by hand and the cluster re-balanced.
Also notice that -X does not show the job output, use
ganeti-watch-jobs for that, in another terminal. See the job
inspection section for more details on that.
Redundant instances distribution
Some instances are redundant across the cluster and should not end up
on the same node. A good example are the web-fsn-01 and web-fsn-02
instances which, in theory, would serve similar traffic. If they end
up on the same node, it might flood the network on that machine or at
least defeats the purpose of having redundant machines.
The way to ensure they get distributed properly by the balancing algorithm is to "tag" them. For the web nodes, for example, this was performed on the master:
gnt-cluster add-tags htools:iextags:service
gnt-instance add-tags web-fsn-01.torproject.org service:web-fsn
gnt-instance add-tags web-fsn-02.torproject.org service:web-fsn
This tells Ganeti that web-fsn is an "exclusion tag" and the
optimizer will not try to schedule instances with those tags on the
same node.
To see which tags are present, use:
# gnt-cluster list-tags
htools:iextags:service
You can also find which nodes are assigned to a tag with:
# gnt-cluster search-tags service
/cluster htools:iextags:service
/instances/web-fsn-01.torproject.org service:web-fsn
/instances/web-fsn-02.torproject.org service:web-fsn
IMPORTANT: a previous version of this article mistakenly indicated that a new cluster-level tag had to be created for each service. That method did not work. The hbal manpage explicitely mentions that the cluster-level tag is a prefix that can be used to create multiple such tags. This configuration also happens to be simpler and easier to use...
HDD migration restrictions
Cluster balancing works well until there are inconsistencies between how nodes are configured. In our case, some nodes have HDDs (Hard Disk Drives, AKA spinning rust) and others do not. Therefore, it's not possible to move an instance from a node with a disk allocated on the HDD to a node that does not have such a disk.
Yet somehow the allocator is not smart enough to tell, and you will get the following error when doing an automatic rebalancing:
one of the migrate failed and stopped the cluster balance: Can't create block device: Can't create block device <LogicalVolume(/dev/vg_ganeti_hdd/98d30e7d-0a47-4a7d-aeed-6301645d8469.disk3_data, visible as /dev/, size=102400m)> on node fsn-node-07.torproject.org for instance gitlab-02.torproject.org: Can't create block device: Can't compute PV info for vg vg_ganeti_hdd
In this case, it is trying to migrate the gitlab-02 server from
fsn-node-01 (which has an HDD) to fsn-node-07 (which hasn't),
which naturally fails. This is a known limitation of the Ganeti
code. There has been a draft design document for multiple storage
unit support since 2015, but it has never been
implemented. There has been multiple issues reported upstream on
the subject:
- 208: Bad behaviour when multiple volume groups exists on nodes
- 1199: unable to mark storage as unavailable for allocation
- 1240: Disk space check with multiple VGs is broken
- 1379: Support for displaying/handling multiple volume groups
Unfortunately, there are no known workarounds for this, at least not
that fix the hbal command. It is possible to exclude the faulty
migration from the pool of possible moves, however, for example in the
above case:
hbal -L -v -C -P --exclude-instances gitlab-02.torproject.org
It's also possible to use the --no-disk-moves option to avoid disk
move operations altogether.
Both workarounds obviously do not correctly balance the
cluster... Note that we have also tried to use htools:migration tags
to workaround that issue, but those do not work for secondary
instances. For this we would need to setup node groups
instead.
A good trick is to look at the solution proposed by hbal:
Trying to minimize the CV...
1. tbb-nightlies-master fsn-node-01:fsn-node-02 => fsn-node-04:fsn-node-02 6.12095251 a=f r:fsn-node-04 f
2. bacula-director-01 fsn-node-01:fsn-node-03 => fsn-node-03:fsn-node-01 4.56735007 a=f
3. staticiforme fsn-node-02:fsn-node-04 => fsn-node-02:fsn-node-01 3.99398707 a=r:fsn-node-01
4. cache01 fsn-node-07:fsn-node-05 => fsn-node-07:fsn-node-01 3.55940346 a=r:fsn-node-01
5. vineale fsn-node-05:fsn-node-06 => fsn-node-05:fsn-node-01 3.18480313 a=r:fsn-node-01
6. pauli fsn-node-06:fsn-node-07 => fsn-node-06:fsn-node-01 2.84263128 a=r:fsn-node-01
7. neriniflorum fsn-node-05:fsn-node-02 => fsn-node-05:fsn-node-01 2.59000393 a=r:fsn-node-01
8. static-master-fsn fsn-node-01:fsn-node-02 => fsn-node-02:fsn-node-01 2.47345604 a=f
9. polyanthum fsn-node-02:fsn-node-07 => fsn-node-07:fsn-node-02 2.47257956 a=f
10. forrestii fsn-node-07:fsn-node-06 => fsn-node-06:fsn-node-07 2.45119245 a=f
Cluster score improved from 8.92360196 to 2.45119245
Look at the last column. The a= field shows what "action" will be
taken. A f is a failover (or "migrate"), and a r: is a
replace-disks, with the new secondary after the semi-colon (:). In
the above case, the proposed solution is correct: no secondary node is
in the range of nodes that lacks HDDs (fsn-node-0[5-7]). If one of
the disk replaces hits one of the nodes without HDD, then it's when
you use --exclude-instances to find a better solution. A typical
exclude is:
hbal -L -v -C -P --exclude-instance=bacula-director-01,tbb-nightlies-master,eugeni,winklerianum,woronowii,rouyi,loghost01,materculae,gayi,weissii
Another option is to specifically look for instances that do not have
a HDD and migrate only those. In my situation, gnt-cluster verify
was complaining that fsn-node-02 was full, so I looked for all the
instances on that node and found the ones which didn't have a HDD:
gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status \
| sort | grep 'fsn-node-02' | awk '{print $3}' | \
while read instance ; do
printf "checking $instance: "
if gnt-instance info $instance | grep -q hdd ; then
echo "HAS HDD"
else
echo "NO HDD"
fi
done
Then you can manually migrate -f (to fail over to the secondary) and
replace-disks -n (to find another secondary) the instances that
can be migrated out of the four first machines (which have HDDs) to
the last three (which do not). Look at the memory usage in gnt-node list to pick the best node.
In general, if a given node in the first four is overloaded, a good trick is to look for one that can be failed over, with, for example:
gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort | grep '^fsn-node-0[1234]' | grep 'fsn-node-0[5678]'
... or, for a particular node (say fsn-node-04):
gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort | grep ^fsn-node-04 | grep 'fsn-node-0[5678]'
The instances listed there would be ones that can be migrated to their
secondary to give fsn-node-04 some breathing room.
Adding and removing addresses on instances
Say you created an instance but forgot to need to assign an extra IP. You can still do so with:
gnt-instance modify --net -1:add,ip=116.202.120.174,network=gnt-fsn test01.torproject.org
Job inspection
Sometimes it can be useful to look at the active jobs. It might be,
for example, that another user has queued a bunch of jobs in another
terminal which you do not have access to, or some automated process
did (Nagios, for example, runs gnt-cluster verify once in a
while). Ganeti has this concept of "jobs" which can provide
information about those.
The command gnt-job list will show the entire job history, and
gnt-job list --running will show running jobs. gnt-job watch can
be used to watch a specific job.
We have a wrapper called ganeti-watch-jobs which automatically shows
the output of whatever job is currently running and exits when all
jobs complete. This is particularly useful while rebalancing the
cluster as hbal -X does not show the job output...
Open vSwitch crash course and debugging
Open vSwitch is used in the gnt-fsn cluster to connect the multiple
machines with each other through Hetzner's "vswitch" system.
You will typically not need to deal with Open vSwitch, as Ganeti takes care of configuring the network on instance creation and migration. But if you believe there might be a problem with it, you can consider reading the following:
Pager playbook
I/O overload
In case of excessive I/O, it might be worth looking into which machine is in cause. The howto/drbd page explains how to map a DRBD device to a VM. You can also find which logical volume is backing an instance (and vice versa) with this command:
lvs -o+tags
This will list all logical volumes and their associated tags. If you already know which logical volume you're looking for, you can address it directly:
root@fsn-node-01:~# lvs -o tags /dev/vg_ganeti_hdd/4091b668-1177-41ac-9310-1eac45b46620.disk2_data
LV Tags
originstname+bacula-director-01.torproject.org
Node failure
Ganeti clusters are designed to be self-healing. As long as only one machine disappears, the cluster should be able to recover by failing over other nodes. This is currently done manually, however.
WARNING: the following procedure should be considered a LAST RESORT. In the vast majority of cases, it is simpler and less risky to just restart the node using a remote power cycle to restore the service than risking a split brain scenario which this procedure can case when not followed properly.
WARNING, AGAIN: if for some reason the node you are failing over from actually returns on its own without you being able to stop it, it may run those DRBD disks and virtual machines, and you may end up in a split brain scenario.
If, say, fsn-node-07 completely fails and you need to restore
service to the virtual machines running on that server, you can
failover to the secondaries. Before you do, however, you need to be
completely confident it is not still running in parallel, which could
lead to a "split brain" scenario. For that, just cut the power to the
machine using out of band management (e.g. on Hetzner, power down the
machine through the Hetzner Robot, on Cymru, use the iDRAC to cut the
power to the main board).
Once the machine is powered down, instruct Ganeti to stop using it altogether:
gnt-node modify --offline=yes fsn-node-07
Then, once the machine is offline and Ganeti also agrees, switch all the instances on that node to their secondaries:
gnt-node failover fsn-node-07.torproject.org
It's possible that you need --ignore-consistency but this has caused
trouble in the past (see 40229). In any case, it is not used at
the WMF, for example, they explicitly say that never needed the
flag.
Note that it will still try to connect to the failed node to shutdown the DRBD devices, as a last resort.
Recovering from the failure should be automatic: once the failed server is repaired and restarts, it will contact the master to ask for instances to start. Since the machines the instances have been migrated, none will be started and there should not be any inconsistencies.
Once the machine is up and running and you are confident you do not have a split brain scenario, you can re-add the machine to the cluster with:
gnt-node add --readd fsn-node-07.torproject.org
Once that is done, rebalance the cluster because you now have an empty node which could be reused (hopefully). It might, obviously, be worth exploring the root case of the failure, however, before readding the machine to the cluster.
Recoveries could eventually be automated if such situations occur more often, by scheduling a harep cron job, which isn't enabled in Debian by default. See also the autorepair section of the admin manual.
Master node failure
A master node failure is a special case, as you do not have access to the node to run Ganeti commands. We have not established our own procedure for this yet, see:
TODO: expand documentation on master node failure recovery.
Split brain recovery
A split brain occurred during a partial failure, failover, then
unexpected recovery of fsn-node-07 (issue 40229). It might
occur in other scenarios, but this section documents that specific
one. Hopefully the recovery will be similar in other scenarios.
The split brain was the result of an operator running this command to failover the instances running on the node:
gnt-node failover --ignore-consistency fsn-node-07.torproject.org
The symptom of the split brain is that the VM is running on two
machines. You will see that in gnt-cluster verify:
Thu Apr 22 01:28:04 2021 * Verifying node status
Thu Apr 22 01:28:04 2021 - ERROR: instance palmeri.torproject.org: instance should not run on node fsn-node-07.torproject.org
Thu Apr 22 01:28:04 2021 - ERROR: instance onionoo-backend-02.torproject.org: instance should not run on node fsn-node-07.torproject.org
Thu Apr 22 01:28:04 2021 - ERROR: instance polyanthum.torproject.org: instance should not run on node fsn-node-07.torproject.org
Thu Apr 22 01:28:04 2021 - ERROR: instance onionbalance-01.torproject.org: instance should not run on node fsn-node-07.torproject.org
Thu Apr 22 01:28:04 2021 - ERROR: instance henryi.torproject.org: instance should not run on node fsn-node-07.torproject.org
Thu Apr 22 01:28:04 2021 - ERROR: instance nevii.torproject.org: instance should not run on node fsn-node-07.torproject.org
In the above, the verification finds an instance running on an
unexpected server (the old primary). Disks will be in a similar
"degraded" state, according to gnt-cluster verify:
Thu Apr 22 01:28:04 2021 * Verifying instance status
Thu Apr 22 01:28:04 2021 - WARNING: instance onionoo-backend-02.torproject.org: disk/0 on fsn-node-07.torproject.org is degraded; local disk state is 'ok'
Thu Apr 22 01:28:04 2021 - WARNING: instance onionoo-backend-02.torproject.org: disk/1 on fsn-node-07.torproject.org is degraded; local disk state is 'ok'
Thu Apr 22 01:28:04 2021 - WARNING: instance onionoo-backend-02.torproject.org: disk/2 on fsn-node-07.torproject.org is degraded; local disk state is 'ok'
Thu Apr 22 01:28:04 2021 - WARNING: instance onionoo-backend-02.torproject.org: disk/0 on fsn-node-06.torproject.org is degraded; local disk state is 'ok'
Thu Apr 22 01:28:04 2021 - WARNING: instance onionoo-backend-02.torproject.org: disk/1 on fsn-node-06.torproject.org is degraded; local disk state is 'ok'
Thu Apr 22 01:28:04 2021 - WARNING: instance onionoo-backend-02.torproject.org: disk/2 on fsn-node-06.torproject.org is degraded; local disk state is 'ok'
We can also see that symptom on an individual instance:
root@fsn-node-01:~# gnt-instance info onionbalance-01.torproject.org
- Instance name: onionbalance-01.torproject.org
[...]
Disks:
- disk/0: drbd, size 10.0G
access mode: rw
nodeA: fsn-node-05.torproject.org, minor=29
nodeB: fsn-node-07.torproject.org, minor=26
port: 11031
on primary: /dev/drbd29 (147:29) in sync, status *DEGRADED*
on secondary: /dev/drbd26 (147:26) in sync, status *DEGRADED*
[...]
The first (optional) thing to do in a split brain scenario is to stop the damage made by running instances: stop all the instances running in parallel, on both the previous and new primaries:
gnt-instance stop $INSTANCES
Then on fsn-node-07 just use kill(1) to shutdown the qemu
processes running the VMs directly. Now the instances should all be
shutdown and no further changes will be done on the VM that could
possibly be lost.
(This step is optional because you can also skip straight to the hard decision below, while leaving the instances running. But that adds pressure to you, and we don't want to do that to your poor brain right now.)
That will leave you time to make a more important decision: which node will be authoritative (which will keep running as primary) and which one will "lose" (and will have its instances destroyed)? There's no easy good or wrong answer for this: it's a judgement call. In any case, there might already been data loss: for as long as both nodes were available and the VMs running on both, data registered on one of the nodes during the split brain will be lost when we destroy the state on the "losing" node.
If you have picked the previous primary as the "new" primary, you will need to first revert the failover and flip the instances back to the previous primary:
for instance in $INSTANCES; do
gnt-instance failover $instance
done
When that is done, or if you have picked the "new" primary (the one the instances were originally failed over to) as the official one: you need to fix the disks' state. For this, flip to a "plain" disk (i.e. turn off DRBD) and turn DRBD back on. This will stop mirroring the disk, and reallocate a new disk in the right place. Assuming all instances are stopped, this should do it:
for instance in $INSTANCES ; do
gnt-instance modify -t plain $instance
gnt-instance modify -t drbd --no-wait-for-sync $instance
gnt-instance start $instance
gnt-instance console $instance
done
Then the machines should be back up on a single machine and the split brain scenario resolved. Note that this means the other side of the DRBD mirror will be destroyed in the procedure, that is the step that drops the data which was sent to the wrong part of the "split brain".
Once everything is back to normal, it might be a good idea to rebalance the cluster.
References:
- the
-t plainhack comes from this post on the Ganeti list -
this procedure suggests using
replace-disks -nwhich also works, but requires us to pick the secondary by hand each time, which is annoying - this procedure has instructions on how to recover at the DRBD level directly, but have not required those instructions so far
Bridge configuration failures
If you get the following error while trying to bring up the bridge:
root@chi-node-02:~# ifup br0
add bridge failed: Package not installed
run-parts: /etc/network/if-pre-up.d/bridge exited with return code 1
ifup: failed to bring up br0
... it might be the bridge cannot find a way to load the kernel
module, because kernel module loading has been disabled. Reboot with
the /etc/no_modules_disabled file present:
touch /etc/no_modules_disabled
reboot
It might be that the machine took too long to boot because it's not in mandos and the operator took too long to enter the LUKS passphrase. Re-enable the machine with this command on mandos:
mandos-ctl --enable chi-node-02.torproject
Cleaning up orphan disks
Sometimes gnt-cluster verify will give this warning, particularly
after a failed rebalance:
* Verifying orphan volumes
- WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/27dd3687-8953-447e-8632-adf4aa4e11b6.disk0_meta is unknown
- WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/27dd3687-8953-447e-8632-adf4aa4e11b6.disk0_data is unknown
- WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_meta is unknown
- WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_data is unknown
This can happen when an instance was partially migrated to a node (in
this case fsn-node-06) but the migration failed because (for
example) there was no HDD on the target node. The fix here is simply
to remove the logical volumes on the target node:
ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/27dd3687-8953-447e-8632-adf4aa4e11b6.disk0_meta
ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/27dd3687-8953-447e-8632-adf4aa4e11b6.disk0_data
ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_meta
ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_data
Fixing inconsistent disks
Sometimes gnt-cluster verify will give this error:
WARNING: instance materculae.torproject.org: disk/0 on fsn-node-02.torproject.org is degraded; local disk state is 'ok'
... or worse:
ERROR: instance materculae.torproject.org: couldn't retrieve status for disk/2 on fsn-node-03.torproject.org: Can't find device <DRBD8(hosts=46cce2d9-ddff-4450-a2d6-b2237427aa3c/10-053e482a-c9f9-49a1-984d-50ae5b4563e6/22, port=11177, backend=<LogicalVolume(/dev/vg_ganeti/486d3e6d-e503-4d61-a8d9-31720c7291bd.disk2_data, visible as /dev/, size=10240m)>, metadev=<LogicalVolume(/dev/vg_ganeti/486d3e6d-e503-4d61-a8d9-31720c7291bd.disk2_meta, visible as /dev/, size=128m)>, visible as /dev/disk/2, size=10240m)>
The fix for both is to run:
gnt-instance activate-disks materculae.torproject.org
This will make sure disks are correctly setup for the instance.
If you have a lot of those warnings, pipe the output into this filter, for example:
gnt-cluster verify | grep -e 'WARNING: instance' -e 'ERROR: instance' |
sed 's/.*instance//;s/:.*//' |
sort -u |
while read instance; do
gnt-instance activate-disks $instance
done
Not enough memory for failovers
Another error that gnt-cluster verify can give you is, for example:
- ERROR: node fsn-node-04.torproject.org: not enough memory to accomodate instance failovers should node fsn-node-03.torproject.org fail (16384MiB needed, 10724MiB available)
The solution is to rebalance the cluster.
Can't assemble device after creation
It's possible that Ganeti fails to create an instance with this error:
Thu Jan 14 20:01:00 2021 - WARNING: Device creation failed
Failure: command execution error:
Can't create block device <DRBD8(hosts=d1b54252-dd81-479b-a9dc-2ab1568659fa/0-3aa32c9d-c0a7-44bb-832d-851710d04765/0, port=11005, backend=<LogicalVolume(/dev/vg_ganeti/3f60a066-c957-4a86-9fae-65525fe3f3c7.disk0_data, not visible, size=10240m)>, metadev=<LogicalVolume(/dev/vg_ganeti/3f60a066-c957-4a86-9fae-65525fe3f3c7.disk0_meta, not visible, size=128m)>, visible as /dev/disk/0, size=10240m)> on node chi-node-03.torproject.org for instance build-x86-13.torproject.org: Can't assemble device after creation, unusual event: drbd0: timeout while configuring network
In this case, the problem was that chi-node-03 had an incorrect
secondary_ip set. The immediate fix was to correctly set the
secondary address of the node:
gnt-node modify --secondary-ip=172.30.130.3 chi-node-03.torproject.org
Then gnt-cluster verify was complaining about the leftover DRBD
device:
- ERROR: node chi-node-03.torproject.org: unallocated drbd minor 0 is in use
For this, see DRBD: deleting a stray device.
SSH key verification failures
Ganeti uses SSH to launch arbitrary commands (as root!) on other
nodes. It does this using a funky command, from node-daemon.log:
ssh -oEscapeChar=none -oHashKnownHosts=no \
-oGlobalKnownHostsFile=/var/lib/ganeti/known_hosts \
-oUserKnownHostsFile=/dev/null -oCheckHostIp=no \
-oConnectTimeout=10 -oHostKeyAlias=chignt.torproject.org
-oPort=22 -oBatchMode=yes -oStrictHostKeyChecking=yes -4 \
root@chi-node-03.torproject.org
This has caused us some problems in the Ganeti buster to bullseye upgrade, possibly because of changes in host verification routines in OpenSSH. The problem was documented in issue 1608 upstream and tpo/tpa/team#40383.
A workaround is to synchronize Ganeti's known_hosts file:
grep 'chi-node-0[0-9]' /etc/ssh/ssh_known_hosts | grep -v 'initramfs' | grep ssh-rsa | sed 's/[^ ]* /chignt.torproject.org /' >> /var/lib/ganeti/known_hosts
Note that the above assumes only a < 10 nodes cluster.
Other troubleshooting
The walkthrough also has a few recipes to resolve common problems.
See also the common issues page in the Ganeti wiki.
Look into logs on the relevant nodes (particularly
/var/log/ganeti/node-daemon.log, which shows all commands ran by
ganeti) when you have problems.
Disaster recovery
If things get completely out of hand and the cluster becomes too unreliable for service, the only solution is to rebuild another one elsewhere. Since Ganeti 2.2, there is a move-instance command to move instances between cluster that can be used for that purpose.
If Ganeti is completely destroyed and its APIs don't work anymore, the last resort is to restore all virtual machines from howto/backup. Hopefully, this should not happen except in the case of a catastrophic data loss bug in Ganeti or howto/drbd.
Reference
Installation
Ganeti is typically installed as part of the bare bones machine installation process, typically as part of the "post-install configuration" procedure, once the machine is fully installed and configured.
Typically, we add a new node to an existing cluster. Below are cluster-specific procedures to add a new node to each existing cluster, alongside the configuration of the cluster as it was done at the time (and how it could be used to rebuild a cluster from scratch).
Make sure you use the procedure specific to the cluster you are working on.
Note that this is not about installing virtual machines (VMs) inside a Ganeti cluster: for that you want to look at the new instance procedure.
New gnt-fsn node
-
To create a new box, follow howto/new-machine-hetzner-robot but change the following settings:
- Server: PX62-NVMe
- Location:
FSN1 - Operating system: Rescue
- Additional drives: 2x10TB HDD (update: starting from fsn-node-05, we are not ordering additional drives to save on costs, see ticket 33083 for rationale)
- Add in the comment form that the server needs to be in the same datacenter as the other machines (FSN1-DC13, but double-check)
-
follow the howto/new-machine post-install configuration
-
Add the server to the two
vSwitchsystems in Hetzner Robot web UI -
install openvswitch and allow modules to be loaded:
touch /etc/no_modules_disabled reboot apt install openvswitch-switch -
Allocate a private IP address in the
30.172.in-addr.arpazone (and thetorproject.orgzone) for the node, in theadmin/dns/domains.gitrepository -
copy over the
/etc/network/interfacesfrom another ganeti node, changing theaddressandgatewayfields to match the local entry. -
knock on wood, cross your fingers, pet a cat, help your local book store, and reboot:
reboot -
Prepare all the nodes by configuring them in Puppet, by adding the class
roles::ganeti::fsnto the node -
Re-enable modules disabling:
rm /etc/no_modules_disabled -
run puppet across the ganeti cluster to ensure ipsec tunnels are up:
cumin -p 0 'C:roles::ganeti::fsn' 'puppet agent -t' -
reboot again:
reboot -
Then the node is ready to be added to the cluster, by running this on the master node:
gnt-node add \ --secondary-ip 172.30.135.2 \ --no-ssh-key-check \ --no-node-setup \ fsn-node-02.torproject.orgIf this is an entirely new cluster, you need a different procedure, see the cluster initialization procedure instead.
-
make sure everything is great in the cluster:
gnt-cluster verifyIf that takes a long time and eventually fails with erors like:
ERROR: node fsn-node-03.torproject.org: ssh communication with node 'fsn-node-06.torproject.org': ssh problem: ssh: connect to host fsn-node-06.torproject.org port 22: Connection timed out\'r\n... that is because the howto/ipsec tunnels between the nodes are failing. Make sure Puppet has run across the cluster (step 10 above) and see howto/ipsec for further diagnostics. For example, the above would be fixed with:
ssh fsn-node-03.torproject.org "puppet agent -t; service ipsec reload" ssh fsn-node-06.torproject.org "puppet agent -t; service ipsec reload; ipsec up gnt-fsn-be::fsn-node-03"
gnt-fsn cluster initialization
This procedure replaces the gnt-node add step in the initial setup
of the first Ganeti node when the gnt-fsn cluster was setup:
gnt-cluster init \
--master-netdev vlan-gntbe \
--vg-name vg_ganeti \
--secondary-ip 172.30.135.1 \
--enabled-hypervisors kvm \
--nic-parameters mode=openvswitch,link=br0,vlan=4000 \
--mac-prefix 00:66:37 \
--no-ssh-init \
--no-etc-hosts \
fsngnt.torproject.org
The above assumes that fsngnt is already in DNS. See the MAC
address prefix selection section for information on how the
--mac-prefix argument was selected.
Then the following extra configuration was performed:
gnt-cluster modify --reserved-lvs vg_ganeti/root,vg_ganeti/swap
gnt-cluster modify -H kvm:kernel_path=,initrd_path=
gnt-cluster modify -H kvm:security_model=pool
gnt-cluster modify -H kvm:kvm_extra='-device virtio-rng-pci\,bus=pci.0\,addr=0x1e\,max-bytes=1024\,period=1000'
gnt-cluster modify -H kvm:disk_cache=none
gnt-cluster modify -H kvm:disk_discard=unmap
gnt-cluster modify -H kvm:scsi_controller_type=virtio-scsi-pci
gnt-cluster modify -H kvm:disk_type=scsi-hd
gnt-cluster modify -H kvm:migration_bandwidth=950
gnt-cluster modify -H kvm:migration_downtime=500
gnt-cluster modify -D drbd:c-plan-ahead=0,disk-custom='--c-plan-ahead 0'
gnt-cluster modify --uid-pool 4000-4019
The network configuration (below) must also be performed for the address blocks reserved in the cluster.
New gnt-chi node
-
to create a new box, follow the cymru new-machine howto
-
follow the howto/new-machine post-install configuration
-
Allocate a private IP address in the
30.172.in-addr.arpazone for the node, in theadmin/dns/domains.gitrepository -
add the private IP address to the eth1 interface, for example in
/etc/network/interfaces.d/eth1:auto eth1 iface eth1 inet static address 172.30.130.5/24This IP must be allocated in the reverse DNS zone file (
30.172.in-addr.arpa) and thetorproject.orgzone file in thedns/domains.gitrepository. -
enable the interface:
ifup eth1 -
setup a bridge on the public interface, replacing the
eth0blocks with something like:auto eth0 iface eth0 inet manual auto br0 iface br0 inet static address 38.229.82.104/24 gateway 38.229.82.1 bridge_ports eth0 bridge_stp off bridge_fd 0 # IPv6 configuration iface br0 inet6 static accept_ra 0 address 2604:8800:5000:82:baca:3aff:fe5d:8774/64 gateway 2604:8800:5000:82::1 -
allow modules to be loaded, cross your fingers that you didn't screw up the network configuration above, and reboot:
touch /etc/no_modules_disabled reboot -
configure the node in Puppet by adding it to the
roles::ganeti::chiclass, and run Puppet on the new node:puppet agent -t -
re-disable module loading:
rm /etc/no_modules_disabled -
run puppet across the ganeti cluster to firewalls are correctly configured:
cumin -p 0 'C:roles::ganeti::chi' 'puppet agent -t' -
Then the node is ready to be added to the cluster, by running this on the master node:
gnt-node add \ --secondary-ip 172.30.130.5 \ --no-ssh-key-check \ --no-node-setup \ chi-node-05.torproject.org
If this is an entirely new cluster, you need a different
procedure, see [the cluster initialization procedure](#gnt-fsn-cluster-initialization) instead.
-
make sure everything is great in the cluster:
gnt-cluster verify
If the last step fails with SSH errors, you may need to re-synchronise
the SSH known_hosts file, see SSH key verification failures.
gnt-chi cluster initialization
This procedure replaces the gnt-node add step in the initial setup
of the first Ganeti node when the gnt-chi cluster was setup:
gnt-cluster init \
--master-netdev eth1 \
--nic-parameters link=br0 \
--vg-name vg_ganeti \
--secondary-ip 172.30.130.1 \
--enabled-hypervisors kvm \
--mac-prefix 06:66:38 \
--no-ssh-init \
--no-etc-hosts \
chignt.torproject.org
The above assumes that chignt is already in DNS. See the MAC
address prefix selection section for information on how the
--mac-prefix argument was selected.
Then the following extra configuration was performed:
gnt-cluster modify --reserved-lvs vg_ganeti/root,vg_ganeti/swap
gnt-cluster modify -H kvm:kernel_path=,initrd_path=
gnt-cluster modify -H kvm:security_model=pool
gnt-cluster modify -H kvm:kvm_extra='-device virtio-rng-pci\,bus=pci.0\,addr=0x1e\,max-bytes=1024\,period=1000'
gnt-cluster modify -H kvm:disk_cache=none
gnt-cluster modify -H kvm:disk_discard=unmap
gnt-cluster modify -H kvm:scsi_controller_type=virtio-scsi-pci
gnt-cluster modify -H kvm:disk_type=scsi-hd
gnt-cluster modify -H kvm:migration_bandwidth=950
gnt-cluster modify -H kvm:migration_downtime=500
gnt-cluster modify -D drbd:c-plan-ahead=0,disk-custom='--c-plan-ahead 0'
gnt-cluster modify --uid-pool 4000-4019
The upper limit for CPU count and memory size were doubled, to 16 and 64G, respectively, with:
gnt-cluster modify --ipolicy-bounds-specs \
max:cpu-count=16,disk-count=16,disk-size=1048576,\
memory-size=65536,nic-count=8,spindle-use=12\
/min:cpu-count=1,disk-count=1,disk-size=1024,\
memory-size=128,nic-count=1,spindle-use=1
NOTE: watch out for whitespace here. The original source for this command had too much whitespace, which fails with:
Failure: unknown/wrong parameter name 'Missing value for key '' in option --ipolicy-bounds-specs'
The disk templates also had to be modified to account for iSCSI devices:
gnt-cluster modify --enabled-disk-templates drbd,plain,blockdev
gnt-cluster modify --ipolicy-disk-templates drbd,plain,blockdev
The network configuration (below) must also be performed for the address blocks reserved in the cluster. This is the actual initial configuration performed:
gnt-network add --network 38.229.82.0/24 --gateway 38.229.82.1 --network6 2604:8800:5000:82::/64 --gateway6 2604:8800:5000:82::1 gnt-chi-01
gnt-network connect --nic-parameters=link=br0 gnt-chi-01 default
The following IPs were reserved:
gnt-network modify --add-reserved-ips=38.229.82.2,38.229.82.3,38.229.82.4,38.229.82.5,38.229.82.6,38.229.82.7,38.229.82.8,38.229.82.9,38.229.82.10,38.229.82.11,38.229.82.12,38.229.82.13,38.229.82.14,38.229.82.15,38.229.82.16,38.229.82.17,38.229.82.18,38.229.82.19 gnt-chi-01
The first two are for the gateway, but the rest is temporary and might be reclaimed eventually.
Network configuration
IP allocation is managed by Ganeti through the gnt-network(8)
system. Say we have 192.0.2.0/24 reserved for the cluster, with
the host IP 192.0.2.100 and the gateway on 192.0.2.1. You will
create this network with:
gnt-network add --network 192.0.2.0/24 --gateway 192.0.2.1 example-network
If there's also IPv6, it would look something like this:
gnt-network add --network 192.0.2.0/24 --gateway 192.0.2.1 --network6 2001:db8::/32 --gateway6 fe80::1 example-network
Note: the actual name of the network (example-network) above, should
follow the convention established in doc/naming-scheme.
Then we associate the new network to the default node group:
gnt-network connect --nic-parameters=link=br0,vlan=4000,mode=openvswitch example-network default
The arguments to --nic-parameters come from the values configured in
the cluster, above. The current values can be found with gnt-cluster info.
For example, the second ganeti network block was assigned with the following commands:
gnt-network add --network 49.12.57.128/27 --gateway 49.12.57.129 gnt-fsn13-02
gnt-network connect --nic-parameters=link=br0,vlan=4000,mode=openvswitch gnt-fsn13-02 default
IP addresses can be reserved with the --reserved-ips argument to the
modify command, for example:
gnt-network modify --add-reserved-ips=38.229.82.2,38.229.82.3,38.229.82.4,38.229.82.5,38.229.82.6,38.229.82.7,38.229.82.8,38.229.82.9,38.229.82.10,38.229.82.11,38.229.82.12,38.229.82.13,38.229.82.14,38.229.82.15,38.229.82.16,38.229.82.17,38.229.82.18,38.229.82.19 gnt-chi-01 gnt-chi-01
Note that the gateway and nodes IP addresses are automatically reserved, this is for hosts outside of the cluster.
The network name must follow the naming convention.
SLA
As long as the cluster is not over capacity, it should be able to survive the loss of a node in the cluster unattended.
Justified machines can be provisionned within a few business days without problems.
New nodes can be provisioned within a week or two, depending on budget and hardware availability.
Design
Our first Ganeti cluster (gnt-fsn) is made of multiple machines
hosted with Hetzner Robot, Hetzner's dedicated server hosting
service. All machines use the same hardware to avoid problems with
live migration. That is currently a customized build of the
PX62-NVMe line.
Network layout
Machines are interconnected over a vSwitch, a "virtual layer 2
network" probably implemented using Software-defined Networking
(SDN) on top of Hetzner's network. The details of that implementation
do not matter much to us, since we do not trust the network and run an
IPsec layer on top of the vswitch. We communicate with the vSwitch
through Open vSwitch (OVS), which is (currently manually)
configured on each node of the cluster.
There are two distinct IPsec networks:
-
gnt-fsn-public: the public network, which maps to thefsn-gnt-inet-vlanvSwitch at Hetzner, thevlan-gntinetOVS network, and thegnt-fsnnetwork pool in Ganeti. it provides public IP addresses and routing across the network. instances get IP allocated in this network. -
gnt-fsn-be: the private ganeti network which maps to thefsn-gnt-backend-vlanvSwitch at Hetzner and thevlan-gntbeOVS network. it has no matchinggnt-networkcomponent and IP addresses are allocated manually in the 172.30.135.0/24 network through DNS. it provides internal routing for Ganeti commands and howto/drbd storage mirroring.
MAC address prefix selection
The MAC address prefix for the gnt-fsn cluster (00:66:37:...) seems
to have been picked arbitrarily. While it does not conflict with a
known existing prefix, it could eventually be issued to a manufacturer
and reused, possibly leading to a MAC address clash. The closest is
currently Huawei:
$ grep ^0066 /var/lib/ieee-data/oui.txt
00664B (base 16) HUAWEI TECHNOLOGIES CO.,LTD
Such a clash is fairly improbable, because that new manufacturer would need to show up on the local network as well. Still, new clusters SHOULD use a different MAC address prefix in a locally administered address (LAA) space, which "are distinguished by setting the second-least-significant bit of the first octet of the address". In other words, the MAC address must have 2, 6, A or E as a its second quad. In other words, the MAC address must look like one of those:
x2 - xx - xx - xx - xx - xx
x6 - xx - xx - xx - xx - xx
xA - xx - xx - xx - xx - xx
xE - xx - xx - xx - xx - xx
We used 06:66:38 in the gnt-chi cluster for that reason. We picked
the 06:66 prefix to ressemble the existing 00:66 prefix used in
gnt-fsn but varied the last quad (from :37 to :38) to make them
slightly more different-looking.
Obviously, it's unlikely the MAC addresses will be compared across clusters in the short term. But it's technically possible a MAC bridge could be established if an exotic VPN bridge gets established between the two networks in the future, so it's good to have some difference.
Hardware variations
We considered experimenting with the new AX line (AX51-NVMe) but in the past DSA had problems live-migrating (it wouldn't immediately fail but there were "issues" after). So we might need to failover instead of migrate between those parts of the cluster. There are also doubts that the Linux kernel supports those shiny new processors at all: similar processors had trouble booting before Linux 5.5 for example, so it might be worth waiting a little before switching to that new platform, even if it's cheaper. See the cluster configuration section below for a larger discussion of CPU emulation.
CPU emulation
Note that we might want to tweak the cpu_type parameter. By default,
it emulates a lot of processing that can be delegated to the host CPU
instead. If we use kvm:cpu_type=host, then each node will tailor the
emulation system to the CPU on the node. But that might make the live
migration more brittle: VMs or processes can crash after a live
migrate because of a slightly different configuration (microcode, CPU,
kernel and QEMU versions all play a role). So we need to find the
lowest common demoninator in CPU families. The list of available
families supported by QEMU varies between releases, but is visible
with:
# qemu-system-x86_64 -cpu help
Available CPUs:
x86 486
x86 Broadwell Intel Core Processor (Broadwell)
[...]
x86 Skylake-Client Intel Core Processor (Skylake)
x86 Skylake-Client-IBRS Intel Core Processor (Skylake, IBRS)
x86 Skylake-Server Intel Xeon Processor (Skylake)
x86 Skylake-Server-IBRS Intel Xeon Processor (Skylake, IBRS)
[...]
The current PX62 line is based on the Coffee Lake Intel
micro-architecture. The closest matching family would be
Skylake-Server or Skylake-Server-IBRS, according to wikichip.
Note that newer QEMU releases (4.2, currently in unstable) have more
supported features.
In that context, of course, supporting different CPU manufacturers (say AMD vs Intel) is impractical: they will have totally different families that are not compatible with each other. This will break live migration, which can trigger crashes and problems in the migrated virtual machines.
If there are problems live-migrating between machines, it is still
possible to "failover" (gnt-instance failover instead of migrate)
which shuts off the machine, fails over disks, and starts it on the
other side. That's not such of a big problem: we often need to reboot
the guests when we reboot the hosts anyways. But it does complicate
our work. Of course, it's also possible that live migrates work fine
if no cpu_type at all is specified in the cluster, but that needs
to be verified.
Nodes could also grouped to limit (automated) live migration to a subset of nodes.
References:
- https://dsa.debian.org/howto/install-ganeti/
- https://qemu.weilnetz.de/doc/qemu-doc.html#recommendations_005fcpu_005fmodels_005fx86
Installer
The ganeti-instance-debootstrap package is used to install instances. It is configured through Puppet with the shared ganeti module, which deploys a few hooks to automate the install as much as possible. The installer will:
- setup grub to respond on the serial console
- setup and log a random root password
- make sure SSH is installed and log the public keys and fingerprints
- setup swap if a labeled partition is present, or a 512MB swapfile otherwise
- setup basic static networking through
/etc/network/interfaces.d
We have custom configurations on top of that to:
- add a few base packages
- do our own custom SSH configuration
- fix the hostname to be a FQDN
- add a line to
/etc/hosts - add a tmpfs
There is work underway to refactor and automate the install better, see ticket 31239 for details.
Storage
TODO: document how DRBD works in general, and how it's setup here in particular.
See also the DRBD documentation.
The Cymru PoP has an iSCSI cluster for large filesystem storage. Ideally, this would be automated inside Ganeti, some quick links:
- search for iSCSI in the ganeti-devel mailing list
- in particular a discussion of integrating SANs into ganeti seems to say "just do it manually" (paraphrasing) and this discussion has an actual implementation, gnt-storage-eql
- it could be implemented as an external storage provider, see the documentation
- the DSA docs are in two parts: iscsi and export-iscsi
- someone made a Kubernetes provisionner for our hardware which could provide sample code
For now, iSCSI volumes are manually created and passed to new virtual machines.
Issues
There is no issue tracker specifically for this project, File or search for issues in the team issue tracker component.
Ganeti has of course its own issue tracker on GitHub.
Monitoring and testing
Logs and metrics
Ganeti logs a significant amount of information in
/var/log/ganeti.log. Those logs are of particular interest:
-
node-daemon.log: all low-level commands and HTTP requests on the node daemon, includes, for example, LVM and DRBD commands -
os/*$hostname*.log: installation log for machine$hostname
It does not expose performance metrics that are digested by Prometheus right now, but that would be an interesting feature to add.
Other documentation
Discussion
Overview
The project of creating a Ganeti cluster for Tor has appeared in the summer of 2019. The machines were delivered by Hetzner in July 2019 and setup by weasel by the end of the month.
Goals
The goal was to replace the aging group of KVM servers (kvm[1-5], AKA
textile, unifolium, macrum, kvm4 and kvm5).
Must have
- arbitrary virtual machine provisionning
- redundant setup
- automated VM installation
- replacement of existing infrastructure
Nice to have
- fully configured in Puppet
- full high availability with automatic failover
- extra capacity for new projects
Non-Goals
- Docker or "container" provisionning - we consider this out of scope for now
- self-provisionning by end-users: TPA remains in control of provisionning
Approvals required
A budget was proposed by weasel in may 2019 and approved by Vegas in June. An extension to the budget was approved in january 2020 by Vegas.
Proposed Solution
Setup a Ganeti cluster of two machines with a Hetzner vSwitch backend.
Cost
The design based on the PX62 line has the following monthly cost structure:
- per server: 118EUR (79EUR + 39EUR for 2x10TB HDDs)
- IPv4 space: 35.29EUR (/27)
- IPv6 space: 8.40EUR (/64)
- bandwidth cost: 1EUR/TB (currently 38EUR)
At three servers, that adds up to around 435EUR/mth. Up to date costs are available in the Tor VM hosts.xlsx spreadsheet.
Alternatives considered
Note that the instance install is possible also through FAI, see the Ganeti wiki for examples.
There are GUIs for Ganeti that we are not using, but could, if we want to grant more users access:
- Ganeti Web manager is a "Django based web frontend for managing Ganeti virtualization clusters. Since Ganeti only provides a command-line interface, Ganeti Web Manager’s goal is to provide a user friendly web interface to Ganeti via Ganeti’s Remote API. On top of Ganeti it provides a permission system for managing access to clusters and virtual machines, an in browser VNC console, and vm state and resource visualizations"
- Synnefo is a "complete open source cloud stack written in Python that provides Compute, Network, Image, Volume and Storage services, similar to the ones offered by AWS. Synnefo manages multiple Ganeti clusters at the backend for handling of low-level VM operations and uses Archipelago to unify cloud storage. To boost 3rd-party compatibility, Synnefo exposes the OpenStack APIs to users."