Ganeti is software designed to facilitate the management of virtual machines (KVM or Xen). It helps you move virtual machine instances from one node to another, create an instance with DRBD replication on another node and do the live migration from one to another, etc.
- Tutorial
-
How-to
- Glossary
- Adding a new instance
- Modifying an instance
- Destroying an instance
- Getting information
- Disk operations (DRBD)
- Evaluating cluster capacity
- Moving instances and failover
- Importing external libvirt instances
- Importing external libvirt instances, manual
- Rebooting
- Rebalancing a cluster
- Adding and removing addresses on instances
- Job inspection
- Pager playbook
- Disaster recovery
- Reference
- Discussion
Tutorial
Listing virtual machines (instances)
This will show the running guests, known as "instances":
gnt-instance list
Accessing serial console
Our instances do serial console, starting in grub. To access it, run
gnt-instance console test01.torproject.org
To exit, use ^]
-- that is, Control-<Closing Bracket>.
How-to
Glossary
In Ganeti, a physical machine is called a node and a virtual machine is an instance. A node is elected to be the master where all commands should be ran from. Nodes are interconnected through a private network that is used to communicate commands and synchronise disks (with howto/drbd). Instances are normally assigned two nodes: a primary and a secondary: the primary is where the virtual machine actually runs and th secondary acts as a hot failover.
See also the more extensive glossary in the Ganeti documentation.
Adding a new instance
This command creates a new guest, or "instance" in Ganeti's vocabulary with 10G root, 2G swap, 20G spare on SSD, 800G on HDD, 8GB ram and 2 CPU cores:
gnt-instance add \
-o debootstrap+buster \
-t drbd --no-wait-for-sync \
--net 0:ip=pool,network=gnt-fsn \
--no-ip-check \
--no-name-check \
--disk 0:size=10G \
--disk 1:size=2G,name=swap \
--disk 2:size=20G \
--disk 3:size=800G,vg=vg_ganeti_hdd \
--backend-parameters memory=8g,vcpus=2 \
test-01.torproject.org
This is the same without the HDD partition, in the gnt-chi
cluster:
gnt-instance add \
-o debootstrap+buster \
-t drbd --no-wait-for-sync \
--net 0:ip=pool,network=gnt-chi-01 \
--no-ip-check \
--no-name-check \
--disk 0:size=10G \
--disk 1:size=2G,name=swap \
--disk 2:size=20G \
--backend-parameters memory=8g,vcpus=2 \
test-01.torproject.org
What that does
This configures the following:
- redundant disks in a DRBD mirror, use
-t plain
instead of-t drbd
for tests as that avoids syncing of disks and will speed things up considerably (even with--no-wait-for-sync
there are some operations that block on synced mirrors). Only one node should be provided as the argument for--node
then. - three partitions: one on the default VG (SSD), one on another (HDD)
and a swap file on the default VG, if you don't specify a swap device,
a 512MB swapfile is created in
/swapfile
. TODO: configure disk 2 and 3 automatically in installer. (/var
and/srv
?) - 8GB of RAM with 2 virtual CPUs
- an IP allocated from the public gnt-fsn pool:
gnt-instance add
will print the IPv4 address it picked to stdout. The IPv6 address can be found in/var/log/ganeti/os/
on the primary node of the instance, see below. - with the
test-01.torproject.org
hostname
Next steps
To find the root password, ssh host key fingerprints, and the IPv6 address, run this on the node where the instance was created, for example:
egrep 'root password|configured eth0 with|SHA256' $(ls -tr /var/log/ganeti/os/* | tail -1) | grep -v $(hostname)
We copy root's authorized keys into the new instance, so you should be able to
log in with your token. You will be required to change the root password immediately.
Pick something nice and document it in tor-passwords
.
Also set reverse DNS for both IPv4 and IPv6 in hetzner's robot (Chek under servers -> vSwitch -> IPs) or in our own reverse zone files (if delegated).
Then follow howto/new-machine.
Known issues
-
usrmerge: that procedure creates a machine with usrmerge! See bug 34115 before proceeding.
-
allocator failures: Note that you may need to use the
--node
parameter to pick on which machines you want the machine to end up, otherwise Ganeti will choose for you (and may fail). Use, for example,--node fsn-node-01:fsn-node-02
to usenode-01
as primary andnode-02
as secondary. The allocator can sometimes fail if the allocator is upset about something in the cluster, for example:Can's find primary node using iallocator hail: Request failed: No valid allocation solutions, failure reasons: FailMem: 2, FailN1: 2
This situation is covered by ticket 33785. If this problem occurs, it might be worth rebalancing the cluster.
-
ping failure: there is a bug in
ganeti-instance-debootstrap
which misconfiguresping
(among other things), see bug 31781. It's currently patched in our version of the Debian package, but that patch might disappear if Debian upgrade the package without shipping our patch.
Modifying an instance
CPU, memory changes
It's possible to change the IP, CPU, or memory allocation of an instance using the gnt-instance modify command:
gnt-instance modify -B vcpus=2 test1.torproject.org
gnt-instance modify -B memory=4g test1.torproject.org
gnt-instance reboot test1.torproject.org
IP address change
IP address changes require a full stop and will require manual changes
to the /etc/network/interfaces*
files:
gnt-instance modify --net 0:modify,ip=116.202.120.175 test1.torproject.org
gnt-instance stop test1.torproject.org
gnt-instance start test1.torproject.org
gnt-instance console test1.torproject.org
Resizing disks
The gnt-instance grow-disk command can be used to change the size of the underlying device:
gnt-instance grow-disk --absolute test1.torproject.org 0 16g
gnt-instance reboot test1.torproject.org
The number 0
in this context, indicates the first disk of the
instance. The amount specified is the final disk size (because of the
--absolute
flag). In the above example, the final disk size will be
16GB. To add space to the existing disk, remove the --absolute
flag:
gnt-instance grow-disk test1.torproject.org 0 16g
gnt-instance reboot test1.torproject.org
In the above example, 16GB will be ADDED to the disk. Be careful
with resizes, because it's not possible to revert such a change:
grow-disk
does support shrinking disks. The only way to revert the
change is by exporting / importing the instance.
Then the filesystem needs to be resized inside the VM:
ssh root@test1.torproject.org
Use pvs
to display information about the physical volumes:
root@cupani:~# pvs
PV VG Fmt Attr PSize PFree
/dev/sdc vg_test lvm2 a-- <8.00g 1020.00m
Resize the physical volume to take up the new space:
pvresize /dev/sdc
Use lvs
to display information about logical volumes:
# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
var-opt vg_test-01 -wi-ao---- <10.00g
test-backup vg_test-01_hdd -wi-ao---- <20.00g
Use lvextend to add space to the volume:
lvextend -l '+100%FREE' vg_test-01/var-opt
Finally resize the filesystem:
resize2fs /dev/vg_test-01/var-opt
See also the LVM howto.
Adding disks
A disk can be added to an instance with the modify
command as
well. This, for example, will add a 100GB disk to the test1
instance
on teh vg_ganeti_hdd
volume group, which is "slow" rotating disks:
gnt-instance modify --disk add:size=100g,vg=vg_ganeti_hdd test1.torproject.org
gnt-instance reboot test1.torproject.org
Adding a network interface on the rfc1918 vlan
We have a vlan that some VMs that do not have public addresses sit on. Its vlanid is 4002 and its backed by Hetzner vswitch vSwitch #11973 "fsn-gnt-rfc1918-traffic". Note that traffic on this vlan will travel in the clear between nodes.
To add an instance to this vlan, give it a second network interface using
gnt-instance modify --net add:link=br0,vlan=4002,mode=openvswitch test1.torproject.org
Destroying an instance
This totally deletes the instance, including all mirrors and everything, be very careful with it:
gnt-instance remove test01.torproject.org
Getting information
Information about an instances can be found in the rather verbose
gnt-instance info
:
root@fsn-node-01:~# gnt-instance info tb-build-02.torproject.org
- Instance name: tb-build-02.torproject.org
UUID: 8e9f3ca6-204f-4b6c-8e3e-6a8fda137c9b
Serial number: 5
Creation time: 2020-12-15 14:06:41
Modification time: 2020-12-15 14:07:31
State: configured to be up, actual state is up
Nodes:
- primary: fsn-node-03.torproject.org
group: default (UUID 8c32fd09-dc4c-4237-9dd2-3da3dfd3189e)
- secondaries: fsn-node-04.torproject.org (group default, group UUID 8c32fd09-dc4c-4237-9dd2-3da3dfd3189e)
Operating system: debootstrap+buster
A quick command that can be done is this, which shows the primary/secondary for a given instance:
gnt-instance info tb-build-02.torproject.org | grep -A 3 Nodes
An equivalent command will show the primary and secondary for all instances, on top of extra information (like the CPU count, memory and disk usage):
gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort
It can be useful to run this in a loop to see changes:
watch -n5 -d 'gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort'
Disk operations (DRBD)
Instances should be setup using the DRBD backend, in which case you should probably take a look at howto/drbd if you have problems with that. Ganeti handles most of the logic there so that should generally not be necessary.
Evaluating cluster capacity
This will list instances repeatedly, but also show their assigned memory, and compare it with the node's capacity:
gnt-instance list -o pnode,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort &&
echo &&
gnt-node list
The latter does not show disk usage for secondary volume groups (see upstream issue 1379), for a complete picture of disk usage, use:
gnt-node list-storage
The gnt-cluster verify command will also check to see if there's enough space on secondaries to account for the failure of a node. Healthy output looks like this:
root@fsn-node-01:~# gnt-cluster verify
Submitted jobs 48030, 48031
Waiting for job 48030 ...
Fri Jan 17 20:05:42 2020 * Verifying cluster config
Fri Jan 17 20:05:42 2020 * Verifying cluster certificate files
Fri Jan 17 20:05:42 2020 * Verifying hypervisor parameters
Fri Jan 17 20:05:42 2020 * Verifying all nodes belong to an existing group
Waiting for job 48031 ...
Fri Jan 17 20:05:42 2020 * Verifying group 'default'
Fri Jan 17 20:05:42 2020 * Gathering data (2 nodes)
Fri Jan 17 20:05:42 2020 * Gathering information about nodes (2 nodes)
Fri Jan 17 20:05:45 2020 * Gathering disk information (2 nodes)
Fri Jan 17 20:05:45 2020 * Verifying configuration file consistency
Fri Jan 17 20:05:45 2020 * Verifying node status
Fri Jan 17 20:05:45 2020 * Verifying instance status
Fri Jan 17 20:05:45 2020 * Verifying orphan volumes
Fri Jan 17 20:05:45 2020 * Verifying N+1 Memory redundancy
Fri Jan 17 20:05:45 2020 * Other Notes
Fri Jan 17 20:05:45 2020 * Hooks Results
A sick node would have said something like this instead:
Mon Oct 26 18:59:37 2009 * Verifying N+1 Memory redundancy
Mon Oct 26 18:59:37 2009 - ERROR: node node2: not enough memory to accommodate instance failovers should node node1 fail
See the ganeti manual for a more extensive example
Also note the hspace -L
command, which can tell you how many
instances can be created in a given cluster. It uses the "standard"
instance template defined in the cluster (which we haven't configured
yet).
Moving instances and failover
Ganeti is smart about assigning instances to nodes. There's also a
command (hbal
) to automatically rebalance the cluster (see
below). If for some reason hbal doesn’t do what you want or you need
to move things around for other reasons, here are a few commands that
might be handy.
Make an instance switch to using it's secondary:
gnt-instance migrate test1.torproject.org
Make all instances on a node switch to their secondaries:
gnt-node migrate test1.torproject.org
The migrate
commands does a "live" migrate which should avoid any
downtime during the migration. It might be preferable to actually
shutdown the machine for some reason (for example if we actually want
to reboot because of a security upgrade). Or we might not be able to
live-migrate because the node is down. In this case, we do a
failover
gnt-instance failover test1.torproject.org
The gnt-node evacuate command can also be used to "empty" a given node altogether, in case of an emergency:
gnt-node evacuate -I . fsn-node-02.torproject.org
Similarly, the gnt-node failover command can be used to hard-recover from a completely crashed node:
gnt-node failover fsn-node-02.torproject.org
Note that you might need the --ignore-consistency
flag if the
node is unresponsive.
Importing external libvirt instances
Assumptions:
-
INSTANCE
: name of the instance being migrated, the "old" one being outside the cluster and the "new" one being the one created inside the cluster (e.g.chiwui.torproject.org
) -
SPARE_NODE
: a ganeti node with free space (e.g.fsn-node-03.torproject.org
) where theINSTANCE
will be migrated -
MASTER_NODE
: the master ganeti node (e.g.fsn-node-01.torproject.org
) -
KVM_HOST
: the machine which we migrate theINSTANCE
from -
the
INSTANCE
has onlyroot
andswap
partitions -
the
SPARE_NODE
has space in/srv/
to host all the virtual machines to import, to check, use:fab -H crm-ext-01.torproject.org,crm-int-01.torproject.org,forrestii.torproject.org,nevii.torproject.org,rude.torproject.org,troodi.torproject.org,vineale.torproject.org libvirt.du -p kvm3.torproject.org | sed '/-swap$/d;s/ .*$//' <f | awk '{s+=$1} END {print s}'
You will very likely need to create a
/srv
big enough for this, for example:lvcreate -L 300G vg_ganeti -n srv-tmp && mkfs /dev/vg_ganeti/srv-tmp && mount /dev/vg_ganeti/srv-tmp /srv
Import procedure:
-
pick a viable SPARE NODE to import the INSTANCE (see "evaluating cluster capacity" above, when in doubt) and find on which KVM HOST the INSTANCE lives
-
copy the disks, without downtime:
./ganeti -v -H $INSTANCE libvirt-import --ganeti-node $SPARE_NODE --libvirt-host $KVM_HOST
-
copy the disks again, this time suspending the machine:
./ganeti -v -H $INSTANCE libvirt-import --ganeti-node $SPARE_NODE --libvirt-host $KVM_HOST --suspend --adopt
-
renumber the host:
./ganeti -v -H $INSTANCE renumber-instance --ganeti-node $SPARE_NODE
-
test services by changing your
/etc/hosts
, possibly warning service admins:Subject: $INSTANCE IP address change planned for Ganeti migration
I will soon migrate this virtual machine to the new ganeti cluster. this will involve an IP address change which might affect the service.
Please let me know if there are any problems you can think of. in particular, do let me know if any internal (inside the server) or external (outside the server) services hardcodes the IP address of the virtual machine.
A test instance has been setup. You can test the service by adding the following to your /etc/hosts:
116.202.120.182 $INSTANCE 2a01:4f8:fff0:4f:266:37ff:fe32:cfb2 $INSTANCE
-
destroy test instance:
gnt-instance remove $INSTANCE
-
lower TTLs to 5 minutes. this procedure varies a lot according to the service, but generally if all DNS entries are
CNAME
s pointing to the main machine domain name, the TTL can be lowered by adding adnsTTL
entry in the LDAP entry for this host. For example, this sets the TTL to 5 minutes:dnsTTL: 300
Then to make the changes immediate, you need the following commands:
ssh root@alberti.torproject.org sudo -u sshdist ud-generate && ssh root@nevii.torproject.org ud-replicate
Warning: if you migrate one of the hosts ud-ldap depends on, this can fail and not only the TTL will not update, but it might also fail to update the IP address in the below procedure. See ticket 33766 for details.
-
shutdown original instance and redo migration as in step 3 and 4:
fab -H $INSTANCE reboot.halt-and-wait --delay-shutdown 60 --reason='migrating to new server' && ./ganeti -v -H $INSTANCE libvirt-import --ganeti-node $SPARE_NODE --libvirt-host $KVM_HOST --adopt && ./ganeti -v -H $INSTANCE renumber-instance --ganeti-node $SPARE_NODE
-
final test procedure
TODO: establish host-level test procedure and run it here.
-
switch to DRBD, still on the Ganeti MASTER NODE:
gnt-instance stop $INSTANCE && gnt-instance modify -t drbd $INSTANCE && gnt-instance failover -f $INSTANCE && gnt-instance start $INSTANCE
The above can sometimes fail if the allocator is upset about
something in the cluster, for example:
Can's find secondary node using iallocator hail: Request failed: No valid allocation solutions, failure reasons: FailMem: 2, FailN1: 2
This situation is covered by [ticket 33785](https://bugs.torproject.org/33785). To work around the
allocator, you can specify a secondary node directly:
gnt-instance modify -t drbd -n fsn-node-04.torproject.org $INSTANCE &&
gnt-instance failover -f $INSTANCE &&
gnt-instance start $INSTANCE
TODO: move into fabric, maybe in a `libvirt-import-live` or
`post-libvirt-import` job that would also do the renumbering below
-
change IP address in the following locations:
-
LDAP (
ipHostNumber
field, but also change thephysicalHost
andl
fields!). Also drop the dnsTTL attribute while you're at it. -
Puppet (grep in tor-puppet source, run
puppet agent -t; ud-replicate
on pauli) -
DNS (grep in tor-dns source,
puppet agent -t; ud-replicate
on nevii) -
nagios (don't forget to change the parent)
-
reverse DNS (upstream web UI, e.g. Hetzner Robot)
-
grep for the host's IP address on itself:
grep -r -e 78.47.38.227 -e 2a01:4f8:fff0:4f:266:37ff:fe77:1ad8 /etc grep -r -e 78.47.38.227 -e 2a01:4f8:fff0:4f:266:37ff:fe77:1ad8 /srv
-
grep for the host's IP on all hosts:
cumin-all-puppet cumin-all 'grep -r -e 78.47.38.227 -e 2a01:4f8:fff0:4f:266:37ff:fe77:1ad8 /etc'
-
TODO: move those jobs into fabric
-
retire old instance (only a tiny part of howto/retire-a-host):
./retire -H $INSTANCE retire-instance --parent-host $KVM_HOST
-
update the Nextcloud spreadsheet to remove the machine from the KVM host
-
warn users about the migration, for example:
To: tor-project@lists.torproject.org Subject: cupani AKA git-rw IP address changed
The main git server, cupani, is the machine you connect to when you push or pull git repositories over ssh to git-rw.torproject.org. That machines has been migrated to the new Ganeti cluster.
This required an IP address change from:
78.47.38.228 2a01:4f8:211:6e8:0:823:4:1
to:
116.202.120.182 2a01:4f8:fff0:4f:266:37ff:fe32:cfb2
DNS has been updated and preliminary tests show that everything is mostly working. You will get a warning about the IP address change when connecting over SSH, which will go away after the first connection.
Warning: Permanently added the ED25519 host key for IP address '116.202.120.182' to the list of known hosts.
That is normal. The SSH fingerprints of the host did not change.
Please do report any other anomaly using the normal channels:
https://gitlab.torproject.org/anarcat/wikitest/-/wikis/doc/how-to-get-help/
The service was unavailable for about an hour during the migration.
Importing external libvirt instances, manual
This procedure is now easier to accomplish with the Fabric tools written especially for this purpose. Use the above procedure instead. This is kept for historical reference.
Assumptions:
-
INSTANCE
: name of the instance being migrated, the "old" one being outside the cluster and the "new" one being the one created inside the cluster (e.g.chiwui.torproject.org
) -
SPARE_NODE
: a ganeti node with free space (e.g.fsn-node-03.torproject.org
) where theINSTANCE
will be migrated -
MASTER_NODE
: the master ganeti node (e.g.fsn-node-01.torproject.org
) -
KVM_HOST
: the machine which we migrate theINSTANCE
from - the
INSTANCE
has onlyroot
andswap
partitions
Import procedure:
-
pick a viable SPARE NODE to import the instance (see "evaluating cluster capacity" above, when in doubt), login to the three servers, setting the proper environment everywhere, for example:
MASTER_NODE=fsn-node-01.torproject.org SPARE_NODE=fsn-node-03.torproject.org KVM_HOST=kvm1.torproject.org INSTANCE=test.torproject.org
-
establish VM specs, on the KVM HOST:
-
disk space in GiB:
for disk in /srv/vmstore/$INSTANCE/*; do printf "$disk: " echo "$(qemu-img info --output=json $disk | jq '."virtual-size"') / 1024 / 1024 / 1024" | bc -l done
-
number of CPU cores:
sed -n '/<vcpu/{s/[^>]*>//;s/<.*//;p}' < /etc/libvirt/qemu/$INSTANCE.xml
-
memory, assuming from KiB to GiB:
echo "$(sed -n '/<memory/{s/[^>]*>//;s/<.*//;p}' < /etc/libvirt/qemu/$INSTANCE.xml) /1024 /1024" | bc -l
TODO: make sure the memory line is in KiB and that the number makes sense.
-
on the INSTANCE, find the swap device UUID so we can recreate it later:
blkid -t TYPE=swap -s UUID -o value
-
-
setup a copy channel, on the SPARE NODE:
ssh-agent bash ssh-add /etc/ssh/ssh_host_ed25519_key cat /etc/ssh/ssh_host_ed25519_key.pub
on the KVM HOST:
echo "$KEY_FROM_SPARE_NODE" >> /etc/ssh/userkeys/root
-
copy the
.qcow
file(s) over, from the KVM HOST to the SPARE NODE:rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-root /srv/ rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-lvm /srv/ || true
Note: it's possible there is not enough room in
/srv
: in the base Ganeti installs, everything is in the same root partition (/
) which will fill up if the instance is (say) over ~30GiB. In that case, create a filesystem in/srv
:(mkdir /root/srv && mv /srv/* /root/srv true) || true && lvcreate -L 200G vg_ganeti -n srv && mkfs /dev/vg_ganeti/srv && echo "/dev/vg_ganeti/srv /srv ext4 rw,noatime,errors=remount-ro 0 2" >> /etc/fstab && mount /srv && ( mv /root/srv/* ; rmdir /root/srv )
This partition can be reclaimed once the VM migrations are completed, as it needlessly takes up space on the node.
-
on the SPARE NODE, create and initialize a logical volume with the predetermined size:
lvcreate -L 4GiB -n $INSTANCE-swap vg_ganeti mkswap --uuid $SWAP_UUID /dev/vg_ganeti/$INSTANCE-swap lvcreate -L 20GiB -n $INSTANCE-root vg_ganeti qemu-img convert /srv/$INSTANCE-root -O raw /dev/vg_ganeti/$INSTANCE-root lvcreate -L 40GiB -n $INSTANCE-lvm vg_ganeti_hdd qemu-img convert /srv/$INSTANCE-lvm -O raw /dev/vg_ganeti_hdd/$INSTANCE-lvm
Note how we assume two disks above, but the instance might have a different configuration that would require changing the above. The above, common, configuration is to have an LVM disk separate from the "root" disk, the former being on a HDD, but the HDD is sometimes completely omitted and sizes can differ.
Sometimes it might be worth using pv to get progress on long transfers:
qemu-img convert /srv/$INSTANCE-lvm -O raw /srv/$INSTANCE-lvm.raw pv /srv/$INSTANCE-lvm.raw | dd of=/dev/vg_ganeti_hdd/$INSTANCE-lvm bs=4k
TODO: ideally, the above procedure (and many steps below as well) would be automatically deduced from the disk listing established in the first step.
-
on the MASTER NODE, create the instance, adopting the LV:
gnt-instance add -t plain \ -n fsn-node-03 \ --disk 0:adopt=$INSTANCE-root \ --disk 1:adopt=$INSTANCE-swap \ --disk 2:adopt=$INSTANCE-lvm,vg=vg_ganeti_hdd \ --backend-parameters memory=2g,vcpus=2 \ --net 0:ip=pool,network=gnt-fsn \ --no-name-check \ --no-ip-check \ -o debootstrap+default \ $INSTANCE
-
cross your fingers and watch the party:
gnt-instance console $INSTANCE
-
IP address change on new instance:
edit
/etc/hosts
and/etc/network/interfaces
by hand and add IPv4 and IPv6 ip. IPv4 configuration can be found in:gnt-instance show $INSTANCE
Latter can be guessed by concatenating
2a01:4f8:fff0:4f::
and the IPv6 local local address withoutfe80::
. For example: a link local address offe80::266:37ff:fe65:870f/64
should yield the following configuration:iface eth0 inet6 static accept_ra 0 address 2a01:4f8:fff0:4f:266:37ff:fe65:870f/64 gateway 2a01:4f8:fff0:4f::1
TODO: reuse
gnt-debian-interfaces
from the ganeti puppet module script here? -
functional tests: change your
/etc/hosts
to point to the new server and see if everything still kind of works -
shutdown original instance
-
resync and reconvert image, on the Ganeti MASTER NODE:
gnt-instance stop $INSTANCE
on the Ganeti node:
rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-root /srv/ && qemu-img convert /srv/$INSTANCE-root -O raw /dev/vg_ganeti/$INSTANCE-root && rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-lvm /srv/ && qemu-img convert /srv/$INSTANCE-lvm -O raw /dev/vg_ganeti_hdd/$INSTANCE-lvm
-
switch to DRBD, still on the Ganeti MASTER NODE:
gnt-instance modify -t drbd $INSTANCE gnt-instance failover $INSTANCE gnt-instance startup $INSTANCE
-
redo IP adress change in
/etc/network/interfaces
and/etc/hosts
-
final functional test
-
change IP address in the following locations:
- nagios (don't forget to change the parent)
- LDAP (
ipHostNumber
field, but also change thephysicalHost
andl
fields!) - Puppet (grep in tor-puppet source, run
puppet agent -t; ud-replicate
on pauli) - DNS (grep in tor-dns source,
puppet agent -t; ud-replicate
on nevii) - reverse DNS (upstream web UI, e.g. Hetzner Robot)
-
decomission old instance (howto/retire-a-host)
Troubleshooting
-
if boot takes a long time and you see a message like this on the console:
[ *** ] A start job is running for dev-disk-by\x2duuid-484b5...26s / 1min 30s)
... which is generally followed by:
[DEPEND] Dependency failed for /dev/disk/by-…6f4b5-f334-4173-8491-9353d4f94e04. [DEPEND] Dependency failed for Swap.
it means the swap device UUID wasn't setup properly, and does not match the one provided in
/etc/fstab
. That is probably because you missed themkswap -U
step documented above.
References
-
Upstream docs have the canonical incantation:
gnt-instance add -t plain -n HOME_NODE ... --disk 0:adopt=lv_name[,vg=vg_name] INSTANCE_NAME
-
DSA docs also use disk adoption and have a procedure to migrate to DRBD
-
Riseup docs suggest creating a VM without installing, shutting down and then syncing
Ganeti supports importing and exporting from the Open Virtualization Format (OVF), but unfortunately it doesn't seem libvirt supports exporting to OVF. There's a virt-convert tool which can import OVF, but not the reverse. The libguestfs library also has a converter but it also doesn't support exporting to OVF or anything Ganeti can load directly.
So people have written their own conversion tools or their own conversion procedure.
Ganeti also supports file-backed instances but "adoption" is specifically designed for logical volumes, so it doesn't work for our use case.
Rebooting
Those hosts need special care, as we can accomplish zero-downtime
reboots on those machines. The reboot
script in tsa-misc
takes
care of the special steps involved (which is basically to empty a
node before rebooting it).
Such a reboot should be ran interactively, inside a tmux
or screen
session, and takes over 15 minutes to complete right now, but depends
on the size of the cluster (in terms of core memory usage).
Once the reboot is completed, all instances might end up on a single machine, and the cluster might need to be rebalanced, see below. (Note: the update script should eventually do that, see ticket 33406).
Rebalancing a cluster
After a reboot or a downtime, all nodes might end up on the same machine. This is normally handled by the reboot script, but it might be desirable to do this by hand if there was a crash or another special condition.
This can be easily corrected with this command, which will spread instances around the cluster to balance it:
hbal -L -C -v -X
This will automatically move the instances around and rebalance the cluster. Here's an example run on a small cluster:
root@fsn-node-01:~# gnt-instance list
Instance Hypervisor OS Primary_node Status Memory
loghost01.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 2.0G
onionoo-backend-01.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 12.0G
static-master-fsn.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 8.0G
web-fsn-01.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 4.0G
web-fsn-02.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 4.0G
root@fsn-node-01:~# hbal -L -X
Loaded 2 nodes, 5 instances
Group size 2 nodes, 5 instances
Selected node group: default
Initial check done: 0 bad nodes, 0 bad instances.
Initial score: 8.45007519
Trying to minimize the CV...
1. onionoo-backend-01 fsn-node-02:fsn-node-01 => fsn-node-01:fsn-node-02 4.98124611 a=f
2. loghost01 fsn-node-02:fsn-node-01 => fsn-node-01:fsn-node-02 1.78271883 a=f
Cluster score improved from 8.45007519 to 1.78271883
Solution length=2
Got job IDs 16345
Got job IDs 16346
root@fsn-node-01:~# gnt-instance list
Instance Hypervisor OS Primary_node Status Memory
loghost01.torproject.org kvm debootstrap+buster fsn-node-01.torproject.org running 2.0G
onionoo-backend-01.torproject.org kvm debootstrap+buster fsn-node-01.torproject.org running 12.0G
static-master-fsn.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 8.0G
web-fsn-01.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 4.0G
web-fsn-02.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 4.0G
In the above example, you should notice that the web-fsn
instances both
ended up on the same node. That's because the balancer did not know
that they should be distributed. A special configuration was done,
below, to avoid that problem in the future. But as a workaround,
instances can also be moved by hand and the cluster re-balanced.
Also notice that -X
does not show the job output, use
ganeti-watch-jobs
for that, in another terminal. See the job
inspection section for more details on that.
Redundant instances distribution
Some instances are redundant across the cluster and should not end up
on the same node. A good example are the web-fsn-01
and web-fsn-02
instances which, in theory, would serve similar traffic. If they end
up on the same node, it might flood the network on that machine or at
least defeats the purpose of having redundant machines.
The way to ensure they get distributed properly by the balancing algorithm is to "tag" them. For the web nodes, for example, this was performed on the master:
gnt-cluster add-tags htools:iextags:service
gnt-instance add-tags web-fsn-01.torproject.org service:web-fsn
gnt-instance add-tags web-fsn-02.torproject.org service:web-fsn
This tells Ganeti that web-fsn
is an "exclusion tag" and the
optimizer will not try to schedule instances with those tags on the
same node.
To see which tags are present, use:
# gnt-cluster list-tags
htools:iextags:service
You can also find which nodes are assigned to a tag with:
# gnt-cluster search-tags service
/cluster htools:iextags:service
/instances/web-fsn-01.torproject.org service:web-fsn
/instances/web-fsn-02.torproject.org service:web-fsn
IMPORTANT: a previous version of this article mistakenly indicated that a new cluster-level tag had to be created for each service. That method did not work. The hbal manpage explicitely mentions that the cluster-level tag is a prefix that can be used to create multiple such tags. This configuration also happens to be simpler and easier to use...
HDD migration restrictions
Cluster balancing works well until there are inconsistencies between how nodes are configured. In our case, some nodes have HDDs (Hard Disk Drives, AKA spinning rust) and others do not. Therefore, it's not possible to move an instance from a node with a disk allocated on the HDD to a node that does not have such a disk.
Yet somehow the allocator is not smart enough to tell, and you will get the following error when doing an automatic rebalancing:
one of the migrate failed and stopped the cluster balance: Can't create block device: Can't create block device <LogicalVolume(/dev/vg_ganeti_hdd/98d30e7d-0a47-4a7d-aeed-6301645d8469.disk3_data, visible as /dev/, size=102400m)> on node fsn-node-07.torproject.org for instance gitlab-02.torproject.org: Can't create block device: Can't compute PV info for vg vg_ganeti_hdd
In this case, it is trying to migrate the gitlab-02
server from
fsn-node-01
(which has an HDD) to fsn-node-07
(which hasn't),
which naturally fails. This is a known limitation of the Ganeti
code. There has been a draft design document for multiple storage
unit support since 2015, but it has never been
implemented. There has been multiple issues reported upstream on
the subject:
- 208: Bad behaviour when multiple volume groups exists on nodes
- 1199: unable to mark storage as unavailable for allocation
- 1240: Disk space check with multiple VGs is broken
- 1379: Support for displaying/handling multiple volume groups
Unfortunately, there are no known workarounds for this, at least not
that fix the hbal
command. It is possible to exclude the faulty
migration from the pool of possible moves, however, for example in the
above case:
hbal -L -v --exclude-instances gitlab-02.torproject.org
It's also possible to use the --no-disk-moves
option to avoid disk
move operations altogether.
Both workarounds obviously do not correctly balance the
cluster... Note that we have also tried to use htools:migration
tags
to workaround that issue, but those do not work for secondary
instances. For this we would need to setup node groups
instead.
Another option is to specifically look for instances that do not have
a HDD and migrate only those. In my situation, gnt-cluster verify
was complaining that fsn-node-02
was full, so I looked for all the
instances on that node and found the ones which didn't have a HDD:
gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status \
| sort | grep 'fsn-node-02' | awk '{print $3}' | \
while read instance ; do
printf "checking $instance: "
if gnt-instance info $instance | grep -q hdd ; then
echo "HAS HDD"
else
echo "NO HDD"
fi
done
Then you can manually migrate -f
(to fail over to the secondary) and
replace-disks -n
(to find another secondary) the instances that
can be migrated out of the four first machines (which have HDDs) to
the last three (which do not). Look at the memory usage in gnt-node list
to pick the best node.
In general, if a given node in the first four is overloaded, a good trick is to look for one that can be failed over, with, for example:
gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort | grep '^fsn-node-0[1234]' | grep 'fsn-node-0[5678]'
... or, for a particular node (say fsn-node-04):
gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort | grep ^fsn-node-04 | grep 'fsn-node-0[5678]'
The instances listed there would be ones that can be migrated to their
secondary to give fsn-node-04
some breathing room.
Adding and removing addresses on instances
Say you created an instance but forgot to need to assign an extra IP. You can still do so with:
gnt-instance modify --net -1:add,ip=116.202.120.174,network=gnt-fsn test01.torproject.org
Job inspection
Sometimes it can be useful to look at the active jobs. It might be,
for example, that another user has queued a bunch of jobs in another
terminal which you do not have access to, or some automated process
did (Nagios, for example, runs gnt-cluster verify
once in a
while). Ganeti has this concept of "jobs" which can provide
information about those.
The command gnt-job list
will show the entire job history, and
gnt-job list --running
will show running jobs. gnt-job watch
can
be used to watch a specific job.
We have a wrapper called ganeti-watch-jobs
which automatically shows
the output of whatever job is currently running and exits when all
jobs complete. This is particularly useful while rebalancing the
cluster as hbal -X
does not show the job output...
Pager playbook
I/O overload
In case of excessive I/O, it might be worth looking into which machine is in cause. The howto/drbd page explains how to map a DRBD device to a VM. You can also find which logical volume is backing an instance (and vice versa) with this command:
lvs -o+tags
This will list all logical volumes and their associated tags. If you already know which logical volume you're looking for, you can address it directly:
root@fsn-node-01:~# lvs -o tags /dev/vg_ganeti_hdd/4091b668-1177-41ac-9310-1eac45b46620.disk2_data
LV Tags
originstname+bacula-director-01.torproject.org
Node failures
Ganeti clusters are designed to be self-healing. As long as only one machine disappears, the cluster should be able to recover by failing over other nodes. This is currently done manually, see the migrate section above.
This could eventually be automated if such situations occur more often, by scheduling a harep cron job, which isn't enabled in Debian by default. See also the autorepair section of the admin manual.
Bridge configuration failures
If you get the following error while trying to bring up the bridge:
root@chi-node-02:~# ifup br0
add bridge failed: Package not installed
run-parts: /etc/network/if-pre-up.d/bridge exited with return code 1
ifup: failed to bring up br0
... it might be the bridge cannot find a way to load the kernel
module, because kernel module loading has been disabled. Reboot with
the /etc/no_modules_disabled
file present:
touch /etc/no_modules_disabled
reboot
It might be that the machine took too long to boot because it's not in mandos and the operator took too long to enter the LUKS passphrase. Re-enable the machine with this command on mandos:
mandos-ctl --enable chi-node-02.torproject
Cleaning up orphan disks
Sometimes gnt-cluster verify
will give this warning, particularly
after a failed rebalance:
* Verifying orphan volumes
- WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/27dd3687-8953-447e-8632-adf4aa4e11b6.disk0_meta is unknown
- WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/27dd3687-8953-447e-8632-adf4aa4e11b6.disk0_data is unknown
- WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_meta is unknown
- WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_data is unknown