[Ganeti](http://ganeti.org/) is software designed to facilitate the management of
virtual machines (KVM or Xen). It helps you move virtual machine
instances from one node to another, create an instance with DRBD
replication on another node and do the live migration from one to
another, etc.

[[_TOC_]]

# Tutorial

## Listing virtual machines (instances)

This will show the running guests, known as "instances":

    gnt-instance list

## Accessing serial console

Our instances do serial console, starting in grub.  To access it, run

    gnt-instance console test01.torproject.org

To exit, use `^]` -- that is, Control-&lt;Closing Bracket&gt;.

# How-to

## Glossary

In Ganeti, we use the following terms:

 * **node** a physical machine is called a *node* and a
 * **instance** a virtual machine
 * **master**: a *node* where on which we issue Ganeti commands and
   that supervises all the other nodes

Nodes are interconnected through a private network that is used to
communicate commands and synchronise disks (with
[howto/drbd](howto/drbd)). Instances are normally assigned two nodes:
a *primary* and a *secondary*: the *primary* is where the virtual
machine actually runs and the *secondary* acts as a hot failover.

See also the more extensive [glossary in the Ganeti documentation](http://docs.ganeti.org/ganeti/2.15/html/glossary.html).

## Adding a new instance

This command creates a new guest, or "instance" in Ganeti's
vocabulary with 10G root, 2G swap, 20G spare on SSD, 800G on HDD, 8GB
ram and 2 CPU cores:

    gnt-instance add \
      -o debootstrap+bullseye \
      -t drbd --no-wait-for-sync \
      --net 0:ip=pool,network=gnt-fsn13-02 \
      --no-ip-check \
      --no-name-check \
      --disk 0:size=10G \
      --disk 1:size=2G,name=swap \
      --disk 2:size=20G \
      --disk 3:size=800G,vg=vg_ganeti_hdd \
      --backend-parameters memory=8g,vcpus=2 \
      test-01.torproject.org

### What that does

This configures the following:

 * redundant disks in a DRBD mirror, use `-t plain` instead of `-t drbd` for
   tests as that avoids syncing of disks and will speed things up considerably
   (even with `--no-wait-for-sync` there are some operations that block on
   synced mirrors).  Only one node should be provided as the argument for
   `--node` then.
 * three partitions: one on the default VG (SSD), one on another (HDD)
   and a swap file on the default VG, if you don't specify a swap device,
   a 512MB swapfile is created in `/swapfile`. TODO: configure disk 2
   and 3 automatically in installer. (`/var` and `/srv`?)
 * 8GB of RAM with 2 virtual CPUs
 * an IP allocated from the public gnt-fsn pool:
   `gnt-instance add` will print the IPv4 address it picked to stdout.  The
   IPv6 address can be found in `/var/log/ganeti/os/` on the primary node
   of the instance, see below.
 * with the `test-01.torproject.org` hostname

### Next steps

To find the root password, ssh host key fingerprints, and the IPv6
address, run this **on the node where the instance was created**, for
example:

    egrep 'root password|configured eth0 with|SHA256' $(ls -tr /var/log/ganeti/os/* | tail -1) | grep -v $(hostname)

We copy root's authorized keys into the new instance, so you should be able to
log in with your token.  You will be required to change the root password immediately.
Pick something nice and document it in `tor-passwords`.

Also set reverse DNS for both IPv4 and IPv6 in [hetzner's robot](https://robot.your-server.de/)
(Chek under servers -> vSwitch -> IPs) or in our own reverse zone
files (if delegated).

Then follow [howto/new-machine](howto/new-machine).

### Known issues

 * **allocator failures**: Note that you may need to use the `--node`
   parameter to pick on which machines you want the machine to end up,
   otherwise Ganeti will choose for you (and may fail). Use, for
   example, `--node fsn-node-01:fsn-node-02` to use `node-01` as
   primary and `node-02` as secondary. The allocator can sometimes
   fail if the allocator is upset about something in the cluster, for
   example:
    
        Can's find primary node using iallocator hail: Request failed: No valid allocation solutions, failure reasons: FailMem: 2, FailN1: 2

   This situation is covered by [ticket 33785](https://bugs.torproject.org/33785). If this problem
   occurs, it might be worth [rebalancing the cluster](#rebalancing-a-cluster).

 * **ping failure**: there is a bug in `ganeti-instance-debootstrap`
   which misconfigures `ping` (among other things), see [bug
   31781](https://bugs.torproject.org/31781). It's currently patched in our version of the Debian
   package, but that patch might disappear if Debian upgrade the
   package without [shipping our patch](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=944538). Note that this was fixed
   in Debian bullseye and later.

### Other examples

This is a typical server creation in the `gnt-chi` cluster:

    gnt-instance add \
      -o debootstrap+bullseye \
      -t drbd --no-wait-for-sync \
      --net 0:ip=pool,network=gnt-chi-01 \
      --no-ip-check \
      --no-name-check \
      --disk 0:size=10G \
      --disk 1:size=2G,name=swap \
      --disk 2:size=20G \
      --backend-parameters memory=8g,vcpus=2 \
      test-01.torproject.org

A simple test machine, with only 1G of disk, ram, and 1 CPU, without
DRBD, in the FSN cluster:

    gnt-instance add \
          -o debootstrap+bullseye \
          -t plain --no-wait-for-sync \
          --net 0:ip=pool,network=gnt-fsn13-02 \
          --no-ip-check \
          --no-name-check \
          --disk 0:size=10G \
          --disk 1:size=2G,name=swap \
          --backend-parameters memory=1g,vcpus=1 \
          test-01.torproject.org

Do not forget to follow the [next steps](#next-steps), above.

### iSCSI integration

To create a VM with iSCSI backing, a disk must first be created on the
SAN, then adopted in a VM, which needs to be *reinstalled* on top of
that. This is typical how large disks are provisionned in the
`gnt-chi` cluster, in the [Cymru POP](howto/new-machine-cymru).

The following instructions assume you are on a node with an [iSCSI
initiator properly setup](howto/new-machine-cymru#iscsi-initiator-setup), and the [SAN cluster management tools
setup](howto/new-machine-cymru#san-management-tools-setup). It also assumes you are familiar with the `SMcli` tool, see
the [storage servers documentation](howto/new-machine-cymru#storage-servers) for an introduction on that.

 1. create a dedicated disk group and virtual disk on the SAN, assign it to the
    host group and propagate the multipath config across the cluster nodes:

        /usr/local/sbin/tpo-create-san-disks --san chi-node-03 --name test-01 --capacity 500

 2. confirm that multipath works, it should look something like this":

        root@chi-node-01:~# multipath -ll
        test-01 (36782bcb00063c6a500000d67603f7abf) dm-20 DELL,MD32xxi
        size=500G features='5 queue_if_no_path pg_init_retries 50 queue_mode mq' hwhandler='1 rdac' wp=rw
        |-+- policy='round-robin 0' prio=6 status=active
        | |- 11:0:0:4 sdi 8:128 active ready running
        | |- 12:0:0:4 sdj 8:144 active ready running
        | `- 9:0:0:4  sdh 8:112 active ready running
        `-+- policy='round-robin 0' prio=1 status=enabled
          |- 10:0:0:4 sdk 8:160 active ghost running
          |- 7:0:0:4  sdl 8:176 active ghost running
          `- 8:0:0:4  sdm 8:192 active ghost running
        root@chi-node-01:~#

 3. adopt the disk in Ganeti:

        gnt-instance add \
              -n chi-node-01.torproject.org \
              -o debootstrap+bullseye \
              -t blockdev --no-wait-for-sync \
              --net 0:ip=pool,network=gnt-chi-01 \
              --no-ip-check \
              --no-name-check \
              --disk 0:adopt=/dev/disk/by-id/dm-name-test-01 \
              --backend-parameters memory=8g,vcpus=2 \
              test-01.torproject.org

    NOTE: the actual node must be manually picked because the `hail`
    allocator doesn't seem to know about block devices.

    NOTE: mixing DRBD and iSCSI volumes on a single instance is not supported.

 4. at this point, the VM probably doesn't boot, because for some
    reason the `gnt-instance-debootstrap` doesn't fire when disks are
    adopted. so you need to reinstall the machine, which involves
    stopping it first:

        gnt-instance shutdown --timeout=0 test-01
        gnt-instance reinstall test-01

    HACK one: the current installer fails on weird partionning errors, see
    [upstream bug 13](https://github.com/ganeti/instance-debootstrap/issues/13).
    We applied [this patch](https://github.com/ganeti/instance-debootstrap/commit/e0df6b1fd25dc3e111851ae42872df0a757ac4a9)
    as a workaround to avoid failures when the installer attempts to partition
    the virtual disk.

From here on, follow the [next steps](#next-steps) above.

TODO: This would ideally be automated by an external storage provider,
see the [storage reference for more information](#storage).

### Troubleshooting

If a Ganeti instance install fails, it will show the end of the
install log, for example:

```
Thu Aug 26 14:11:09 2021  - INFO: Selected nodes for instance tb-pkgstage-01.torproject.org via iallocator hail: chi-node-02.torproject.org, chi-node-01.torproject.org
Thu Aug 26 14:11:09 2021  - INFO: NIC/0 inherits netparams ['br0', 'bridged', '']
Thu Aug 26 14:11:09 2021  - INFO: Chose IP 38.229.82.29 from network gnt-chi-01
Thu Aug 26 14:11:10 2021 * creating instance disks...
Thu Aug 26 14:12:58 2021 adding instance tb-pkgstage-01.torproject.org to cluster config
Thu Aug 26 14:12:58 2021 adding disks to cluster config
Thu Aug 26 14:13:00 2021 * checking mirrors status
Thu Aug 26 14:13:01 2021  - INFO: - device disk/0: 30.90% done, 3m 32s remaining (estimated)
Thu Aug 26 14:13:01 2021  - INFO: - device disk/2:  0.60% done, 55m 26s remaining (estimated)
Thu Aug 26 14:13:01 2021 * checking mirrors status
Thu Aug 26 14:13:02 2021  - INFO: - device disk/0: 31.20% done, 3m 40s remaining (estimated)
Thu Aug 26 14:13:02 2021  - INFO: - device disk/2:  0.60% done, 52m 13s remaining (estimated)
Thu Aug 26 14:13:02 2021 * pausing disk sync to install instance OS
Thu Aug 26 14:13:03 2021 * running the instance OS create scripts...
Thu Aug 26 14:16:31 2021 * resuming disk sync
Failure: command execution error:
Could not add os for instance tb-pkgstage-01.torproject.org on node chi-node-02.torproject.org: OS create script failed (exited with exit code 1), last lines in the log file:
Setting up openssh-sftp-server (1:7.9p1-10+deb10u2) ...
Setting up openssh-server (1:7.9p1-10+deb10u2) ...
Creating SSH2 RSA key; this may take some time ...
2048 SHA256:ZTeMxYSUDTkhUUeOpDWpbuOzEAzOaehIHW/lJarOIQo root@chi-node-02 (RSA)
Creating SSH2 ED25519 key; this may take some time ...
256 SHA256:MWKeA8vJKkEG4TW+FbG2AkupiuyFFyoVWNVwO2WG0wg root@chi-node-02 (ED25519)
Created symlink /etc/systemd/system/sshd.service \xe2\x86\x92 /lib/systemd/system/ssh.service.
Created symlink /etc/systemd/system/multi-user.target.wants/ssh.service \xe2\x86\x92 /lib/systemd/system/ssh.service.
invoke-rc.d: could not determine current runlevel
Setting up ssh (1:7.9p1-10+deb10u2) ...
Processing triggers for systemd (241-7~deb10u8) ...
Processing triggers for libc-bin (2.28-10) ...
Errors were encountered while processing:
 linux-image-4.19.0-17-amd64
E: Sub-process /usr/bin/dpkg returned an error code (1)
run-parts: /etc/ganeti/instance-debootstrap/hooks/ssh exited with return code 100
Using disk /dev/drbd4 as swap...
Setting up swapspace version 1, size = 2 GiB (2147479552 bytes)
no label, UUID=96111754-c57d-43f2-83d0-8e1c8b4688b4
Not using disk 2 (/dev/drbd5) because it is not named 'swap' (name: )
root@chi-node-01:~#
```

Here the failure which tripped the install is:

```
Errors were encountered while processing:
 linux-image-4.19.0-17-amd64
E: Sub-process /usr/bin/dpkg returned an error code (1)
```

But the actual error is higher up, and we need to go look at the logs
on the server for this, in this case in
`chi-node-02:/var/log/ganeti/os/add-debootstrap+buster-tb-pkgstage-01.torproject.org-2021-08-26_14_13_04.log`,
we can find the real problem:

```
Setting up linux-image-4.19.0-17-amd64 (4.19.194-3) ...
/etc/kernel/postinst.d/initramfs-tools:
update-initramfs: Generating /boot/initrd.img-4.19.0-17-amd64
W: Couldn't identify type of root file system for fsck hook
/etc/kernel/postinst.d/zz-update-grub:
/usr/sbin/grub-probe: error: cannot find a device for / (is /dev mounted?).
run-parts: /etc/kernel/postinst.d/zz-update-grub exited with return code 1
dpkg: error processing package linux-image-4.19.0-17-amd64 (--configure):
 installed linux-image-4.19.0-17-amd64 package post-installation script subprocess returned error exit status 1
```

In this case, oddly enough, even though Ganeti thought the install had
failed, the machine can actually start:

```
gnt-instance start tb-pkgstage-01.torproject.org
```

... and after a while, we can even get a console:

```
gnt-instance start tb-pkgstage-01.torproject.org
```

And in *that* case, the procedure can just continue from here on:
reset the root password, and just make sure you finish the install:

```
apt install linux-image-amd64
```

In the above case, the `sources-list` post-install hook was buggy: it
wasn't mounting `/dev` and friends before launching the upgrades,
which was causing issues when a kernel upgrade was queued.

And *if* you are debugging an installer and by mistake end up with
half-open filesystems and stray DRBD devices, do take a look at the
[LVM](howto/lvm) and [DRBD documentation](howto/drbd).

## Modifying an instance

### CPU, memory changes

It's possible to change the IP, CPU, or memory allocation of an instance
using the [gnt-instance modify](http://docs.ganeti.org/ganeti/2.15/man/gnt-instance.html#modify) command:

    gnt-instance modify -B vcpus=4 test1.torproject.org
    gnt-instance modify -B memory=8g test1.torproject.org
    gnt-instance reboot test1.torproject.org

### IP address change

IP address changes require a full stop and will require manual changes
to the `/etc/network/interfaces*` files:

    gnt-instance modify --net 0:modify,ip=116.202.120.175 test1.torproject.org
    gnt-instance stop test1.torproject.org
    gnt-instance start test1.torproject.org
    gnt-instance console test1.torproject.org

### Resizing disks

The [gnt-instance grow-disk](http://docs.ganeti.org/ganeti/2.15/man/gnt-instance.html#grow-disk) command can be used to change the size
of the underlying device:

    gnt-instance grow-disk --absolute test1.torproject.org 0 16g
    gnt-instance reboot test1.torproject.org

The number `0` in this context, indicates the first disk of the
instance.  The amount specified is the final disk size (because of the
`--absolute` flag). In the above example, the final disk size will be
16GB. To *add* space to the existing disk, remove the `--absolute`
flag:

    gnt-instance grow-disk test1.torproject.org 0 16g
    gnt-instance reboot test1.torproject.org

In the above example, 16GB will be **ADDED** to the disk. Be careful
with resizes, because it's not possible to revert such a change:
`grow-disk` does support shrinking disks. The only way to revert the
change is by exporting / importing the instance.

Note the reboot, above, will impose a downtime. See [upstream bug
28](https://github.com/ganeti/ganeti/issues/28) about improving that.

Then the filesystem needs to be resized inside the VM:

    ssh root@test1.torproject.org 

#### Resizing under LVM

Use `pvs` to display information about the physical volumes:

    root@cupani:~# pvs
    PV         VG        Fmt  Attr PSize   PFree   
    /dev/sdc   vg_test   lvm2 a--  <8.00g  1020.00m

Resize the physical volume to take up the new space:

    pvresize /dev/sdc

Use `lvs` to display information about logical volumes:

    # lvs
    LV            VG               Attr       LSize    Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
    var-opt       vg_test-01     -wi-ao---- <10.00g                                                    
    test-backup vg_test-01_hdd   -wi-ao---- <20.00g            

Use lvextend to add space to the volume:

    lvextend -l '+100%FREE' vg_test-01/var-opt

Finally resize the filesystem:

    resize2fs /dev/vg_test-01/var-opt

See also the [LVM howto](howto/lvm).

#### Resizing without LVM, no partitions

If there's no LVM inside the VM (a more common configuration
nowadays), the above procedure will obviously not work. If this is a
secondary disk (e.g. `/dev/sdc`) there is a good chance a partition
was created directly on it and that you do not need to repartition the
drive. This is an example of a good configuration if we want to resize
`sdc`:

```
root@bacula-director-01:~# lsblk 
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
fd0      2:0    1    4K  0 disk 
sda      8:0    0   10G  0 disk 
└─sda1   8:1    0   10G  0 part /
sdb      8:16   0    2G  0 disk [SWAP]
sdc      8:32   0  250G  0 disk /srv
```

Note that if we would need to resize `sda`, we'd have to follow the
other procedure, in the next section.

If we check the free disk space on the device we will notice it has
not changed yet:

```
# df -h /srv
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdc        196G  160G   27G  86% /srv
```

The resize is then simply:

```
# resize2fs /dev/sdc
resize2fs 1.44.5 (15-Dec-2018)
Filesystem at /dev/sdc is mounted on /srv; on-line resizing required
old_desc_blocks = 25, new_desc_blocks = 32
The filesystem on /dev/sdc is now 65536000 (4k) blocks long.
```

Read on for the most complicated scenario.

#### Resizing without LVM, with partitions

If the filesystem to resize is not *directly* on the device, you will
need to resize the partition manually, which can be done using
fdisk. In the following example we have a `sda1` partition that we
want to extend from 10G to 20G to fill up the free space on
`/dev/sda`. Here is what the partition layout looks like before the
resize:

```
# lsblk
NAME    MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
fd0       2:0    1   4K  0 disk 
sda       8:0    0  40G  0 disk 
└─sda1    8:1    0  20G  0 part /
sdb       8:16   0   4G  0 disk [SWAP]
```

We use `sfdisk` to resize the partition to take up all available
space, in this case, with the magic:

    echo ", +" | sfdisk -N 1 --no-act /dev/sda

Note the `--no-act` here, which you'll need to remove to actually make
the change, the above is just a preview to make sure you will do the
right thing. Here's a working example:

```
# echo ", +" | sfdisk -N 1 --no-reread /dev/sda
Disk /dev/sda: 40 GiB, 42949672960 bytes, 83886080 sectors
Disk model: QEMU HARDDISK   
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x00000000

Old situation:

Device     Boot Start      End  Sectors Size Id Type
/dev/sda1  *     2048 41943039 41940992  20G 83 Linux

/dev/sda1: 
New situation:
Disklabel type: dos
Disk identifier: 0x00000000

Device     Boot Start      End  Sectors Size Id Type
/dev/sda1  *     2048 83886079 83884032  40G 83 Linux

The partition table has been altered.
Calling ioctl() to re-read partition table.
Re-reading the partition table failed.: Device or resource busy
The kernel still uses the old table. The new table will be used at the next reboot or after you run partprobe(8) or kpartx(8).
Syncing disks.
```

Note that the partition table wasn't updated:

```
# lsblk
NAME    MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
fd0       2:0    1   4K  0 disk 
sda       8:0    0  40G  0 disk 
└─sda1    8:1    0  20G  0 part /
sdb       8:16   0   4G  0 disk [SWAP]
```

So we need to reboot:

```
reboot
```

Note: a previous version of this guide was using `fdisk` instead, but
that guide was destroying and recreating the partition, which seemed
too error-prone. The above procedure is more annoying (because of the
reboot below) but should be less dangerous.

TODO: next time, test with `--force` instead of `--no-reread` to see
if we still need a reboot.

Now we check the partitions again:

```
# lsblk
NAME   MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
fd0      2:0    1   4K  0 disk 
sda      8:0    0  40G  0 disk 
└─sda1   8:1    0  40G  0 part /
sdb      8:16   0   4G  0 disk [SWAP]
```

If we check the free space on the device, we will notice it has not
changed yet:

```
# df -h  /
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        20G   16G  2.8G  86% /
```

We need to resize it:

```
# resize2fs /dev/sda1
resize2fs 1.44.5 (15-Dec-2018)
Filesystem at /dev/sda1 is mounted on /; on-line resizing required
old_desc_blocks = 2, new_desc_blocks = 3
The filesystem on /dev/sda1 is now 10485504 (4k) blocks long.
```

The resize is now complete.

#### Resizing an iSCSI LUN

All the above procedures detail the normal use case where disks are
hosted as "plain" files or with the DRBD backend. However, some
instances (most notably in the gnt-chi cluster) have their storage
backed by an iSCSI SAN.

Growing a disk hosted on a SAN like the Dell PowerVault MD3200i
involves several steps beginning with resizing the LUN itself. In the
example below, we're going to grow the disk associated with the
`tb-build-03` instance. 

> It should be noted that the instance was setup in a peculiar way: it
> has one LUN per partition, instead of one big LUN partitioned
> correctly. The instructions below therefore mention a LUN named
> `tb-build-03-srv`, but normally there should be a single LUN named
> after the hostname of the machine, in this case it should have been
> named simply `tb-build-03`.

First, we identify how much space is available on the virtual disks' diskGroup:

    # SMcli -n chi-san-01 -c "show allVirtualDisks summary;"

	STANDARD VIRTUAL DISKS SUMMARY
	Number of standard virtual disks: 5

	Name                Thin Provisioned     Status     Capacity     Accessible by       Source
	tb-build-03-srv     No                   Optimal    700.000 GB   Host Group gnt-chi  Disk Group 5

This shows that `tb-build-03-srv` is hosted on Disk Group "5":

    # SMcli -n chi-san-01 -c "show diskGroup [5];"

    DETAILS

       Name:              5

          Status:         Optimal
          Capacity:       1,852.026 GB
          Current owner:  RAID Controller Module in slot 1

          Data Service (DS) Attributes

             RAID level:                    5
             Physical Disk media type:      Physical Disk
             Physical Disk interface type:  Serial Attached SCSI (SAS)
             Enclosure loss protection:     No
             Secure Capable:                No
             Secure:                        No


          Total Virtual Disks:          1
             Standard virtual disks:    1
             Repository virtual disks:  0
             Free Capacity:             1,152.026 GB

          Associated physical disks - present (in piece order)
          Total physical disks present: 3

             Enclosure     Slot
             0             6
             1             11
             0             7

`Free Capacity` indicates about 1,5 TB of free space available. So we can go
ahead with the actual resize:

    # SMcli -n chi-san-01 -p $PASSWORD -c "set virtualdisk [\"tb-build-03-srv\"] addCapacity=100GB;"

Next, we need to make all nodes in the cluster to rescan the iSCSI LUNs and have
`multipathd` resize the device node. This is accomplished by running this command
on the primary node (eg. `chi-node-01`):

    # gnt-cluster command "iscsiadm -m node --rescan; multipathd -v3 -k\"resize map tb-build-srv\""

The success of this step can be validated by looking at the output of `lsblk`:
the device nodes associated with the LUN should now display the new size. The
output should be identical across the cluster nodes.

In order for ganeti/qemu to make this extra space available to the instance, a
reboot must be performed from outside the instance.

Then the normal resize procedure can happen inside the virtual
machine, see [resizing under LVM](#resizing-under-lvm), [resizing without LVM, no
partitions](#resizing-without-lvm-no-partitions), or [Resizing without LVM, with partitions](#resizing-without-lvm-with-partitions),
depending on the situation.

### Removing an iSCSI LUN

Use this procedure before to a virtual disk from one of the iSCSI SANs.

First, we'll need to gather a some information about the disk to remove.

 * Which SAN is hosting the disk
 * What LUN is assigned to the disk
 * The WWID of both the SAN and the virtual disk

    /usr/local/sbin/tpo-show-san-disks
    SMcli -n chi-san-03 -S -quick -c "show storageArray summary;" | grep "Storage array world-wide identifier"
    cat /etc/multipath/conf.d/test-01.conf

Second, remove the multipath config and reload:

    gnt-cluster command rm /etc/multipath/conf.d/test-01.conf
    gnt-cluster command "multipath -r ; multipath -w {disk-wwid} ; multipath -r"

Then, remove the iSCSI device nodes. Running `iscsiadm --rescan` does not remove
LUNs which have been deleted from the SAN.

Be very careful with this command, it will delete device nodes without prejudice
and cause data corruption if they are still in use!

    gnt-cluster command "find /dev/disk/by-path/ -name \*{san-wwid}-lun-{lun} -exec readlink {} \; | cut -d/ -f3 | while read -d $'\n' n; do echo 1 > /sys/block/\$n/device/delete; done"

Finally, the disk group can be deleted from the SAN (all the virtual disks it
contains will be deleted):

    SMcli -n chi-san-03 -p $SAN_PASSWORD -S -quick -c "delete diskGroup [<disk-group-number>];"

### Adding disks

A disk can be added to an instance with the `modify` command as
well. This, for example, will add a 100GB disk to the `test1` instance
on teh `vg_ganeti_hdd` volume group, which is "slow" rotating disks:

    gnt-instance modify --disk add:size=100g,vg=vg_ganeti_hdd test1.torproject.org
    gnt-instance reboot test1.torproject.org

### Changing disk type

If you have, say, a test instance that was created with a `plain` disk
template but we actually want it in production, with a `drbd` disk
template. Switching to `drbd` is easy:

    gnt-instance shutdown test-01
    gnt-instance modify -t drbd test-01
    gnt-instance start test-01

The second command will use the allocator to find a secondary node. If
that fails, you can assign a node manually with `-n`.

You can also switch back to `plain`, although you should generally
never do that.

See also the [upstream procedure](https://docs.ganeti.org/docs/ganeti/3.0/html/admin.html#conversion-of-an-instance-s-disk-type) and [design document](https://docs.ganeti.org/docs/ganeti/3.0/html/design-disk-conversion.html).

### Detaching a disk

If you need to remove a volume from an instance without destroying data, it's
possible to detach it. First, you must identify the disk's uuid using
`gnt-instance info`, then:

    gnt-instance modify --disk <uuid>:detach test-01

### Adding a network interface on the rfc1918 vlan

We have a vlan that some VMs that do not have public addresses sit on.
Its vlanid is 4002 and its backed by Hetzner vswitch vSwitch #11973 "fsn-gnt-rfc1918-traffic".
Note that traffic on this vlan will travel in the clear between nodes.

To add an instance to this vlan, give it a second network interface using

    gnt-instance modify --net add:link=br0,vlan=4002,mode=openvswitch test1.torproject.org

## Destroying an instance

This totally deletes the instance, including all mirrors and
everything, be very careful with it:

    gnt-instance remove test01.torproject.org

## Getting information

Information about an instances can be found in the rather verbose
`gnt-instance info`:

    root@fsn-node-01:~# gnt-instance info tb-build-02.torproject.org
    - Instance name: tb-build-02.torproject.org
      UUID: 8e9f3ca6-204f-4b6c-8e3e-6a8fda137c9b
      Serial number: 5
      Creation time: 2020-12-15 14:06:41
      Modification time: 2020-12-15 14:07:31
      State: configured to be up, actual state is up
      Nodes: 
        - primary: fsn-node-03.torproject.org
          group: default (UUID 8c32fd09-dc4c-4237-9dd2-3da3dfd3189e)
        - secondaries: fsn-node-04.torproject.org (group default, group UUID 8c32fd09-dc4c-4237-9dd2-3da3dfd3189e)
      Operating system: debootstrap+buster

A quick command that can be done is this, which shows the
primary/secondary for a given instance:

    gnt-instance info tb-build-02.torproject.org | grep -A 3 Nodes

An equivalent command will show the primary and secondary for *all*
instances, on top of extra information (like the CPU count, memory and
disk usage):

    gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort

It can be useful to run this in a loop to see changes:

    watch -n5 -d 'gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort'

## Disk operations (DRBD)

Instances should be setup using the DRBD backend, in which case you
should probably take a look at [howto/drbd](howto/drbd) if you have problems with
that. Ganeti handles most of the logic there so that should generally
not be necessary.

## Evaluating cluster capacity

This will list instances repeatedly, but also show their assigned
memory, and compare it with the node's capacity:

    gnt-instance list -o pnode,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort &&
    echo &&
    gnt-node list

The latter does not show disk usage for secondary volume groups (see
[upstream issue 1379](https://github.com/ganeti/ganeti/issues/1379)), for a complete picture of disk usage, use:

    gnt-node list-storage

The [gnt-cluster verify](http://docs.ganeti.org/ganeti/2.15/man/gnt-cluster.html#verify) command will also check to see if there's
enough space on secondaries to account for the failure of a
node. Healthy output looks like this:

    root@fsn-node-01:~# gnt-cluster verify
    Submitted jobs 48030, 48031
    Waiting for job 48030 ...
    Fri Jan 17 20:05:42 2020 * Verifying cluster config
    Fri Jan 17 20:05:42 2020 * Verifying cluster certificate files
    Fri Jan 17 20:05:42 2020 * Verifying hypervisor parameters
    Fri Jan 17 20:05:42 2020 * Verifying all nodes belong to an existing group
    Waiting for job 48031 ...
    Fri Jan 17 20:05:42 2020 * Verifying group 'default'
    Fri Jan 17 20:05:42 2020 * Gathering data (2 nodes)
    Fri Jan 17 20:05:42 2020 * Gathering information about nodes (2 nodes)
    Fri Jan 17 20:05:45 2020 * Gathering disk information (2 nodes)
    Fri Jan 17 20:05:45 2020 * Verifying configuration file consistency
    Fri Jan 17 20:05:45 2020 * Verifying node status
    Fri Jan 17 20:05:45 2020 * Verifying instance status
    Fri Jan 17 20:05:45 2020 * Verifying orphan volumes
    Fri Jan 17 20:05:45 2020 * Verifying N+1 Memory redundancy
    Fri Jan 17 20:05:45 2020 * Other Notes
    Fri Jan 17 20:05:45 2020 * Hooks Results

A sick node would have said something like this instead:

    Mon Oct 26 18:59:37 2009 * Verifying N+1 Memory redundancy
    Mon Oct 26 18:59:37 2009   - ERROR: node node2: not enough memory to accommodate instance failovers should node node1 fail

See the [ganeti manual](http://docs.ganeti.org/ganeti/2.15/html/walkthrough.html#n-1-errors) for a more extensive example

Also note the `hspace -L` command, which can tell you how many
instances can be created in a given cluster. It uses the "standard"
instance template defined in the cluster (which we haven't configured
yet).

## Moving instances and failover

Ganeti is smart about assigning instances to nodes. There's also a
command (`hbal`) to automatically rebalance the cluster (see
below). If for some reason `hbal` doesn’t do what you want or you need
to move things around for other reasons, here are a few commands that
might be handy.

Make an instance switch to using it's secondary:

    gnt-instance migrate test1.torproject.org

Make all instances on a node switch to their secondaries:

    gnt-node migrate test1.torproject.org

The `migrate` commands does a "live" migrate which should avoid any
downtime during the migration. It might be preferable to actually
shutdown the machine for some reason (for example if we actually want
to reboot because of a security upgrade). Or we might not be able to
live-migrate because the node is down. In this case, we do a
[failover](http://docs.ganeti.org/ganeti/2.15/html/admin.html#failing-over-an-instance)

    gnt-instance failover test1.torproject.org

The [gnt-node evacuate](http://docs.ganeti.org/ganeti/2.15/man/gnt-node.html#evacuate) command can also be used to "empty" a given
node altogether, in case of an emergency:

    gnt-node evacuate -I . fsn-node-02.torproject.org

Similarly, the [gnt-node failover](http://docs.ganeti.org/ganeti/2.15/man/gnt-node.html#failover) command can be used to
hard-recover from a completely crashed node:

    gnt-node failover fsn-node-02.torproject.org

Note that you might need the `--ignore-consistency` flag if the
node is unresponsive.

## Importing external libvirt instances

Assumptions:

 * `INSTANCE`: name of the instance being migrated, the "old" one
   being outside the cluster and the "new" one being the one created
   inside the cluster (e.g. `chiwui.torproject.org`)
 * `SPARE_NODE`: a ganeti node with free space
   (e.g. `fsn-node-03.torproject.org`) where the `INSTANCE` will be
   migrated
 * `MASTER_NODE`: the master ganeti node
   (e.g. `fsn-node-01.torproject.org`)
 * `KVM_HOST`: the machine which we migrate the `INSTANCE` from
 * the `INSTANCE` has only `root` and `swap` partitions
 * the `SPARE_NODE` has space in `/srv/` to host all the virtual
   machines to import, to check, use:

        fab -H crm-ext-01.torproject.org,crm-int-01.torproject.org,forrestii.torproject.org,nevii.torproject.org,rude.torproject.org,troodi.torproject.org,vineale.torproject.org libvirt.du -p kvm3.torproject.org | sed '/-swap$/d;s/ .*$//' <f | awk '{s+=$1} END {print s}'

   You will very likely need to create a `/srv` big enough for this,
   for example:

        lvcreate -L 300G vg_ganeti -n srv-tmp &&
        mkfs /dev/vg_ganeti/srv-tmp &&
        mount /dev/vg_ganeti/srv-tmp /srv

Import procedure:

 1. pick a viable SPARE NODE to import the INSTANCE (see "evaluating
    cluster capacity" above, when in doubt) and find on which KVM HOST
    the INSTANCE lives

 2. copy the disks, without downtime:
 
        ./ganeti -v -H $INSTANCE libvirt-import  --ganeti-node $SPARE_NODE --libvirt-host $KVM_HOST

 3. copy the disks again, this time suspending the machine:

        ./ganeti -v -H $INSTANCE libvirt-import  --ganeti-node $SPARE_NODE --libvirt-host $KVM_HOST --suspend --adopt

 4. renumber the host:
    
        ./ganeti -v -H $INSTANCE renumber-instance --ganeti-node $SPARE_NODE

 5. test services by changing your `/etc/hosts`, possibly warning
    service admins:

    > Subject: $INSTANCE IP address change planned for Ganeti migration
    >
    > I will soon migrate this virtual machine to the new ganeti cluster. this
    > will involve an IP address change which might affect the service.
    >
    > Please let me know if there are any problems you can think of. in
    > particular, do let me know if any internal (inside the server) or external
    > (outside the server) services hardcodes the IP address of the virtual
    > machine.
    >
    > A test instance has been setup. You can test the service by
    > adding the following to your /etc/hosts:
    >
    >     116.202.120.182 $INSTANCE
    >     2a01:4f8:fff0:4f:266:37ff:fe32:cfb2 $INSTANCE

 6. destroy test instance:
 
        gnt-instance remove $INSTANCE
 
 7. lower TTLs to 5 minutes. this procedure varies a lot according to
    the service, but generally if all DNS entries are `CNAME`s
    pointing to the main machine domain name, the TTL can be lowered
    by adding a `dnsTTL` entry in the LDAP entry for this host. For
    example, this sets the TTL to 5 minutes:
    
        dnsTTL: 300

    Then to make the changes immediate, you need the following
    commands:
    
        ssh root@alberti.torproject.org sudo -u sshdist ud-generate &&
        ssh root@nevii.torproject.org ud-replicate
 
    Warning: if you migrate one of the hosts ud-ldap depends on, this
    can fail and not only the TTL will not update, but it might also
    fail to update the IP address in the below procedure. See [ticket
    33766](https://bugs.torproject.org/33766) for
    details.
 
 8. shutdown original instance and redo migration as in step 3 and 4:
 
        fab -H $INSTANCE reboot.halt-and-wait --delay-shutdown 60 --reason='migrating to new server' &&
        ./ganeti -v -H $INSTANCE libvirt-import  --ganeti-node $SPARE_NODE --libvirt-host $KVM_HOST --adopt &&
        ./ganeti -v -H $INSTANCE renumber-instance --ganeti-node $SPARE_NODE

 9. final test procedure

    TODO: establish host-level test procedure and run it here.

 10. switch to DRBD, still on the Ganeti MASTER NODE:

         gnt-instance stop $INSTANCE &&
         gnt-instance modify -t drbd $INSTANCE &&
         gnt-instance failover -f $INSTANCE &&
         gnt-instance start $INSTANCE

     The above can sometimes fail if the allocator is upset about
     something in the cluster, for example:
     
         Can's find secondary node using iallocator hail: Request failed: No valid allocation solutions, failure reasons: FailMem: 2, FailN1: 2

     This situation is covered by [ticket 33785](https://bugs.torproject.org/33785). To work around the
     allocator, you can specify a secondary node directly:
     
         gnt-instance modify -t drbd -n fsn-node-04.torproject.org $INSTANCE &&
         gnt-instance failover -f $INSTANCE &&
         gnt-instance start $INSTANCE

     TODO: move into fabric, maybe in a `libvirt-import-live` or
     `post-libvirt-import` job that would also do the renumbering below

 11. change IP address in the following locations:

     * LDAP (`ipHostNumber` field, but also change the `physicalHost` and `l` fields!).  Also drop the dnsTTL attribute while you're at it.
     * Puppet (grep in tor-puppet source, run `puppet agent -t; ud-replicate` on pauli)
     * DNS (grep in tor-dns source, `puppet agent -t; ud-replicate` on nevii)
     * nagios (don't forget to change the parent)
     * reverse DNS (upstream web UI, e.g. Hetzner Robot)
     * grep for the host's IP address on itself:

            grep -r -e 78.47.38.227  -e 2a01:4f8:fff0:4f:266:37ff:fe77:1ad8 /etc
            grep -r -e 78.47.38.227  -e 2a01:4f8:fff0:4f:266:37ff:fe77:1ad8 /srv

     * grep for the host's IP on *all* hosts:

            cumin-all-puppet
            cumin-all 'grep -r -e 78.47.38.227  -e 2a01:4f8:fff0:4f:266:37ff:fe77:1ad8 /etc'

     TODO: move those jobs into fabric

 12. retire old instance (only a tiny part of [howto/retire-a-host](howto/retire-a-host)):
 
         ./retire -H $INSTANCE retire-instance --parent-host $KVM_HOST

 12. update the [Nextcloud spreadsheet](https://nc.torproject.net/apps/onlyoffice/5395) to remove the machine from
     the KVM host

 13. warn users about the migration, for example:
 
> To: tor-project@lists.torproject.org
> Subject: cupani AKA git-rw IP address changed
> 
> The main git server, cupani, is the machine you connect to when you push
> or pull git repositories over ssh to git-rw.torproject.org. That
> machines has been migrated to the new Ganeti cluster.
> 
> This required an IP address change from:
> 
>     78.47.38.228 2a01:4f8:211:6e8:0:823:4:1
> 
> to:
> 
>     116.202.120.182 2a01:4f8:fff0:4f:266:37ff:fe32:cfb2
> 
> DNS has been updated and preliminary tests show that everything is
> mostly working. You *will* get a warning about the IP address change
> when connecting over SSH, which will go away after the first
> connection. 
>
>     Warning: Permanently added the ED25519 host key for IP address '116.202.120.182' to the list of known hosts.
>
> That is normal. The SSH fingerprints of the host did *not* change.
> 
> Please do report any other anomaly using the normal channels:
> 
> https://gitlab.torproject.org/tpo/tpa/team/-/wikis/support
> 
> The service was unavailable for about an hour during the migration.

## Importing external libvirt instances, manual

This procedure is now easier to accomplish with the Fabric tools
written especially for this purpose. Use the above procedure
instead. This is kept for historical reference.

Assumptions:

 * `INSTANCE`: name of the instance being migrated, the "old" one
   being outside the cluster and the "new" one being the one created
   inside the cluster (e.g. `chiwui.torproject.org`)
 * `SPARE_NODE`: a ganeti node with free space
   (e.g. `fsn-node-03.torproject.org`) where the `INSTANCE` will be
   migrated
 * `MASTER_NODE`: the master ganeti node
   (e.g. `fsn-node-01.torproject.org`)
 * `KVM_HOST`: the machine which we migrate the `INSTANCE` from
 * the `INSTANCE` has only `root` and `swap` partitions

Import procedure:

 1. pick a viable SPARE NODE to import the instance (see "evaluating
    cluster capacity" above, when in doubt), login to the three
    servers, setting the proper environment everywhere, for example:
    
        MASTER_NODE=fsn-node-01.torproject.org
        SPARE_NODE=fsn-node-03.torproject.org
        KVM_HOST=kvm1.torproject.org
        INSTANCE=test.torproject.org

 2. establish VM specs, on the KVM HOST:
 
    * disk space in GiB:
    
          for disk in /srv/vmstore/$INSTANCE/*; do
              printf "$disk: "
              echo "$(qemu-img info --output=json $disk | jq '."virtual-size"') / 1024 / 1024 / 1024" | bc -l
          done

    * number of CPU cores:

          sed -n '/<vcpu/{s/[^>]*>//;s/<.*//;p}' < /etc/libvirt/qemu/$INSTANCE.xml

    * memory, assuming from KiB to GiB:

          echo "$(sed -n '/<memory/{s/[^>]*>//;s/<.*//;p}' < /etc/libvirt/qemu/$INSTANCE.xml) /1024 /1024" | bc -l

      TODO: make sure the memory line is in KiB and that the number
      makes sense.

    * on the INSTANCE, find the swap device UUID so we can recreate it later:

          blkid -t TYPE=swap -s UUID -o value

 3. setup a copy channel, on the SPARE NODE:
 
        ssh-agent bash
        ssh-add /etc/ssh/ssh_host_ed25519_key
        cat /etc/ssh/ssh_host_ed25519_key.pub

    on the KVM HOST:
    
        echo "$KEY_FROM_SPARE_NODE" >> /etc/ssh/userkeys/root

 4. copy the `.qcow` file(s) over, from the KVM HOST to the SPARE NODE:
 
        rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-root /srv/
        rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-lvm /srv/ || true

    Note: it's possible there is not enough room in `/srv`: in the
    base Ganeti installs, everything is in the same root partition
    (`/`) which will fill up if the instance is (say) over ~30GiB. In
    that case, create a filesystem in `/srv`:

        (mkdir /root/srv && mv /srv/* /root/srv true) || true &&
        lvcreate -L 200G vg_ganeti -n srv &&
        mkfs /dev/vg_ganeti/srv &&
        echo "/dev/vg_ganeti/srv /srv ext4 rw,noatime,errors=remount-ro 0 2" >> /etc/fstab &&
        mount /srv &&
        ( mv /root/srv/* ; rmdir /root/srv )

    This partition can be reclaimed once the VM migrations are
    completed, as it needlessly takes up space on the node.

 5. on the SPARE NODE, create and initialize a logical volume with the predetermined size:
 
        lvcreate -L 4GiB -n $INSTANCE-swap vg_ganeti
        mkswap --uuid $SWAP_UUID /dev/vg_ganeti/$INSTANCE-swap
        lvcreate -L 20GiB -n $INSTANCE-root vg_ganeti
        qemu-img convert /srv/$INSTANCE-root  -O raw /dev/vg_ganeti/$INSTANCE-root
        lvcreate -L 40GiB -n $INSTANCE-lvm vg_ganeti_hdd
        qemu-img convert /srv/$INSTANCE-lvm  -O raw /dev/vg_ganeti_hdd/$INSTANCE-lvm

    Note how we assume two disks above, but the instance might have a
    different configuration that would require changing the above. The
    above, common, configuration is to have an LVM disk separate from
    the "root" disk, the former being on a HDD, but the HDD is
    sometimes completely omitted and sizes can differ.
    
    Sometimes it might be worth using pv to get progress on long
    transfers:
    
        qemu-img convert /srv/$INSTANCE-lvm -O raw /srv/$INSTANCE-lvm.raw
        pv /srv/$INSTANCE-lvm.raw | dd of=/dev/vg_ganeti_hdd/$INSTANCE-lvm bs=4k

    TODO: ideally, the above procedure (and many steps below as well)
    would be automatically deduced from the disk listing established
    in the first step.

 6. on the MASTER NODE, create the instance, adopting the LV:
 
        gnt-instance add -t plain \
            -n fsn-node-03 \
            --disk 0:adopt=$INSTANCE-root \
            --disk 1:adopt=$INSTANCE-swap \
            --disk 2:adopt=$INSTANCE-lvm,vg=vg_ganeti_hdd \
            --backend-parameters memory=2g,vcpus=2 \
            --net 0:ip=pool,network=gnt-fsn \
            --no-name-check \
            --no-ip-check \
            -o debootstrap+default \
            $INSTANCE

 7. cross your fingers and watch the party:
 
        gnt-instance console $INSTANCE

 9. IP address change on new instance:
 
      edit `/etc/hosts` and `/etc/network/interfaces` by hand and add
      IPv4 and IPv6 ip. IPv4 configuration can be found in:
      
          gnt-instance show $INSTANCE
          
      Latter can be guessed by concatenating `2a01:4f8:fff0:4f::` and
      the IPv6 local local address without `fe80::`. For example: a
      link local address of `fe80::266:37ff:fe65:870f/64` should yield
      the following configuration:
      
          iface eth0 inet6 static
              accept_ra 0
              address 2a01:4f8:fff0:4f:266:37ff:fe65:870f/64
              gateway 2a01:4f8:fff0:4f::1

      TODO: reuse `gnt-debian-interfaces` from the ganeti puppet
      module script here?

 10. functional tests: change your `/etc/hosts` to point to the new
     server and see if everything still kind of works

 11. shutdown original instance

 12. resync and reconvert image, on the Ganeti MASTER NODE:
 
         gnt-instance stop $INSTANCE

     on the Ganeti node:

         rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-root /srv/ &&
         qemu-img convert /srv/$INSTANCE-root  -O raw /dev/vg_ganeti/$INSTANCE-root &&
         rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-lvm /srv/ &&
         qemu-img convert /srv/$INSTANCE-lvm  -O raw /dev/vg_ganeti_hdd/$INSTANCE-lvm

 13. switch to DRBD, still on the Ganeti MASTER NODE:

         gnt-instance modify -t drbd $INSTANCE
         gnt-instance failover $INSTANCE
         gnt-instance startup $INSTANCE

 14. redo IP adress change in `/etc/network/interfaces` and `/etc/hosts`

 15. final functional test

 16. change IP address in the following locations:

     * nagios (don't forget to change the parent)
     * LDAP (`ipHostNumber` field, but also change the `physicalHost` and `l` fields!)
     * Puppet (grep in tor-puppet source, run `puppet agent -t; ud-replicate` on pauli)
     * DNS (grep in tor-dns source, `puppet agent -t; ud-replicate` on nevii)
     * reverse DNS (upstream web UI, e.g. Hetzner Robot)

 17. decomission old instance ([howto/retire-a-host](howto/retire-a-host))

### Troubleshooting

 * if boot takes a long time and you see a message like this on the console:
 
        [  *** ] A start job is running for dev-disk-by\x2duuid-484b5...26s / 1min 30s)

   ... which is generally followed by:
   
        [DEPEND] Dependency failed for /dev/disk/by-…6f4b5-f334-4173-8491-9353d4f94e04.
        [DEPEND] Dependency failed for Swap.

   it means the swap device UUID wasn't setup properly, and does not
   match the one provided in `/etc/fstab`. That is probably because
   you missed the `mkswap -U` step documented above.

### References

 * [Upstream docs](http://docs.ganeti.org/ganeti/2.15/html/admin.html#import-of-foreign-instances) have the canonical incantation:

        gnt-instance add -t plain -n HOME_NODE ... --disk 0:adopt=lv_name[,vg=vg_name] INSTANCE_NAME

 * [DSA docs](https://dsa.debian.org/howto/install-ganeti/) also use disk adoption and have a procedure to
   migrate to DRBD

 * [Riseup docs](https://we.riseup.net/riseup+tech/ganeti#move-an-instance-from-one-cluster-to-another-from-) suggest creating a VM without installing, shutting
   down and then syncing

Ganeti [supports importing and exporting](http://docs.ganeti.org/ganeti/2.15/html/design-ovf-support.html?highlight=qcow) from the [Open
Virtualization Format](https://en.wikipedia.org/wiki/Open_Virtualization_Format) (OVF), but unfortunately it [doesn't seem
libvirt supports *exporting* to OVF](https://forums.centos.org/viewtopic.php?t=49231). There's a [virt-convert](http://manpages.debian.org/virt-convert)
tool which can *import* OVF, but not the reverse. The [libguestfs](http://www.libguestfs.org/)
library also has a [converter](http://www.libguestfs.org/virt-v2v.1.html) but it also doesn't support
exporting to OVF or anything Ganeti can load directly.

So people have written [their own conversion tools](https://virtuallyhyper.com/2013/06/migrate-from-libvirt-kvm-to-virtualbox/) or [their own
conversion procedure](https://scienceofficersblog.blogspot.com/2014/04/using-cloud-images-with-ganeti.html).

Ganeti also supports [file-backed instances](http://docs.ganeti.org/ganeti/2.15/html/design-file-based-storage.html) but "adoption" is
specifically designed for logical volumes, so it doesn't work for our
use case.

## Rebooting

Those hosts need special care, as we can accomplish zero-downtime
reboots on those machines. The `reboot` script in `tsa-misc` takes
care of the special steps involved (which is basically to empty a
node before rebooting it).

Such a reboot should be ran interactively. 

### Full fleet reboot

This command will reboot the entire Ganeti fleets, including the
hosted VMs, use this when (for example) you have kernel upgrades to
deploy everywhere:

    ./reboot --skip-ganeti-empty -v --reason 'qemu flagged in needrestart' \
        -H chi-node-0{1,2,3,4,5,6,7,8,9}.torproject.org \
           chi-node-1{0,1}.torproject.org \
           fsn-node-0{1,2,3,4,5,6,7,8}.torproject.org

This is long and rather disruptive. Notifications should be posted on
IRC, in `#tor-project`, as instances are rebooted.

It can take about a day to complete a full fleet-wide reboot.

### Node-only reboot

In certain cases (Open vSwitch restarts, for example), only the nodes
need a reboot, and not the instances. In that case, you want to reboot
the nodes but before that, migrate the instances off the node and then
migrate it back when done. This incantation should do so:

    ./reboot --ganeti-migrate-back -v --reason 'Open vSwitch upgrade' \
        -H fsn-node-0{1,2,3,4,5,6,7,8}.torproject.org

This should cause no user-visible disruption.

### Instance-only restarts

An alternative procedure should be used if only the `ganeti.service`
requires a restart. This happens when a Qemu dependency that has been
upgraded, for example `libxml` or OpenSSL.

This will only migrate the VMs without rebooting the hosts:

    ./reboot --ganeti-migrate-back --kind=cancel -v --reason 'qemu flagged in needrestart' \
        -H chi-node-0{1,2,3,4,5,6,7,8,9}.torproject.org \
           chi-node-1{0,1}.torproject.org \
           fsn-node-0{1,2,3,4,5,6,7,8}.torproject.org

This should cause no user-visible disruption.

## Rebalancing a cluster

After a reboot or a downtime, all nodes might end up on the same
machine. This is normally handled by the reboot script, but it might
be desirable to do this by hand if there was a crash or another
special condition.

This can be easily corrected with this command, which will spread
instances around the cluster to balance it:

    hbal -L -C -v -p

The above will show the proposed solution, with the state of the
cluster before, and after (`-p`) and the commands to get there
(`-C`). To actually execute the commands, you can copy-paste those
commands. An alternative is to pass the `-X` argument, to tell `hbal`
to actually issue the commands itself:

    hbal -L -C -v -p -X

This will automatically move the instances around and rebalance the
cluster. Here's an example run on a small cluster:

    root@fsn-node-01:~# gnt-instance list
    Instance                          Hypervisor OS                 Primary_node               Status  Memory
    loghost01.torproject.org          kvm        debootstrap+buster fsn-node-02.torproject.org running   2.0G
    onionoo-backend-01.torproject.org kvm        debootstrap+buster fsn-node-02.torproject.org running  12.0G
    static-master-fsn.torproject.org  kvm        debootstrap+buster fsn-node-02.torproject.org running   8.0G
    web-fsn-01.torproject.org         kvm        debootstrap+buster fsn-node-02.torproject.org running   4.0G
    web-fsn-02.torproject.org         kvm        debootstrap+buster fsn-node-02.torproject.org running   4.0G
    root@fsn-node-01:~# hbal -L -X
    Loaded 2 nodes, 5 instances
    Group size 2 nodes, 5 instances
    Selected node group: default
    Initial check done: 0 bad nodes, 0 bad instances.
    Initial score: 8.45007519
    Trying to minimize the CV...
        1. onionoo-backend-01 fsn-node-02:fsn-node-01 => fsn-node-01:fsn-node-02   4.98124611 a=f
        2. loghost01          fsn-node-02:fsn-node-01 => fsn-node-01:fsn-node-02   1.78271883 a=f
    Cluster score improved from 8.45007519 to 1.78271883
    Solution length=2
    Got job IDs 16345
    Got job IDs 16346
    root@fsn-node-01:~# gnt-instance list
    Instance                          Hypervisor OS                 Primary_node               Status  Memory
    loghost01.torproject.org          kvm        debootstrap+buster fsn-node-01.torproject.org running   2.0G
    onionoo-backend-01.torproject.org kvm        debootstrap+buster fsn-node-01.torproject.org running  12.0G
    static-master-fsn.torproject.org  kvm        debootstrap+buster fsn-node-02.torproject.org running   8.0G
    web-fsn-01.torproject.org         kvm        debootstrap+buster fsn-node-02.torproject.org running   4.0G
    web-fsn-02.torproject.org         kvm        debootstrap+buster fsn-node-02.torproject.org running   4.0G

In the above example, you should notice that the `web-fsn` instances both
ended up on the same node. That's because the balancer did not know
that they should be distributed. A special configuration was done,
below, to avoid that problem in the future. But as a workaround,
instances can also be moved by hand and the cluster re-balanced.

Also notice that `-X` does not show the job output, use
`ganeti-watch-jobs` for that, in another terminal. See the [job
inspection](#job-inspection) section for more details on that.

### Redundant instances distribution

Some instances are redundant across the cluster and should *not* end up
on the same node. A good example are the `web-fsn-01` and `web-fsn-02`
instances which, in theory, would serve similar traffic. If they end
up on the same node, it might flood the network on that machine or at
least defeats the purpose of having redundant machines.

The way to ensure they get distributed properly by the balancing
algorithm is to "tag" them. For the web nodes, for example, this was
performed on the master:

    gnt-cluster add-tags htools:iextags:service
    gnt-instance add-tags web-fsn-01.torproject.org service:web-fsn
    gnt-instance add-tags web-fsn-02.torproject.org service:web-fsn

This tells Ganeti that `web-fsn` is an "exclusion tag" and the
optimizer will not try to schedule instances with those tags on the
same node.

To see which tags are present, use:

    # gnt-cluster list-tags
    htools:iextags:service

You can also find which nodes are assigned to a tag with:

    # gnt-cluster search-tags service
    /cluster htools:iextags:service
    /instances/web-fsn-01.torproject.org service:web-fsn
    /instances/web-fsn-02.torproject.org service:web-fsn

IMPORTANT: a previous version of this article mistakenly indicated
that a new cluster-level tag had to be created for each service. That
method did *not* work. The [hbal manpage](http://docs.ganeti.org/ganeti/current/man/hbal.html#exclusion-tags) explicitely mentions that
the cluster-level tag is a *prefix* that can be used to create
*multiple* such tags. This configuration also happens to be simpler
and easier to use...

### HDD migration restrictions

Cluster balancing works well until there are inconsistencies between
how nodes are configured. In our case, some nodes have HDDs (Hard Disk
Drives, AKA spinning rust) and others do not. Therefore, it's not
possible to move an instance from a node with a disk allocated on the
HDD to a node that does not have such a disk.

Yet somehow the allocator is not smart enough to tell, and you will
get the following error when doing an automatic rebalancing:

    one of the migrate failed and stopped the cluster balance: Can't create block device: Can't create block device <LogicalVolume(/dev/vg_ganeti_hdd/98d30e7d-0a47-4a7d-aeed-6301645d8469.disk3_data, visible as /dev/, size=102400m)> on node fsn-node-07.torproject.org for instance gitlab-02.torproject.org: Can't create block device: Can't compute PV info for vg vg_ganeti_hdd

In this case, it is trying to migrate the `gitlab-02` server from
`fsn-node-01` (which has an HDD) to `fsn-node-07` (which hasn't),
which naturally fails. This is a known limitation of the Ganeti
code. There has been a [draft design document for multiple storage
unit support](http://docs.ganeti.org/ganeti/master/html/design-multi-storage-htools.html) since 2015, but it has [never been
implemented](https://github.com/ganeti/ganeti/issues/865). There has been multiple issues reported upstream on
the subject:

 * [208: Bad behaviour when multiple volume groups exists on nodes](https://github.com/ganeti/ganeti/issues/208)
 * [1199: unable to mark storage as unavailable for allocation](https://github.com/ganeti/ganeti/issues/1199)
 * [1240: Disk space check with multiple VGs is broken](https://github.com/ganeti/ganeti/issues/1240)
 * [1379: Support for displaying/handling multiple volume groups](https://github.com/ganeti/ganeti/issues/1379)

Unfortunately, there are no known workarounds for this, at least not
that fix the `hbal` command. It *is* possible to exclude the faulty
migration from the pool of possible moves, however, for example in the
above case:

    hbal -L -v -C -P --exclude-instances gitlab-02.torproject.org

It's also possible to use the `--no-disk-moves` option to avoid disk
move operations altogether.

Both workarounds obviously do not correctly balance the
cluster... Note that we have also tried to use `htools:migration` tags
to workaround that issue, but [those do not work for secondary
instances](https://github.com/ganeti/ganeti/issues/1497). For this we would need to setup [node groups](http://docs.ganeti.org/ganeti/current/html/man-gnt-group.html)
instead.

A good trick is to look at the solution proposed by `hbal`:

    Trying to minimize the CV...
        1. tbb-nightlies-master fsn-node-01:fsn-node-02 => fsn-node-04:fsn-node-02   6.12095251 a=f r:fsn-node-04 f
        2. bacula-director-01   fsn-node-01:fsn-node-03 => fsn-node-03:fsn-node-01   4.56735007 a=f
        3. staticiforme         fsn-node-02:fsn-node-04 => fsn-node-02:fsn-node-01   3.99398707 a=r:fsn-node-01
        4. cache01              fsn-node-07:fsn-node-05 => fsn-node-07:fsn-node-01   3.55940346 a=r:fsn-node-01
        5. vineale              fsn-node-05:fsn-node-06 => fsn-node-05:fsn-node-01   3.18480313 a=r:fsn-node-01
        6. pauli                fsn-node-06:fsn-node-07 => fsn-node-06:fsn-node-01   2.84263128 a=r:fsn-node-01
        7. neriniflorum         fsn-node-05:fsn-node-02 => fsn-node-05:fsn-node-01   2.59000393 a=r:fsn-node-01
        8. static-master-fsn    fsn-node-01:fsn-node-02 => fsn-node-02:fsn-node-01   2.47345604 a=f
        9. polyanthum           fsn-node-02:fsn-node-07 => fsn-node-07:fsn-node-02   2.47257956 a=f
       10. forrestii            fsn-node-07:fsn-node-06 => fsn-node-06:fsn-node-07   2.45119245 a=f
    Cluster score improved from 8.92360196 to 2.45119245

Look at the last column. The `a=` field shows what "action" will be
taken. A `f` is a failover (or "migrate"), and a `r:` is a
`replace-disks`, with the new secondary after the semi-colon (`:`). In
the above case, the proposed solution is correct: no secondary node is
in the range of nodes that lacks HDDs (`fsn-node-0[5-7]`). If one of
the disk replaces hits one of the nodes without HDD, then it's when
you use `--exclude-instances` to find a better solution. A typical
exclude is:

    hbal -L -v -C -P --exclude-instance=bacula-director-01,tbb-nightlies-master,eugeni,winklerianum,woronowii,rouyi,loghost01,materculae,gayi,weissii

Another option is to specifically look for instances that do not have
a HDD and migrate only those. In my situation, `gnt-cluster verify`
was complaining that `fsn-node-02` was full, so I looked for all the
instances on that node and found the ones which didn't have a HDD:

    gnt-instance list -o  pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status \
      | sort | grep 'fsn-node-02' | awk '{print $3}' | \
      while read instance ; do
        printf "checking $instance: "
        if gnt-instance info $instance | grep -q hdd ; then
          echo "HAS HDD"
        else
          echo "NO HDD"
        fi
      done

Then you can manually `migrate -f` (to fail over to the secondary) and
`replace-disks -n` (to find another secondary) the instances that
*can* be migrated out of the four first machines (which have HDDs) to
the last three (which do not). Look at the memory usage in `gnt-node
list` to pick the best node. 

In general, if a given node in the first four is overloaded, a good
trick is to look for one that can be failed over, with, for example:

    gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort | grep '^fsn-node-0[1234]' | grep 'fsn-node-0[5678]'

... or, for a particular node (say fsn-node-04):

    gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort | grep ^fsn-node-04 | grep 'fsn-node-0[5678]'

The instances listed there would be ones that can be migrated to their
secondary to give `fsn-node-04` some breathing room.

## Adding and removing addresses on instances

Say you created an instance but forgot to need to assign an extra
IP. You can still do so with:

    gnt-instance modify --net -1:add,ip=116.202.120.174,network=gnt-fsn test01.torproject.org

## Job inspection

Sometimes it can be useful to look at the active jobs. It might be,
for example, that another user has queued a bunch of jobs in another
terminal which you do not have access to, or some automated process
did (Nagios, for example, runs `gnt-cluster verify` once in a
while). Ganeti has this concept of "jobs" which can provide
information about those.

The command `gnt-job list` will show the entire job history, and
`gnt-job list --running` will show running jobs. `gnt-job watch` can
be used to watch a specific job.

We have a wrapper called `ganeti-watch-jobs` which automatically shows
the output of whatever job is currently running and exits when all
jobs complete. This is particularly useful while [rebalancing the
cluster](#rebalancing-a-cluster) as `hbal -X` does not show the job output...

## Open vSwitch crash course and debugging

[Open vSwitch](https://www.openvswitch.org/) is used in the `gnt-fsn` cluster to connect the multiple
machines with each other through [Hetzner's "vswitch"](https://wiki.hetzner.de/index.php/Vswitch/en) system.

You will typically not need to deal with Open vSwitch, as Ganeti takes
care of configuring the network on instance creation and
migration. But if you believe there might be a problem with it, you
can consider reading the following:

 * [Documentation portal](https://docs.openvswitch.org/en/latest/)
 * [Tutorials](https://docs.openvswitch.org/en/latest/tutorials/index.html_)
 * [Debugging Open vSwitch slides](https://www.openvswitch.org/support/slides/OVS-Debugging-110414.pdf)

## Accessing the QEMU control ports

There is a magic warp zone on the node where an instance is running:

```
nc -U /var/run/ganeti/kvm-hypervisor/ctrl/$INSTANCE.monitor
```

This drops you in the [QEMU monitor](https://people.redhat.com/pbonzini/qemu-test-doc/_build/html/topics/pcsys_005fmonitor.html) which can do all sorts of
things including adding/removing devices, save/restore the VM state,
pause/resume the VM, do screenshots, etc.

There are many sockets in the `ctrl` directory, including:

 * `.serial`: the instance's serial port
 * `.monitor`: the QEMU monitor control port
 * `.qmp`: the same, but with a JSON interface that I can't figure out
   (the `-qmp` argument to `qemu`)
 * `.kvmd`: same as the above?

## Pager playbook

### I/O overload

In case of excessive I/O, it might be worth looking into which machine
is in cause. The [howto/drbd](howto/drbd) page explains how to map a DRBD device to a
VM. You can also find which logical volume is backing an instance (and
vice versa) with this command:

    lvs -o+tags

This will list all logical volumes and their associated tags. If you
already know which logical volume you're looking for, you can address
it directly:

    root@fsn-node-01:~# lvs -o tags /dev/vg_ganeti_hdd/4091b668-1177-41ac-9310-1eac45b46620.disk2_data
      LV Tags
      originstname+bacula-director-01.torproject.org

### Node failure

Ganeti clusters are designed to be [self-healing](http://docs.ganeti.org/ganeti/2.15/html/admin.html#autorepair). As long as only
one machine disappears, the cluster should be able to recover by
failing over other nodes. This is currently done manually, however.

WARNING: the following procedure should be considered a LAST
RESORT. In the vast majority of cases, it is simpler and less risky to
just restart the node using a remote power cycle to restore the
service than risking a split brain scenario which this procedure can
case when not followed properly.

WARNING, AGAIN: if for some reason the node you are failing over from
actually returns on its own without you being able to stop it, it
may run those DRBD disks and virtual machines, and you *may* end
up in a split brain scenario. Normally, the node asks the master for
which VM to start, so it should be safe to failover from a node that
is NOT the master, but make sure the rest of the cluster is healthy
before going ahead with this procedure.

If, say, `fsn-node-07` completely fails and you need to restore
service to the virtual machines running on that server, you can
failover to the secondaries. Before you do, however, you need to be
completely confident it is not still running in parallel, which could
lead to a "split brain" scenario. For that, just cut the power to the
machine using out of band management (e.g. on Hetzner, power down the
machine through the Hetzner Robot, on Cymru, use the iDRAC to cut the
power to the main board).

Once the machine is powered down, instruct Ganeti to stop using it
altogether:

    gnt-node modify --offline=yes fsn-node-07

Then, once the machine is offline and Ganeti also agrees, switch all
the instances on that node to their secondaries:

    gnt-node failover fsn-node-07.torproject.org

It's possible that you need `--ignore-consistency` but this has caused
trouble in the past (see [40229](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40229)). In any case, it is [not used at
the WMF](https://wikitech.wikimedia.org/wiki/Ganeti#Failed_hardware_node), for example, they explicitly say that never needed the
flag.

Note that it will still try to connect to the failed node to shutdown
the DRBD devices, as a last resort.

Recovering from the failure should be automatic: once the failed
server is repaired and restarts, it will contact the master to ask for
instances to start. Since the machines the instances have been
migrated, none will be started and there *should* not be any
inconsistencies. 

Once the machine is up and running and you are confident you do not
have a split brain scenario, you can re-add the machine to the cluster
with:

    gnt-node add --readd fsn-node-07.torproject.org

Once that is done, rebalance the cluster because you now have an empty
node which could be reused (hopefully). It might, obviously, be worth
exploring the root case of the failure, however, before readding the
machine to the cluster.

Recoveries could eventually be automated if such situations occur more
often, by scheduling a [harep](http://docs.ganeti.org/ganeti/2.15/man/harep.html) cron job, which isn't enabled in
Debian by default. See also the [autorepair](http://docs.ganeti.org/docs/ganeti/2.15/html/admin.html#autorepair) section of the admin
manual.

### Master node failure

A master node failure is a special case, as you may not have access to
the node to run Ganeti commands. The [Ganeti wiki master failover
procedure](https://github.com/ganeti/ganeti/wiki/Common-Issues#master-failuresafter-a-failure-two-nodes-think-they-are-master) has good documentation on this, but we also include
scenarios specific to our use cases, to make sure this is also
available offline.

There are two different scenarios that might require a master
failover:

 1. the master is *expected* to fail or go down for maintenance
    (looming HDD failure, planned maintenance) and we want to retain
    availability

 2. the master has completely failed (motherboard fried, power failure,
    etc)

The key difference between scenario 1 and 2 here is that in scenario
1, the master is *still* available.

#### Scenario 1: preventive maintenance

This is the best case scenario, as the master is still available. In
that case, it should simply be a matter of doing the `master-failover`
command and marking the old master as offline. 

On the machine you want to elect as the new master:

    gnt-cluster master-failover
    gnt-node modify --offline yes OLDMASTER.torproject.org

When the old master is available again, re-add it to the cluster with:

    gnt-node add --readd OLDMASTER.torproject.org

Note that it *should* be safe to boot the old master normally, as long
as it doesn't think it's the master before reboot. That is because
it's the master which tells nodes which VMs to start on boot. You can
check that by running this on the OLDMASTER:

    gnt-cluster getmaster

It should return the *NEW* master.

Here's an example of a routine failover performed on `fsn-node-01`,
the nominal master of the `gnt-fsn` cluster, falling over to a
secondary master (we picked `fsn-node-02` here) in prevision for a
disk replacement:

    root@fsn-node-02:~# gnt-cluster master-failover
    root@fsn-node-02:~# gnt-cluster getmaster
    fsn-node-02.torproject.org
    root@fsn-node-02:~# gnt-node modify --offline yes fsn-node-01.torproject.org
    Tue Jun 21 14:30:56 2022 Failed to stop KVM daemon on node 'fsn-node-01.torproject.org': Node is marked offline
    Modified node fsn-node-01.torproject.org
     - master_candidate -> False
     - offline -> True

And indeed, `fsn-node-01` now thinks it's not the master anymore:

    root@fsn-node-01:~# gnt-cluster getmaster
    fsn-node-02.torproject.org

And this is how the node was recovered, after a reboot, on the new
master:

    root@fsn-node-02:~# gnt-node add --readd fsn-node-01.torproject.org
    2022-06-21 16:43:52,666: The certificate differs after being reencoded. Please renew the certificates cluster-wide to prevent future inconsistencies.
    Tue Jun 21 16:43:54 2022  - INFO: Readding a node, the offline/drained flags were reset
    Tue Jun 21 16:43:54 2022  - INFO: Node will be a master candidate

And to promote it back, on the old master:

    root@fsn-node-01:~# gnt-cluster master-failover
    root@fsn-node-01:~# 

And both nodes agree on who the master is:

    root@fsn-node-01:~# gnt-cluster getmaster
    fsn-node-01.torproject.org

    root@fsn-node-02:~# gnt-cluster getmaster
    fsn-node-01.torproject.org

Now is a good time to verify the cluster too:

    gnt-cluster verify

That's pretty much it! See [tpo/tpa/team#40805](https://gitlab.torproject.org/tpo/tpa/team/-/issues/incident/40805) for the rest of
that incident.

#### Scenario 2: complete master node failure

In this scenario, the master node is *completely* unavailable. In this
case, the [Ganeti wiki master failover procedure](https://github.com/ganeti/ganeti/wiki/Common-Issues#master-failuresafter-a-failure-two-nodes-think-they-are-master) should be
followed pretty much to the letter.

WARNING: if you follow this procedure and skip step 1, you will
probably end up with a split brain scenario (recovery documented
below). So make absolutely sure the old master is *REALLY* unavailable
before moving ahead with this.

The procedure is, at the time of writing (WARNING: UNTESTED):

 1. Make sure that the original failed master won't start again while
    a new master is present, preferably by physically shutting down
    the node.

 2. To upgrade one of the master candidates to the master, issue the
    following command on the machine you intend to be the new master:

        gnt-cluster master-failover

 3. Offline the old master so the new master doesn't try to
    communicate with it. Issue the following command:

        gnt-node modify --offline yes oldmaster

 4. If there were any DRBD instances on the old master node, they can
    be failed over by issuing the following commands:

        gnt-node evacuate -s oldmaster
        gnt-node evacuate -p oldmaster

 5. Any plain instances on the old master need to be recreated again.

If the old master becomes available again, re-add it to the cluster
with:

    gnt-node add --readd OLDMASTER.torproject.org

The above procedure is UNTESTED. See also the [Riseup master failover
procedure](https://we.riseup.net/riseup+tech/ganeti#primary-node-fails) for further ideas.

### Split brain recovery

A split brain occurred during a partial failure, failover, then
unexpected recovery of `fsn-node-07` ([issue 40229](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40229)). It might
occur in other scenarios, but this section documents that specific
one. Hopefully the recovery will be similar in other scenarios.

The split brain was the result of an operator running this command to
failover the instances running on the node:

    gnt-node failover --ignore-consistency fsn-node-07.torproject.org

The symptom of the split brain is that the VM is running on two
machines. You will see that in `gnt-cluster verify`:

    Thu Apr 22 01:28:04 2021 * Verifying node status
    Thu Apr 22 01:28:04 2021   - ERROR: instance palmeri.torproject.org: instance should not run on node fsn-node-07.torproject.org
    Thu Apr 22 01:28:04 2021   - ERROR: instance onionoo-backend-02.torproject.org: instance should not run on node fsn-node-07.torproject.org
    Thu Apr 22 01:28:04 2021   - ERROR: instance polyanthum.torproject.org: instance should not run on node fsn-node-07.torproject.org
    Thu Apr 22 01:28:04 2021   - ERROR: instance onionbalance-01.torproject.org: instance should not run on node fsn-node-07.torproject.org
    Thu Apr 22 01:28:04 2021   - ERROR: instance henryi.torproject.org: instance should not run on node fsn-node-07.torproject.org
    Thu Apr 22 01:28:04 2021   - ERROR: instance nevii.torproject.org: instance should not run on node fsn-node-07.torproject.org

In the above, the verification finds an instance running on an
unexpected server (the old primary). Disks will be in a similar
"degraded" state, according to `gnt-cluster verify`:

    Thu Apr 22 01:28:04 2021 * Verifying instance status
    Thu Apr 22 01:28:04 2021   - WARNING: instance onionoo-backend-02.torproject.org: disk/0 on fsn-node-07.torproject.org is degraded; local disk state is 'ok'
    Thu Apr 22 01:28:04 2021   - WARNING: instance onionoo-backend-02.torproject.org: disk/1 on fsn-node-07.torproject.org is degraded; local disk state is 'ok'
    Thu Apr 22 01:28:04 2021   - WARNING: instance onionoo-backend-02.torproject.org: disk/2 on fsn-node-07.torproject.org is degraded; local disk state is 'ok'
    Thu Apr 22 01:28:04 2021   - WARNING: instance onionoo-backend-02.torproject.org: disk/0 on fsn-node-06.torproject.org is degraded; local disk state is 'ok'
    Thu Apr 22 01:28:04 2021   - WARNING: instance onionoo-backend-02.torproject.org: disk/1 on fsn-node-06.torproject.org is degraded; local disk state is 'ok'
    Thu Apr 22 01:28:04 2021   - WARNING: instance onionoo-backend-02.torproject.org: disk/2 on fsn-node-06.torproject.org is degraded; local disk state is 'ok'

We can also see that symptom on an individual instance:

    root@fsn-node-01:~# gnt-instance info onionbalance-01.torproject.org
    - Instance name: onionbalance-01.torproject.org
    [...]
      Disks: 
        - disk/0: drbd, size 10.0G
          access mode: rw
          nodeA: fsn-node-05.torproject.org, minor=29
          nodeB: fsn-node-07.torproject.org, minor=26
          port: 11031
          on primary: /dev/drbd29 (147:29) in sync, status *DEGRADED*
          on secondary: /dev/drbd26 (147:26) in sync, status *DEGRADED*
    [...]

The first (optional) thing to do in a split brain scenario is to stop the damage
made by running instances: stop all the instances running in parallel,
on both the previous and new primaries:

    gnt-instance stop $INSTANCES

Then on `fsn-node-07` just use `kill(1)` to shutdown the `qemu`
processes running the VMs directly. Now the instances should all be
shutdown and no further changes will be done on the VM that could
possibly be lost.

(This step is optional because you can also skip straight to the hard
decision below, while leaving the instances running. But that adds
pressure to you, and we don't want to do that to your poor brain right
now.)

That will leave you time to make a more important decision: which node
will be authoritative (which will keep running as primary) and which
one will "lose" (and will have its instances destroyed)? There's no
easy good or wrong answer for this: it's a judgement call. In any
case, there might already been data loss: for as long as both nodes
were available and the VMs running on both, data registered on one of
the nodes during the split brain will be lost when we destroy the
state on the "losing" node.

If you have picked the previous primary as the "new" primary, you will
need to *first* revert the failover and flip the instances back to the
previous primary:

    for instance in $INSTANCES; do
        gnt-instance failover $instance
    done

When that is done, or if you have picked the "new" primary (the one
the instances were originally failed over to) as the official one: you
need to fix the disks' state. For this, flip to a "plain" disk
(i.e. turn off DRBD) and turn DRBD back on. This will stop mirroring
the disk, and reallocate a new disk in the right place. Assuming all
instances are stopped, this should do it:

    for instance in $INSTANCES ; do
      gnt-instance modify -t plain $instance
      gnt-instance modify -t drbd --no-wait-for-sync $instance
      gnt-instance start $instance
      gnt-instance console $instance
    done

Then the machines should be back up on a single machine and the split
brain scenario resolved. Note that this means the other side of the
DRBD mirror will be destroyed in the procedure, that is the step that
drops the data which was sent to the wrong part of the "split
brain". 

Once everything is back to normal, it might be a good idea to
rebalance the cluster.

References:

 * the `-t plain` hack comes from [this post on the Ganeti list](https://groups.google.com/g/ganeti/c/l8www_IcFFI)
 * [this procedure](https://blkperl.github.io/split-brain-ganeti.html) suggests using `replace-disks -n` which also
   works, but requires us to pick the secondary by hand each time,
   which is annoying
 * [this procedure](https://www.ipserverone.info/knowledge-base/how-to-fix-drbd-recovery-from-split-brain/) has instructions on how to recover at the DRBD
   level directly, but have not required those instructions so far

### Bridge configuration failures

If you get the following error while trying to bring up the bridge:

    root@chi-node-02:~# ifup br0
    add bridge failed: Package not installed
    run-parts: /etc/network/if-pre-up.d/bridge exited with return code 1
    ifup: failed to bring up br0

... it might be the bridge cannot find a way to load the kernel
module, because kernel module loading has been disabled. Reboot with
the `/etc/no_modules_disabled` file present:

    touch /etc/no_modules_disabled
    reboot

It might be that the machine took too long to boot because it's not in
mandos and the operator took too long to enter the LUKS
passphrase. Re-enable the machine with this command on mandos:

    mandos-ctl --enable chi-node-02.torproject

### Cleaning up orphan disks

Sometimes `gnt-cluster verify` will give this warning, particularly
after a failed rebalance:

    * Verifying orphan volumes
       - WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/27dd3687-8953-447e-8632-adf4aa4e11b6.disk0_meta is unknown
       - WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/27dd3687-8953-447e-8632-adf4aa4e11b6.disk0_data is unknown
       - WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_meta is unknown
       - WARNING: node fsn-node-06.torproject.org: volume vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_data is unknown

This can happen when an instance was partially migrated to a node (in
this case `fsn-node-06`) but the migration failed because (for
example) there was no HDD on the target node. The fix here is simply
to remove the logical volumes on the target node:

    ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/27dd3687-8953-447e-8632-adf4aa4e11b6.disk0_meta
    ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/27dd3687-8953-447e-8632-adf4aa4e11b6.disk0_data
    ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_meta
    ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/abf0eeac-55a0-4ccc-b8a0-adb0d8d67cf7.disk1_data

### Cleaning up ghost disks

Under certain circumstances, you might end up with "ghost" disks, for
example:

    Tue Oct  4 13:24:07 2022   - ERROR: cluster : ghost disk 'ed225e68-83af-40f7-8d8c-cf7e46adad54' in temporary DRBD map

It's unclear how this happens, but in this specific case it is
believed the problem occurred because a disk failed to add to an
instance being resized.

It's *possible* this is a situation similar to the one above, in which
case you must first find *where* the ghost disk is, with something
like:

    gnt-cluster command 'lvs --noheadings' | grep 'ed225e68-83af-40f7-8d8c-cf7e46adad54'

If this finds a device, you can remove it as normal:

    ssh fsn-node-06.torproject.org -tt lvremove vg_ganeti/ed225e68-83af-40f7-8d8c-cf7e46adad54.disk1_data

... but in this case, the DRBD map is *not* associated with a logical
volume. You can also check the `dmsetup` output for a match as well:

    gnt-cluster command 'dmsetup ls' | grep 'ed225e68-83af-40f7-8d8c-cf7e46adad54'

According to [this discussion](https://groups.google.com/g/ganeti/c/s5qoh26T1yA), it's possible that restarting
ganeti on all nodes might clear out the issue:

    gnt-cluster command 'service ganeti restart'

If *all* the "ghost" disks mentioned are not actually found anywhere
in the cluster, either in the device mapper or logical volumes, it
might just be stray data leftover in the data file.

So it *looks* like the proper way to do this is to *remove* the
temporary file where this data is stored:

    gnt-cluster command  'grep ed225e68-83af-40f7-8d8c-cf7e46adad54 /var/lib/ganeti/tempres.data'
    ssh ... service ganeti stop
    ssh ... rm /var/lib/ganeti/tempres.data
    ssh ... service ganeti start
    gnt-cluster verify

That solution was proposed in [this discussion](https://groups.google.com/g/ganeti/c/SMR3yNek3Js). Anarcat toured the
Ganeti source code and found that the `ComputeDRBDMap` function, in
the Haskell codebase, basically just sucks the data out of that
`tempres.data` JSON file, and dumps it into the Python side of
things. Then the Python code looks for those disks in its internal
disk list and compares. It's pretty unlikely that the warning would
happen with the disks still being around, therefore.

### Fixing inconsistent disks

Sometimes `gnt-cluster verify` will give this error:

    WARNING: instance materculae.torproject.org: disk/0 on fsn-node-02.torproject.org is degraded; local disk state is 'ok'

... or worse:

    ERROR: instance materculae.torproject.org: couldn't retrieve status for disk/2 on fsn-node-03.torproject.org: Can't find device <DRBD8(hosts=46cce2d9-ddff-4450-a2d6-b2237427aa3c/10-053e482a-c9f9-49a1-984d-50ae5b4563e6/22, port=11177, backend=<LogicalVolume(/dev/vg_ganeti/486d3e6d-e503-4d61-a8d9-31720c7291bd.disk2_data, visible as /dev/, size=10240m)>, metadev=<LogicalVolume(/dev/vg_ganeti/486d3e6d-e503-4d61-a8d9-31720c7291bd.disk2_meta, visible as /dev/, size=128m)>, visible as /dev/disk/2, size=10240m)>

The fix for both is to run:

    gnt-instance activate-disks materculae.torproject.org

This will make sure disks are correctly setup for the instance. 

If you have a lot of those warnings, pipe the output into this filter,
for example:

    gnt-cluster verify | grep -e 'WARNING: instance' -e 'ERROR: instance' |
      sed 's/.*instance//;s/:.*//' |
      sort -u |
      while read instance; do
        gnt-instance activate-disks $instance
      done

If you see an error like this:

    DRBD CRITICAL: Device 28 WFConnection UpToDate, Device 3 WFConnection UpToDate, Device 31 WFConnection UpToDate, Device 4 WFConnection UpToDate

In this case, it's warning that the node has device 4, 28, and 31 in
`WFConnection` state, which is incorrect. This might not be detected
by Ganeti and therefore requires some hand-holding. This is documented
in the [resyncing disks section of out DRBD documentation](howto/drbd#resyncing-disks). Like in
the above scenario, the solution is basically to run `activate-disks`
on the affected instances.

### Not enough memory for failovers

Another error that `gnt-cluster verify` can give you is, for example:

    - ERROR: node fsn-node-04.torproject.org: not enough memory to accomodate instance failovers should node fsn-node-03.torproject.org fail (16384MiB needed, 10724MiB available)

The solution is to [rebalance the cluster](#rebalancing-a-cluster).

### Can't assemble device after creation

It's possible that Ganeti fails to create an instance with this error:

    Thu Jan 14 20:01:00 2021  - WARNING: Device creation failed
    Failure: command execution error:
    Can't create block device <DRBD8(hosts=d1b54252-dd81-479b-a9dc-2ab1568659fa/0-3aa32c9d-c0a7-44bb-832d-851710d04765/0, port=11005, backend=<LogicalVolume(/dev/vg_ganeti/3f60a066-c957-4a86-9fae-65525fe3f3c7.disk0_data, not visible, size=10240m)>, metadev=<LogicalVolume(/dev/vg_ganeti/3f60a066-c957-4a86-9fae-65525fe3f3c7.disk0_meta, not visible, size=128m)>, visible as /dev/disk/0, size=10240m)> on node chi-node-03.torproject.org for instance build-x86-13.torproject.org: Can't assemble device after creation, unusual event: drbd0: timeout while configuring network

In this case, the problem was that `chi-node-03` had an incorrect
`secondary_ip` set. The immediate fix was to correctly set the
secondary address of the node:

    gnt-node modify --secondary-ip=172.30.130.3 chi-node-03.torproject.org

Then `gnt-cluster verify` was complaining about the leftover DRBD
device:

       - ERROR: node chi-node-03.torproject.org: unallocated drbd minor 0 is in use

For this, see [DRBD: deleting a stray device](howto/drbd#deleting-a-stray-device).

### SSH key verification failures

Ganeti uses SSH to launch arbitrary commands (as root!) on other
nodes. It does this using a funky command, from `node-daemon.log`:

    ssh -oEscapeChar=none -oHashKnownHosts=no \
      -oGlobalKnownHostsFile=/var/lib/ganeti/known_hosts \
      -oUserKnownHostsFile=/dev/null -oCheckHostIp=no \
      -oConnectTimeout=10 -oHostKeyAlias=chignt.torproject.org
      -oPort=22 -oBatchMode=yes -oStrictHostKeyChecking=yes -4 \
      root@chi-node-03.torproject.org

This has caused us some problems in the Ganeti buster to bullseye
upgrade, possibly because of changes in host verification routines in
OpenSSH. The problem was documented in [issue 1608 upstream](https://github.com/ganeti/ganeti/issues/1608) and
[tpo/tpa/team#40383](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40383).

A workaround is to synchronize Ganeti's `known_hosts` file:

    grep 'chi-node-0[0-9]' /etc/ssh/ssh_known_hosts | grep -v 'initramfs' | grep ssh-rsa | sed 's/[^ ]* /chignt.torproject.org /' >> /var/lib/ganeti/known_hosts

Note that the above assumes only a < 10 nodes cluster.

### Other troubleshooting

The [walkthrough](http://docs.ganeti.org/ganeti/2.15/html/walkthrough.html) also has a few recipes to resolve common
problems.

See also the [common issues page](https://github.com/ganeti/ganeti/wiki/Common-Issues) in the Ganeti wiki.

Look into logs on the relevant nodes (particularly
`/var/log/ganeti/node-daemon.log`, which shows all commands ran by
ganeti) when you have problems.

### Migrating a VM between clusters

The [export/import](https://docs.ganeti.org/docs/ganeti/3.0/html/admin.html#export-import) mechanism can also be used to export and import
VMs one at a time, if only a subset of the cluster needs to be
evacuated.

Note that this procedure is still a work in progress. A simulation was
performed in [tpo/tpa/team#40917](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40917), a proper procedure might vary
from this significantly. In particular, there are some optimizations
possible through things like [zerofree](https://tracker.debian.org/pkg/zerofree) and compression...

 1. find nodes to host the exported VM on the source cluster and the
    target cluster; it needs enough disk space in `/var/lib/ganeti/export` to
    keep a copy of a snapshot of the VM:

        df -h /var/lib/ganeti/export

 2. have the right kernel modules loaded, which might require a
    reboot of the source node:

        modprobe dm_snapshot

 3. on the master of the source Ganeti cluster, export the VM to the
    source node, also use `--noshutdown` if you cannot afford to have
    downtime on the VM *and* you are ready to lose data accumulated
    after the snapshot:

        gnt-backup export -n fsn-node-01.torproject.org test-01.torproject.org

    WARNING: this step is currently not working if there's a second
    disk (or swap device? to be confirmed), see [this upstream issue
    for details](https://github.com/ganeti/instance-debootstrap/issues/18). for now we're deploying the "nocloud"
    export/import mechanisms through Puppet to workaround that problem
    which means the whole disk is copied (as opposed to only the used
    parts)

 4. copy the VM snapshot from the source node to node in the target
    cluster:

        rsync -a /var/lib/ganeti/export/test-01.torproject.org/ root@chi-node-02.torproject.org:/var/lib/ganeti/export/test-01.torproject.org/

 5. on the master of the target Ganeti cluster, import the VM:

        gnt-backup import -n chi-node-08:chi-node-07 --src-node=chi-node-02.torproject.org --src-dir=/var/lib/ganeti/export/test-01.torproject.org/ test-01.torproject.org

 6. enter the restored server console to change the IP address:
 
        gnt-instance console test-01.torproject.org
        
 7. if everything looks well, change the IP in LDAP

 8. destroy the old VM

### Mass migrating instances to a new cluster

The [move-instance](https://docs.ganeti.org/docs/ganeti/3.0/html/move-instance.html) command can do this.

TODO: document mass cluster migrations.

### Reboot procedures

If you get this email in Nagios:

    Subject: ** PROBLEM Service Alert: chi-node-01/needrestart is WARNING **

... and in the detailed results, you see:

    WARN - Kernel: 5.10.0-19-amd64, Microcode: CURRENT, Services: 1 (!), Containers: none, Sessions: none
    Services:
    - ganeti.service

You can try to make `needrestart` fix Ganeti by hand:

    root@chi-node-01:~# needrestart
    Scanning processes...
    Scanning candidates...
    Scanning processor microcode...
    Scanning linux images...

    Running kernel seems to be up-to-date.

    The processor microcode seems to be up-to-date.

    Restarting services...
     systemctl restart ganeti.service

    No containers need to be restarted.

    No user sessions are running outdated binaries.
    root@chi-node-01:~#

... but it's actually likely this didn't fix anything. A rerun will
yield the same result.

That is likely because the virtual machines, running inside a `qemu`
process, need a restart. This can be fixed by rebooting the entire
host, if it needs a reboot, or, if it doesn't, just migrating the VMs
around.

See the [Ganeti reboot procedures](#rebooting) for how to proceed from
here on. This is likely a case of an [Instance-only restart](#instance-only-restarts).

### Slow disk sync after rebooting/Broken migrate-back

After rebooting a node with high-traffic instances, the node's disks may take several minutes to sync. While the disks are syncing, the `reboot` script's `--ganeti-migrate-back` option can fail

```
Wed Aug 10 21:48:22 2022 Migrating instance onionbalance-02.torproject.org
Wed Aug 10 21:48:22 2022 * checking disk consistency between source and target
Wed Aug 10 21:48:23 2022  - WARNING: Can't find disk on node chi-node-08.torproject.org
Failure: command execution error:
Disk 0 is degraded or not fully synchronized on target node, aborting migration
unexpected exception during reboot: [<UnexpectedExit: cmd='gnt-instance migrate -f onionbalance-02.torproject.org' exited=1>] Encountered a bad command exit code!

Command: 'gnt-instance migrate -f onionbalance-02.torproject.org'
```

When this happens, `gnt-cluter verify` may show a large amount of errors for node status and instance status

```
Wed Aug 10 21:49:37 2022 * Verifying node status
Wed Aug 10 21:49:37 2022   - ERROR: node chi-node-08.torproject.org: drbd minor 0 of disk 1e713d4e-344c-4c39-9286-cb47bcaa8da3 (attached in instance 'probetelemetry-01.torproject.org') is not active
Wed Aug 10 21:49:37 2022   - ERROR: node chi-node-08.torproject.org: drbd minor 1 of disk 1948dcb7-b281-4ad3-a2e4-cdaf3fa159a0 (attached in instance 'probetelemetry-01.torproject.org') is not active
Wed Aug 10 21:49:37 2022   - ERROR: node chi-node-08.torproject.org: drbd minor 2 of disk 25986a9f-3c32-4f11-b546-71d432b1848f (attached in instance 'probetelemetry-01.torproject.org') is not active
Wed Aug 10 21:49:37 2022   - ERROR: node chi-node-08.torproject.org: drbd minor 3 of disk 7f3a5ef1-b522-4726-96cf-010d57436dd5 (attached in instance 'static-gitlab-shim.torproject.org') is not active
Wed Aug 10 21:49:37 2022   - ERROR: node chi-node-08.torproject.org: drbd minor 4 of disk bfd77fb0-b8ec-44dc-97ad-fd65d6c45850 (attached in instance 'static-gitlab-shim.torproject.org') is not active
Wed Aug 10 21:49:37 2022   - ERROR: node chi-node-08.torproject.org: drbd minor 5 of disk c1828d0a-87c5-49db-8abb-ee00ccabcb73 (attached in instance 'static-gitlab-shim.torproject.org') is not active
Wed Aug 10 21:49:37 2022   - ERROR: node chi-node-08.torproject.org: drbd minor 8 of disk 1f3f4f1e-0dfa-4443-aabf-0f3b4c7d2dc4 (attached in instance 'onionbalance-02.torproject.org') is not active
Wed Aug 10 21:49:37 2022   - ERROR: node chi-node-08.torproject.org: drbd minor 9 of disk bbd5b2e9-8dbb-42f4-9c10-ef0df7f59b85 (attached in instance 'onionbalance-02.torproject.org') is not active
Wed Aug 10 21:49:37 2022 * Verifying instance status
Wed Aug 10 21:49:37 2022   - WARNING: instance static-gitlab-shim.torproject.org: disk/0 on chi-node-04.torproject.org is degraded; local disk state is 'ok'
Wed Aug 10 21:49:37 2022   - WARNING: instance static-gitlab-shim.torproject.org: disk/1 on chi-node-04.torproject.org is degraded; local disk state is 'ok'
Wed Aug 10 21:49:37 2022   - WARNING: instance static-gitlab-shim.torproject.org: disk/2 on chi-node-04.torproject.org is degraded; local disk state is 'ok'
Wed Aug 10 21:49:37 2022   - ERROR: instance static-gitlab-shim.torproject.org: couldn't retrieve status for disk/0 on chi-node-08.torproject.org: Can't find device <DRBD8(hosts=0d8b8663-e2bd-42e7-9e8d-e4502fa621b8/3-3aa32c9d-c0a7-44bb-832d-851710d04765/8, port=11040, backend=<LogicalVolume(/dev/vg_ganeti/b1913b02-14f4-4c0e-9d78-970bd34f5291.disk0_data, visible as /dev/, size=10240m)>, metadev=<LogicalVolume(/dev/vg_ganeti/b1913b02-14f4-4c0e-9d78-970bd34f5291.disk0_meta, visible as /dev/, size=128m)>, visible as /dev/disk/0, size=10240m)>
Wed Aug 10 21:49:37 2022   - ERROR: instance static-gitlab-shim.torproject.org: couldn't retrieve status for disk/1 on chi-node-08.torproject.org: Can't find device <DRBD8(hosts=0d8b8663-e2bd-42e7-9e8d-e4502fa621b8/4-3aa32c9d-c0a7-44bb-832d-851710d04765/11, port=11041, backend=<LogicalVolume(/dev/vg_ganeti/5fc54069-ee70-499a-9987-8201a604ee77.disk1_data, visible as /dev/, size=2048m)>, metadev=<LogicalVolume(/dev/vg_ganeti/5fc54069-ee70-499a-9987-8201a604ee77.disk1_meta, visible as /dev/, size=128m)>, visible as /dev/disk/1, size=2048m)>
Wed Aug 10 21:49:37 2022   - ERROR: instance static-gitlab-shim.torproject.org: couldn't retrieve status for disk/2 on chi-node-08.torproject.org: Can't find device <DRBD8(hosts=0d8b8663-e2bd-42e7-9e8d-e4502fa621b8/5-3aa32c9d-c0a7-44bb-832d-851710d04765/12, port=11042, backend=<LogicalVolume(/dev/vg_ganeti/5d092bcf-d229-47cd-bb2b-04dfe241fb68.disk2_data, visible as /dev/, size=20480m)>, metadev=<LogicalVolume(/dev/vg_ganeti/5d092bcf-d229-47cd-bb2b-04dfe241fb68.disk2_meta, visible as /dev/, size=128m)>, visible as /dev/disk/2, size=20480m)>
Wed Aug 10 21:49:37 2022   - WARNING: instance probetelemetry-01.torproject.org: disk/0 on chi-node-06.torproject.org is degraded; local disk state is 'ok'
Wed Aug 10 21:49:37 2022   - WARNING: instance probetelemetry-01.torproject.org: disk/1 on chi-node-06.torproject.org is degraded; local disk state is 'ok'
Wed Aug 10 21:49:37 2022   - WARNING: instance probetelemetry-01.torproject.org: disk/2 on chi-node-06.torproject.org is degraded; local disk state is 'ok'
Wed Aug 10 21:49:37 2022   - ERROR: instance probetelemetry-01.torproject.org: couldn't retrieve status for disk/0 on chi-node-08.torproject.org: Can't find device <DRBD8(hosts=e2efd223-53e1-44f4-b84d-38f6eb26dcbb/3-0d8b8663-e2bd-42e7-9e8d-e4502fa621b8/0, port=11035, backend=<LogicalVolume(/dev/vg_ganeti/4b699f8a-ebde-4680-bfda-4e1a2e191b8f.disk0_data, visible as /dev/, size=10240m)>, metadev=<LogicalVolume(/dev/vg_ganeti/4b699f8a-ebde-4680-bfda-4e1a2e191b8f.disk0_meta, visible as /dev/, size=128m)>, visible as /dev/disk/0, size=10240m)>
Wed Aug 10 21:49:37 2022   - ERROR: instance probetelemetry-01.torproject.org: couldn't retrieve status for disk/1 on chi-node-08.torproject.org: Can't find device <DRBD8(hosts=e2efd223-53e1-44f4-b84d-38f6eb26dcbb/4-0d8b8663-e2bd-42e7-9e8d-e4502fa621b8/1, port=11036, backend=<LogicalVolume(/dev/vg_ganeti/e5f56f72-1492-4596-8957-ce442ef0fcd5.disk1_data, visible as /dev/, size=2048m)>, metadev=<LogicalVolume(/dev/vg_ganeti/e5f56f72-1492-4596-8957-ce442ef0fcd5.disk1_meta, visible as /dev/, size=128m)>, visible as /dev/disk/1, size=2048m)>
Wed Aug 10 21:49:37 2022   - ERROR: instance probetelemetry-01.torproject.org: couldn't retrieve status for disk/2 on chi-node-08.torproject.org: Can't find device <DRBD8(hosts=e2efd223-53e1-44f4-b84d-38f6eb26dcbb/5-0d8b8663-e2bd-42e7-9e8d-e4502fa621b8/2, port=11037, backend=<LogicalVolume(/dev/vg_ganeti/ee280ecd-78cb-46c6-aca4-db23a0ae1454.disk2_data, visible as /dev/, size=51200m)>, metadev=<LogicalVolume(/dev/vg_ganeti/ee280ecd-78cb-46c6-aca4-db23a0ae1454.disk2_meta, visible as /dev/, size=128m)>, visible as /dev/disk/2, size=51200m)>
Wed Aug 10 21:49:37 2022   - WARNING: instance onionbalance-02.torproject.org: disk/0 on chi-node-09.torproject.org is degraded; local disk state is 'ok'
Wed Aug 10 21:49:37 2022   - WARNING: instance onionbalance-02.torproject.org: disk/1 on chi-node-09.torproject.org is degraded; local disk state is 'ok'
Wed Aug 10 21:49:37 2022   - ERROR: instance onionbalance-02.torproject.org: couldn't retrieve status for disk/0 on chi-node-08.torproject.org: Can't find device <DRBD8(hosts=0d8b8663-e2bd-42e7-9e8d-e4502fa621b8/8-86e465ce-60df-4a6f-be17-c6abb33eaf88/4, port=11022, backend=<LogicalVolume(/dev/vg_ganeti/3b0e4300-d4c1-4b7c-970a-f20b2214dab5.disk0_data, visible as /dev/, size=10240m)>, metadev=<LogicalVolume(/dev/vg_ganeti/3b0e4300-d4c1-4b7c-970a-f20b2214dab5.disk0_meta, visible as /dev/, size=128m)>, visible as /dev/disk/0, size=10240m)>
Wed Aug 10 21:49:37 2022   - ERROR: instance onionbalance-02.torproject.org: couldn't retrieve status for disk/1 on chi-node-08.torproject.org: Can't find device <DRBD8(hosts=0d8b8663-e2bd-42e7-9e8d-e4502fa621b8/9-86e465ce-60df-4a6f-be17-c6abb33eaf88/5, port=11021, backend=<LogicalVolume(/dev/vg_ganeti/ec75f295-1e09-46df-b2c2-4fa24f064401.disk1_data, visible as /dev/, size=2048m)>, metadev=<LogicalVolume(/dev/vg_ganeti/ec75f295-1e09-46df-b2c2-4fa24f064401.disk1_meta, visible as /dev/, size=128m)>, visible as /dev/disk/1, size=2048m)>
```

This is usually a false alarm, and the warnings and errors will disappear in a few minutes when the disk finishes syncing. Re-check `gnt-cluster verify` every few minutes, and manually migrate the instances back when the errors disappear.

If such an error persists, consider telling Ganeti to "re-seat" the
disks (so to speak) with, for example:

    gnt-instance activate-disks onionbalance-02.torproject.org

## Disaster recovery

If things get completely out of hand and the cluster becomes too
unreliable for service, the only solution is to rebuild another one
elsewhere. Since Ganeti 2.2, there is a [move-instance](https://docs.ganeti.org/docs/ganeti/3.0/html/move-instance.html) command to
move instances between cluster that can be used for that purpose. See
the [mass migration procedure](#mass-migrating-instances-to-a-new-cluster) above.

The [export/import](https://docs.ganeti.org/docs/ganeti/3.0/html/admin.html#export-import) mechanism can also be used to export and import
VMs one at a time, if only a subset of the cluster needs to be
evacuated. See the [migrating a VM between clusters](#migrating-a-vm-between-clusters) procedure above.

If Ganeti is completely destroyed and its APIs don't work anymore, the
last resort is to restore all virtual machines from
[howto/backup](howto/backup). Hopefully, this should not happen except in the case of a
catastrophic data loss bug in Ganeti or [howto/drbd](howto/drbd).

# Reference

## Installation

Ganeti is typically installed as part of the [bare bones machine
installation process](howto/new-machine), typically as part of the
"post-install configuration" procedure, once the machine is fully
installed and configured.

Typically, we add a new *node* to an existing *cluster*. Below are
cluster-specific procedures to add a new *node* to each existing
cluster, alongside the configuration of the cluster as it was done at
the time (and how it could be used to rebuild a cluster from scratch).

Make sure you use the procedure specific to the cluster you are
working on.

Note that this is *not* about installing virtual machines (VMs)
*inside* a Ganeti cluster: for that you want to look at the [new
instance procedure](#adding-a-new-instance).

### New gnt-fsn node

 1. To create a new box, follow [howto/new-machine-hetzner-robot](howto/new-machine-hetzner-robot) but change
    the following settings:

    * Server: [PX62-NVMe][]
    * Location: `FSN1`
    * Operating system: Rescue
    * Additional drives: 2x10TB HDD (update: starting from fsn-node-05,
      we are *not* ordering additional drives to save on costs, see
      [ticket 33083](https://bugs.torproject.org/33083) for rationale)
    * Add in the comment form that the server needs to be in the same
      datacenter as the other machines (FSN1-DC13, but double-check)

 [PX62-NVMe]: https://www.hetzner.com/dedicated-rootserver/px62-nvme?country=OTHER

 2. follow the [howto/new-machine](howto/new-machine) post-install configuration

 3. Add the server to the two `vSwitch` systems in [Hetzner Robot web
    UI](https://robot.your-server.de/vswitch)

 4. install openvswitch and allow modules to be loaded:

        touch /etc/no_modules_disabled
        reboot
        apt install openvswitch-switch

 5. Allocate a private IP address in the `30.172.in-addr.arpa` zone
    (and the `torproject.org` zone) for the node, in the
    `admin/dns/domains.git` repository

 6. copy over the `/etc/network/interfaces` from another ganeti node,
    changing the `address` and `gateway` fields to match the local
    entry.

 7. knock on wood, cross your fingers, pet a cat, help your local
    book store, and reboot:
  
         reboot

 8. Prepare all the nodes by configuring them in Puppet, by adding the
    class `roles::ganeti::fsn` to the node

 9. Re-enable modules disabling:

        rm /etc/no_modules_disabled

 10. run puppet across the ganeti cluster to ensure ipsec tunnels are
     up:

         cumin -p 0 'C:roles::ganeti::fsn' 'puppet agent -t'

 11. reboot again:
 
         reboot

 12. Then the node is ready to be added to the cluster, by running
     this on the master node:

         gnt-node add \
          --secondary-ip 172.30.135.2 \
          --no-ssh-key-check \
          --no-node-setup \
          fsn-node-02.torproject.org

     If this is an entirely new cluster, you need a different
     procedure, see [the cluster initialization procedure](#gnt-fsn-cluster-initialization) instead.

 13. make sure everything is great in the cluster:

         gnt-cluster verify

     If that takes a long time and eventually fails with erors like:

         ERROR: node fsn-node-03.torproject.org: ssh communication with node 'fsn-node-06.torproject.org': ssh problem: ssh: connect to host fsn-node-06.torproject.org port 22: Connection timed out\'r\n

     ... that is because the [howto/ipsec](howto/ipsec) tunnels between the nodes are
     failing. Make sure Puppet has run across the cluster (step 10
     above) and see [howto/ipsec](howto/ipsec) for further diagnostics. For example,
     the above would be fixed with:

         ssh fsn-node-03.torproject.org "puppet agent -t; service ipsec reload"
         ssh fsn-node-06.torproject.org "puppet agent -t; service ipsec reload; ipsec up gnt-fsn-be::fsn-node-03"

### gnt-fsn cluster initialization

This procedure replaces the `gnt-node add` step in the initial setup
of the first Ganeti node when the `gnt-fsn` cluster was setup:

    gnt-cluster init \
        --master-netdev vlan-gntbe \
        --vg-name vg_ganeti \
        --secondary-ip 172.30.135.1 \
        --enabled-hypervisors kvm \
        --nic-parameters mode=openvswitch,link=br0,vlan=4000 \
        --mac-prefix 00:66:37 \
        --no-ssh-init \
        --no-etc-hosts \
        fsngnt.torproject.org

The above assumes that `fsngnt` is already in DNS. See the [MAC
address prefix selection](#mac-address-prefix-selection) section for information on how the
`--mac-prefix` argument was selected.

Then the following extra configuration was performed:

    gnt-cluster modify --reserved-lvs vg_ganeti/root,vg_ganeti/swap
    gnt-cluster modify -H kvm:kernel_path=,initrd_path=
    gnt-cluster modify -H kvm:security_model=pool
    gnt-cluster modify -H kvm:kvm_extra='-device virtio-rng-pci\,bus=pci.0\,addr=0x1e\,max-bytes=1024\,period=1000'
    gnt-cluster modify -H kvm:disk_cache=none
    gnt-cluster modify -H kvm:disk_discard=unmap
    gnt-cluster modify -H kvm:scsi_controller_type=virtio-scsi-pci
    gnt-cluster modify -H kvm:disk_type=scsi-hd
    gnt-cluster modify -H kvm:migration_bandwidth=950
    gnt-cluster modify -H kvm:migration_downtime=500
    gnt-cluster modify -H kvm:migration_caps=postcopy-ram
    gnt-cluster modify -D drbd:c-plan-ahead=0,disk-custom='--c-plan-ahead 0'
    gnt-cluster modify --uid-pool 4000-4019

The [network configuration](#network-configuration) (below) must also be performed for the
address blocks reserved in the cluster.

### New gnt-chi node

 1. to create a new box, follow the [cymru new-machine howto](howto/new-machine-cymru)

 2. follow the [howto/new-machine](howto/new-machine) post-install configuration

 3. Allocate a private IP address in the `30.172.in-addr.arpa` zone for
    the node, in the `admin/dns/domains.git` repository

 4. add the private IP address to the eth1 interface, for example in
    `/etc/network/interfaces.d/eth1`:

        auto eth1
        iface eth1 inet static
            address 172.30.130.5/24

    This IP must be allocated in the reverse DNS zone file
    (`30.172.in-addr.arpa`) and the `torproject.org` zone file in
    the `dns/domains.git` repository.

 5. enable the interface:
 
        ifup eth1

 6. setup a bridge on the public interface, replacing the `eth0` blocks
    with something like:
    
        auto eth0
        iface eth0 inet manual

        auto br0
        iface br0 inet static
            address 38.229.82.104/24
            gateway 38.229.82.1
            bridge_ports eth0
            bridge_stp off
            bridge_fd 0

        # IPv6 configuration
        iface br0 inet6 static
            accept_ra 0
            address 2604:8800:5000:82:baca:3aff:fe5d:8774/64
            gateway 2604:8800:5000:82::1

 6. allow modules to be loaded, cross your fingers that you didn't
    screw up the network configuration above, and reboot:
 
        touch /etc/no_modules_disabled
        reboot

 7. configure the node in Puppet by adding it to the
    `roles::ganeti::chi` class, and run Puppet on the new node:
    
        puppet agent -t

 8. re-disable module loading:
 
         rm /etc/no_modules_disabled

 9. run puppet across the ganeti cluster to firewalls are correctly
    configured:

         cumin -p 0 'C:roles::ganeti::chi' 'puppet agent -t'

 10. Then the node is ready to be added to the cluster, by running
     this on the master node:

         gnt-node add \
          --secondary-ip 172.30.130.5 \
          --no-ssh-key-check \
          --no-node-setup \
          chi-node-05.torproject.org

    If this is an entirely new cluster, you need a different
    procedure, see [the cluster initialization procedure](#gnt-fsn-cluster-initialization) instead.

 11. make sure everything is great in the cluster:

         gnt-cluster verify

If the last step fails with SSH errors, you may need to re-synchronise
the SSH `known_hosts` file, see [SSH key verification failures](#ssh-key-verification-failures).

### gnt-chi cluster initialization

This procedure replaces the `gnt-node add` step in the initial setup
of the first Ganeti node when the `gnt-chi` cluster was setup:

    gnt-cluster init \
        --master-netdev eth1 \
        --nic-parameters link=br0 \
        --vg-name vg_ganeti \
        --secondary-ip 172.30.130.1 \
        --enabled-hypervisors kvm \
        --mac-prefix 06:66:38 \
        --no-ssh-init \
        --no-etc-hosts \
        chignt.torproject.org
    
The above assumes that `chignt` is already in DNS. See the [MAC
address prefix selection](#mac-address-prefix-selection) section for information on how the
`--mac-prefix` argument was selected.

Then the following extra configuration was performed:

```
gnt-cluster modify --reserved-lvs vg_ganeti/root,vg_ganeti/swap
gnt-cluster modify -H kvm:kernel_path=,initrd_path=
gnt-cluster modify -H kvm:security_model=pool
gnt-cluster modify -H kvm:kvm_extra='-device virtio-rng-pci\,bus=pci.0\,addr=0x1e\,max-bytes=1024\,period=1000'
gnt-cluster modify -H kvm:disk_cache=none
gnt-cluster modify -H kvm:disk_discard=unmap
gnt-cluster modify -H kvm:scsi_controller_type=virtio-scsi-pci
gnt-cluster modify -H kvm:disk_type=scsi-hd
gnt-cluster modify -H kvm:migration_bandwidth=950
gnt-cluster modify -H kvm:migration_downtime=500
gnt-cluster modify -H kvm:migration_caps=postcopy-ram
gnt-cluster modify -D drbd:c-plan-ahead=0,disk-custom='--c-plan-ahead 0'
gnt-cluster modify --uid-pool 4000-4019
```

The upper limit for CPU count and memory size were doubled, to 16 and
64G, respectively, with:

```
gnt-cluster modify --ipolicy-bounds-specs \
max:cpu-count=16,disk-count=16,disk-size=1048576,\
memory-size=65536,nic-count=8,spindle-use=12\
/min:cpu-count=1,disk-count=1,disk-size=1024,\
memory-size=128,nic-count=1,spindle-use=1
```

NOTE: watch out for whitespace here. The [original source](https://johnny85v.wordpress.com/2016/06/13/ganeti-commands/) for this
command had too much whitespace, which fails with:

    Failure: unknown/wrong parameter name 'Missing value for key '' in option --ipolicy-bounds-specs'

The disk templates also had to be modified to account for iSCSI
devices:

    gnt-cluster modify --enabled-disk-templates drbd,plain,blockdev
    gnt-cluster modify --ipolicy-disk-templates drbd,plain,blockdev

The [network configuration](#network-configuration) (below) must also be performed for the
address blocks reserved in the cluster. This is the actual initial
configuration performed:

    gnt-network add --network 38.229.82.0/24 --gateway 38.229.82.1 --network6 2604:8800:5000:82::/64 --gateway6 2604:8800:5000:82::1 gnt-chi-01
    gnt-network connect --nic-parameters=link=br0 gnt-chi-01 default

The following IPs were reserved:

    gnt-network modify --add-reserved-ips=38.229.82.2,38.229.82.3,38.229.82.4,38.229.82.5,38.229.82.6,38.229.82.7,38.229.82.8,38.229.82.9,38.229.82.10,38.229.82.11,38.229.82.12,38.229.82.13,38.229.82.14,38.229.82.15,38.229.82.16,38.229.82.17,38.229.82.18,38.229.82.19 gnt-chi-01

The first two are for the gateway, but the rest is temporary and might
be reclaimed eventually.

### Network configuration

IP allocation is managed by Ganeti through the `gnt-network(8)`
system. Say we have `192.0.2.0/24` reserved for the cluster, with
the host IP `192.0.2.100` and the gateway on `192.0.2.1`. You will
create this network with:

    gnt-network add --network 192.0.2.0/24 --gateway 192.0.2.1 example-network

If there's also IPv6, it would look something like this:

    gnt-network add --network 192.0.2.0/24 --gateway 192.0.2.1 --network6 2001:db8::/32 --gateway6 fe80::1 example-network

Note: the actual name of the network (`example-network`) above, should
follow the convention established in [doc/naming-scheme](doc/naming-scheme).

Then we associate the new network to the default node group:

    gnt-network connect --nic-parameters=link=br0,vlan=4000,mode=openvswitch example-network default

The arguments to `--nic-parameters` come from the values configured in
the cluster, above. The current values can be found with `gnt-cluster
info`.

For example, the second ganeti network block was assigned with the
following commands:

    gnt-network add --network 49.12.57.128/27 --gateway 49.12.57.129 gnt-fsn13-02
    gnt-network connect --nic-parameters=link=br0,vlan=4000,mode=openvswitch gnt-fsn13-02 default

IP addresses can be reserved with the `--reserved-ips` argument to the
modify command, for example:

    gnt-network modify --add-reserved-ips=38.229.82.2,38.229.82.3,38.229.82.4,38.229.82.5,38.229.82.6,38.229.82.7,38.229.82.8,38.229.82.9,38.229.82.10,38.229.82.11,38.229.82.12,38.229.82.13,38.229.82.14,38.229.82.15,38.229.82.16,38.229.82.17,38.229.82.18,38.229.82.19 gnt-chi-01 gnt-chi-01

Note that the gateway and nodes IP addresses are automatically
reserved, this is for hosts outside of the cluster.

The network name must follow the [naming convention](doc/naming-scheme).

## SLA

As long as the cluster is not over capacity, it should be able to
survive the loss of a node in the cluster unattended.

Justified machines can be provisionned within a few business days
without problems.

New nodes can be provisioned within a week or two, depending on budget
and hardware availability.

## Design

Our first Ganeti cluster (`gnt-fsn`) is made of multiple machines
hosted with [Hetzner Robot](https://robot.your-server.de/), Hetzner's dedicated server hosting
service. All machines use the same hardware to avoid problems with
live migration. That is currently a customized build of the
[PX62-NVMe][] line.

### Network layout

Machines are interconnected over a [vSwitch](https://wiki.hetzner.de/index.php/Vswitch/en), a "virtual layer 2
network" probably implemented using [Software-defined Networking](https://en.wikipedia.org/wiki/Software-defined_networking)
(SDN) on top of Hetzner's network. The details of that implementation
do not matter much to us, since we do not trust the network and run an
IPsec layer on top of the vswitch. We communicate with the `vSwitch`
through [Open vSwitch](https://en.wikipedia.org/wiki/Open_vSwitch) (OVS), which is (currently manually)
configured on each node of the cluster.

There are two distinct IPsec networks:

 * `gnt-fsn-public`: the public network, which maps to the
   `fsn-gnt-inet-vlan` vSwitch at Hetzner, the `vlan-gntinet` OVS
   network, and the `gnt-fsn` network pool in Ganeti. it provides
   public IP addresses and routing across the network. instances get
   IP allocated in this network.

 * `gnt-fsn-be`: the private ganeti network which maps to the
   `fsn-gnt-backend-vlan` vSwitch at Hetzner and the `vlan-gntbe` OVS
   network. it has no matching `gnt-network` component and IP
   addresses are allocated manually in the 172.30.135.0/24 network
   through DNS. it provides internal routing for Ganeti commands and
   [howto/drbd](howto/drbd) storage mirroring.

### MAC address prefix selection

The MAC address prefix for the gnt-fsn cluster (`00:66:37:...`) seems
to have been picked arbitrarily. While it does not conflict with a
known existing prefix, it could eventually be issued to a manufacturer
and reused, possibly leading to a MAC address clash. The closest is
currently Huawei:

    $ grep ^0066 /var/lib/ieee-data/oui.txt
    00664B     (base 16)		HUAWEI TECHNOLOGIES CO.,LTD

Such a clash is fairly improbable, because that new manufacturer would
need to show up on the local network as well. Still, new clusters
SHOULD use a different MAC address prefix in a [locally administered
address](https://en.wikipedia.org/wiki/MAC_address#Universal_vs._local) (LAA) space, which "are distinguished by setting the
second-least-significant bit of the first octet of the address". In
other words, the MAC address must have 2, 6, A or E as a its second
[quad](https://en.wikipedia.org/wiki/Nibble). In other words, the MAC address must look like one of those:

    x2 - xx - xx - xx - xx - xx
    x6 - xx - xx - xx - xx - xx
    xA - xx - xx - xx - xx - xx
    xE - xx - xx - xx - xx - xx

We used `06:66:38` in the gnt-chi cluster for that reason. We picked
the `06:66` prefix to ressemble the existing `00:66` prefix used in
`gnt-fsn` but varied the last quad (from `:37` to `:38`) to make them
slightly more different-looking.

Obviously, it's unlikely the MAC addresses will be compared across
clusters in the short term. But it's technically possible a MAC bridge
could be established if an exotic VPN bridge gets established between
the two networks in the future, so it's good to have some difference.

### Hardware variations

We considered experimenting with the new AX line ([AX51-NVMe](https://www.hetzner.com/dedicated-rootserver/ax51-nvme?country=OTHER)) but
in the past DSA had problems live-migrating (it wouldn't immediately
fail but there were "issues" after). So we might need to [failover](http://docs.ganeti.org/ganeti/2.15/man/gnt-instance.html#failover)
instead of migrate between those parts of the cluster. There are also
doubts that the Linux kernel supports those shiny new processors at
all: similar processors had [trouble booting before Linux 5.5](https://www.phoronix.com/scan.php?page=news_item&px=Threadripper-3000-MCE-5.5-Fix) for
example, so it might be worth waiting a little before switching to
that new platform, even if it's cheaper. See the cluster configuration
section below for a larger discussion of CPU emulation.

### CPU emulation

Note that we might want to tweak the `cpu_type` parameter. By default,
it emulates a lot of processing that can be delegated to the host CPU
instead. If we use `kvm:cpu_type=host`, then each node will tailor the
emulation system to the CPU on the node. But that might make the live
migration more brittle: VMs or processes can crash after a live
migrate because of a slightly different configuration (microcode, CPU,
kernel and QEMU versions all play a role). So we need to find the
lowest common demoninator in CPU families. The list of available
families supported by QEMU varies between releases, but is visible
with:

    # qemu-system-x86_64 -cpu help
    Available CPUs:
    x86 486
    x86 Broadwell             Intel Core Processor (Broadwell)
    [...]
    x86 Skylake-Client        Intel Core Processor (Skylake)
    x86 Skylake-Client-IBRS   Intel Core Processor (Skylake, IBRS)
    x86 Skylake-Server        Intel Xeon Processor (Skylake)
    x86 Skylake-Server-IBRS   Intel Xeon Processor (Skylake, IBRS)
    [...]

The current [PX62 line][PX62-NVMe] is based on the [Coffee Lake](https://en.wikipedia.org/wiki/Coffee_Lake) Intel
micro-architecture. The closest matching family would be
`Skylake-Server` or `Skylake-Server-IBRS`, [according to wikichip](https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake#Compiler_support).
Note that newer QEMU releases (4.2, currently in unstable) have more
supported features.

In that context, of course, supporting different CPU manufacturers
(say AMD vs Intel) is impractical: they will have totally different
families that are not compatible with each other. This will break live
migration, which can trigger crashes and problems in the migrated
virtual machines.

If there are problems live-migrating between machines, it is still
possible to "failover" (`gnt-instance failover` instead of `migrate`)
which shuts off the machine, fails over disks, and starts it on the
other side. That's not such of a big problem: we often need to reboot
the guests when we reboot the hosts anyways. But it does complicate
our work. Of course, it's also possible that live migrates work fine
if *no* `cpu_type` at all is specified in the cluster, but that needs
to be verified.

Nodes could also [grouped](http://docs.ganeti.org/ganeti/2.15/man/gnt-group.html) to limit (automated) live migration to a
subset of nodes.

References:

 * <https://dsa.debian.org/howto/install-ganeti/>
 * <https://qemu.weilnetz.de/doc/qemu-doc.html#recommendations_005fcpu_005fmodels_005fx86>

### Installer

The [ganeti-instance-debootstrap](https://tracker.debian.org/pkg/ganeti-instance-debootstrap) package is used to install
instances. It is configured through Puppet with the [shared ganeti
module](https://forge.puppet.com/smash/ganeti), which deploys a few hooks to automate the install as much
as possible. The installer will:

 1. setup grub to respond on the serial console
 2. setup and log a random root password
 3. make sure SSH is installed and log the public keys and
    fingerprints
 4. setup swap if a labeled partition is present, or a 512MB swapfile
    otherwise
 5. setup basic static networking through `/etc/network/interfaces.d`

We have custom configurations on top of that to:

 1. add a few base packages
 2. do our own custom SSH configuration
 3. fix the hostname to be a FQDN
 4. add a line to `/etc/hosts`
 5. add a tmpfs

There is work underway to refactor and automate the install better,
see [ticket 31239](https://bugs.torproject.org/31239) for details.

### Storage

TODO: document how DRBD works in general, and how it's setup here in
particular.

See also the [DRBD documentation](howto/drbd).

The Cymru PoP has an iSCSI cluster for large filesystem
storage. Ideally, this would be automated inside Ganeti, some quick
links:

 * [search for iSCSI in the ganeti-devel mailing list](https://www.mail-archive.com/search?l=ganeti-devel@googlegroups.com&q=iscsi&submit.x=0&submit.y=0)
 * in particular a [discussion of integrating SANs into ganeti](https://groups.google.com/forum/m/?_escaped_fragment_=topic/ganeti/P7JU_0YGn9s)
   seems to say "just do it manually" (paraphrasing) and [this
   discussion has an actual implementation](https://groups.google.com/forum/m/?_escaped_fragment_=topic/ganeti/kkXFDgvg2rY), [gnt-storage-eql](https://github.com/atta/gnt-storage-eql)
 * it could be implemented as an [external storage provider](https://github.com/ganeti/ganeti/wiki/External-Storage-Providers), see
   the [documentation](http://docs.ganeti.org/ganeti/2.10/html/design-shared-storage.html)
 * the DSA docs are in two parts: [iscsi](https://dsa.debian.org/howto/iscsi/) and [export-iscsi](https://dsa.debian.org/howto/export-iscsi/)
 * someone made a [Kubernetes provisionner](https://github.com/nmaupu/dell-provisioner) for our hardware which
   could provide sample code

For now, iSCSI volumes are manually created and passed to new virtual
machines. 

## Issues

There is no issue tracker specifically for this project, [File][] or
[search][] for issues in the [team issue tracker][search] with the
~Ganeti label.

 [File]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/new
 [search]: https://gitlab.torproject.org/tpo/tpa/team/-/issues?label_name%5B%5D=Ganeti

Upstream Ganeti has of course its own [issue tracker on GitHub](https://github.com/ganeti/ganeti/issues).

## Monitoring and testing

<!-- TODO: describe how this service is monitored and how it can be tested -->
<!-- after major changes like IP address changes or upgrades -->

## Logs and metrics

Ganeti logs a significant amount of information in
`/var/log/ganeti.log`. Those logs are of particular interest:

 * `node-daemon.log`: all low-level commands and HTTP requests on the
   node daemon, includes, for example, LVM and DRBD commands
 * `os/*$hostname*.log`: installation log for machine `$hostname`

It does not expose performance metrics that are digested by Prometheus
right now, but that would be an interesting feature to add.

## Other documentation

 * [Ganeti](http://www.ganeti.org/)
   * [Ganeti documentation home](http://docs.ganeti.org/)
   * [Main manual](http://docs.ganeti.org/ganeti/master/html/)
   * [Manual pages](http://docs.ganeti.org/ganeti/master/man/)
   * [Wiki](https://github.com/ganeti/ganeti/wiki)
   * [Issues](https://github.com/ganeti/ganeti/issues)
   * [Google group](https://groups.google.com/forum/#!forum/ganeti)
 * [Wikimedia foundation documentation](https://wikitech.wikimedia.org/wiki/Ganeti)
 * [Riseup documentation](https://we.riseup.net/riseup+tech/ganeti)
 * [DSA](https://dsa.debian.org/howto/install-ganeti/)
 * [OSUOSL wiki](https://wiki.osuosl.org/ganeti/)

# Discussion

## Overview

The project of creating a Ganeti cluster for Tor has appeared in the
summer of 2019. The machines were delivered by Hetzner in July 2019
and setup by weasel by the end of the month.

## Goals

The goal was to replace the aging group of KVM servers (`kvm[1-5]`, AKA
`textile`, `unifolium`, `macrum`, `kvm4` and `kvm5`).

### Must have

 * arbitrary virtual machine provisionning
 * redundant setup
 * automated VM installation
 * replacement of existing infrastructure

### Nice to have

 * fully configured in Puppet
 * full high availability with automatic failover
 * extra capacity for new projects

### Non-Goals

 * Docker or "container" provisionning - we consider this out of scope
   for now
 * self-provisionning by end-users: TPA remains in control of
   provisionning

## Approvals required

A budget was proposed by weasel in may 2019 and approved by Vegas in
June. An extension to the budget was approved in january 2020 by
Vegas.

## Proposed Solution

Setup a Ganeti cluster of two machines with a Hetzner vSwitch backend.

## Cost

The design based on the [PX62 line][PX62-NVMe] has the following monthly cost
structure:

 * per server: 118EUR (79EUR + 39EUR for 2x10TB HDDs)
 * IPv4 space: 35.29EUR (/27)
 * IPv6 space: 8.40EUR (/64)
 * bandwidth cost: 1EUR/TB (currently 38EUR)

At three servers, that adds up to around 435EUR/mth. Up to date costs
are available in the [Tor VM hosts.xlsx](https://nc.torproject.net/apps/onlyoffice/5395) spreadsheet.

## Alternatives considered

<!-- include benchmarks and procedure if relevant -->

Note that the instance install is possible also [through FAI, see the
Ganeti wiki for examples](https://github.com/ganeti/ganeti/wiki/System-template-with-FAI).

There are GUIs for Ganeti that we are not using, but could, if we want
to grant more users access:

 * [Ganeti Web manager](https://ganeti-webmgr.readthedocs.io/) is a
   "Django based web frontend for managing Ganeti virtualization
   clusters. Since Ganeti only provides a command-line interface,
   Ganeti Web Manager’s goal is to provide a user friendly web
   interface to Ganeti via Ganeti’s Remote API. On top of Ganeti it
   provides a permission system for managing access to clusters and
   virtual machines, an in browser VNC console, and vm state and
   resource visualizations"
 * [Synnefo](https://www.synnefo.org/) is a "complete open source
   cloud stack written in Python that provides Compute, Network,
   Image, Volume and Storage services, similar to the ones offered by
   AWS. Synnefo manages multiple Ganeti clusters at the backend for
   handling of low-level VM operations and uses Archipelago to unify
   cloud storage. To boost 3rd-party compatibility, Synnefo exposes
   the OpenStack APIs to users."