Skip to content
Snippets Groups Projects
ganeti.md 157 KiB
Newer Older
[Ganeti](http://ganeti.org/) is software designed to facilitate the management of
virtual machines (KVM or Xen). It helps you move virtual machine
instances from one node to another, create an instance with DRBD
replication on another node and do the live migration from one to
another, etc.

[[_TOC_]]
anarcat's avatar
anarcat committed

anarcat's avatar
anarcat committed
## Listing virtual machines (instances)
anarcat's avatar
anarcat committed

anarcat's avatar
anarcat committed
This will show the running guests, known as "instances":
anarcat's avatar
anarcat committed
## Accessing serial console
anarcat's avatar
anarcat committed
Our instances do serial console, starting in grub.  To access it, run

    gnt-instance console test01.torproject.org

To exit, use `^]` -- that is, Control-<Closing Bracket>.
# How-to
anarcat's avatar
anarcat committed
## Glossary

In Ganeti, we use the following terms:

 * **node** a physical machine is called a *node* and a
 * **instance** a virtual machine
 * **master**: a *node* where on which we issue Ganeti commands and
   that supervises all the other nodes

Nodes are interconnected through a private network that is used to
communicate commands and synchronise disks (with
[howto/drbd](howto/drbd)). Instances are normally assigned two nodes:
a *primary* and a *secondary*: the *primary* is where the virtual
machine actually runs and the *secondary* acts as a hot failover.
See also the more extensive [glossary in the Ganeti documentation](http://docs.ganeti.org/docs/ganeti/3.0/html/glossary.html).
anarcat's avatar
anarcat committed

## Adding a new instance

This command creates a new guest, or "instance" in Ganeti's
vocabulary with 10G root, 512M swap, 20G spare on SSD, 800G on HDD, 8GB
      -t drbd --no-wait-for-sync \
anarcat's avatar
anarcat committed
      --net 0:ip=pool,network=gnt-fsn13-02 \
      --no-ip-check \
      --no-name-check \
      --disk 0:size=10G \
      --disk 1:size=20G \
      --disk 2:size=800G,vg=vg_ganeti_hdd \
      --backend-parameters memory=8g,vcpus=2 \
      test-01.torproject.org
This configures the following:

 * redundant disks in a DRBD mirror
 * two additional partitions: one on the default VG (SSD), one on another
   (HDD). A 512MB swapfile is created in `/swapfile`. TODO: configure disk 2
anarcat's avatar
anarcat committed
   and 3 automatically in installer. (`/var` and `/srv`?)
 * 8GB of RAM with 2 virtual CPUs
 * an IP allocated from the public gnt-fsn pool:
   `gnt-instance add` will print the IPv4 address it picked to stdout.  The
   IPv6 address can be found in `/var/log/ganeti/os/` on the primary node
   of the instance, see below.
 * with the `test-01.torproject.org` hostname
To find the root password, ssh host key fingerprints, and the IPv6
address, run this **on the node where the instance was created**, for
example:
    egrep 'root password|configured eth0 with|SHA256' $(ls -tr /var/log/ganeti/os/* | tail -1) | grep -v $(hostname)
We copy root's authorized keys into the new instance, so you should be able to
log in with your token.  You will be required to change the root password immediately.
Pick something nice and document it in `tor-passwords`.

anarcat's avatar
anarcat committed
Also set reverse DNS for both IPv4 and IPv6 in [hetzner's robot](https://robot.your-server.de/)
(Check under servers -> vSwitch -> IPs) or in our own reverse zone
anarcat's avatar
anarcat committed
files (if delegated).

Then follow [howto/new-machine](howto/new-machine).

anarcat's avatar
anarcat committed
### Known issues

 * **allocator failures**: Note that you may need to use the `--node`
   parameter to pick on which machines you want the machine to end up,
   otherwise Ganeti will choose for you (and may fail). Use, for
   example, `--node fsn-node-01:fsn-node-02` to use `node-01` as
   primary and `node-02` as secondary. The allocator can sometimes
   fail if the allocator is upset about something in the cluster, for
   example:
anarcat's avatar
anarcat committed
        Can's find primary node using iallocator hail: Request failed: No valid allocation solutions, failure reasons: FailMem: 2, FailN1: 2

   This situation is covered by [ticket 33785](https://bugs.torproject.org/33785). If this problem
   occurs, it might be worth [rebalancing the cluster](#rebalancing-a-cluster).
anarcat's avatar
anarcat committed

   The following dashboards can help you choose the less busy nodes to use:

   - CPU usage: [gnt-dal](https://grafana.torproject.org/d/gex9eLcWz/cpu-usage?orgId=1&refresh=1m&var-class=role::ganeti::dal&var-node=All&var-show_cpu_count=or), [gnt-fsn](https://grafana.torproject.org/d/gex9eLcWz/cpu-usage?orgId=1&refresh=1m&var-class=role::ganeti::fsn&var-node=All&var-show_cpu_count=or)
   - Memory usage: [gnt-dal](https://grafana.torproject.org/d/amgrk2Qnk/memory-usage?from=now-24h&to=now&var-class=role::ganeti::dal&var-node=All&var-show_cpu_count=or&orgId=1&refresh=1m), [gnt-fsn](https://grafana.torproject.org/d/amgrk2Qnk/memory-usage?from=now-24h&to=now&var-class=role::ganeti::fsn&var-node=All&var-show_cpu_count=or&orgId=1&refresh=1m)
   - LVM Disk usage: [gnt-dal](https://grafana.torproject.org/d/f7887271-1a77-4138-ad16-28be8b0ad0ab/lvm-disk-usage?orgId=1&var-class=role::ganeti::dal&var-vg_name=All&var-instance=All), [gnt-fsn](https://grafana.torproject.org/d/f7887271-1a77-4138-ad16-28be8b0ad0ab/lvm-disk-usage?orgId=1&var-class=role%3A%3Aganeti%3A%3Afsn&var-vg_name=All&var-instance=All)

anarcat's avatar
anarcat committed
 * **ping failure**: there is a bug in `ganeti-instance-debootstrap`
   which misconfigures `ping` (among other things), see [bug
   31781](https://bugs.torproject.org/31781). It's currently patched in our version of the Debian
   package, but that patch might disappear if Debian upgrade the
   package without [shipping our patch](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=944538). Note that this was fixed
   in Debian bullseye and later.
### Other examples

#### Dallas cluster

This is a typical server creation in the `gnt-dal` cluster:

    gnt-instance add \
      -t drbd --no-wait-for-sync \
      --net 0:ip=pool,network=gnt-dal-01 \
      --no-ip-check \
      --no-name-check \
      --disk 0:size=10G \
      --backend-parameters memory=8g,vcpus=2 \
      test-01.torproject.org

Do not forget to follow the [next steps](#next-steps), above.

#### No DRBD, test machine

A simple test machine, with only 1G of disk, ram, and 1 CPU, without
DRBD, in the FSN cluster:

    gnt-instance add \
          -t plain --no-wait-for-sync \
          --net 0:ip=pool,network=gnt-fsn13-02 \
          --no-ip-check \
          --no-name-check \
          --disk 0:size=10G \
          --backend-parameters memory=1g,vcpus=1 \
          test-01.torproject.org

anarcat's avatar
anarcat committed
Do not forget to follow the [next steps](#next-steps), above.

Don't be afraid to create `plain` machines: they can be easily
converted to `drbd` (with `gnt-instance modify -t drbd`) and the
node's disk are already in RAID-1. What you lose is:

 - High availability during node reboots
 - Faster disaster recovery in case of a node failure

What you gain is:

 - Improved performance
 - Less (2x!) disk usage

anarcat's avatar
anarcat committed
### iSCSI integration

To create a VM with iSCSI backing, a disk must first be created on the
SAN, then adopted in a VM, which needs to be *reinstalled* on top of
that. This is typically how large disks are provisionned in the (now defunct)
anarcat's avatar
anarcat committed
`gnt-chi` cluster, in the [Cymru POP](howto/new-machine-cymru).

The following instructions assume you are on a node with an [iSCSI
initiator properly setup](howto/new-machine-cymru#iscsi-initiator-setup), and the [SAN cluster management tools
setup](howto/new-machine-cymru#san-management-tools-setup). It also assumes you are familiar with the `SMcli` tool, see
the [storage servers documentation](howto/new-machine-cymru#storage-servers) for an introduction on that.

 1. create a dedicated disk group and virtual disk on the SAN, assign it to the
    host group and propagate the multipath config across the cluster nodes:
anarcat's avatar
anarcat committed

        /usr/local/sbin/tpo-create-san-disks --san chi-node-03 --name test-01 --capacity 500
 2. confirm that multipath works, it should look something like this":
anarcat's avatar
anarcat committed

        root@chi-node-01:~# multipath -ll
        test-01 (36782bcb00063c6a500000d67603f7abf) dm-20 DELL,MD32xxi
anarcat's avatar
anarcat committed
        size=500G features='5 queue_if_no_path pg_init_retries 50 queue_mode mq' hwhandler='1 rdac' wp=rw
        |-+- policy='round-robin 0' prio=6 status=active
        | |- 11:0:0:4 sdi 8:128 active ready running
        | |- 12:0:0:4 sdj 8:144 active ready running
        | `- 9:0:0:4  sdh 8:112 active ready running
        `-+- policy='round-robin 0' prio=1 status=enabled
          |- 10:0:0:4 sdk 8:160 active ghost running
          |- 7:0:0:4  sdl 8:176 active ghost running
          `- 8:0:0:4  sdm 8:192 active ghost running
        root@chi-node-01:~#

anarcat's avatar
anarcat committed

        gnt-instance add \
anarcat's avatar
anarcat committed
              -n chi-node-01.torproject.org \
              -t blockdev --no-wait-for-sync \
anarcat's avatar
anarcat committed
              --net 0:ip=pool,network=gnt-chi-01 \
              --no-ip-check \
              --no-name-check \
              --disk 0:adopt=/dev/disk/by-id/dm-name-test-01 \
              --backend-parameters memory=8g,vcpus=2 \
              test-01.torproject.org

    NOTE: the actual node must be manually picked because the `hail`
    allocator doesn't seem to know about block devices.
anarcat's avatar
anarcat committed

    NOTE: mixing DRBD and iSCSI volumes on a single instance is not supported.

 4. at this point, the VM probably doesn't boot, because for some
anarcat's avatar
anarcat committed
    reason the `gnt-instance-debootstrap` doesn't fire when disks are
    adopted. so you need to reinstall the machine, which involves
    stopping it first:

        gnt-instance shutdown --timeout=0 test-01
        gnt-instance reinstall test-01
anarcat's avatar
anarcat committed

    HACK one: the current installer fails on weird partionning errors, see
    [upstream bug 13](https://github.com/ganeti/instance-debootstrap/issues/13).
    We applied [this patch](https://github.com/ganeti/instance-debootstrap/commit/e0df6b1fd25dc3e111851ae42872df0a757ac4a9)
    as a workaround to avoid failures when the installer attempts to partition
    the virtual disk.

anarcat's avatar
anarcat committed
From here on, follow the [next steps](#next-steps) above.

TODO: This would ideally be automated by an external storage provider,
see the [storage reference for more information](#storage).

anarcat's avatar
anarcat committed
### Troubleshooting

If a Ganeti instance install fails, it will show the end of the
install log, for example:

```
Thu Aug 26 14:11:09 2021  - INFO: Selected nodes for instance tb-pkgstage-01.torproject.org via iallocator hail: chi-node-02.torproject.org, chi-node-01.torproject.org
Thu Aug 26 14:11:09 2021  - INFO: NIC/0 inherits netparams ['br0', 'bridged', '']
Thu Aug 26 14:11:09 2021  - INFO: Chose IP 38.229.82.29 from network gnt-chi-01
Thu Aug 26 14:11:10 2021 * creating instance disks...
Thu Aug 26 14:12:58 2021 adding instance tb-pkgstage-01.torproject.org to cluster config
Thu Aug 26 14:12:58 2021 adding disks to cluster config
Thu Aug 26 14:13:00 2021 * checking mirrors status
Thu Aug 26 14:13:01 2021  - INFO: - device disk/0: 30.90% done, 3m 32s remaining (estimated)
Thu Aug 26 14:13:01 2021  - INFO: - device disk/2:  0.60% done, 55m 26s remaining (estimated)
Thu Aug 26 14:13:01 2021 * checking mirrors status
Thu Aug 26 14:13:02 2021  - INFO: - device disk/0: 31.20% done, 3m 40s remaining (estimated)
Thu Aug 26 14:13:02 2021  - INFO: - device disk/2:  0.60% done, 52m 13s remaining (estimated)
Thu Aug 26 14:13:02 2021 * pausing disk sync to install instance OS
Thu Aug 26 14:13:03 2021 * running the instance OS create scripts...
Thu Aug 26 14:16:31 2021 * resuming disk sync
Failure: command execution error:
Could not add os for instance tb-pkgstage-01.torproject.org on node chi-node-02.torproject.org: OS create script failed (exited with exit code 1), last lines in the log file:
Setting up openssh-sftp-server (1:7.9p1-10+deb10u2) ...
Setting up openssh-server (1:7.9p1-10+deb10u2) ...
Creating SSH2 RSA key; this may take some time ...
2048 SHA256:ZTeMxYSUDTkhUUeOpDWpbuOzEAzOaehIHW/lJarOIQo root@chi-node-02 (RSA)
Creating SSH2 ED25519 key; this may take some time ...
256 SHA256:MWKeA8vJKkEG4TW+FbG2AkupiuyFFyoVWNVwO2WG0wg root@chi-node-02 (ED25519)
Created symlink /etc/systemd/system/sshd.service \xe2\x86\x92 /lib/systemd/system/ssh.service.
Created symlink /etc/systemd/system/multi-user.target.wants/ssh.service \xe2\x86\x92 /lib/systemd/system/ssh.service.
invoke-rc.d: could not determine current runlevel
Setting up ssh (1:7.9p1-10+deb10u2) ...
Processing triggers for systemd (241-7~deb10u8) ...
Processing triggers for libc-bin (2.28-10) ...
Errors were encountered while processing:
 linux-image-4.19.0-17-amd64
E: Sub-process /usr/bin/dpkg returned an error code (1)
run-parts: /etc/ganeti/instance-debootstrap/hooks/ssh exited with return code 100
Using disk /dev/drbd4 as swap...
Setting up swapspace version 1, size = 2 GiB (2147479552 bytes)
no label, UUID=96111754-c57d-43f2-83d0-8e1c8b4688b4
Not using disk 2 (/dev/drbd5) because it is not named 'swap' (name: )
root@chi-node-01:~#
```

Here the failure which tripped the install is:

```
Errors were encountered while processing:
 linux-image-4.19.0-17-amd64
E: Sub-process /usr/bin/dpkg returned an error code (1)
```

But the actual error is higher up, and we need to go look at the logs
on the server for this, in this case in
`chi-node-02:/var/log/ganeti/os/add-debootstrap+buster-tb-pkgstage-01.torproject.org-2021-08-26_14_13_04.log`,
we can find the real problem:

```
Setting up linux-image-4.19.0-17-amd64 (4.19.194-3) ...
/etc/kernel/postinst.d/initramfs-tools:
update-initramfs: Generating /boot/initrd.img-4.19.0-17-amd64
W: Couldn't identify type of root file system for fsck hook
/etc/kernel/postinst.d/zz-update-grub:
/usr/sbin/grub-probe: error: cannot find a device for / (is /dev mounted?).
run-parts: /etc/kernel/postinst.d/zz-update-grub exited with return code 1
dpkg: error processing package linux-image-4.19.0-17-amd64 (--configure):
 installed linux-image-4.19.0-17-amd64 package post-installation script subprocess returned error exit status 1
```

In this case, oddly enough, even though Ganeti thought the install had
failed, the machine can actually start:

```
gnt-instance start tb-pkgstage-01.torproject.org
```

... and after a while, we can even get a console:

```
gnt-instance console tb-pkgstage-01.torproject.org
anarcat's avatar
anarcat committed
```

And in *that* case, the procedure can just continue from here on:
reset the root password, and just make sure you finish the install:

```
apt install linux-image-amd64
```

In the above case, the `sources-list` post-install hook was buggy: it
wasn't mounting `/dev` and friends before launching the upgrades,
which was causing issues when a kernel upgrade was queued.

anarcat's avatar
anarcat committed
And *if* you are debugging an installer and by mistake end up with
half-open filesystems and stray DRBD devices, do take a look at the
[LVM](howto/lvm) and [DRBD documentation](howto/drbd).

anarcat's avatar
anarcat committed
## Modifying an instance

### CPU, memory changes

anarcat's avatar
anarcat committed
It's possible to change the IP, CPU, or memory allocation of an instance
using the [gnt-instance modify](http://docs.ganeti.org/docs/ganeti/3.0/html/man-gnt-instance.html#modify) command:
anarcat's avatar
anarcat committed

    gnt-instance modify -B vcpus=4,memory=8g test1.torproject.org
anarcat's avatar
anarcat committed
    gnt-instance reboot test1.torproject.org

anarcat's avatar
anarcat committed
Note that this can be more easily done with a Fabric task which will
handle wall warnings, delays, silences and so on, using the standard
reboot procedures:

    fab -H idle-fsn-01.torproject.org ganeti.modify vcpus=4,memory=8g

If you get a cryptic failure (TODO: add sample output) about policy
being violated while you're *not* actually violating the stated
policy, it's possible this VM was *already* violating the policy and
the changes *you* proposed are okay.

In that case (and only in that case!) it's okay to bypass the policy
with `--ignore-ipolicy`. Otherwise, discuss this with a fellow
sysadmin, and see if that VM really needs that many resources or if
the policies need to be changed.

anarcat's avatar
anarcat committed
IP address changes require a full stop and will require manual changes
to the `/etc/network/interfaces*` files:

    gnt-instance modify --net 0:modify,ip=116.202.120.175 test1.torproject.org
    gnt-instance stop test1.torproject.org
anarcat's avatar
anarcat committed
The renumbering can be done with [Fabric](howto/fabric), with:
    ./ganeti -H test1.torproject.org renumber-instance --ganeti-node $PRIMARY_NODE

Note that the `$PRIMARY_NODE` must be passed here, not the "master"!

Alternatively, it can also be done by hand:

anarcat's avatar
anarcat committed
    gnt-instance start test1.torproject.org
    gnt-instance console test1.torproject.org

The [gnt-instance grow-disk](http://docs.ganeti.org/docs/ganeti/3.0/html/man-gnt-instance.html#grow-disk) command can be used to change the size
anarcat's avatar
anarcat committed
of the underlying device:

    gnt-instance grow-disk --absolute test1.torproject.org 0 16g
anarcat's avatar
anarcat committed
    gnt-instance reboot test1.torproject.org

The number `0` in this context, indicates the first disk of the
instance.  The amount specified is the final disk size (because of the
`--absolute` flag). In the above example, the final disk size will be
16GB. To *add* space to the existing disk, remove the `--absolute`
flag:

    gnt-instance grow-disk test1.torproject.org 0 16g
    gnt-instance reboot test1.torproject.org

In the above example, 16GB will be **ADDED** to the disk. Be careful
with resizes, because it's not possible to revert such a change:
`grow-disk` does support shrinking disks. The only way to revert the
change is by exporting / importing the instance.

Note the reboot, above, will impose a downtime. See [upstream bug
28](https://github.com/ganeti/ganeti/issues/28) about improving that.

Then the filesystem needs to be resized inside the VM:
anarcat's avatar
anarcat committed

Hiro's avatar
Hiro committed
    ssh root@test1.torproject.org 

Use `pvs` to display information about the physical volumes:

    root@cupani:~# pvs
    PV         VG        Fmt  Attr PSize   PFree   
    /dev/sdc   vg_test   lvm2 a--  <8.00g  1020.00m

Resize the physical volume to take up the new space:

    pvresize /dev/sdc

Use `lvs` to display information about logical volumes:
Hiro's avatar
Hiro committed

    # lvs
    LV            VG               Attr       LSize    Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
    var-opt       vg_test-01     -wi-ao---- <10.00g                                                    
    test-backup vg_test-01_hdd   -wi-ao---- <20.00g            

Use lvextend to add space to the volume:

anarcat's avatar
anarcat committed
    lvextend -l '+100%FREE' vg_test-01/var-opt

Finally resize the filesystem:

    resize2fs /dev/vg_test-01/var-opt
anarcat's avatar
anarcat committed

See also the [LVM howto](howto/lvm), particularly if the `lvextend`
step fails with:

```
  Unable to resize logical volumes of cache type.
```
#### Resizing without LVM, no partitions

If there's no LVM inside the VM (a more common configuration
nowadays), the above procedure will obviously not work. If this is a
secondary disk (e.g. `/dev/sdc`) there is a good chance a partition
was created directly on it and that you do not need to repartition the
drive. This is an example of a good configuration if we want to resize
`sdc`:
Hiro's avatar
Hiro committed

```
root@bacula-director-01:~# lsblk 
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
fd0      2:0    1    4K  0 disk 
sda      8:0    0   10G  0 disk 
└─sda1   8:1    0   10G  0 part /
sdb      8:16   0    2G  0 disk [SWAP]
sdc      8:32   0  250G  0 disk /srv
```

Note that if we would need to resize `sda`, we'd have to follow the
other procedure, in the next section.

If we check the free disk space on the device we will notice it has
not changed yet:

```
# df -h /srv
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdc        196G  160G   27G  86% /srv
```

The resize is then simply:

```
# resize2fs /dev/sdc
resize2fs 1.44.5 (15-Dec-2018)
Filesystem at /dev/sdc is mounted on /srv; on-line resizing required
old_desc_blocks = 25, new_desc_blocks = 32
The filesystem on /dev/sdc is now 65536000 (4k) blocks long.
```

anarcat's avatar
anarcat committed
Note that for XFS filesystems, the above command is simply:

```
xfs_growfs /dev/sdb
```

Read on for the most complicated scenario.

#### Resizing without LVM, with partitions

If the filesystem to resize is not *directly* on the device, you will
need to resize the partition manually, which can be done using
fdisk. In the following example we have a `sda1` partition that we
want to extend from 10G to 20G to fill up the free space on
`/dev/sda`. Here is what the partition layout looks like before the
resize:
Hiro's avatar
Hiro committed

```
# lsblk
NAME    MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
fd0       2:0    1   4K  0 disk 
sda       8:0    0  40G  0 disk 
└─sda1    8:1    0  20G  0 part /
sdb       8:16   0   4G  0 disk [SWAP]
Hiro's avatar
Hiro committed
```

We use `sfdisk` to resize the partition to take up all available
space, in this case, with the magic:
Hiro's avatar
Hiro committed

    echo ", +" | sfdisk -N 1 --no-act /dev/sda
Hiro's avatar
Hiro committed

Note the `--no-act` here, which you'll need to remove to actually make
the change, the above is just a preview to make sure you will do the
right thing:

    echo ", +" | sfdisk -N 1 --no-reread /dev/sda

TODO: next time, test with `--force` instead of `--no-reread` to see
if we still need a reboot.

Here's a working example:
Hiro's avatar
Hiro committed

```
# echo ", +" | sfdisk -N 1 --no-reread /dev/sda
Disk /dev/sda: 40 GiB, 42949672960 bytes, 83886080 sectors
Hiro's avatar
Hiro committed
Disk model: QEMU HARDDISK   
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x00000000
Hiro's avatar
Hiro committed

Old situation:
Hiro's avatar
Hiro committed

Device     Boot Start      End  Sectors Size Id Type
/dev/sda1  *     2048 41943039 41940992  20G 83 Linux
Hiro's avatar
Hiro committed

/dev/sda1: 
New situation:
Hiro's avatar
Hiro committed
Disklabel type: dos
Disk identifier: 0x00000000
Hiro's avatar
Hiro committed

Device     Boot Start      End  Sectors Size Id Type
/dev/sda1  *     2048 83886079 83884032  40G 83 Linux
Hiro's avatar
Hiro committed

The partition table has been altered.
Calling ioctl() to re-read partition table.
Re-reading the partition table failed.: Device or resource busy
The kernel still uses the old table. The new table will be used at the next reboot or after you run partprobe(8) or kpartx(8).
Syncing disks.
Hiro's avatar
Hiro committed
```

Note that the partition table wasn't updated:
Hiro's avatar
Hiro committed

```
# lsblk
NAME    MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
fd0       2:0    1   4K  0 disk 
sda       8:0    0  40G  0 disk 
└─sda1    8:1    0  20G  0 part /
sdb       8:16   0   4G  0 disk [SWAP]
```
Hiro's avatar
Hiro committed

So we need to reboot:
Hiro's avatar
Hiro committed

Hiro's avatar
Hiro committed

Note: a previous version of this guide was using `fdisk` instead, but
that guide was destroying and recreating the partition, which seemed
too error-prone. The above procedure is more annoying (because of the
reboot below) but should be less dangerous.
Hiro's avatar
Hiro committed

Now we check the partitions again:
Hiro's avatar
Hiro committed

```
# lsblk
Hiro's avatar
Hiro committed
NAME   MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
fd0      2:0    1   4K  0 disk 
sda      8:0    0  40G  0 disk 
└─sda1   8:1    0  40G  0 part /
sdb      8:16   0   4G  0 disk [SWAP]
Hiro's avatar
Hiro committed
```

If we check the free space on the device, we will notice it has not
changed yet:
anarcat's avatar
anarcat committed

Hiro's avatar
Hiro committed
```
# df -h  /
Hiro's avatar
Hiro committed
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        20G   16G  2.8G  86% /
Hiro's avatar
Hiro committed
```

We need to resize it:

```
# resize2fs /dev/sda1
resize2fs 1.44.5 (15-Dec-2018)
Filesystem at /dev/sda1 is mounted on /; on-line resizing required
old_desc_blocks = 2, new_desc_blocks = 3
The filesystem on /dev/sda1 is now 10485504 (4k) blocks long.
Hiro's avatar
Hiro committed
```

The resize is now complete.
Hiro's avatar
Hiro committed

All the above procedures detail the normal use case where disks are
hosted as "plain" files or with the DRBD backend. However, some
instances (most notably in the, now defunct, gnt-chi cluster) have their storage
backed by an iSCSI SAN.

Growing a disk hosted on a SAN like the Dell PowerVault MD3200i
involves several steps beginning with resizing the LUN itself. In the
example below, we're going to grow the disk associated with the
`tb-build-03` instance. 

> It should be noted that the instance was setup in a peculiar way: it
> has one LUN per partition, instead of one big LUN partitioned
> correctly. The instructions below therefore mention a LUN named
> `tb-build-03-srv`, but normally there should be a single LUN named
> after the hostname of the machine, in this case it should have been
> named simply `tb-build-03`.

First, we identify how much space is available on the virtual disks' diskGroup:

    # SMcli -n chi-san-01 -c "show allVirtualDisks summary;"

	STANDARD VIRTUAL DISKS SUMMARY
	Number of standard virtual disks: 5

	Name                Thin Provisioned     Status     Capacity     Accessible by       Source
	tb-build-03-srv     No                   Optimal    700.000 GB   Host Group gnt-chi  Disk Group 5

This shows that `tb-build-03-srv` is hosted on Disk Group "5":

    # SMcli -n chi-san-01 -c "show diskGroup [5];"

    DETAILS

       Name:              5

          Status:         Optimal
          Capacity:       1,852.026 GB
          Current owner:  RAID Controller Module in slot 1

          Data Service (DS) Attributes

             RAID level:                    5
             Physical Disk media type:      Physical Disk
             Physical Disk interface type:  Serial Attached SCSI (SAS)
             Enclosure loss protection:     No
             Secure Capable:                No
             Secure:                        No


          Total Virtual Disks:          1
             Standard virtual disks:    1
             Repository virtual disks:  0
             Free Capacity:             1,152.026 GB

          Associated physical disks - present (in piece order)
          Total physical disks present: 3

             Enclosure     Slot
             0             6
             1             11
             0             7

`Free Capacity` indicates about 1,5 TB of free space available. So we can go
ahead with the actual resize:

    # SMcli -n chi-san-01 -p $PASSWORD -c "set virtualdisk [\"tb-build-03-srv\"] addCapacity=100GB;"
Next, we need to make all nodes in the cluster to rescan the iSCSI LUNs and have
`multipathd` resize the device node. This is accomplished by running this command
on the primary node (eg. `chi-node-01`):
    # gnt-cluster command "iscsiadm -m node --rescan; multipathd -v3 -k\"resize map tb-build-srv\""

The success of this step can be validated by looking at the output of `lsblk`:
the device nodes associated with the LUN should now display the new size. The
output should be identical across the cluster nodes.

In order for ganeti/qemu to make this extra space available to the instance, a
reboot must be performed from outside the instance.

Then the normal resize procedure can happen inside the virtual
machine, see [resizing under LVM](#resizing-under-lvm), [resizing without LVM, no
partitions](#resizing-without-lvm-no-partitions), or [Resizing without LVM, with partitions](#resizing-without-lvm-with-partitions),
depending on the situation.
### Removing an iSCSI LUN

Use this procedure before to a virtual disk from one of the iSCSI SANs.

First, we'll need to gather a some information about the disk to remove.

 * Which SAN is hosting the disk
 * What LUN is assigned to the disk
 * The WWID of both the SAN and the virtual disk

    /usr/local/sbin/tpo-show-san-disks
    SMcli -n chi-san-03 -S -quick -c "show storageArray summary;" | grep "Storage array world-wide identifier"
    cat /etc/multipath/conf.d/test-01.conf

Second, remove the multipath config and reload:

    gnt-cluster command rm /etc/multipath/conf.d/test-01.conf
    gnt-cluster command "multipath -r ; multipath -w {disk-wwid} ; multipath -r"

Then, remove the iSCSI device nodes. Running `iscsiadm --rescan` does not remove
LUNs which have been deleted from the SAN.

Be very careful with this command, it will delete device nodes without prejudice
and cause data corruption if they are still in use!

    gnt-cluster command "find /dev/disk/by-path/ -name \*{san-wwid}-lun-{lun} -exec readlink {} \; | cut -d/ -f3 | while read -d $'\n' n; do echo 1 > /sys/block/\$n/device/delete; done"

Finally, the disk group can be deleted from the SAN (all the virtual disks it
contains will be deleted):

    SMcli -n chi-san-03 -p $SAN_PASSWORD -S -quick -c "delete diskGroup [<disk-group-number>];"

### Adding disks

A disk can be added to an instance with the `modify` command as
well. This, for example, will add a 100GB disk to the `test1` instance
on the `vg_ganeti_hdd` volume group, which is "slow" rotating disks:
anarcat's avatar
anarcat committed
    gnt-instance modify --disk add:size=100g,vg=vg_ganeti_hdd test1.torproject.org
    gnt-instance reboot test1.torproject.org
### Changing disk type

If you have, say, a test instance that was created with a `plain` disk
template but we actually want it in production, with a `drbd` disk
template. Switching to `drbd` is easy:

    gnt-instance shutdown test-01
    gnt-instance modify -t drbd test-01
    gnt-instance start test-01

The second command will use the allocator to find a secondary node. If
that fails, you can assign a node manually with `-n`.

You can also switch back to `plain`, although you should generally
never do that.

See also the [upstream procedure](https://docs.ganeti.org/docs/ganeti/3.0/html/admin.html#conversion-of-an-instance-s-disk-type) and [design document](https://docs.ganeti.org/docs/ganeti/3.0/html/design-disk-conversion.html).

anarcat's avatar
anarcat committed
### Removing or detaching a disk
anarcat's avatar
anarcat committed
If you need to destroy a volume from an instance, you can use the
`remove` flag to the `gnt-instance modify` command. First, you must
identify the disk's uuid using `gnt-instance info`, then:

    gnt-instance modify --disk <uuid>:remove test-01

But maybe you just want to detach the disk without destroying data,
it's possible to detach it. For this, use the `detach` keyword:

    gnt-instance modify --disk <uuid>:detach test-01

### Adding a network interface on the rfc1918 vlan
Peter Palfrader's avatar
Peter Palfrader committed

We have a vlan that some VMs that do not have public addresses sit on.
Its vlanid is 4002 and its backed by Hetzner vswitch vSwitch #11973 "fsn-gnt-rfc1918-traffic".
Note that traffic on this vlan will travel in the clear between nodes.

To add an instance to this vlan, give it a second network interface using

    gnt-instance modify --net add:link=br0,vlan=4002,mode=openvswitch test1.torproject.org

## Destroying an instance

This totally deletes the instance, including all mirrors and
everything, be very careful with it:

    gnt-instance remove test01.torproject.org
anarcat's avatar
anarcat committed

## Getting information

Information about an instances can be found in the rather verbose
`gnt-instance info`:

    root@fsn-node-01:~# gnt-instance info tb-build-02.torproject.org
    - Instance name: tb-build-02.torproject.org
      UUID: 8e9f3ca6-204f-4b6c-8e3e-6a8fda137c9b
      Serial number: 5
      Creation time: 2020-12-15 14:06:41
      Modification time: 2020-12-15 14:07:31
      State: configured to be up, actual state is up
      Nodes: 
        - primary: fsn-node-03.torproject.org
          group: default (UUID 8c32fd09-dc4c-4237-9dd2-3da3dfd3189e)
        - secondaries: fsn-node-04.torproject.org (group default, group UUID 8c32fd09-dc4c-4237-9dd2-3da3dfd3189e)
      Operating system: debootstrap+buster

A quick command that can be done is this, which shows the
primary/secondary for a given instance:

    gnt-instance info tb-build-02.torproject.org | grep -A 3 Nodes

An equivalent command will show the primary and secondary for *all*
instances, on top of extra information (like the CPU count, memory and
disk usage):

    gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort

It can be useful to run this in a loop to see changes:

    watch -n5 -d 'gnt-instance list -o pnode,snodes,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort'

## Disk operations (DRBD)
anarcat's avatar
anarcat committed

Instances should be setup using the DRBD backend, in which case you
anarcat's avatar
anarcat committed
should probably take a look at [howto/drbd](howto/drbd) if you have problems with
anarcat's avatar
anarcat committed
that. Ganeti handles most of the logic there so that should generally
not be necessary.
### Identifying volumes of an instance

As noted above, ganeti handles most of the complexity around managing DRBD and
LVM volumes. Sometimes though it might be interesting to know which volume is
associated with which instance, especially for confirming an operation before
deleting a stray device.

Ganeti maintains that information handy. On the cluster master you can extract
information about all volumes on all nodes:

    gnt-node volumes

If you're already connected to one node, you can check which LVM volumes
correspond to which instance:

    lvs -o+tags

anarcat's avatar
anarcat committed
## Evaluating cluster capacity

This will list instances repeatedly, but also show their assigned
memory, and compare it with the node's capacity:

    gnt-instance list -o pnode,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort &&
    echo &&
    gnt-node list
anarcat's avatar
anarcat committed
The latter does not show disk usage for secondary volume groups (see
[upstream issue 1379](https://github.com/ganeti/ganeti/issues/1379)), for a complete picture of disk usage, use:
    gnt-node list-storage
The [gnt-cluster verify](http://docs.ganeti.org/docs/ganeti/3.0/html/man-gnt-cluster.html#verify) command will also check to see if there's
anarcat's avatar
anarcat committed
enough space on secondaries to account for the failure of a
node. Healthy output looks like this:

    root@fsn-node-01:~# gnt-cluster verify
    Submitted jobs 48030, 48031
    Waiting for job 48030 ...
    Fri Jan 17 20:05:42 2020 * Verifying cluster config
    Fri Jan 17 20:05:42 2020 * Verifying cluster certificate files
    Fri Jan 17 20:05:42 2020 * Verifying hypervisor parameters
    Fri Jan 17 20:05:42 2020 * Verifying all nodes belong to an existing group
    Waiting for job 48031 ...
    Fri Jan 17 20:05:42 2020 * Verifying group 'default'
    Fri Jan 17 20:05:42 2020 * Gathering data (2 nodes)
    Fri Jan 17 20:05:42 2020 * Gathering information about nodes (2 nodes)
    Fri Jan 17 20:05:45 2020 * Gathering disk information (2 nodes)
    Fri Jan 17 20:05:45 2020 * Verifying configuration file consistency
    Fri Jan 17 20:05:45 2020 * Verifying node status
    Fri Jan 17 20:05:45 2020 * Verifying instance status
    Fri Jan 17 20:05:45 2020 * Verifying orphan volumes
    Fri Jan 17 20:05:45 2020 * Verifying N+1 Memory redundancy
    Fri Jan 17 20:05:45 2020 * Other Notes
    Fri Jan 17 20:05:45 2020 * Hooks Results

A sick node would have said something like this instead:

    Mon Oct 26 18:59:37 2009 * Verifying N+1 Memory redundancy
    Mon Oct 26 18:59:37 2009   - ERROR: node node2: not enough memory to accommodate instance failovers should node node1 fail

See the [ganeti manual](http://docs.ganeti.org/docs/ganeti/3.0/html/walkthrough.html#n-1-errors) for a more extensive example
anarcat's avatar
anarcat committed

Also note the `hspace -L` command, which can tell you how many
instances can be created in a given cluster. It uses the "standard"
instance template defined in the cluster (which we haven't configured
yet).

anarcat's avatar
anarcat committed
## Moving instances and failover
anarcat's avatar
anarcat committed

Ganeti is smart about assigning instances to nodes. There's also a
command (`hbal`) to automatically rebalance the cluster (see
below). If for some reason `hbal` doesn’t do what you want or you need
anarcat's avatar
anarcat committed
to move things around for other reasons, here are a few commands that
might be handy.

Make an instance switch to using it's secondary:

    gnt-instance migrate test1.torproject.org

Make all instances on a node switch to their secondaries:

    gnt-node migrate test1.torproject.org

The `migrate` commands does a "live" migrate which should avoid any
downtime during the migration. It might be preferable to actually
shutdown the machine for some reason (for example if we actually want
to reboot because of a security upgrade). Or we might not be able to
live-migrate because the node is down. In this case, we do a
[failover](http://docs.ganeti.org/docs/ganeti/3.0/html/admin.html#failing-over-an-instance)
anarcat's avatar
anarcat committed

    gnt-instance failover test1.torproject.org

The [gnt-node evacuate](http://docs.ganeti.org/docs/ganeti/3.0/html/man-gnt-node.html#evacuate) command can also be used to "empty" a given
anarcat's avatar
anarcat committed
node altogether, in case of an emergency:

    gnt-node evacuate -I . fsn-node-02.torproject.org

Similarly, the [gnt-node failover](http://docs.ganeti.org/docs/ganeti/3.0/html/man-gnt-node.html#failover) command can be used to
anarcat's avatar
anarcat committed
hard-recover from a completely crashed node:

    gnt-node failover fsn-node-02.torproject.org

Note that you might need the `--ignore-consistency` flag if the
node is unresponsive.

## Importing external libvirt instances
anarcat's avatar
anarcat committed

Assumptions:

 * `INSTANCE`: name of the instance being migrated, the "old" one
   being outside the cluster and the "new" one being the one created
   inside the cluster (e.g. `chiwui.torproject.org`)
 * `SPARE_NODE`: a ganeti node with free space
   (e.g. `fsn-node-03.torproject.org`) where the `INSTANCE` will be
   migrated
 * `MASTER_NODE`: the master ganeti node
   (e.g. `fsn-node-01.torproject.org`)
 * `KVM_HOST`: the machine which we migrate the `INSTANCE` from
 * the `INSTANCE` has only `root` and `swap` partitions
anarcat's avatar
anarcat committed
 * the `SPARE_NODE` has space in `/srv/` to host all the virtual
   machines to import, to check, use:

        fab -H crm-ext-01.torproject.org,crm-int-01.torproject.org,forrestii.torproject.org,nevii.torproject.org,rude.torproject.org,troodi.torproject.org,vineale.torproject.org libvirt.du -p kvm3.torproject.org | sed '/-swap$/d;s/ .*$//' <f | awk '{s+=$1} END {print s}'

   You will very likely need to create a `/srv` big enough for this,
   for example:

        lvcreate -L 300G vg_ganeti -n srv-tmp &&
        mkfs /dev/vg_ganeti/srv-tmp &&
        mount /dev/vg_ganeti/srv-tmp /srv
anarcat's avatar
anarcat committed

Import procedure:

 1. pick a viable SPARE NODE to import the INSTANCE (see "evaluating
    cluster capacity" above, when in doubt) and find on which KVM HOST
    the INSTANCE lives

 2. copy the disks, without downtime:
 
        ./ganeti -H $INSTANCE libvirt-import  --ganeti-node $SPARE_NODE --libvirt-host $KVM_HOST
anarcat's avatar
anarcat committed

 3. copy the disks again, this time suspending the machine:

        ./ganeti -H $INSTANCE libvirt-import  --ganeti-node $SPARE_NODE --libvirt-host $KVM_HOST --suspend --adopt
 4. renumber the host:
        ./ganeti -H $INSTANCE renumber-instance --ganeti-node $SPARE_NODE
 5. test services by changing your `/etc/hosts`, possibly warning
    service admins:

    > Subject: $INSTANCE IP address change planned for Ganeti migration
    >
    > I will soon migrate this virtual machine to the new ganeti cluster. this
    > will involve an IP address change which might affect the service.
    >
    > Please let me know if there are any problems you can think of. in
    > particular, do let me know if any internal (inside the server) or external
    > (outside the server) services hardcodes the IP address of the virtual
    > machine.
    >
    > A test instance has been setup. You can test the service by
    > adding the following to your /etc/hosts:
    >
    >     116.202.120.182 $INSTANCE
    >     2a01:4f8:fff0:4f:266:37ff:fe32:cfb2 $INSTANCE
anarcat's avatar
anarcat committed
 6. destroy test instance:
anarcat's avatar
anarcat committed
        gnt-instance remove $INSTANCE