|
|
[Ganeti](http://ganeti.org/) is software designed to facilitate the management of
|
|
|
virtual machines (KVM or Xen). It helps you move virtual machine
|
|
|
instances from one node to another, create an instance with DRBD
|
|
|
replication on another node and do the live migration from one to
|
|
|
another, etc.
|
|
|
|
|
|
[[_TOC_]]
|
|
|
|
|
|
# Tutorial
|
|
|
|
|
|
## Listing virtual machines (instances)
|
|
|
|
|
|
This will show the running guests, known as "instances":
|
|
|
|
|
|
gnt-instance list
|
|
|
|
|
|
## Accessing serial console
|
|
|
|
|
|
Our instances do serial console, starting in grub. To access it, run
|
|
|
|
|
|
gnt-instance console test01.torproject.org
|
|
|
|
|
|
To exit, use `^]` -- that is, Control-<Closing Bracket>.
|
|
|
|
|
|
# How-to
|
|
|
|
|
|
## Glossary
|
|
|
|
|
|
In Ganeti, a physical machine is called a *node* and a virtual machine
|
|
|
is an *instance*. A node is elected to be the *master* where all
|
|
|
commands should be ran from. Nodes are interconnected through a
|
|
|
private network that is used to communicate commands and synchronise
|
|
|
disks (with [howto/drbd](howto/drbd)). Instances are normally assigned two nodes: a
|
|
|
*primary* and a *secondary*: the *primary* is where the virtual
|
|
|
machine actually runs and th *secondary* acts as a hot failover.
|
|
|
|
|
|
See also the more extensive [glossary in the Ganeti documentation](http://docs.ganeti.org/ganeti/2.15/html/glossary.html).
|
|
|
|
|
|
## Adding a new instance
|
|
|
|
|
|
WARNING: this creates a machine with usrmerge! See [bug 34115](https://bugs.torproject.org/34115)
|
|
|
before proceeding.
|
|
|
|
|
|
This command creates a new guest, or "instance" in Ganeti's
|
|
|
vocabulary:
|
|
|
|
|
|
gnt-instance add \
|
|
|
-o debootstrap+buster \
|
|
|
-t drbd --no-wait-for-sync \
|
|
|
--net 0:ip=pool,network=gnt-fsn \
|
|
|
--no-ip-check \
|
|
|
--no-name-check \
|
|
|
--disk 0:size=10G \
|
|
|
--disk 1:size=2G,name=swap \
|
|
|
--disk 2:size=20G \
|
|
|
--disk 3:size=800G,vg=vg_ganeti_hdd \
|
|
|
--backend-parameters memory=8g,vcpus=2 \
|
|
|
test-01.torproject.org
|
|
|
|
|
|
WARNING: there is a bug in `ganeti-instance-debootstrap` which
|
|
|
misconfigures `ping` (among other things), see [bug #31781](https://bugs.torproject.org/31781). It's
|
|
|
currently patched in our version of the Debian package, but that patch
|
|
|
might disappear if Debian upgrade the package without [shipping our
|
|
|
patch](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=944538).
|
|
|
|
|
|
This configures the following:
|
|
|
|
|
|
* redundant disks in a DRBD mirror, use `-t plain` instead of `-t drbd` for
|
|
|
tests as that avoids syncing of disks and will speed things up considerably
|
|
|
(even with `--no-wait-for-sync` there are some operations that block on
|
|
|
synced mirrors). Only one node should be provided as the argument for
|
|
|
`--node` then.
|
|
|
* three partitions: one on the default VG (SSD), one on another (HDD)
|
|
|
and a swap file on the default VG, if you don't specify a swap device,
|
|
|
a 512MB swapfile is created in `/swapfile`. TODO: configure disk 2
|
|
|
and 3 automatically in installer. (`/var` and `/srv`?)
|
|
|
* 8GB of RAM with 2 virtual CPUs
|
|
|
* an IP allocated from the public gnt-fsn pool:
|
|
|
`gnt-instance add` will print the IPv4 address it picked to stdout. The
|
|
|
IPv6 address can be found in `/var/log/ganeti/os/` on the primary node
|
|
|
of the instance, see below.
|
|
|
* with the `test-01.torproject.org` hostname
|
|
|
|
|
|
To find the root password, ssh host key fingerprints, and the IPv6
|
|
|
address, run this **on the node where the instance was created**, for
|
|
|
example:
|
|
|
|
|
|
egrep 'root password|configured eth0 with|SHA256' $(ls -tr /var/log/ganeti/os/* | tail -1) | grep -v $(hostname)
|
|
|
|
|
|
Note that you need to use the `--node` parameter to pick on which
|
|
|
machines you want the machine to end up, otherwise Ganeti will choose
|
|
|
for you. Use, for example, `--node fsn-node-01:fsn-node-02` to use
|
|
|
`node-01` as primary and `node-02` as secondary. It might be better to
|
|
|
let the Ganeti allocator do its job since it will, eventually do this
|
|
|
during cluster rebalancing. The allocator can sometimes fail if the
|
|
|
allocator is upset about something in the cluster, for example:
|
|
|
|
|
|
Can's find primary node using iallocator hail: Request failed: No valid allocation solutions, failure reasons: FailMem: 2, FailN1: 2
|
|
|
|
|
|
This situation is covered by [ticket 33785](https://bugs.torproject.org/33785).
|
|
|
|
|
|
We copy root's authorized keys into the new instance, so you should be able to
|
|
|
log in with your token. You will be required to change the root password immediately.
|
|
|
Pick something nice and document it in `tor-passwords`.
|
|
|
|
|
|
Also set reverse DNS for both IPv4 and IPv6 in [hetzner's robot](https://robot.your-server.de/).
|
|
|
(Chek under servers -> vSwitch -> IPs)
|
|
|
|
|
|
Then follow [howto/new-machine](howto/new-machine).
|
|
|
|
|
|
## Modifying an instance
|
|
|
|
|
|
### CPU, memory changes
|
|
|
|
|
|
It's possible to change the IP, CPU, or memory allocation of an instance
|
|
|
using the [gnt-instance modify](http://docs.ganeti.org/ganeti/2.15/man/gnt-instance.html#modify) command:
|
|
|
|
|
|
gnt-instance modify -B vcpus=2 test1.torproject.org
|
|
|
gnt-instance modify -B memory=4g test1.torproject.org
|
|
|
gnt-instance reboot test1.torproject.org
|
|
|
|
|
|
### IP address change
|
|
|
|
|
|
IP address changes require a full stop and will require manual changes
|
|
|
to the `/etc/network/interfaces*` files:
|
|
|
|
|
|
gnt-instance modify --net 0:modify,ip=116.202.120.175 test1.torproject.org
|
|
|
gnt-instance stop test1.torproject.org
|
|
|
gnt-instance start test1.torproject.org
|
|
|
gnt-instance console test1.torproject.org
|
|
|
|
|
|
### Resizing disks
|
|
|
|
|
|
The [gnt-instance grow-disk](http://docs.ganeti.org/ganeti/2.15/man/gnt-instance.html#grow-disk) command can be used to change the size
|
|
|
of the underlying device:
|
|
|
|
|
|
gnt-instance grow-disk test1.torproject.org 0 16g
|
|
|
gnt-instance reboot test1.torproject.org
|
|
|
|
|
|
The number `0` in this context, indicates the first disk of the
|
|
|
instance. Then the filesystem needs to be resized inside the VM:
|
|
|
|
|
|
ssh root@test1.torproject.org resize2fs /dev/sda1
|
|
|
|
|
|
### Adding disks
|
|
|
|
|
|
A disk can be added to an instance with the `modify` command as
|
|
|
well. This, for example, will add a 100GB disk to the `test1` instance
|
|
|
on teh `vg_ganeti_hdd` volume group, which is "slow" rotating disks:
|
|
|
|
|
|
gnt-instance modify --disk add:size=100g,vg=vg_ganeti_hdd test1.torproject.org
|
|
|
gnt-instance reboot test1.torproject.org
|
|
|
|
|
|
### Adding a network interface on the rfc1918 vlan
|
|
|
|
|
|
We have a vlan that some VMs that do not have public addresses sit on.
|
|
|
Its vlanid is 4002 and its backed by Hetzner vswitch vSwitch #11973 "fsn-gnt-rfc1918-traffic".
|
|
|
Note that traffic on this vlan will travel in the clear between nodes.
|
|
|
|
|
|
To add an instance to this vlan, give it a second network interface using
|
|
|
|
|
|
gnt-instance modify --net add:link=br0,vlan=4002,mode=openvswitch test1.torproject.org
|
|
|
|
|
|
## Destroying an instance
|
|
|
|
|
|
This totally deletes the instance, including all mirrors and
|
|
|
everything, be very careful with it:
|
|
|
|
|
|
gnt-instance remove test01.torproject.org
|
|
|
|
|
|
## Disk operations (DRBD)
|
|
|
|
|
|
Instances should be setup using the DRBD backend, in which case you
|
|
|
should probably take a look at [howto/drbd](howto/drbd) if you have problems with
|
|
|
that. Ganeti handles most of the logic there so that should generally
|
|
|
not be necessary.
|
|
|
|
|
|
## Evaluating cluster capacity
|
|
|
|
|
|
This will list instances repeatedly, but also show their assigned
|
|
|
memory, and compare it with the node's capacity:
|
|
|
|
|
|
watch -n5 -d 'gnt-instance list -o pnode,name,be/vcpus,be/memory,disk_usage,disk_template,status | sort; echo; gnt-node list'
|
|
|
|
|
|
The latter does not show disk usage for secondary volume groups, for a
|
|
|
complete picture of disk usage, use:
|
|
|
|
|
|
gnt-node list-storage
|
|
|
|
|
|
The [gnt-cluster verify](http://docs.ganeti.org/ganeti/2.15/man/gnt-cluster.html#verify) command will also check to see if there's
|
|
|
enough space on secondaries to account for the failure of a
|
|
|
node. Healthy output looks like this:
|
|
|
|
|
|
root@fsn-node-01:~# gnt-cluster verify
|
|
|
Submitted jobs 48030, 48031
|
|
|
Waiting for job 48030 ...
|
|
|
Fri Jan 17 20:05:42 2020 * Verifying cluster config
|
|
|
Fri Jan 17 20:05:42 2020 * Verifying cluster certificate files
|
|
|
Fri Jan 17 20:05:42 2020 * Verifying hypervisor parameters
|
|
|
Fri Jan 17 20:05:42 2020 * Verifying all nodes belong to an existing group
|
|
|
Waiting for job 48031 ...
|
|
|
Fri Jan 17 20:05:42 2020 * Verifying group 'default'
|
|
|
Fri Jan 17 20:05:42 2020 * Gathering data (2 nodes)
|
|
|
Fri Jan 17 20:05:42 2020 * Gathering information about nodes (2 nodes)
|
|
|
Fri Jan 17 20:05:45 2020 * Gathering disk information (2 nodes)
|
|
|
Fri Jan 17 20:05:45 2020 * Verifying configuration file consistency
|
|
|
Fri Jan 17 20:05:45 2020 * Verifying node status
|
|
|
Fri Jan 17 20:05:45 2020 * Verifying instance status
|
|
|
Fri Jan 17 20:05:45 2020 * Verifying orphan volumes
|
|
|
Fri Jan 17 20:05:45 2020 * Verifying N+1 Memory redundancy
|
|
|
Fri Jan 17 20:05:45 2020 * Other Notes
|
|
|
Fri Jan 17 20:05:45 2020 * Hooks Results
|
|
|
|
|
|
A sick node would have said something like this instead:
|
|
|
|
|
|
Mon Oct 26 18:59:37 2009 * Verifying N+1 Memory redundancy
|
|
|
Mon Oct 26 18:59:37 2009 - ERROR: node node2: not enough memory to accommodate instance failovers should node node1 fail
|
|
|
|
|
|
See the [ganeti manual](http://docs.ganeti.org/ganeti/2.15/html/walkthrough.html#n-1-errors) for a more extensive example
|
|
|
|
|
|
Also note the `hspace -L` command, which can tell you how many
|
|
|
instances can be created in a given cluster. It uses the "standard"
|
|
|
instance template defined in the cluster (which we haven't configured
|
|
|
yet).
|
|
|
|
|
|
## Moving instances and failover
|
|
|
|
|
|
Ganeti is smart about assigning instances to nodes. There's also a
|
|
|
command (`hbal`) to automatically rebalance the cluster (see
|
|
|
below). If for some reason hbal doesn’t do what you want or you need
|
|
|
to move things around for other reasons, here are a few commands that
|
|
|
might be handy.
|
|
|
|
|
|
Make an instance switch to using it's secondary:
|
|
|
|
|
|
gnt-instance migrate test1.torproject.org
|
|
|
|
|
|
Make all instances on a node switch to their secondaries:
|
|
|
|
|
|
gnt-node migrate test1.torproject.org
|
|
|
|
|
|
The `migrate` commands does a "live" migrate which should avoid any
|
|
|
downtime during the migration. It might be preferable to actually
|
|
|
shutdown the machine for some reason (for example if we actually want
|
|
|
to reboot because of a security upgrade). Or we might not be able to
|
|
|
live-migrate because the node is down. In this case, we do a
|
|
|
[failover](http://docs.ganeti.org/ganeti/2.15/html/admin.html#failing-over-an-instance)
|
|
|
|
|
|
gnt-instance failover test1.torproject.org
|
|
|
|
|
|
The [gnt-node evacuate](http://docs.ganeti.org/ganeti/2.15/man/gnt-node.html#evacuate) command can also be used to "empty" a given
|
|
|
node altogether, in case of an emergency:
|
|
|
|
|
|
gnt-node evacuate -I . fsn-node-02.torproject.org
|
|
|
|
|
|
Similarly, the [gnt-node failover](http://docs.ganeti.org/ganeti/2.15/man/gnt-node.html#failover) command can be used to
|
|
|
hard-recover from a completely crashed node:
|
|
|
|
|
|
gnt-node failover fsn-node-02.torproject.org
|
|
|
|
|
|
Note that you might need the `--ignore-consistency` flag if the
|
|
|
node is unresponsive.
|
|
|
|
|
|
## Importing external instances
|
|
|
|
|
|
Assumptions:
|
|
|
|
|
|
* `INSTANCE`: name of the instance being migrated, the "old" one
|
|
|
being outside the cluster and the "new" one being the one created
|
|
|
inside the cluster (e.g. `chiwui.torproject.org`)
|
|
|
* `SPARE_NODE`: a ganeti node with free space
|
|
|
(e.g. `fsn-node-03.torproject.org`) where the `INSTANCE` will be
|
|
|
migrated
|
|
|
* `MASTER_NODE`: the master ganeti node
|
|
|
(e.g. `fsn-node-01.torproject.org`)
|
|
|
* `KVM_HOST`: the machine which we migrate the `INSTANCE` from
|
|
|
* the `INSTANCE` has only `root` and `swap` partitions
|
|
|
* the `SPARE_NODE` has space in `/srv/` to host all the virtual
|
|
|
machines to import, to check, use:
|
|
|
|
|
|
fab -H crm-ext-01.torproject.org,crm-int-01.torproject.org,forrestii.torproject.org,nevii.torproject.org,rude.torproject.org,troodi.torproject.org,vineale.torproject.org libvirt.du -p kvm3.torproject.org | sed '/-swap$/d;s/ .*$//' <f | awk '{s+=$1} END {print s}'
|
|
|
|
|
|
You will very likely need to create a `/srv` big enough for this,
|
|
|
for example:
|
|
|
|
|
|
lvcreate -L 300G vg_ganeti -n srv-tmp &&
|
|
|
mkfs /dev/vg_ganeti/srv-tmp &&
|
|
|
mount /dev/vg_ganeti/srv-tmp /srv
|
|
|
|
|
|
Import procedure:
|
|
|
|
|
|
1. pick a viable SPARE NODE to import the INSTANCE (see "evaluating
|
|
|
cluster capacity" above, when in doubt) and find on which KVM HOST
|
|
|
the INSTANCE lives
|
|
|
|
|
|
2. copy the disks, without downtime:
|
|
|
|
|
|
./ganeti -v -H $INSTANCE libvirt-import --ganeti-node $SPARE_NODE --libvirt-host $KVM_HOST
|
|
|
|
|
|
3. copy the disks again, this time suspending the machine:
|
|
|
|
|
|
./ganeti -v -H $INSTANCE libvirt-import --ganeti-node $SPARE_NODE --libvirt-host $KVM_HOST --suspend --adopt
|
|
|
|
|
|
4. renumber the host:
|
|
|
|
|
|
./ganeti -v -H $INSTANCE renumber-instance --ganeti-node $SPARE_NODE
|
|
|
|
|
|
5. test services by changing your `/etc/hosts`, possibly warning
|
|
|
service admins:
|
|
|
|
|
|
> Subject: $INSTANCE IP address change planned for Ganeti migration
|
|
|
>
|
|
|
> I will soon migrate this virtual machine to the new ganeti cluster. this
|
|
|
> will involve an IP address change which might affect the service.
|
|
|
>
|
|
|
> Please let me know if there are any problems you can think of. in
|
|
|
> particular, do let me know if any internal (inside the server) or external
|
|
|
> (outside the server) services hardcodes the IP address of the virtual
|
|
|
> machine.
|
|
|
>
|
|
|
> A test instance has been setup. You can test the service by
|
|
|
> adding the following to your /etc/hosts:
|
|
|
>
|
|
|
> 116.202.120.182 $INSTANCE
|
|
|
> 2a01:4f8:fff0:4f:266:37ff:fe32:cfb2 $INSTANCE
|
|
|
|
|
|
6. destroy test instance:
|
|
|
|
|
|
gnt-instance remove $INSTANCE
|
|
|
|
|
|
7. lower TTLs to 5 minutes. this procedure varies a lot according to
|
|
|
the service, but generally if all DNS entries are `CNAME`s
|
|
|
pointing to the main machine domain name, the TTL can be lowered
|
|
|
by adding a `dnsTTL` entry in the LDAP entry for this host. For
|
|
|
example, this sets the TTL to 5 minutes:
|
|
|
|
|
|
dnsTTL: 300
|
|
|
|
|
|
Then to make the changes immediate, you need the following
|
|
|
commands:
|
|
|
|
|
|
ssh root@alberti.torproject.org sudo -u sshdist ud-generate &&
|
|
|
ssh root@nevii.torproject.org ud-replicate
|
|
|
|
|
|
Warning: if you migrate one of the hosts ud-ldap depends on, this
|
|
|
can fail and not only the TTL will not update, but it might also
|
|
|
fail to update the IP address in the below procedure. See [ticket
|
|
|
33766](https://bugs.torproject.org/33766) for
|
|
|
details.
|
|
|
|
|
|
8. shutdown original instance and redo migration as in step 3 and 4:
|
|
|
|
|
|
fab -H $INSTANCE reboot.halt-and-wait --delay-shutdown 60 --reason='migrating to new server' &&
|
|
|
./ganeti -v -H $INSTANCE libvirt-import --ganeti-node $SPARE_NODE --libvirt-host $KVM_HOST --adopt &&
|
|
|
./ganeti -v -H $INSTANCE renumber-instance --ganeti-node $SPARE_NODE
|
|
|
|
|
|
9. final test procedure
|
|
|
|
|
|
TODO: establish host-level test procedure and run it here.
|
|
|
|
|
|
10. switch to DRBD, still on the Ganeti MASTER NODE:
|
|
|
|
|
|
gnt-instance stop $INSTANCE &&
|
|
|
gnt-instance modify -t drbd $INSTANCE &&
|
|
|
gnt-instance failover -f $INSTANCE &&
|
|
|
gnt-instance start $INSTANCE
|
|
|
|
|
|
The above can sometimes fail if the allocator is upset about
|
|
|
something in the cluster, for example:
|
|
|
|
|
|
Can's find secondary node using iallocator hail: Request failed: No valid allocation solutions, failure reasons: FailMem: 2, FailN1: 2
|
|
|
|
|
|
This situation is covered by [ticket 33785](https://bugs.torproject.org/33785). To work around the
|
|
|
allocator, you can specify a secondary node directly:
|
|
|
|
|
|
gnt-instance modify -t drbd -n fsn-node-04.torproject.org $INSTANCE &&
|
|
|
gnt-instance failover -f $INSTANCE &&
|
|
|
gnt-instance start $INSTANCE
|
|
|
|
|
|
TODO: move into fabric, maybe in a `libvirt-import-live` or
|
|
|
`post-libvirt-import` job that would also do the renumbering below
|
|
|
|
|
|
11. change IP address in the following locations:
|
|
|
|
|
|
* LDAP (`ipHostNumber` field, but also change the `physicalHost` and `l` fields!). Also drop the dnsTTL attribute while you're at it.
|
|
|
* Puppet (grep in tor-puppet source, run `puppet agent -t; ud-replicate` on pauli)
|
|
|
* DNS (grep in tor-dns source, `puppet agent -t; ud-replicate` on nevii)
|
|
|
* nagios (don't forget to change the parent)
|
|
|
* reverse DNS (upstream web UI, e.g. Hetzner Robot)
|
|
|
* grep for the host's IP address on itself:
|
|
|
|
|
|
grep -r -e 78.47.38.227 -e 2a01:4f8:fff0:4f:266:37ff:fe77:1ad8 /etc
|
|
|
grep -r -e 78.47.38.227 -e 2a01:4f8:fff0:4f:266:37ff:fe77:1ad8 /srv
|
|
|
|
|
|
* grep for the host's IP on *all* hosts:
|
|
|
|
|
|
cumin-all-puppet
|
|
|
cumin-all 'grep -r -e 78.47.38.227 -e 2a01:4f8:fff0:4f:266:37ff:fe77:1ad8 /etc'
|
|
|
|
|
|
TODO: move those jobs into fabric
|
|
|
|
|
|
12. retire old instance (only a tiny part of [howto/retire-a-host](howto/retire-a-host)):
|
|
|
|
|
|
./retire -H $INSTANCE retire-instance --parent-host $KVM_HOST
|
|
|
|
|
|
12. update the [Nextcloud spreadsheet](https://nc.torproject.net/apps/onlyoffice/5395) to change where the machine
|
|
|
lives
|
|
|
|
|
|
13. warn users about the migration, for example:
|
|
|
|
|
|
> To: tor-project@lists.torproject.org
|
|
|
> Subject: cupani AKA git-rw IP address changed
|
|
|
>
|
|
|
> The main git server, cupani, is the machine you connect to when you push
|
|
|
> or pull git repositories over ssh to git-rw.torproject.org. That
|
|
|
> machines has been migrated to the new Ganeti cluster.
|
|
|
>
|
|
|
> This required an IP address change from:
|
|
|
>
|
|
|
> 78.47.38.228 2a01:4f8:211:6e8:0:823:4:1
|
|
|
>
|
|
|
> to:
|
|
|
>
|
|
|
> 116.202.120.182 2a01:4f8:fff0:4f:266:37ff:fe32:cfb2
|
|
|
>
|
|
|
> DNS has been updated and preliminary tests show that everything is
|
|
|
> mostly working. You *will* get a warning about the IP address change
|
|
|
> when connecting over SSH, which will go away after the first
|
|
|
> connection.
|
|
|
>
|
|
|
> Warning: Permanently added the ED25519 host key for IP address '116.202.120.182' to the list of known hosts.
|
|
|
>
|
|
|
> That is normal. The SSH fingerprints of the host did *not* change.
|
|
|
>
|
|
|
> Please do report any other anomaly using the normal channels:
|
|
|
>
|
|
|
> https://gitlab.torproject.org/anarcat/wikitest/-/wikis/doc/how-to-get-help/
|
|
|
>
|
|
|
> The service was unavailable for about an hour during the migration.
|
|
|
|
|
|
## Importing external instances, manual
|
|
|
|
|
|
This procedure is now easier to accomplish with the Fabric tools
|
|
|
written especially for this purpose. Use the above procedure
|
|
|
instead. This is kept for historical reference.
|
|
|
|
|
|
Assumptions:
|
|
|
|
|
|
* `INSTANCE`: name of the instance being migrated, the "old" one
|
|
|
being outside the cluster and the "new" one being the one created
|
|
|
inside the cluster (e.g. `chiwui.torproject.org`)
|
|
|
* `SPARE_NODE`: a ganeti node with free space
|
|
|
(e.g. `fsn-node-03.torproject.org`) where the `INSTANCE` will be
|
|
|
migrated
|
|
|
* `MASTER_NODE`: the master ganeti node
|
|
|
(e.g. `fsn-node-01.torproject.org`)
|
|
|
* `KVM_HOST`: the machine which we migrate the `INSTANCE` from
|
|
|
* the `INSTANCE` has only `root` and `swap` partitions
|
|
|
|
|
|
Import procedure:
|
|
|
|
|
|
1. pick a viable SPARE NODE to import the instance (see "evaluating
|
|
|
cluster capacity" above, when in doubt), login to the three
|
|
|
servers, setting the proper environment everywhere, for example:
|
|
|
|
|
|
MASTER_NODE=fsn-node-01.torproject.org
|
|
|
SPARE_NODE=fsn-node-03.torproject.org
|
|
|
KVM_HOST=kvm1.torproject.org
|
|
|
INSTANCE=test.torproject.org
|
|
|
|
|
|
2. establish VM specs, on the KVM HOST:
|
|
|
|
|
|
* disk space in GiB:
|
|
|
|
|
|
for disk in /srv/vmstore/$INSTANCE/*; do
|
|
|
printf "$disk: "
|
|
|
echo "$(qemu-img info --output=json $disk | jq '."virtual-size"') / 1024 / 1024 / 1024" | bc -l
|
|
|
done
|
|
|
|
|
|
* number of CPU cores:
|
|
|
|
|
|
sed -n '/<vcpu/{s/[^>]*>//;s/<.*//;p}' < /etc/libvirt/qemu/$INSTANCE.xml
|
|
|
|
|
|
* memory, assuming from KiB to GiB:
|
|
|
|
|
|
echo "$(sed -n '/<memory/{s/[^>]*>//;s/<.*//;p}' < /etc/libvirt/qemu/$INSTANCE.xml) /1024 /1024" | bc -l
|
|
|
|
|
|
TODO: make sure the memory line is in KiB and that the number
|
|
|
makes sense.
|
|
|
|
|
|
* on the INSTANCE, find the swap device UUID so we can recreate it later:
|
|
|
|
|
|
blkid -t TYPE=swap -s UUID -o value
|
|
|
|
|
|
3. setup a copy channel, on the SPARE NODE:
|
|
|
|
|
|
ssh-agent bash
|
|
|
ssh-add /etc/ssh/ssh_host_ed25519_key
|
|
|
cat /etc/ssh/ssh_host_ed25519_key.pub
|
|
|
|
|
|
on the KVM HOST:
|
|
|
|
|
|
echo "$KEY_FROM_SPARE_NODE" >> /etc/ssh/userkeys/root
|
|
|
|
|
|
4. copy the `.qcow` file(s) over, from the KVM HOST to the SPARE NODE:
|
|
|
|
|
|
rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-root /srv/
|
|
|
rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-lvm /srv/ || true
|
|
|
|
|
|
Note: it's possible there is not enough room in `/srv`: in the
|
|
|
base Ganeti installs, everything is in the same root partition
|
|
|
(`/`) which will fill up if the instance is (say) over ~30GiB. In
|
|
|
that case, create a filesystem in `/srv`:
|
|
|
|
|
|
(mkdir /root/srv && mv /srv/* /root/srv true) || true &&
|
|
|
lvcreate -L 200G vg_ganeti -n srv &&
|
|
|
mkfs /dev/vg_ganeti/srv &&
|
|
|
echo "/dev/vg_ganeti/srv /srv ext4 rw,noatime,errors=remount-ro 0 2" >> /etc/fstab &&
|
|
|
mount /srv &&
|
|
|
( mv /root/srv/* ; rmdir /root/srv )
|
|
|
|
|
|
This partition can be reclaimed once the VM migrations are
|
|
|
completed, as it needlessly takes up space on the node.
|
|
|
|
|
|
5. on the SPARE NODE, create and initialize a logical volume with the predetermined size:
|
|
|
|
|
|
lvcreate -L 4GiB -n $INSTANCE-swap vg_ganeti
|
|
|
mkswap --uuid $SWAP_UUID /dev/vg_ganeti/$INSTANCE-swap
|
|
|
lvcreate -L 20GiB -n $INSTANCE-root vg_ganeti
|
|
|
qemu-img convert /srv/$INSTANCE-root -O raw /dev/vg_ganeti/$INSTANCE-root
|
|
|
lvcreate -L 40GiB -n $INSTANCE-lvm vg_ganeti_hdd
|
|
|
qemu-img convert /srv/$INSTANCE-lvm -O raw /dev/vg_ganeti_hdd/$INSTANCE-lvm
|
|
|
|
|
|
Note how we assume two disks above, but the instance might have a
|
|
|
different configuration that would require changing the above. The
|
|
|
above, common, configuration is to have an LVM disk separate from
|
|
|
the "root" disk, the former being on a HDD, but the HDD is
|
|
|
sometimes completely omitted and sizes can differ.
|
|
|
|
|
|
Sometimes it might be worth using pv to get progress on long
|
|
|
transfers:
|
|
|
|
|
|
qemu-img convert /srv/$INSTANCE-lvm -O raw /srv/$INSTANCE-lvm.raw
|
|
|
pv /srv/$INSTANCE-lvm.raw | dd of=/dev/vg_ganeti_hdd/$INSTANCE-lvm bs=4k
|
|
|
|
|
|
TODO: ideally, the above procedure (and many steps below as well)
|
|
|
would be automatically deduced from the disk listing established
|
|
|
in the first step.
|
|
|
|
|
|
6. on the MASTER NODE, create the instance, adopting the LV:
|
|
|
|
|
|
gnt-instance add -t plain \
|
|
|
-n fsn-node-03 \
|
|
|
--disk 0:adopt=$INSTANCE-root \
|
|
|
--disk 1:adopt=$INSTANCE-swap \
|
|
|
--disk 2:adopt=$INSTANCE-lvm,vg=vg_ganeti_hdd \
|
|
|
--backend-parameters memory=2g,vcpus=2 \
|
|
|
--net 0:ip=pool,network=gnt-fsn \
|
|
|
--no-name-check \
|
|
|
--no-ip-check \
|
|
|
-o debootstrap+default \
|
|
|
$INSTANCE
|
|
|
|
|
|
7. cross your fingers and watch the party:
|
|
|
|
|
|
gnt-instance console $INSTANCE
|
|
|
|
|
|
9. IP address change on new instance:
|
|
|
|
|
|
edit `/etc/hosts` and `/etc/network/interfaces` by hand and add
|
|
|
IPv4 and IPv6 ip. IPv4 configuration can be found in:
|
|
|
|
|
|
gnt-instance show $INSTANCE
|
|
|
|
|
|
Latter can be guessed by concatenating `2a01:4f8:fff0:4f::` and
|
|
|
the IPv6 local local address without `fe80::`. For example: a
|
|
|
link local address of `fe80::266:37ff:fe65:870f/64` should yield
|
|
|
the following configuration:
|
|
|
|
|
|
iface eth0 inet6 static
|
|
|
accept_ra 0
|
|
|
address 2a01:4f8:fff0:4f:266:37ff:fe65:870f/64
|
|
|
gateway 2a01:4f8:fff0:4f::1
|
|
|
|
|
|
TODO: reuse `gnt-debian-interfaces` from the ganeti puppet
|
|
|
module script here?
|
|
|
|
|
|
10. functional tests: change your `/etc/hosts` to point to the new
|
|
|
server and see if everything still kind of works
|
|
|
|
|
|
11. shutdown original instance
|
|
|
|
|
|
12. resync and reconvert image, on the Ganeti MASTER NODE:
|
|
|
|
|
|
gnt-instance stop $INSTANCE
|
|
|
|
|
|
on the Ganeti node:
|
|
|
|
|
|
rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-root /srv/ &&
|
|
|
qemu-img convert /srv/$INSTANCE-root -O raw /dev/vg_ganeti/$INSTANCE-root &&
|
|
|
rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-lvm /srv/ &&
|
|
|
qemu-img convert /srv/$INSTANCE-lvm -O raw /dev/vg_ganeti_hdd/$INSTANCE-lvm
|
|
|
|
|
|
13. switch to DRBD, still on the Ganeti MASTER NODE:
|
|
|
|
|
|
gnt-instance modify -t drbd $INSTANCE
|
|
|
gnt-instance failover $INSTANCE
|
|
|
gnt-instance startup $INSTANCE
|
|
|
|
|
|
14. redo IP adress change in `/etc/network/interfaces` and `/etc/hosts`
|
|
|
|
|
|
15. final functional test
|
|
|
|
|
|
16. change IP address in the following locations:
|
|
|
|
|
|
* nagios (don't forget to change the parent)
|
|
|
* LDAP (`ipHostNumber` field, but also change the `physicalHost` and `l` fields!)
|
|
|
* Puppet (grep in tor-puppet source, run `puppet agent -t; ud-replicate` on pauli)
|
|
|
* DNS (grep in tor-dns source, `puppet agent -t; ud-replicate` on nevii)
|
|
|
* reverse DNS (upstream web UI, e.g. Hetzner Robot)
|
|
|
|
|
|
17. decomission old instance ([howto/retire-a-host](howto/retire-a-host))
|
|
|
|
|
|
### Troubleshooting
|
|
|
|
|
|
* if boot takes a long time and you see a message like this on the console:
|
|
|
|
|
|
[ *** ] A start job is running for dev-disk-by\x2duuid-484b5...26s / 1min 30s)
|
|
|
|
|
|
... which is generally followed by:
|
|
|
|
|
|
[DEPEND] Dependency failed for /dev/disk/by-…6f4b5-f334-4173-8491-9353d4f94e04.
|
|
|
[DEPEND] Dependency failed for Swap.
|
|
|
|
|
|
it means the swap device UUID wasn't setup properly, and does not
|
|
|
match the one provided in `/etc/fstab`. That is probably because
|
|
|
you missed the `mkswap -U` step documented above.
|
|
|
|
|
|
### References
|
|
|
|
|
|
* [Upstream docs](http://docs.ganeti.org/ganeti/2.15/html/admin.html#import-of-foreign-instances) have the canonical incantation:
|
|
|
|
|
|
gnt-instance add -t plain -n HOME_NODE ... --disk 0:adopt=lv_name[,vg=vg_name] INSTANCE_NAME
|
|
|
|
|
|
* [DSA docs](https://dsa.debian.org/howto/install-ganeti/) also use disk adoption and have a procedure to
|
|
|
migrate to DRBD
|
|
|
|
|
|
* [Riseup docs](https://we.riseup.net/riseup+tech/ganeti#move-an-instance-from-one-cluster-to-another-from-) suggest creating a VM without installing, shutting
|
|
|
down and then syncing
|
|
|
|
|
|
Ganeti [supports importing and exporting](http://docs.ganeti.org/ganeti/2.15/html/design-ovf-support.html?highlight=qcow) from the [Open
|
|
|
Virtualization Format](https://en.wikipedia.org/wiki/Open_Virtualization_Format) (OVF), but unfortunately it [doesn't seem
|
|
|
libvirt supports *exporting* to OVF](https://forums.centos.org/viewtopic.php?t=49231). There's a [virt-convert](http://manpages.debian.org/virt-convert)
|
|
|
tool which can *import* OVF, but not the reverse. The [libguestfs](http://www.libguestfs.org/)
|
|
|
library also has a [converter](http://www.libguestfs.org/virt-v2v.1.html) but it also doesn't support
|
|
|
exporting to OVF or anything Ganeti can load directly.
|
|
|
|
|
|
So people have written [their own conversion tools](https://virtuallyhyper.com/2013/06/migrate-from-libvirt-kvm-to-virtualbox/) or [their own
|
|
|
conversion procedure](https://scienceofficersblog.blogspot.com/2014/04/using-cloud-images-with-ganeti.html).
|
|
|
|
|
|
Ganeti also supports [file-backed instances](http://docs.ganeti.org/ganeti/2.15/html/design-file-based-storage.html) but "adoption" is
|
|
|
specifically designed for logical volumes, so it doesn't work for our
|
|
|
use case.
|
|
|
|
|
|
## Rebooting
|
|
|
|
|
|
Those hosts need special care, as we can accomplish zero-downtime
|
|
|
reboots on those machines. The `reboot` script in `tsa-misc` takes
|
|
|
care of the special steps involved (which is basically to empty a
|
|
|
node before rebooting it).
|
|
|
|
|
|
Such a reboot should be ran interactively, inside a `tmux` or `screen`
|
|
|
session, and takes over 15 minutes to complete right now, but depends
|
|
|
on the size of the cluster (in terms of core memory usage).
|
|
|
|
|
|
Once the reboot is completed, all instances might end up on a single
|
|
|
machine, and the cluster might need to be rebalanced, see
|
|
|
below. (Note: the update script should eventually do that, see [ticket
|
|
|
33406](https://bugs.torproject.org/33406)).
|
|
|
|
|
|
## Rebalancing a cluster
|
|
|
|
|
|
After a reboot or a downtime, all nodes might end up on the same
|
|
|
machine. This is normally handled by the reboot script, but it might
|
|
|
be desirable to do this by hand if there was a crash or another
|
|
|
special condition.
|
|
|
|
|
|
This can be easily corrected with this command, which will spread
|
|
|
instances around the cluster to balance it:
|
|
|
|
|
|
hbal -L -C -v -X
|
|
|
|
|
|
This will automatically move the instances around and rebalance the
|
|
|
cluster. Here's an example run on a small cluster:
|
|
|
|
|
|
root@fsn-node-01:~# gnt-instance list
|
|
|
Instance Hypervisor OS Primary_node Status Memory
|
|
|
loghost01.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 2.0G
|
|
|
onionoo-backend-01.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 12.0G
|
|
|
static-master-fsn.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 8.0G
|
|
|
web-fsn-01.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 4.0G
|
|
|
web-fsn-02.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 4.0G
|
|
|
root@fsn-node-01:~# hbal -L -X
|
|
|
Loaded 2 nodes, 5 instances
|
|
|
Group size 2 nodes, 5 instances
|
|
|
Selected node group: default
|
|
|
Initial check done: 0 bad nodes, 0 bad instances.
|
|
|
Initial score: 8.45007519
|
|
|
Trying to minimize the CV...
|
|
|
1. onionoo-backend-01 fsn-node-02:fsn-node-01 => fsn-node-01:fsn-node-02 4.98124611 a=f
|
|
|
2. loghost01 fsn-node-02:fsn-node-01 => fsn-node-01:fsn-node-02 1.78271883 a=f
|
|
|
Cluster score improved from 8.45007519 to 1.78271883
|
|
|
Solution length=2
|
|
|
Got job IDs 16345
|
|
|
Got job IDs 16346
|
|
|
root@fsn-node-01:~# gnt-instance list
|
|
|
Instance Hypervisor OS Primary_node Status Memory
|
|
|
loghost01.torproject.org kvm debootstrap+buster fsn-node-01.torproject.org running 2.0G
|
|
|
onionoo-backend-01.torproject.org kvm debootstrap+buster fsn-node-01.torproject.org running 12.0G
|
|
|
static-master-fsn.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 8.0G
|
|
|
web-fsn-01.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 4.0G
|
|
|
web-fsn-02.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 4.0G
|
|
|
|
|
|
In the above example, you should notice that the `web-fsn` instances both
|
|
|
ended up on the same node. That's because the balancer did not know
|
|
|
that they should be distributed. A special configuration was done,
|
|
|
below, to avoid that problem in the future. But as a workaround,
|
|
|
instances can also be moved by hand and the cluster re-balanced.
|
|
|
|
|
|
## Redundant instances distribution
|
|
|
|
|
|
Some instances are redundant across the cluster and should *not* end up
|
|
|
on the same node. A good example are the `web-fsn-01` and `web-fsn-02`
|
|
|
instances which, in theory, would serve similar traffic. If they end
|
|
|
up on the same node, it might flood the network on that machine or at
|
|
|
least defeats the purpose of having redundant machines.
|
|
|
|
|
|
The way to ensure they get distributed properly by the balancing
|
|
|
algorithm is to "tag" them. For the web nodes, for example, this was
|
|
|
performed on the master:
|
|
|
|
|
|
gnt-instance add-tags web-fsn-01.torproject.org web-fsn
|
|
|
gnt-instance add-tags web-fsn-02.torproject.org web-fsn
|
|
|
gnt-cluster add-tags htools:iextags:web-fsn
|
|
|
|
|
|
This tells Ganeti that `web-fsn` is an "exclusion tag" and the
|
|
|
optimizer will not try to schedule instances with those tags on the
|
|
|
same node.
|
|
|
|
|
|
To see which tags are present, use:
|
|
|
|
|
|
# gnt-cluster list-tags
|
|
|
htools:iextags:web-fsn
|
|
|
|
|
|
You can also find which nodes are assigned to a tag with:
|
|
|
|
|
|
# gnt-cluster search-tags web-fsn
|
|
|
/cluster htools:iextags:web-fsn
|
|
|
/instances/web-fsn-01.torproject.org web-fsn
|
|
|
/instances/web-fsn-02.torproject.org web-fsn
|
|
|
|
|
|
## Adding and removing addresses on instances
|
|
|
|
|
|
Say you created an instance but forgot to need to assign an extra
|
|
|
IP. You can still do so with:
|
|
|
|
|
|
gnt-instance modify --net -1:add,ip=116.202.120.174,network=gnt-fsn test01.torproject.org
|
|
|
|
|
|
## Pager playbook
|
|
|
|
|
|
### I/O overload
|
|
|
|
|
|
In case of excessive I/O, it might be worth looking into which machine
|
|
|
is in cause. The [howto/drbd](howto/drbd) page explains how to map a DRBD device to a
|
|
|
VM. You can also find which logical volume is backing an instance (and
|
|
|
vice versa) with this command:
|
|
|
|
|
|
lvs -o+tags
|
|
|
|
|
|
This will list all logical volumes and their associated tags. If you
|
|
|
already know which logical volume you're looking for, you can address
|
|
|
it directly:
|
|
|
|
|
|
root@fsn-node-01:~# lvs -o tags /dev/vg_ganeti_hdd/4091b668-1177-41ac-9310-1eac45b46620.disk2_data
|
|
|
LV Tags
|
|
|
originstname+bacula-director-01.torproject.org
|
|
|
|
|
|
### Node failures
|
|
|
|
|
|
Ganeti clusters are designed to be [self-healing](http://docs.ganeti.org/ganeti/2.15/html/admin.html#autorepair). As long as only
|
|
|
one machine disappears, the cluster should be able to recover by
|
|
|
failing over other nodes. This is currently done manually, see the
|
|
|
migrate section above.
|
|
|
|
|
|
This could eventually be automated if such situations occur more
|
|
|
often, by scheduling a [harep](http://docs.ganeti.org/ganeti/2.15/man/harep.html) cron job, which isn't enabled in
|
|
|
Debian by default. See also the [autorepair](http://docs.ganeti.org/ganeti/2.15/html/admin.html#autorepair) section of the admin
|
|
|
manual.
|
|
|
|
|
|
### Other troubleshooting
|
|
|
|
|
|
Riseup has [documentation on various failure scenarios](https://we.riseup.net/riseup+tech/ganeti#failure-scenarios) including
|
|
|
master failover, which we haven't tested. There's also upstream
|
|
|
documentation on [changing node roles](http://docs.ganeti.org/ganeti/2.15/html/admin.html#changing-the-node-role) which might be useful for a
|
|
|
master failover scenario.
|
|
|
|
|
|
The [walkthrough](http://docs.ganeti.org/ganeti/2.15/html/walkthrough.html) also has a few recipes to resolve common
|
|
|
problems.
|
|
|
|
|
|
## Disaster recovery
|
|
|
|
|
|
If things get completely out of hand and the cluster becomes too
|
|
|
unreliable for service, the only solution is to rebuild another one
|
|
|
elsewhere. Since Ganeti 2.2, there is a [move-instance](http://docs.ganeti.org/ganeti/2.15/html/move-instance.html) command to
|
|
|
move instances between cluster that can be used for that purpose.
|
|
|
|
|
|
If Ganeti is completely destroyed and its APIs don't work anymore, the
|
|
|
last resort is to restore all virtual machines from
|
|
|
[howto/backup](howto/backup). Hopefully, this should not happen except in the case of a
|
|
|
catastrophic data loss bug in Ganeti or [howto/drbd](howto/drbd).
|
|
|
|
|
|
# Reference
|
|
|
|
|
|
## Installation
|
|
|
|
|
|
### New node
|
|
|
|
|
|
1. To create a new box, follow [howto/new-machine-hetzner-robot](howto/new-machine-hetzner-robot) but change
|
|
|
the following settings:
|
|
|
|
|
|
* Server: [PX62-NVMe][]
|
|
|
* Location: `FSN1`
|
|
|
* Operating system: Rescue
|
|
|
* Additional drives: 2x10TB HDD (update: starting from fsn-node-05,
|
|
|
we are *not* ordering additional drives to save on costs, see
|
|
|
[ticket 33083](https://bugs.torproject.org/33083) for rationale)
|
|
|
* Add in the comment form that the server needs to be in the same
|
|
|
datacenter as the other machines (FSN1-DC13, but double-check)
|
|
|
|
|
|
[PX62-NVMe]: https://www.hetzner.com/dedicated-rootserver/px62-nvme?country=OTHER
|
|
|
|
|
|
2. follow the [howto/new-machine](howto/new-machine) post-install configuration
|
|
|
|
|
|
3. Add the server to the two `vSwitch` systems in [Hetzner Robot web
|
|
|
UI](https://robot.your-server.de/vswitch)
|
|
|
|
|
|
4. install openvswitch and allow modules to be loaded:
|
|
|
|
|
|
touch /etc/no_modules_disabled
|
|
|
reboot
|
|
|
apt install openvswitch-switch
|
|
|
|
|
|
5. Allocate a private IP address in the `30.172.in-addr.arpa` zone for
|
|
|
the node.
|
|
|
|
|
|
6. copy over the `/etc/network/interfaces` from another ganeti node,
|
|
|
changing the `address` and `gateway` fields to match the local
|
|
|
entry.
|
|
|
|
|
|
7. knock on wood, cross your fingers, pet a cat, help your local
|
|
|
book store, and reboot:
|
|
|
|
|
|
reboot
|
|
|
|
|
|
8. Prepare all the nodes by configuring them in puppet. They should
|
|
|
be in the class `roles::ganeti::fsn` if they are part of the fsn
|
|
|
cluster.
|
|
|
|
|
|
9. Re-enable modules disabling:
|
|
|
|
|
|
rm /etc/no_modules_disabled
|
|
|
|
|
|
10. run puppet across the ganeti cluster to ensure ipsec tunnels are
|
|
|
up:
|
|
|
|
|
|
cumin -p 0 'C:roles::ganeti::fsn' 'puppet agent -t'
|
|
|
|
|
|
11. reboot again:
|
|
|
|
|
|
reboot
|
|
|
|
|
|
12. Then the node is ready to be added to the cluster, by running
|
|
|
this on the master node:
|
|
|
|
|
|
gnt-node add \
|
|
|
--secondary-ip 172.30.135.2 \
|
|
|
--no-ssh-key-check \
|
|
|
--no-node-setup \
|
|
|
fsn-node-02.torproject.org
|
|
|
|
|
|
If this is an entirely new cluster, you need a different procedure:
|
|
|
|
|
|
gnt-cluster init \
|
|
|
--master-netdev vlan-gntbe \
|
|
|
--vg-name vg_ganeti \
|
|
|
--secondary-ip 172.30.135.1 \
|
|
|
--enabled-hypervisors kvm \
|
|
|
--nic-parameters link=br0,vlan=4000 \
|
|
|
--mac-prefix 00:66:37 \
|
|
|
--no-ssh-init \
|
|
|
--no-etc-hosts \
|
|
|
fsngnt.torproject.org
|
|
|
|
|
|
The above assumes that `fsngnt` is already in DNS.
|
|
|
|
|
|
13. make sure everything is great in the cluster:
|
|
|
|
|
|
gnt-cluster verify
|
|
|
|
|
|
If that takes a long time and eventually fails with erors like:
|
|
|
|
|
|
ERROR: node fsn-node-03.torproject.org: ssh communication with node 'fsn-node-06.torproject.org': ssh problem: ssh: connect to host fsn-node-06.torproject.org port 22: Connection timed out\'r\n
|
|
|
|
|
|
... that is because the [howto/ipsec](howto/ipsec) tunnels between the nodes are
|
|
|
failing. Make sure Puppet has run across the cluster (step 10
|
|
|
above) and see [howto/ipsec](howto/ipsec) for further diagnostics. For example,
|
|
|
the above would be fixed with:
|
|
|
|
|
|
ssh fsn-node-03.torproject.org "puppet agent -t; service ipsec reload"
|
|
|
ssh fsn-node-06.torproject.org "puppet agent -t; service ipsec reload; ipsec up gnt-fsn-be::fsn-node-03"
|
|
|
|
|
|
### cluster config
|
|
|
|
|
|
These could probably be merged into the cluster init, but just to document what has been done:
|
|
|
|
|
|
gnt-cluster modify --reserved-lvs vg_ganeti/root,vg_ganeti/swap
|
|
|
gnt-cluster modify -H kvm:kernel_path=,initrd_path=,
|
|
|
gnt-cluster modify -H kvm:security_model=pool
|
|
|
gnt-cluster modify -H kvm:kvm_extra='-device virtio-rng-pci\,bus=pci.0\,addr=0x1e\,max-bytes=1024\,period=1000'
|
|
|
gnt-cluster modify -H kvm:disk_cache=none
|
|
|
gnt-cluster modify -H kvm:disk_discard=unmap
|
|
|
gnt-cluster modify -H kvm:scsi_controller_type=virtio-scsi-pci
|
|
|
gnt-cluster modify -H kvm:disk_type=scsi-hd
|
|
|
gnt-cluster modify --uid-pool 4000-4019
|
|
|
gnt-cluster modify --nic-parameters mode=openvswitch,link=br0,vlan=4000
|
|
|
gnt-cluster modify -D drbd:c-plan-ahead=0,disk-custom='--c-plan-ahead 0'
|
|
|
gnt-cluster modify -H kvm:migration_bandwidth=950
|
|
|
gnt-cluster modify -H kvm:migration_downtime=500
|
|
|
|
|
|
### Network configuration
|
|
|
|
|
|
IP allocation is managed by Ganeti through the `gnt-network(8)`
|
|
|
system. Say we have `192.0.2.0/24` reserved for the cluster, with
|
|
|
the host IP `192.0.2.100`` and the gateway on `192.0.2.1`. You will
|
|
|
create this network with:
|
|
|
|
|
|
gnt-network add --network 192.0.2.0/24 --gateway 192.0.2.1 example-network
|
|
|
|
|
|
If there's also IPv6, it would look something like this:
|
|
|
|
|
|
gnt-network add --network 192.0.2.0/24 --gateway 192.0.2.1 --network6 2001:db8::/32 --gateway6 fe80::1 example-network
|
|
|
|
|
|
Note: the actual name of the network (`example-network`) above, should
|
|
|
follow the convention established in [doc/naming-scheme](doc/naming-scheme).
|
|
|
|
|
|
Then we associate the new network to the default node group:
|
|
|
|
|
|
gnt-network connect --nic-parameters=link=br0,vlan=4000,mode=openvswitch example-network default
|
|
|
|
|
|
The arguments to `--nic-parameters` come from the values configured in
|
|
|
the cluster, above. The current values can be found with `gnt-cluster
|
|
|
info`.
|
|
|
|
|
|
For example, the second ganeti network block was assigned with the
|
|
|
following commands:
|
|
|
|
|
|
gnt-network add --network 49.12.57.128/27 --gateway 49.12.57.129 gnt-fsn13-02
|
|
|
gnt-network connect --nic-parameters=link=br0,vlan=4000,mode=openvswitch gnt-fsn13-02 default
|
|
|
|
|
|
## SLA
|
|
|
|
|
|
As long as the cluster is not over capacity, it should be able to
|
|
|
survive the loss of a node in the cluster unattended.
|
|
|
|
|
|
Justified machines can be provisionned within a few business days
|
|
|
without problems.
|
|
|
|
|
|
New nodes can be provisioned within a week or two, depending on budget
|
|
|
and hardware availability.
|
|
|
|
|
|
## Design
|
|
|
|
|
|
Our first Ganeti cluster (`gnt-fsn`) is made of multiple machines
|
|
|
hosted with [Hetzner Robot](https://robot.your-server.de/), Hetzner's dedicated server hosting
|
|
|
service. All machines use the same hardware to avoid problems with
|
|
|
live migration. That is currently a customized build of the
|
|
|
[PX62-NVMe][] line.
|
|
|
|
|
|
### Network layout
|
|
|
|
|
|
Machines are interconnected over a [vSwitch](https://wiki.hetzner.de/index.php/Vswitch/en), a "virtual layer 2
|
|
|
network" probably implemented using [Software-defined Networking](https://en.wikipedia.org/wiki/Software-defined_networking)
|
|
|
(SDN) on top of Hetzner's network. The details of that implementation
|
|
|
do not matter much to us, since we do not trust the network and run an
|
|
|
IPsec layer on top of the vswitch. We communicate with the `vSwitch`
|
|
|
through [Open vSwitch](https://en.wikipedia.org/wiki/Open_vSwitch) (OVS), which is (currently manually)
|
|
|
configured on each node of the cluster.
|
|
|
|
|
|
There are two distinct IPsec networks:
|
|
|
|
|
|
* `gnt-fsn-public`: the public network, which maps to the
|
|
|
`fsn-gnt-inet-vlan` vSwitch at Hetzner, the `vlan-gntinet` OVS
|
|
|
network, and the `gnt-fsn` network pool in Ganeti. it provides
|
|
|
public IP addresses and routing across the network. instances get
|
|
|
IP allocated in this network.
|
|
|
|
|
|
* `gnt-fsn-be`: the private ganeti network which maps to the
|
|
|
`fsn-gnt-backend-vlan` vSwitch at Hetzner and the `vlan-gntbe` OVS
|
|
|
network. it has no matching `gnt-network` component and IP
|
|
|
addresses are allocated manually in the 172.30.135.0/24 network
|
|
|
through DNS. it provides internal routing for Ganeti commands and
|
|
|
[howto/drbd](howto/drbd) storage mirroring.
|
|
|
|
|
|
### Hardware variations
|
|
|
|
|
|
We considered experimenting with the new AX line ([AX51-NVMe](https://www.hetzner.com/dedicated-rootserver/ax51-nvme?country=OTHER)) but
|
|
|
in the past DSA had problems live-migrating (it wouldn't immediately
|
|
|
fail but there were "issues" after). So we might need to [failover](http://docs.ganeti.org/ganeti/2.15/man/gnt-instance.html#failover)
|
|
|
instead of migrate between those parts of the cluster. There are also
|
|
|
doubts that the Linux kernel supports those shiny new processors at
|
|
|
all: similar processors had [trouble booting before Linux 5.5](https://www.phoronix.com/scan.php?page=news_item&px=Threadripper-3000-MCE-5.5-Fix) for
|
|
|
example, so it might be worth waiting a little before switching to
|
|
|
that new platform, even if it's cheaper. See the cluster configuration
|
|
|
section below for a larger discussion of CPU emulation.
|
|
|
|
|
|
### CPU emulation
|
|
|
|
|
|
Note that we might want to tweak the `cpu_type` parameter. By default,
|
|
|
it emulates a lot of processing that can be delegated to the host CPU
|
|
|
instead. If we use `kvm:cpu_type=host`, then each node will tailor the
|
|
|
emulation system to the CPU on the node. But that might make the live
|
|
|
migration more brittle: VMs or processes can crash after a live
|
|
|
migrate because of a slightly different configuration (microcode, CPU,
|
|
|
kernel and QEMU versions all play a role). So we need to find the
|
|
|
lowest common demoninator in CPU families. The list of available
|
|
|
families supported by QEMU varies between releases, but is visible
|
|
|
with:
|
|
|
|
|
|
# qemu-system-x86_64 -cpu help
|
|
|
Available CPUs:
|
|
|
x86 486
|
|
|
x86 Broadwell Intel Core Processor (Broadwell)
|
|
|
[...]
|
|
|
x86 Skylake-Client Intel Core Processor (Skylake)
|
|
|
x86 Skylake-Client-IBRS Intel Core Processor (Skylake, IBRS)
|
|
|
x86 Skylake-Server Intel Xeon Processor (Skylake)
|
|
|
x86 Skylake-Server-IBRS Intel Xeon Processor (Skylake, IBRS)
|
|
|
[...]
|
|
|
|
|
|
The current [PX62 line][PX62-NVMe] is based on the [Coffee Lake](https://en.wikipedia.org/wiki/Coffee_Lake) Intel
|
|
|
micro-architecture. The closest matching family would be
|
|
|
`Skylake-Server` or `Skylake-Server-IBRS`, [according to wikichip](https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake#Compiler_support).
|
|
|
Note that newer QEMU releases (4.2, currently in unstable) have more
|
|
|
supported features.
|
|
|
|
|
|
In that context, of course, supporting different CPU manufacturers
|
|
|
(say AMD vs Intel) is impractical: they will have totally different
|
|
|
families that are not compatible with each other. This will break live
|
|
|
migration, which can trigger crashes and problems in the migrated
|
|
|
virtual machines.
|
|
|
|
|
|
If there are problems live-migrating between machines, it is still
|
|
|
possible to "failover" (`gnt-instance failover` instead of `migrate`)
|
|
|
which shuts off the machine, fails over disks, and starts it on the
|
|
|
other side. That's not such of a big problem: we often need to reboot
|
|
|
the guests when we reboot the hosts anyways. But it does complicate
|
|
|
our work. Of course, it's also possible that live migrates work fine
|
|
|
if *no* `cpu_type` at all is specified in the cluster, but that needs
|
|
|
to be verified.
|
|
|
|
|
|
Nodes could also [grouped](http://docs.ganeti.org/ganeti/2.15/man/gnt-group.html) to limit (automated) live migration to a
|
|
|
subset of nodes.
|
|
|
|
|
|
References:
|
|
|
|
|
|
* <https://dsa.debian.org/howto/install-ganeti/>
|
|
|
* <https://qemu.weilnetz.de/doc/qemu-doc.html#recommendations_005fcpu_005fmodels_005fx86>
|
|
|
|
|
|
### Installer
|
|
|
|
|
|
The [ganeti-instance-debootstrap](https://tracker.debian.org/pkg/ganeti-instance-debootstrap) package is used to install
|
|
|
instances. It is configured through Puppet with the [shared ganeti
|
|
|
module](https://forge.puppet.com/smash/ganeti), which deploys a few hooks to automate the install as much
|
|
|
as possible. The installer will:
|
|
|
|
|
|
1. setup grub to respond on the serial console
|
|
|
2. setup and log a random root password
|
|
|
3. make sure SSH is installed and log the public keys and
|
|
|
fingerprints
|
|
|
4. setup swap if a labeled partition is present, or a 512MB swapfile
|
|
|
otherwise
|
|
|
5. setup basic static networking through `/etc/network/interfaces.d`
|
|
|
|
|
|
We have custom configurations on top of that to:
|
|
|
|
|
|
1. add a few base packages
|
|
|
2. do our own custom SSH configuration
|
|
|
3. fix the hostname to be a FQDN
|
|
|
4. add a line to `/etc/hosts`
|
|
|
5. add a tmpfs
|
|
|
|
|
|
There is work underway to refactor and automate the install better,
|
|
|
see [ticket 31239](https://bugs.torproject.org/31239) for details.
|
|
|
|
|
|
## Issues
|
|
|
|
|
|
There is no issue tracker specifically for this project, [File][] or
|
|
|
[search][] for issues in the [generic internal services][search] component.
|
|
|
|
|
|
[File]: https://trac.torproject.org/projects/tor/newticket?component=Internal+Services%2FTor+Sysadmin+Team
|
|
|
[search]: https://trac.torproject.org/projects/tor/query?status=!closed&component=Internal+Services%2FTor+Sysadmin+Team
|
|
|
|
|
|
# Discussion
|
|
|
|
|
|
## Overview
|
|
|
|
|
|
The project of creating a Ganeti cluster for Tor has appeared in the
|
|
|
summer of 2019. The machines were delivered by Hetzner in July 2019
|
|
|
and setup by weasel by the end of the month.
|
|
|
|
|
|
## Goals
|
|
|
|
|
|
The goal was to replace the aging group of KVM servers (kvm[1-5], AKA
|
|
|
textile, unifolium, macrum, kvm4 and kvm5).
|
|
|
|
|
|
### Must have
|
|
|
|
|
|
* arbitrary virtual machine provisionning
|
|
|
* redundant setup
|
|
|
* automated VM installation
|
|
|
* replacement of existing infrastructure
|
|
|
|
|
|
### Nice to have
|
|
|
|
|
|
* fully configured in Puppet
|
|
|
* full high availability with automatic failover
|
|
|
* extra capacity for new projects
|
|
|
|
|
|
### Non-Goals
|
|
|
|
|
|
* Docker or "container" provisionning - we consider this out of scope
|
|
|
for now
|
|
|
* self-provisionning by end-users: TPA remains in control of
|
|
|
provisionning
|
|
|
|
|
|
## Approvals required
|
|
|
|
|
|
A budget was proposed by weasel in may 2019 and approved by Vegas in
|
|
|
June. An extension to the budget was approved in january 2020 by
|
|
|
Vegas.
|
|
|
|
|
|
## Proposed Solution
|
|
|
|
|
|
Setup a Ganeti cluster of two machines with a Hetzner vSwitch backend.
|
|
|
|
|
|
## Cost
|
|
|
|
|
|
The design based on the [PX62 line][PX62-NVMe] has the following monthly cost
|
|
|
structure:
|
|
|
|
|
|
* per server: 118EUR (79EUR + 39EUR for 2x10TB HDDs)
|
|
|
* IPv4 space: 35.29EUR (/27)
|
|
|
* IPv6 space: 8.40EUR (/64)
|
|
|
* bandwidth cost: 1EUR/TB (currently 38EUR)
|
|
|
|
|
|
At three servers, that adds up to around 435EUR/mth. Up to date costs
|
|
|
are available in the [Tor VM hosts.xlsx](https://nc.torproject.net/apps/onlyoffice/5395) spreadsheet.
|
|
|
|
|
|
## Alternatives considered
|
|
|
|
|
|
<!-- include benchmarks and procedure if relevant --> |