Newer
Older
[Ganeti](http://ganeti.org/) is software designed to facilitate the management of
virtual machines (KVM or Xen). It helps you move virtual machine
instances from one node to another, create an instance with DRBD
replication on another node and do the live migration from one to
another, etc.
This will show the running guests, known as "instances":
gnt-instance list
Our instances do serial console, starting in grub. To access it, run
gnt-instance console test01.torproject.org
To exit, use `^]` -- that is, Control-<Closing Bracket>.
## Glossary
In Ganeti, a physical machine is called a *node* and a virtual machine
is an *instance*. A node is elected to be the *master* where all
commands should be ran from. Nodes are interconnected through a
private network that is used to communicate commands and synchronise
disks (with [[drbd]]). Instances are normally assigned two nodes: a
*primary* and a *secondary*: the *primary* is where the virtual
machine actually runs and th *secondary* acts as a hot failover.
See also the more extensive [glossary in the Ganeti documentation](http://docs.ganeti.org/ganeti/2.15/html/glossary.html).
This command creates a new guest, or "instance" in Ganeti's
vocabulary:
gnt-instance add \
-o debootstrap+buster \
-t drbd --no-wait-for-sync \
--net 0:ip=pool,network=gnt-fsn \
--no-ip-check \
--no-name-check \
--disk 1:size=2G,name=swap \
--disk 2:size=20G \
--disk 3:size=800G,vg=vg_ganeti_hdd \
--backend-parameters memory=8g,vcpus=2 \
WARNING: there is a bug in `ganeti-instance-debootstrap` which
misconfigures `ping` (among other things), see [bug #31781](https://bugs.torproject.org/31781). It's
currently patched in our version of the Debian package, but that patch
might disappear if Debian upgrade the package without [shipping our
patch](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=944538).
This configures the following:
* redundant disks in a DRBD mirror, use `-t plain` instead of `-t drbd` for
tests as that avoids syncing of disks and will speed things up considerably
(even with `--no-wait-for-sync` there are some operations that block on
synced mirrors). Only one node should be provided as the argument for
`--node` then.
* three partitions: one on the default VG (SSD), one on another (HDD)
and a swap file on the default VG, if you don't specify a swap device,
a 512MB swapfile is created in `/swapfile`. TODO: configure disk 2
and 3 automatically in installer. (`/var` and `/srv`?)
* an IP allocated from the public gnt-fsn pool:
`gnt-instance add` will print the IPv4 address it picked to stdout. The
IPv6 address can be found in `/var/log/ganeti/os/` on the primary node
of the instance, see below.
To find the root password, ssh host key fingerprints, and the IPv6
address, run this **on the node where the instance was created**, for
example:
egrep 'root password|configured eth0 with|SHA256' $(ls -tr /var/log/ganeti/os/* | tail -1) | grep -v $(hostname)
Note that you need to use the `--node` parameter to pick on which
machines you want the machine to end up, otherwise Ganeti will choose
for you. Use, for example, `--node fsn-node-01:fsn-node-02` to use
`node-01` as primary and `node-02` as secondary. It might be better to
let the Ganeti allocator do its job since it will, eventually do this
during cluster rebalancing.
We copy root's authorized keys into the new instance, so you should be able to
log in with your token. You will be required to change the root password immediately.
Pick something nice and document it in `tor-passwords`.
Also set reverse DNS for both IPv4 and IPv6 in [hetzner's robot](https://robot.your-server.de/).
(Chek under servers -> vSwitch -> IPs)
Then follow [[new-machine]].
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
## Modifying an instance
It's possible to change the IP, CPU, or memory allocation of an instance
using the [gnt-instance modify](http://docs.ganeti.org/ganeti/2.15/man/gnt-instance.html#modify) command:
gnt-instance modify -B vcpus=2 test1.torproject.org
gnt-instance modify -B memory=4g test1.torproject.org
gnt-instance reboot test1.torproject.org
IP address changes require a full stop and will require manual changes
to the `/etc/network/interfaces*` files:
gnt-instance modify --net 0:modify,ip=116.202.120.175 test1.torproject.org
gnt-instance stop test1.torproject.org
gnt-instance start test1.torproject.org
gnt-instance console test1.torproject.org
The [gnt-instance grow-disk](http://docs.ganeti.org/ganeti/2.15/man/gnt-instance.html#grow-disk) command can be used to change the size
of the underlying device:
gnt-instance grow-disk test1.torproject.org 0 16g
gnt-instance reboot test1.torproject.org
The number `0` in this context, indicates the first disk of the
instance. Then the filesystem needs to be resized inside the VM:
ssh root@test1.torproject.org resize2fs /dev/sda1
This totally deletes the instance, including all mirrors and
everything, be very careful with it:
gnt-instance remove test01.torproject.org
Instances should be setup using the DRBD backend, in which case you
should probably take a look at [[drbd]] if you have problems with
that. Ganeti handles most of the logic there so that should generally
not be necessary.
## Evaluating cluster capacity
This will list instances repeatedly, but also show their assigned
memory, and compare it with the node's capacity:
watch -n5 -d 'gnt-instance list -o pnode,name,be/vcpus,be/memory,status,disk_template | sort; echo; gnt-node list'
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
The [gnt-cluster verify](http://docs.ganeti.org/ganeti/2.15/man/gnt-cluster.html#verify) command will also check to see if there's
enough space on secondaries to account for the failure of a
node. Healthy output looks like this:
root@fsn-node-01:~# gnt-cluster verify
Submitted jobs 48030, 48031
Waiting for job 48030 ...
Fri Jan 17 20:05:42 2020 * Verifying cluster config
Fri Jan 17 20:05:42 2020 * Verifying cluster certificate files
Fri Jan 17 20:05:42 2020 * Verifying hypervisor parameters
Fri Jan 17 20:05:42 2020 * Verifying all nodes belong to an existing group
Waiting for job 48031 ...
Fri Jan 17 20:05:42 2020 * Verifying group 'default'
Fri Jan 17 20:05:42 2020 * Gathering data (2 nodes)
Fri Jan 17 20:05:42 2020 * Gathering information about nodes (2 nodes)
Fri Jan 17 20:05:45 2020 * Gathering disk information (2 nodes)
Fri Jan 17 20:05:45 2020 * Verifying configuration file consistency
Fri Jan 17 20:05:45 2020 * Verifying node status
Fri Jan 17 20:05:45 2020 * Verifying instance status
Fri Jan 17 20:05:45 2020 * Verifying orphan volumes
Fri Jan 17 20:05:45 2020 * Verifying N+1 Memory redundancy
Fri Jan 17 20:05:45 2020 * Other Notes
Fri Jan 17 20:05:45 2020 * Hooks Results
A sick node would have said something like this instead:
Mon Oct 26 18:59:37 2009 * Verifying N+1 Memory redundancy
Mon Oct 26 18:59:37 2009 - ERROR: node node2: not enough memory to accommodate instance failovers should node node1 fail
See the [ganeti manual](http://docs.ganeti.org/ganeti/2.15/html/walkthrough.html#n-1-errors) for a more extensive example
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
Ganeti is smart about assigning instances to nodes. There's also a
command (`hbal`) to automatically rebalance the cluster (see
below). If for some reason hbal doesn’t do what you want or you need
to move things around for other reasons, here are a few commands that
might be handy.
Make an instance switch to using it's secondary:
gnt-instance migrate test1.torproject.org
Make all instances on a node switch to their secondaries:
gnt-node migrate test1.torproject.org
The `migrate` commands does a "live" migrate which should avoid any
downtime during the migration. It might be preferable to actually
shutdown the machine for some reason (for example if we actually want
to reboot because of a security upgrade). Or we might not be able to
live-migrate because the node is down. In this case, we do a
[failover](http://docs.ganeti.org/ganeti/2.15/html/admin.html#failing-over-an-instance)
gnt-instance failover test1.torproject.org
The [gnt-node evacuate](http://docs.ganeti.org/ganeti/2.15/man/gnt-node.html#evacuate) command can also be used to "empty" a given
node altogether, in case of an emergency:
gnt-node evacuate -I . fsn-node-02.torproject.org
Similarly, the [gnt-node failover](http://docs.ganeti.org/ganeti/2.15/man/gnt-node.html#failover) command can be used to
hard-recover from a completely crashed node:
gnt-node failover fsn-node-02.torproject.org
Note that you might need the `--ignore-consistency` flag if the
node is unresponsive.
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
Assumptions:
* `INSTANCE`: name of the instance being migrated, the "old" one
being outside the cluster and the "new" one being the one created
inside the cluster (e.g. `chiwui.torproject.org`)
* `SPARE_NODE`: a ganeti node with free space
(e.g. `fsn-node-03.torproject.org`) where the `INSTANCE` will be
migrated
* `MASTER_NODE`: the master ganeti node
(e.g. `fsn-node-01.torproject.org`)
* `KVM_HOST`: the machine which we migrate the `INSTANCE` from
* the `INSTANCE` has only `root` and `swap` partitions
Import procedure:
1. pick a viable SPARE NODE to import the INSTANCE (see "evaluating
cluster capacity" above, when in doubt) and find on which KVM HOST
the INSTANCE lives
2. copy the disks, without downtime:
./ganeti -v -H $INSTANCE libvirt-import --ganeti-node $SPARE_NODE --libvirt-host $KVM_HOST
3. copy the disks again, this time suspending the machine:
./ganeti -v -H $INSTANCE libvirt-import --ganeti-node $SPARE_NODE --libvirt-host $KVM_HOST --suspend --adopt
./ganeti -v -H $INSTANCE renumber-instance --ganeti-node $SPARE_NODE
5. test services by changing your `/etc/hosts`
ssh $INSTANCE shutdown -h +1 'migrating to new server'
./ganeti -v -H $INSTANCE libvirt-import --ganeti-node $SPARE_NODE --libvirt-host $KVM_HOST --adopt
./ganeti -v -H $INSTANCE renumber-instance --ganeti-node $SPARE_NODE
TODO: establish host-level test procedure and run it here.
10. switch to DRBD, still on the Ganeti MASTER NODE:
gnt-instance stop $INSTANCE &&
gnt-instance modify -t drbd $INSTANCE &&
gnt-instance failover $INSTANCE &&
TODO: move into fabric, maybe in a `libvirt-import-live` or
`post-libvirt-import` job that would also do the renumbering below
11. change IP address in the following locations:
* LDAP (`ipHostNumber` field, but also change the `physicalHost` and `l` fields!)
* Puppet (grep in tor-puppet source, run `puppet agent -t; ud-replicate` on pauli)
* DNS (grep in tor-dns source, `puppet agent -t; ud-replicate` on nevii)
* nagios (don't forget to change the parent)
* reverse DNS (upstream web UI, e.g. Hetzner Robot)
* grep for the host's IP address on itself:
grep -r -e 78.47.38.227 -e 2a01:4f8:fff0:4f:266:37ff:fe77:1ad8 /etc
grep -r -e 78.47.38.227 -e 2a01:4f8:fff0:4f:266:37ff:fe77:1ad8 /srv
* grep for the host's IP on *all* hosts:
cumin-all-puppet
cumin-all 'grep -r -e 78.47.38.227 -e 2a01:4f8:fff0:4f:266:37ff:fe77:1ad8 /etc'
TODO: move those jobs into fabric
12. retire old instance (only a tiny part of [[retire-a-host]]):
./retire -H $INSTANCE retire-instance --parent-host $KVM_HOST
13. warn users about the migration, for example:
> To: tor-project@lists.torproject.org
> Subject: cupani AKA git-rw IP address changed
>
> The main git server, cupani, is the machine you connect to when you push
> or pull git repositories over ssh to git-rw.torproject.org. That
> machines has been migrated to the new Ganeti cluster.
>
> This required an IP address change from:
>
> 78.47.38.228 2a01:4f8:211:6e8:0:823:4:1
>
> to:
>
> 116.202.120.182 2a01:4f8:fff0:4f:266:37ff:fe32:cfb2
>
> DNS has been updated and preliminary tests show that everything is
> mostly working. You *will* get a warning about the IP address change
> when connecting over SSH, which will go away after the first
> connection.
>
> Warning: Permanently added the ED25519 host key for IP address '116.202.120.182' to the list of known hosts.
>
> That is normal. The SSH fingerprints of the host did *not* change.
>
> Please do report any other anomaly using the normal channels:
>
> https://help.torproject.org/tsa/doc/how-to-get-help/
>
> The service was unavailable for about an hour during the migration.
This procedure is now easier to accomplish with the Fabric tools
written especially for this purpose. Use the above procedure
instead. This is kept for historical reference.
Assumptions:
* `INSTANCE`: name of the instance being migrated, the "old" one
being outside the cluster and the "new" one being the one created
inside the cluster (e.g. `chiwui.torproject.org`)
* `SPARE_NODE`: a ganeti node with free space
(e.g. `fsn-node-03.torproject.org`) where the `INSTANCE` will be
migrated
* `MASTER_NODE`: the master ganeti node
(e.g. `fsn-node-01.torproject.org`)
* `KVM_HOST`: the machine which we migrate the `INSTANCE` from
* the `INSTANCE` has only `root` and `swap` partitions
1. pick a viable SPARE NODE to import the instance (see "evaluating
cluster capacity" above, when in doubt), login to the three
servers, setting the proper environment everywhere, for example:
MASTER_NODE=fsn-node-01.torproject.org
SPARE_NODE=fsn-node-03.torproject.org
KVM_HOST=kvm1.torproject.org
INSTANCE=test.torproject.org
echo "$(qemu-img info --output=json $disk | jq '."virtual-size"') / 1024 / 1024 / 1024" | bc -l
sed -n '/<vcpu/{s/[^>]*>//;s/<.*//;p}' < /etc/libvirt/qemu/$INSTANCE.xml
* memory, assuming from KiB to GiB:
echo "$(sed -n '/<memory/{s/[^>]*>//;s/<.*//;p}' < /etc/libvirt/qemu/$INSTANCE.xml) /1024 /1024" | bc -l
TODO: make sure the memory line is in KiB and that the number
makes sense.
* on the INSTANCE, find the swap device UUID so we can recreate it later:
3. setup a copy channel, on the SPARE NODE:
ssh-agent bash
ssh-add /etc/ssh/ssh_host_ed25519_key
cat /etc/ssh/ssh_host_ed25519_key.pub
4. copy the `.qcow` file(s) over, from the KVM HOST to the SPARE NODE:
rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-root /srv/
rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-lvm /srv/ || true
Note: it's possible there is not enough room in `/srv`: in the
base Ganeti installs, everything is in the same root partition
(`/`) which will fill up if the instance is (say) over ~30GiB. In
that case, create a filesystem in `/srv`:
(mkdir /root/srv && mv /srv/* /root/srv true) || true &&
lvcreate -L 200G vg_ganeti -n srv &&
mkfs /dev/vg_ganeti/srv &&
echo "/dev/vg_ganeti/srv /srv ext4 rw,noatime,errors=remount-ro 0 2" >> /etc/fstab &&
mount /srv &&
( mv /root/srv/* ; rmdir /root/srv )
This partition can be reclaimed once the VM migrations are
completed, as it needlessly takes up space on the node.
5. on the SPARE NODE, create and initialize a logical volume with the predetermined size:
lvcreate -L 4GiB -n $INSTANCE-swap vg_ganeti
mkswap --uuid $SWAP_UUID /dev/vg_ganeti/$INSTANCE-swap
lvcreate -L 20GiB -n $INSTANCE-root vg_ganeti
qemu-img convert /srv/$INSTANCE-root -O raw /dev/vg_ganeti/$INSTANCE-root
qemu-img convert /srv/$INSTANCE-lvm -O raw /dev/vg_ganeti_hdd/$INSTANCE-lvm
Note how we assume two disks above, but the instance might have a
different configuration that would require changing the above. The
above, common, configuration is to have an LVM disk separate from
the "root" disk, the former being on a HDD, but the HDD is
sometimes completely omitted and sizes can differ.
Sometimes it might be worth using pv to get progress on long
transfers:
qemu-img convert /srv/$INSTANCE-lvm -O raw /srv/$INSTANCE-lvm.raw
pv /srv/$INSTANCE-lvm.raw | dd of=/dev/vg_ganeti_hdd/$INSTANCE-lvm bs=4k
TODO: ideally, the above procedure (and many steps below as well)
would be automatically deduced from the disk listing established
in the first step.
6. on the MASTER NODE, create the instance, adopting the LV:
gnt-instance add -t plain \
-n fsn-node-03 \
--disk 0:adopt=$INSTANCE-root \
--disk 1:adopt=$INSTANCE-swap \
--disk 2:adopt=$INSTANCE-lvm,vg=vg_ganeti_hdd \
--backend-parameters memory=2g,vcpus=2 \
--net 0:ip=pool,network=gnt-fsn \
--no-name-check \
--no-ip-check \
-o debootstrap+default \
7. cross your fingers and watch the party:
9. IP address change on new instance:
edit `/etc/hosts` and `/etc/network/interfaces` by hand and add
IPv4 and IPv6 ip. IPv4 configuration can be found in:
Latter can be guessed by concatenating `2a01:4f8:fff0:4f::` and
the IPv6 local local address without `fe80::`. For example: a
link local address of `fe80::266:37ff:fe65:870f/64` should yield
the following configuration:
iface eth0 inet6 static
accept_ra 0
address 2a01:4f8:fff0:4f:266:37ff:fe65:870f/64
gateway 2a01:4f8:fff0:4f::1
TODO: reuse `gnt-debian-interfaces` from the ganeti puppet
module script here?
10. functional tests: change your `/etc/hosts` to point to the new
server and see if everything still kind of works
11. shutdown original instance
12. resync and reconvert image, on the Ganeti MASTER NODE:
rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-root /srv/ &&
qemu-img convert /srv/$INSTANCE-root -O raw /dev/vg_ganeti/$INSTANCE-root &&
rsync -P $KVM_HOST:/srv/vmstore/$INSTANCE/$INSTANCE-lvm /srv/ &&
qemu-img convert /srv/$INSTANCE-lvm -O raw /dev/vg_ganeti_hdd/$INSTANCE-lvm
13. switch to DRBD, still on the Ganeti MASTER NODE:
gnt-instance modify -t drbd $INSTANCE
gnt-instance failover $INSTANCE
gnt-instance startup $INSTANCE
14. redo IP adress change in `/etc/network/interfaces` and `/etc/hosts`
15. final functional test
16. change IP address in the following locations:
* nagios (don't forget to change the parent)
* LDAP (`ipHostNumber` field, but also change the `physicalHost` and `l` fields!)
* Puppet (grep in tor-puppet source, run `puppet agent -t; ud-replicate` on pauli)
* DNS (grep in tor-dns source, `puppet agent -t; ud-replicate` on nevii)
* reverse DNS (upstream web UI, e.g. Hetzner Robot)
17. decomission old instance ([[retire-a-host]])
### Troubleshooting
* if boot takes a long time and you see a message like this on the console:
[ *** ] A start job is running for dev-disk-by\x2duuid-484b5...26s / 1min 30s)
... which is generally followed by:
[DEPEND] Dependency failed for /dev/disk/by-…6f4b5-f334-4173-8491-9353d4f94e04.
[DEPEND] Dependency failed for Swap.
it means the swap device UUID wasn't setup properly, and does not
match the one provided in `/etc/fstab`. That is probably because
you missed the `mkswap -U` step documented above.
### References
* [Upstream docs](http://docs.ganeti.org/ganeti/2.15/html/admin.html#import-of-foreign-instances) have the canonical incantation:
gnt-instance add -t plain -n HOME_NODE ... --disk 0:adopt=lv_name[,vg=vg_name] INSTANCE_NAME
* [DSA docs](https://dsa.debian.org/howto/install-ganeti/) also use disk adoption and have a procedure to
migrate to DRBD
* [Riseup docs](https://we.riseup.net/riseup+tech/ganeti#move-an-instance-from-one-cluster-to-another-from-) suggest creating a VM without installing, shutting
down and then syncing
Ganeti [supports importing and exporting](http://docs.ganeti.org/ganeti/2.15/html/design-ovf-support.html?highlight=qcow) from the [Open
Virtualization Format](https://en.wikipedia.org/wiki/Open_Virtualization_Format) (OVF), but unfortunately it [doesn't seem
libvirt supports *exporting* to OVF](https://forums.centos.org/viewtopic.php?t=49231). There's a [virt-convert](http://manpages.debian.org/virt-convert)
tool which can *import* OVF, but not the reverse. The [libguestfs](http://www.libguestfs.org/)
library also has a [converter](http://www.libguestfs.org/virt-v2v.1.html) but it also doesn't support
exporting to OVF or anything Ganeti can load directly.
So people have written [their own conversion tools](https://virtuallyhyper.com/2013/06/migrate-from-libvirt-kvm-to-virtualbox/) or [their own
conversion procedure](https://scienceofficersblog.blogspot.com/2014/04/using-cloud-images-with-ganeti.html).
Ganeti also supports [file-backed instances](http://docs.ganeti.org/ganeti/2.15/html/design-file-based-storage.html) but "adoption" is
specifically designed for logical volumes, so it doesn't work for our
use case.
Those hosts need special care, as we can accomplish zero-downtime
reboots on those machines. The `reboot` script in `tsa-misc` takes
care of the special steps involved (which is basically to empty a
node before rebooting it).
Such a reboot should be ran interactively, inside a `tmux` or `screen`
session, and takes over 15 minutes to complete right now, but depends
on the size of the cluster (in terms of core memory usage).
Once the reboot is completed, all instances might end up on a single
machine, and the cluster might need to be rebalanced, see
below. (Note: the update script should eventually do that, see [ticket
#33406](https://trac.torproject.org/projects/tor/ticket/33406)).
After a reboot or a downtime, all nodes might end up on the same
machine. This is normally handled by the reboot script, but it might
be desirable to do this by hand if there was a crash or another
special condition.
This can be easily corrected with this command, which will spread
instances around the cluster to balance it:
hbal -L -C -v -X
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
This will automatically move the instances around and rebalance the
cluster. Here's an example run on a small cluster:
root@fsn-node-01:~# gnt-instance list
Instance Hypervisor OS Primary_node Status Memory
loghost01.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 2.0G
onionoo-backend-01.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 12.0G
static-master-fsn.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 8.0G
web-fsn-01.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 4.0G
web-fsn-02.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 4.0G
root@fsn-node-01:~# hbal -L -X
Loaded 2 nodes, 5 instances
Group size 2 nodes, 5 instances
Selected node group: default
Initial check done: 0 bad nodes, 0 bad instances.
Initial score: 8.45007519
Trying to minimize the CV...
1. onionoo-backend-01 fsn-node-02:fsn-node-01 => fsn-node-01:fsn-node-02 4.98124611 a=f
2. loghost01 fsn-node-02:fsn-node-01 => fsn-node-01:fsn-node-02 1.78271883 a=f
Cluster score improved from 8.45007519 to 1.78271883
Solution length=2
Got job IDs 16345
Got job IDs 16346
root@fsn-node-01:~# gnt-instance list
Instance Hypervisor OS Primary_node Status Memory
loghost01.torproject.org kvm debootstrap+buster fsn-node-01.torproject.org running 2.0G
onionoo-backend-01.torproject.org kvm debootstrap+buster fsn-node-01.torproject.org running 12.0G
static-master-fsn.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 8.0G
web-fsn-01.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 4.0G
web-fsn-02.torproject.org kvm debootstrap+buster fsn-node-02.torproject.org running 4.0G
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
In the above example, you should notice that the `web-fsn` instances both
ended up on the same node. That's because the balancer did not know
that they should be distributed. A special configuration was done,
below, to avoid that problem in the future. But as a workaround,
instances can also be moved by hand and the cluster re-balanced.
## Redundant instances distribution
Some instances are redundant across the cluster and should *not* end up
on the same node. A good example are the `web-fsn-01` and `web-fsn-02`
instances which, in theory, would serve similar traffic. If they end
up on the same node, it might flood the network on that machine or at
least defeats the purpose of having redundant machines.
The way to ensure they get distributed properly by the balancing
algorithm is to "tag" them. For the web nodes, for example, this was
performed on the master:
gnt-instance add-tags web-fsn-01.torproject.org web-fsn
gnt-instance add-tags web-fsn-02.torproject.org web-fsn
gnt-cluster add-tags htools:iextags:web-fsn
This tells Ganeti that `web-fsn` is an "exclusion tag" and the
optimizer will not try to schedule instances with those tags on the
same node.
To see which tags are present, use:
# gnt-cluster list-tags
htools:iextags:web-fsn
You can also find which nodes are assigned to a tag with:
# gnt-cluster search-tags web-fsn
/cluster htools:iextags:web-fsn
/instances/web-fsn-01.torproject.org web-fsn
/instances/web-fsn-02.torproject.org web-fsn
## Adding and removing addresses on instances
Say you created an instance but forgot to need to assign an extra
IP. You can still do so with:
gnt-instance modify --net -1:add,ip=116.202.120.174,network=gnt-fsn test01.torproject.org
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
### I/O overload
In case of excessive I/O, it might be worth looking into which machine
is in cause. The [[drbd]] page explains how to map a DRBD device to a
VM. You can also find which logical volume is backing an instance (and
vice versa) with this command:
lvs -o+tags
This will list all logical volumes and their associated tags. If you
already know which logical volume you're looking for, you can address
it directly:
root@fsn-node-01:~# lvs -o tags /dev/vg_ganeti_hdd/4091b668-1177-41ac-9310-1eac45b46620.disk2_data
LV Tags
originstname+bacula-director-01.torproject.org
### Node failures
Ganeti clusters are designed to be [self-healing](http://docs.ganeti.org/ganeti/2.15/html/admin.html#autorepair). As long as only
one machine disappears, the cluster should be able to recover by
failing over other nodes. This is currently done manually, see the
migrate section above.
This could eventually be automated if such situations occur more
often, by scheduling a [harep](http://docs.ganeti.org/ganeti/2.15/man/harep.html) cron job, which isn't enabled in
Debian by default. See also the [autorepair](http://docs.ganeti.org/ganeti/2.15/html/admin.html#autorepair) section of the admin
manual.
### Other troubleshooting
Riseup has [documentation on various failure scenarios](https://we.riseup.net/riseup+tech/ganeti#failure-scenarios) including
master failover, which we haven't tested. There's also upstream
documentation on [changing node roles](http://docs.ganeti.org/ganeti/2.15/html/admin.html#changing-the-node-role) which might be useful for a
master failover scenario.
The [walkthrough](http://docs.ganeti.org/ganeti/2.15/html/walkthrough.html) also has a few recipes to resolve common
problems.
If things get completely out of hand and the cluster becomes too
unreliable for service, the only solution is to rebuild another one
elsewhere. Since Ganeti 2.2, there is a [move-instance](http://docs.ganeti.org/ganeti/2.15/html/move-instance.html) command to
move instances between cluster that can be used for that purpose.
If Ganeti is completely destroyed and its APIs don't work anymore, the
last resort is to restore all virtual machines from
[[backup]]. Hopefully, this should not happen except in the case of a
catastrophic data loss bug in Ganeti or [[drbd]].
# Reference
## Installation
- To create a new box, follow [[new-machine-hetzner-robot]] but change
the following settings:
* Location: `FSN1`
* Operating system: Rescue
* Additional drives: 2x10TB HDD (update: starting from fsn-node-05,
we are *not* ordering additional drives to save on costs, see
[ticket 33083](https://trac.torproject.org/projects/tor/ticket/33083) for rationale)
* Add in the comment form that the server needs to be in the same
datacenter as the other machines (FSN1-DC13, but double-check)
[PX62-NVMe]: https://www.hetzner.com/dedicated-rootserver/px62-nvme?country=OTHER
1. Add the server to the two `vSwitch` systems in [Hetzner Robot web UI](https://robot.your-server.de/vswitch)
2. install openvswitch and allow modules to be loaded
touch /etc/no_modules_disabled
reboot
apt install openvswitch-switch
3. Allocate a private IP address in the `30.172.in-addr.arpa` zone for
the node.
4. copy over the `/etc/network/interfaces` from another ganeti node,
changing the `address` and `gateway` fields to match the local
entry.
5. knock on wood, cross your fingers, pet a cat, help your local
book store, and reboot:
- Prepare all the nodes by configuring them in puppet. They should be
in the class `roles::ganeti::fsn` if they are part of the fsn
cluster.
- Re-enable modules disabling:
rm /etc/no_modules_disabled
- reboot again
Then the node is ready to be added to the cluster, by running this on
the master node:
gnt-node add \
--secondary-ip 172.30.135.2 \
--no-ssh-key-check \
--no-node-setup \
fsn-node-02.torproject.org
If this is an entirely new cluster, you need a different procedure:
gnt-cluster init \
--master-netdev vlan-gntbe \
--vg-name vg_ganeti \
--secondary-ip 172.30.135.1 \
--enabled-hypervisors kvm \
--nic-parameters link=br0,vlan=4000 \
--mac-prefix 00:66:37 \
--no-ssh-init \
--no-etc-hosts \
fsngnt.torproject.org
The above assumes that `fsngnt` is already in DNS.
### cluster config
These could probably be merged into the cluster init, but just to document what has been done:
gnt-cluster modify --reserved-lvs vg_ganeti/root,vg_ganeti/swap
gnt-cluster modify -H kvm:kernel_path=,initrd_path=,
gnt-cluster modify -H kvm:security_model=pool
gnt-cluster modify -H kvm:kvm_extra='-device virtio-rng-pci\,bus=pci.0\,addr=0x1e\,max-bytes=1024\,period=1000'
gnt-cluster modify -H kvm:disk_cache=none
gnt-cluster modify -H kvm:disk_discard=unmap
gnt-cluster modify -H kvm:scsi_controller_type=virtio-scsi-pci
gnt-cluster modify -H kvm:disk_type=scsi-hd
gnt-cluster modify --uid-pool 4000-4019
gnt-cluster modify --nic-parameters mode=openvswitch,link=br0,vlan=4000
gnt-cluster modify -D drbd:c-plan-ahead=0,disk-custom='--c-plan-ahead 0'
gnt-cluster modify -H kvm:migration_bandwidth=950
gnt-cluster modify -H kvm:migration_downtime=500
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
### Network configuration
IP allocation is managed by Ganeti through the `gnt-network(8)`
system. Say we have `192.0.2.0/24` reserved for the cluster, with
the host IP `192.0.2.100`` and the gateway on `192.0.2.1`. You will
create this network with:
gnt-network add --network 192.0.2.0/24 --gateway 192.0.2.1 --network6 2001:db8::/32 --gateway6 fe80::1 example-network
Then we associate the new network to the default node group:
gnt-network connect --nic-parameters=link=br0,vlan=4000,mode=openvswitch example-network default
The arguments to `--nic-parameters` come from the values configured in
the cluster, above. The current values can be found with `gnt-cluster
info`.
## SLA
As long as the cluster is not over capacity, it should be able to
survive the loss of a node in the cluster unattended.
Justified machines can be provisionned within a few business days
without problems.
New nodes can be provisioned within a week or two, depending on budget
and hardware availability.
## Design
Our first Ganeti cluster (`gnt-fsn`) is made of multiple machines
hosted with [Hetzner Robot](https://robot.your-server.de/), Hetzner's dedicated server hosting
service. All machines use the same hardware to avoid problems with
live migration. That is currently a customized build of the
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
### Network layout
Machines are interconnected over a [vSwitch](https://wiki.hetzner.de/index.php/Vswitch/en), a "virtual layer 2
network" probably implemented using [Software-defined Networking](https://en.wikipedia.org/wiki/Software-defined_networking)
(SDN) on top of Hetzner's network. The details of that implementation
do not matter much to us, since we do not trust the network and run an
IPsec layer on top of the vswitch. We communicate with the `vSwitch`
through [Open vSwitch](https://en.wikipedia.org/wiki/Open_vSwitch) (OVS), which is (currently manually)
configured on each node of the cluster.
There are two distinct IPsec networks:
* `gnt-fsn-public`: the public network, which maps to the
`fsn-gnt-inet-vlan` vSwitch at Hetzner, the `vlan-gntinet` OVS
network, and the `gnt-fsn` network pool in Ganeti. it provides
public IP addresses and routing across the network. instances get
IP allocated in this network.
* `gnt-fsn-be`: the private ganeti network which maps to the
`fsn-gnt-backend-vlan` vSwitch at Hetzner and the `vlan-gntbe` OVS
network. it has no matching `gnt-network` component and IP
addresses are allocated manually in the 172.30.135.0/24 network
through DNS. it provides internal routing for Ganeti commands and
[[drbd]] storage mirroring.
### Hardware variations
We considered experimenting with the new AX line ([AX51-NVMe](https://www.hetzner.com/dedicated-rootserver/ax51-nvme?country=OTHER)) but
in the past DSA had problems live-migrating (it wouldn't immediately
fail but there were "issues" after). So we might need to [failover](http://docs.ganeti.org/ganeti/2.15/man/gnt-instance.html#failover)
instead of migrate between those parts of the cluster. There are also
doubts that the Linux kernel supports those shiny new processors at
all: similar processors had [trouble booting before Linux 5.5](https://www.phoronix.com/scan.php?page=news_item&px=Threadripper-3000-MCE-5.5-Fix) for
example, so it might be worth waiting a little before switching to
that new platform, even if it's cheaper. See the cluster configuration
section below for a larger discussion of CPU emulation.
### CPU emulation
Note that we might want to tweak the `cpu_type` parameter. By default,
it emulates a lot of processing that can be delegated to the host CPU
instead. If we use `kvm:cpu_type=host`, then each node will tailor the
emulation system to the CPU on the node. But that might make the live
migration more brittle: VMs or processes can crash after a live
migrate because of a slightly different configuration (microcode, CPU,
kernel and QEMU versions all play a role). So we need to find the
lowest common demoninator in CPU families. The list of available
families supported by QEMU varies between releases, but is visible
with:
# qemu-system-x86_64 -cpu help
Available CPUs:
x86 486
x86 Broadwell Intel Core Processor (Broadwell)
[...]
x86 Skylake-Client Intel Core Processor (Skylake)
x86 Skylake-Client-IBRS Intel Core Processor (Skylake, IBRS)
x86 Skylake-Server Intel Xeon Processor (Skylake)
x86 Skylake-Server-IBRS Intel Xeon Processor (Skylake, IBRS)
[...]
The current [PX62 line][PX62-NVMe] is based on the [Coffee Lake](https://en.wikipedia.org/wiki/Coffee_Lake) Intel
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
micro-architecture. The closest matching family would be
`Skylake-Server` or `Skylake-Server-IBRS`, [according to wikichip](https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake#Compiler_support).
Note that newer QEMU releases (4.2, currently in unstable) have more
supported features.
In that context, of course, supporting different CPU manufacturers
(say AMD vs Intel) is impractical: they will have totally different
families that are not compatible with each other. This will break live
migration, which can trigger crashes and problems in the migrated
virtual machines.
If there are problems live-migrating between machines, it is still
possible to "failover" (`gnt-instance failover` instead of `migrate`)
which shuts off the machine, fails over disks, and starts it on the
other side. That's not such of a big problem: we often need to reboot
the guests when we reboot the hosts anyways. But it does complicate
our work. Of course, it's also possible that live migrates work fine
if *no* `cpu_type` at all is specified in the cluster, but that needs
to be verified.
Nodes could also [grouped](http://docs.ganeti.org/ganeti/2.15/man/gnt-group.html) to limit (automated) live migration to a
subset of nodes.
References:
* <https://dsa.debian.org/howto/install-ganeti/>
* <https://qemu.weilnetz.de/doc/qemu-doc.html#recommendations_005fcpu_005fmodels_005fx86>
The [ganeti-instance-debootstrap](https://tracker.debian.org/pkg/ganeti-instance-debootstrap) package is used to install
instances. It is configured through Puppet with the [shared ganeti
module](https://forge.puppet.com/smash/ganeti), which deploys a few hooks to automate the install as much
as possible. The installer will:
1. setup grub to respond on the serial console
2. setup and log a random root password
3. make sure SSH is installed and log the public keys and
fingerprints
4. setup swap if a labeled partition is present, or a 512MB swapfile
otherwise
5. setup basic static networking through `/etc/network/interfaces.d`
1. add a few base packages
2. do our own custom SSH configuration
3. fix the hostname to be a FQDN
4. add a line to `/etc/hosts`
5. add a tmpfs
There is work underway to refactor and automate the install better,
see [ticket 31239](https://trac.torproject.org/projects/tor/ticket/31239) for details.
There is no issue tracker specifically for this project, [File][] or
[search][] for issues in the [generic internal services][search] component.
[File]: https://trac.torproject.org/projects/tor/newticket?component=Internal+Services%2FTor+Sysadmin+Team
[search]: https://trac.torproject.org/projects/tor/query?status=!closed&component=Internal+Services%2FTor+Sysadmin+Team
# Discussion
## Overview
The project of creating a Ganeti cluster for Tor has appeared in the
summer of 2019. The machines were delivered by Hetzner in July 2019
and setup by weasel by the end of the month.
The goal was to replace the aging group of KVM servers (kvm[1-5], AKA
textile, unifolium, macrum, kvm4 and kvm5).
* arbitrary virtual machine provisionning
* redundant setup
* automated VM installation
* replacement of existing infrastructure
* fully configured in Puppet
* full high availability with automatic failover
* extra capacity for new projects