Newer
Older
We considered experimenting with the new AX line ([AX51-NVMe](https://www.hetzner.com/dedicated-rootserver/ax51-nvme?country=OTHER)) but
in the past DSA had problems live-migrating (it wouldn't immediately
fail but there were "issues" after). So we might need to [failover](http://docs.ganeti.org/ganeti/2.15/man/gnt-instance.html#failover)
instead of migrate between those parts of the cluster. There are also
doubts that the Linux kernel supports those shiny new processors at
all: similar processors had [trouble booting before Linux 5.5](https://www.phoronix.com/scan.php?page=news_item&px=Threadripper-3000-MCE-5.5-Fix) for
example, so it might be worth waiting a little before switching to
that new platform, even if it's cheaper. See the cluster configuration
section below for a larger discussion of CPU emulation.
### CPU emulation
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
Note that we might want to tweak the `cpu_type` parameter. By default,
it emulates a lot of processing that can be delegated to the host CPU
instead. If we use `kvm:cpu_type=host`, then each node will tailor the
emulation system to the CPU on the node. But that might make the live
migration more brittle: VMs or processes can crash after a live
migrate because of a slightly different configuration (microcode, CPU,
kernel and QEMU versions all play a role). So we need to find the
lowest common demoninator in CPU families. The list of available
families supported by QEMU varies between releases, but is visible
with:
# qemu-system-x86_64 -cpu help
Available CPUs:
x86 486
x86 Broadwell Intel Core Processor (Broadwell)
[...]
x86 Skylake-Client Intel Core Processor (Skylake)
x86 Skylake-Client-IBRS Intel Core Processor (Skylake, IBRS)
x86 Skylake-Server Intel Xeon Processor (Skylake)
x86 Skylake-Server-IBRS Intel Xeon Processor (Skylake, IBRS)
[...]
The current [PX62 line][PX62-NVMe] is based on the [Coffee Lake](https://en.wikipedia.org/wiki/Coffee_Lake) Intel
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
micro-architecture. The closest matching family would be
`Skylake-Server` or `Skylake-Server-IBRS`, [according to wikichip](https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake#Compiler_support).
Note that newer QEMU releases (4.2, currently in unstable) have more
supported features.
In that context, of course, supporting different CPU manufacturers
(say AMD vs Intel) is impractical: they will have totally different
families that are not compatible with each other. This will break live
migration, which can trigger crashes and problems in the migrated
virtual machines.
If there are problems live-migrating between machines, it is still
possible to "failover" (`gnt-instance failover` instead of `migrate`)
which shuts off the machine, fails over disks, and starts it on the
other side. That's not such of a big problem: we often need to reboot
the guests when we reboot the hosts anyways. But it does complicate
our work. Of course, it's also possible that live migrates work fine
if *no* `cpu_type` at all is specified in the cluster, but that needs
to be verified.
Nodes could also [grouped](http://docs.ganeti.org/ganeti/2.15/man/gnt-group.html) to limit (automated) live migration to a
subset of nodes.
References:
* <https://dsa.debian.org/howto/install-ganeti/>
* <https://qemu.weilnetz.de/doc/qemu-doc.html#recommendations_005fcpu_005fmodels_005fx86>
The [ganeti-instance-debootstrap](https://tracker.debian.org/pkg/ganeti-instance-debootstrap) package is used to install
instances. It is configured through Puppet with the [shared ganeti
module](https://forge.puppet.com/smash/ganeti), which deploys a few hooks to automate the install as much
as possible. The installer will:
1. setup grub to respond on the serial console
2. setup and log a random root password
3. make sure SSH is installed and log the public keys and
fingerprints
4. setup swap if a labeled partition is present, or a 512MB swapfile
otherwise
5. setup basic static networking through `/etc/network/interfaces.d`
1. add a few base packages
2. do our own custom SSH configuration
3. fix the hostname to be a FQDN
4. add a line to `/etc/hosts`
5. add a tmpfs
There is work underway to refactor and automate the install better,
see [ticket 31239](https://trac.torproject.org/projects/tor/ticket/31239) for details.
There is no issue tracker specifically for this project, [File][] or
[search][] for issues in the [generic internal services][search] component.
[File]: https://trac.torproject.org/projects/tor/newticket?component=Internal+Services%2FTor+Sysadmin+Team
[search]: https://trac.torproject.org/projects/tor/query?status=!closed&component=Internal+Services%2FTor+Sysadmin+Team
# Discussion
## Overview
The project of creating a Ganeti cluster for Tor has appeared in the
summer of 2019. The machines were delivered by Hetzner in July 2019
and setup by weasel by the end of the month.
The goal was to replace the aging group of KVM servers (kvm[1-5], AKA
textile, unifolium, macrum, kvm4 and kvm5).
* arbitrary virtual machine provisionning
* redundant setup
* automated VM installation
* replacement of existing infrastructure
* fully configured in Puppet
* full high availability with automatic failover
* extra capacity for new projects
* Docker or "container" provisionning - we consider this out of scope
for now
* self-provisionning by end-users: TPA remains in control of
provisionning
A budget was proposed by weasel in may 2019 and approved by Vegas in
June. An extension to the budget was approved in january 2020 by
Vegas.
## Proposed Solution
Setup a Ganeti cluster of two machines with a Hetzner vSwitch backend.
The design based on the [PX62 line][PX62-NVMe] has the following monthly cost
structure:
* per server: 118EUR (79EUR + 39EUR for 2x10TB HDDs)
* IPv4 space: 35.29EUR (/27)
* IPv6 space: 8.40EUR (/64)
* bandwidth cost: 1EUR/TB (currently 38EUR)
At three servers, that adds up to around 435EUR/mth. Up to date costs
are available in the [Tor VM hosts.xlsx](https://nc.torproject.net/apps/onlyoffice/5395) spreadsheet.
## Alternatives considered
<!-- include benchmarks and procedure if relevant -->