ganeti.mdwn

    x86 Broadwell             Intel Core Processor (Broadwell)
    [...]
    x86 Skylake-Client        Intel Core Processor (Skylake)
    x86 Skylake-Client-IBRS   Intel Core Processor (Skylake, IBRS)
    x86 Skylake-Server        Intel Xeon Processor (Skylake)
    x86 Skylake-Server-IBRS   Intel Xeon Processor (Skylake, IBRS)
    [...]

The current [PX62 line][PX62-NVMe] is based on the [Coffee Lake](https://en.wikipedia.org/wiki/Coffee_Lake) Intel
micro-architecture. The closest matching family would be
`Skylake-Server` or `Skylake-Server-IBRS`, [according to wikichip](https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake#Compiler_support).
Note that newer QEMU releases (4.2, currently in unstable) have more
supported features.

In that context, of course, supporting different CPU manufacturers
(say AMD vs Intel) is impractical: they will have totally different
families that are not compatible with each other. This will break live
migration, which can trigger crashes and problems in the migrated
virtual machines.

If there are problems live-migrating between machines, it is still
possible to "failover" (`gnt-instance failover` instead of `migrate`)
which shuts off the machine, fails over disks, and starts it on the
other side. That's not such of a big problem: we often need to reboot
the guests when we reboot the hosts anyways. But it does complicate
our work. Of course, it's also possible that live migrates work fine
if *no* `cpu_type` at all is specified in the cluster, but that needs
to be verified.

Nodes could also [grouped](http://docs.ganeti.org/ganeti/2.15/man/gnt-group.html) to limit (automated) live migration to a
subset of nodes.

References:

 * <https://dsa.debian.org/howto/install-ganeti/>
 * <https://qemu.weilnetz.de/doc/qemu-doc.html#recommendations_005fcpu_005fmodels_005fx86>

### Installer

The [ganeti-instance-debootstrap](https://tracker.debian.org/pkg/ganeti-instance-debootstrap) package is used to install
instances. It is configured through Puppet with the [shared ganeti
module](https://forge.puppet.com/smash/ganeti), which deploys a few hooks to automate the install as much
as possible. The installer will:

 1. setup grub to respond on the serial console
 2. setup and log a random root password
 3. make sure SSH is installed and log the public keys and
    fingerprints
 4. setup swap if a labeled partition is present, or a 512MB swapfile
    otherwise
 5. setup basic static networking through `/etc/network/interfaces.d`

We have custom configurations on top of that to:

 1. add a few base packages
 2. do our own custom SSH configuration
 3. fix the hostname to be a FQDN
 4. add a line to `/etc/hosts`
 5. add a tmpfs

There is work underway to refactor and automate the install better,
see [ticket 31239](https://trac.torproject.org/projects/tor/ticket/31239) for details.

## Issues

There is no issue tracker specifically for this project, [File][] or
[search][] for issues in the [generic internal services][search] component.

 [File]: https://trac.torproject.org/projects/tor/newticket?component=Internal+Services%2FTor+Sysadmin+Team
 [search]: https://trac.torproject.org/projects/tor/query?status=!closed&component=Internal+Services%2FTor+Sysadmin+Team

# Discussion

## Overview

The project of creating a Ganeti cluster for Tor has appeared in the
summer of 2019. The machines were delivered by Hetzner in July 2019
and setup by weasel by the end of the month.

## Goals

The goal was to replace the aging group of KVM servers (kvm[1-5], AKA
textile, unifolium, macrum, kvm4 and kvm5).

### Must have

 * arbitrary virtual machine provisionning
 * redundant setup
 * automated VM installation
 * replacement of existing infrastructure

### Nice to have

 * fully configured in Puppet
 * full high availability with automatic failover
 * extra capacity for new projects

### Non-Goals

 * Docker or "container" provisionning - we consider this out of scope
   for now
 * self-provisionning by end-users: TPA remains in control of
   provisionning

## Approvals required

A budget was proposed by weasel in may 2019 and approved by Vegas in
June. An extension to the budget was approved in january 2020 by
Vegas.

## Proposed Solution

Setup a Ganeti cluster of two machines with a Hetzner vSwitch backend.

## Cost

The design based on the [PX62 line][PX62-NVMe] has the following monthly cost
structure:

 * per server: 118EUR (79EUR + 39EUR for 2x10TB HDDs)
 * IPv4 space: 35.29EUR (/27)
 * IPv6 space: 8.40EUR (/64)
 * bandwidth cost: 1EUR/TB (currently 38EUR)

At three servers, that adds up to around 435EUR/mth. Up to date costs
are available in the [Tor VM hosts.xlsx](https://nc.torproject.net/apps/onlyoffice/5395) spreadsheet.

## Alternatives considered

<!-- include benchmarks and procedure if relevant -->