[[_TOC_]]

# How to

## Burn-in

Before we even install the machine, we should do some sort of
stress-testing or [burn-in](https://en.wikipedia.org/wiki/Burn-in) so that we don't go through the lengthy
install process and put into production fautly hardware.

This implies testing the various components to see if they support a
moderate to high load. A tool like [stressant](https://stressant.readthedocs.io/) can be used for that
purpose, but a full procedure still needs to be established.

Example stressant run:

    apt install stressant
    stressant --email torproject-admin@torproject.org --overwrite --writeSize 10% --diskRuntime 120m --logfile fsn-node-04-sda.log --diskDevice /dev/sda

This will *wipe* parts of `/dev/sda`, so be careful. If instead you
want to test inside a directory, use this:

    stressant --email torproject-admin@torproject.org  --diskRuntime 120m --logfile fsn-node-05-home-test.log --directory /home/test --writeSize 1024M

Stressant is still in development and currently has serious
limitations (e.g. it tests one disk at a time and clunky UI) but
should be a good way to get started.

## Installation

This document assumes the machine is already installed with a Debian
operating system. We preferably install stable or, when close to the
release, testing. Here are site-specific installs:

* [Hetnzer Cloud](howto/new-machine-hetzner-cloud)
* [Hetzner Robot](howto/new-machine-hetzner-robot)
* [Ganeti](howto/ganeti) clusters:
  * new virtual machine: [new instance procedure](howto/ganeti#adding-a-new-instance)
  * new *nodes* (which host virtual machines) [new node
    procedure](howto/ganeti#installation), normally done as a post-install configuration
* Linaro and OSUOSL: [howto/openstack](howto/openstack)
* [Cymru](howto/new-machine-cymru)

The following sites are not documented yet:

 * sunet: possible like Linaro's [howto/openstack](howto/openstack), each TPA admin has
   their own account there
 * eclips.is: our account is marked as "suspended" but oddly enough we
   have 200 credits which would give us (roughly) 32GB of RAM and 8
   vCPUs (yearly? monthly? how knows). it is (separately) used by the
   metrics team for onionperf, that said

The following sites are deprecated:

 * [howto/KVM](howto/KVM)/libvirt (really at Hetzner) - replaced by Ganeti
 * scaleway - see [ticket 32920](https://bugs.torproject.org/32920)

## Post-install configuration

The post-install configuration mostly takes care of bootstrapping
Puppet and everything else follows from there. There are, however,
still some unrelated manual steps but those should eventually all be
automated (see [ticket #31239](https://bugs.torproject.org/31239) for details of that work).

### Pre-requisites

The procedure below assumes the following steps have already been
taken by the installer:

 0. partitions have been correctly setup, including some (>=1GB) swap
    space (or at least a swap file) and a `tmpfs` in `/tmp`

 1. a minimal Debian install with security updates has been booted
    (see also [ticket #31957](https://bugs.torproject.org/31957) for upgrade automation)

 2. a hostname has been set, picked from the [doc/naming-scheme](doc/naming-scheme)
    and the short hostname (e.g. `test`) resolves to a fully qualified
    domain name (e.g. `test.torproject.org`) in the `torproject.org`
    domain (i.e. `/etc/hosts` is correctly configured). this can be
    fixed with:

        fab -H root@38.229.82.108 host.rewrite-hosts chi-node-05.torproject.org 38.229.82.108

    WARNING: The short hostname (e.g. `foo` in `foo.example.com`) MUST
    NOT be longer than 21 characters, as that will crash the backup
    server because its label will be too long:
    
        Sep 24 17:14:45 bacula-director-01 bacula-dir[1467]: Config error: name torproject-static-gitlab-shim-source.torproject.org-full.${Year}-${Month:p/2/0/r}-${Day:p/2/0/r}_${Hour:p/2/0/r}:${Minute:p/2/0/r} length 130 too long, max is 127

 3. a public IP address has been set and the host is available over
    SSH on that IP address. this can be fixed with:

        fab -H root@88.99.194.57 host.rewrite-interfaces 88.99.194.57 26 88.99.194.1 2a01:4f8:221:2193::2 64 fe80::1

    If the IPv6 address is not known, it might be guessable from the
    MAC address. Try this:
    
        ipv6calc --action prefixmac2ipv6 --in prefix+mac --out ipv6 $SUBNET $MAC

    ... where `$SUBNET` is the (known) subnet from the upstream
    provider and `$MAC` is the MAC address as found in `ip link show
    up`.

 4. ensure reverse DNS is set for the machine. this can be done either
    in the upstream configuration dashboard (e.g. Hetzner) or in our
    zone files, in the `dns/domains.git` repository.

    Pro tip: `dig -x` will show you an SOA record pointing at the
    authoritative DNS server for the relevant zone, and will even show
    you the right record to create. Since IPv6 records are
    particularly painful to create, you should use this all the time.

    For example, the IP addresses of `chi-node-01` are `38.229.82.104`
    and `2604:8800:5000:82:baca:3aff:fe5d:8774`, so the records to
    create are:

        $ dig -x 2604:8800:5000:82:baca:3aff:fe5d:8774 38.229.82.104
        [...]
        ;; QUESTION SECTION:
        ;4.7.7.8.d.5.e.f.f.f.a.3.a.c.a.b.2.8.0.0.0.0.0.5.0.0.8.8.4.0.6.2.ip6.arpa. IN PTR

        ;; AUTHORITY SECTION:
        2.8.0.0.0.0.0.5.0.0.8.8.4.0.6.2.ip6.arpa. 3552 IN SOA nevii.torproject.org. hostmaster.torproject.org. 2021020201 10800 3600 1814400 3601

        [...]

        ;; QUESTION SECTION:
        ;104.82.229.38.in-addr.arpa.	IN	PTR

        ;; AUTHORITY SECTION:
        82.229.38.in-addr.arpa.	2991	IN	SOA	ns1.cymru.com. noc.cymru.com. 2020110201 21600 3600 604800 7200

        [...]

    In this case, you should add this record to
    `82.229.38.in-addr.arpa.`:

        104.82.229.38.in-addr.arpa.	IN	PTR chi-node-01.torproject.org.

    And this to `2.8.0.0.0.0.0.5.0.0.8.8.4.0.6.2.ip6.arpa.`:
    
        4.7.7.8.d.5.e.f.f.f.a.3.a.c.a.b.2.8.0.0.0.0.0.5.0.0.8.8.4.0.6.2.ip6.arpa. IN PTR chi-node-01.torproject.org.

    Inversely, say you need to add an IP address for Hetzner
    (e.g. `88.198.8.180`), they will already have a dummy PTR
    allocated:
    
        180.8.198.88.in-addr.arpa. 86400 IN	PTR	static.88-198-8-180.clients.your-server.de.

    The `your-server.de` domain is owned by Hetzner, so you should
    update that record in their control panel.

 5. DNS works on the machine (i.e. `/etc/resolv.conf` is configured to
    talk to a working resolver, but not necessarily ours, which Puppet
    will handle)

 6. a strong root password has been set in the password manager, this
    implies resetting the password for Ganeti instance installs the
    installed password was written to disk (TODO: move to trocla?
    [#33332](https://bugs.torproject.org/33332))

 7. `grub-pc/install_devices` debconf parameter is correctly set, to
    allow unattended upgrades of `grub-pc` to function. The command
    below can be used to bring up an interactive prompt in case it
    needs to be fixed:

         debconf-show grub-pc | grep -qoP "grub-pc/install_devices: \K.*" || dpkg-reconfigure grub-pc

### Main procedure

All commands to be run as root unless otherwise noted.

IMPORTANT: make sure you follow the [pre-requisites checklist
above](#pre-requisites)! Some installers cover all of those steps, but most do not.

 1. if the machine is not inside a [ganeti](ganeti) cluster (which has its
    own inventory), allocate and document the machine in the
    [Nextcloud spreadsheet](https://nc.torproject.net/apps/onlyoffice/5395), and the [services page](service), if it's a
    new service

 2. clone the `tsa-misc` git repository on the machine:
 
        git clone https://git.torproject.org/admin/tsa-misc.git

    Make sure the repo matches a known good copy. You can check the
    current head with:
    
        git -C tsa-misc show-ref

    You can then make sure it matches your local copy with something
    like:
    
        git -C tsa-misc show-ref master | grep 4879545cda75c44a1cf9efcf377d2c7e45683ac9

    TODO: just ship the parts below as part of the installer so we
    don't need that checkout

 5. bootstrap puppet:

    * on the new machine run the `installer/puppet-bootstrap-client`
      from the `tsa-misc` git repo cloned earlier. copy-paste the
      generated checksum literally (including the filename) into the
      script waiting on the Puppetmaster above.

    * This will tell you to add the host into LDAP, this should be
      done on the LDAP server (`db.torproject.org`), with:

          ldapvi -ZZ --encoding=ASCII --ldap-conf -h db.torproject.org -D "uid=$USER,ou=users,dc=torproject,dc=org"

      Make sure you review all fields, in particular `location` (`l`),
      `physicalHost`, `description` and `purpose` which do not have
      good defaults.

      See the [howto/upgrades](howto/upgrades) section for information about the
      `rebootPolicy` field. See also the [ldapvi manual](http://www.lichteblau.com/ldapvi/manual/) for more
      information.

    * This will *also* tell you to run the bootstrap script
      (`tpa-puppet-sign-client`) on the Puppet server (currently
      `pauli`), which will prompt you for a checksum that will be
      generated by the script above, eventually. It is necessary to
      run the script to unblock the firewall so the client can connect
      and generate its certificate.

 6. while Puppet is bootstrapping, you can add the node to
     [howto/nagios](howto/nagios), in `tor-nagios/config/nagios-master.cfg`
     (TODO: puppetize, in [ticket #32901](https://bugs.torproject.org/32901))

 7. ... and if the machine is handling mail, add it to [dnswl.org](https://www.dnswl.org/)
     (password in tor-passwords, `hosts-extra-info`)

 8. you will probably want to create a `/srv` filesystem to hole
    service files and data unless this is a very minimal
    system. Typically, installers may create the partition, but will
    *not* create the filesystem and configure it in `/etc/fstab`:

        mkfs -t ext4 -j /dev/sdc &&
        printf 'UUID=%s\t/srv\text4\tdefaults\t1\t2\n' $(blkid --match-tag UUID --output value /dev/sdc) >> /etc/fstab  &&
        mount /srv

 9. once everything is done, reboot the new machine to make sure
    *that* still works:

        reboot

At this point, the machine has a basic TPA setup. You will probably
need to assign it a "role" in Puppet to get it to do anything. 

# Reference

## Design

If you want to understand better the different installation procedures there is a
install flowchart that was made on [Draw.io](https://draw.io). 

![install.png](/howto/new-machine/install.png)

There are also per-site install graphs:

 * [install-hetzner-cloud.png](/howto/new-machine/install-hetzner-cloud.png)
 * [install-hetzner-robot.png](/howto/new-machine/install-hetzner-robot.png)
 * [install-ganeti.png](/howto/new-machine/install-ganeti.png)

To edit those graphics, head to the <https://draw.io> website (or
install their Electron desktop app) and load the [install.drawio](new-machine/install.drawio)
file.

Those diagrams were created as part of the redesign of the install
process, to better understand the various steps of the process and see
how they could be refactored. They should not be considered an
authoritative version of how the process should be followed. 

The text representation in this wiki remains the reference copy.

## Issues

Issues regarding installation on new machines are far ranging and do
not have a specific component. 

The install system is manual and not completely documented for all
sites. It needs to be automated, which is discussed below and in
[ticket 31239: automate installs](https://bugs.torproject.org/31239).

A good example of the problems that can come up with variations in
the install process is [ticket 31781: ping fails as a regular user on
new VMs](https://bugs.torproject.org/31781).

# Discussion

This section discusses background and implementation details of
installation of machines in the project. It shouldn't be necessary for
day to day operation.

## Overview

The current install procedures work, but have only recently been
formalized, mostly because we rarely setup machines. We do expect,
however, to setup a significant number of machines in 2019, or at
least significant enough to warrant automating the install process
better.

Automating installs is also critical according to Tom Limoncelli, the
author of the [Practice of System and Network Administration](https://the-sysadmin-book.com/). In
their [Ops report card](http://opsreportcard.com/), [question 20](http://opsreportcard.com/section/20) explains:

> If OS installation is automated then all machines start out the
> same. Fighting entropy is difficult enough. If each machine is
> hand-crafted, it is impossible.
>
> If you install the OS manually, you are wasting your time twice:
> Once when doing the installation and again every time you debug an
> issue that would have been prevented by having consistently
> configured machines.
>
> If two people install OSs manually, half are wrong but you don't
> know which half. Both may claim they use the same procedure but I
> assure you they are not. Put each in a different room and have them
> write down their procedure. Now show each sysadmin the other
> person's list. There will be a fistfight.

In that context, it's critical to automate a reproducible install
process. This gives us a consistent platform that Puppet runs on top
of, with no manual configuration.

## Goals

The project of automating the install is documented in [ticket
31239][].

[ticket 31239]: https://bugs.torproject.org/31239

### Must have

 * unattended installation
 * reproducible results
 * post-installer configuration (ie. not full installer, see below)
 * support for running in our different environments (Hetzner Cloud,
   Robot, bare metal, Ganeti...)

### Nice to have

 * packaged in Debian
 * full installer support:
   * RAID, LUKS, etc filesystem configuration
   * debootstrap, users, etc

### Non-Goals

 * full configuration management stack - that's done by [howto/puppet](howto/puppet)

## Approvals required

TBD.

## Proposed Solution

The solution being explored right now is assume the existence of a
rescue shell (SSH) of some sort and use [fabric](fabric) to deploy
everything on top of it, up to [puppet](puppet). Then everything should be
"puppetized" to remove manual configuration steps. See also [ticket
31239][] for the discussion of alternatives, which are also detailed
below.

## Cost

TBD.

## Alternatives considered

 * [Ansible](https://www.ansible.com/) - configuration management that duplicates [howto/puppet](howto/puppet)
   but which we may want to use to bootstrap machines instead of yet
   another custom thing that operators would need to learn.
 * [cloud-init](https://cloud-init.io/) - builtin to many cloud images (e.g. Amazon), can
   do [rudimentary filesystem setup](https://cloudinit.readthedocs.io/en/latest/topics/modules.html#disk-setup) (no RAID/LUKS/etc but ext4
   and disk partitionning is okay), [config can be fetched over
   HTTPS](https://cloudinit.readthedocs.io/en/latest/topics/datasources/nocloud.html), assumes it runs on first boot, but could be coerced to
   run manually (e.g. `fgrep -r cloud-init /lib/systemd/ | grep
   Exec`), [ganeti-os-interface backend](https://github.com/neicnordic/ganeti-os-nocloud)
 * [cobbler](https://cobbler.github.io/) - takes care of PXE and boot, delegates to kickstart
   the autoinstall, more relevant to RPM-based distros
 * [curtin](https://launchpad.net/curtin) - "a "fast path" installer designed to install Ubuntu
   quickly.  It is blunt, brief, snappish, snippety and
   unceremonious." ubuntu-specific, not in Debian, but has strong
   [partitionning support](https://curtin.readthedocs.io/en/latest/topics/storage.html) with ZFS, LVM, LUKS, etc support. part
   of the larger [MAAS](https://maas.io/) project
 * [FAI](https://fai-project.org/) - built by a debian developer, used to build live images
   since buster, might require complex setup (e.g. an NFS server),
   [setup-storage(8)](https://manpages.debian.org/buster/fai-setup-storage/setup-storage.8.en.html) might be reusable on its own. uses Tar-based
   images created by FAI itself, requires network control or custom
   ISO boot, requires a "server" (the [fai-server](https://packages.debian.org/unstable/fai-server) package), not
   directly supported by Ganeti, although there are [hacks to make it
   work](https://github.com/ganeti/ganeti/wiki/System-template-with-FAI) and there is a [ganeti-os-interface backend now](https://github.com/glance-/ganeti-os-fai)
 * [himblick](https://github.com/himblick/himblick) has some interesting post-install configure bits in
   Python, along with pyparted bridges
 * [list of debian setup tools](https://wiki.debian.org/SystemBuildTools), see also
   [AutomatedInstallation](https://wiki.debian.org/AutomatedInstallation)
 * [livewrapper](https://salsa.debian.org/enrico/live-wrapper) is also one of those installers, in a way
 * [vmdb2](https://vmdb2.liw.fi/) - a rewrite of vmdeboostrap, which uses a YAML file to
   describe a set of "steps" to take to install Debian, should work on
   VM images but also disks, no RAID support and a [significant number
   of bugs](https://bugs.debian.org/cgi-bin/pkgreport.cgi?repeatmerged=no&src=vmdb2) might affect reliability in production
 * [bdebstrap](https://github.com/bdrung/bdebstrap) - yet another one of those tools, built on top of
   mmdebstrap, YAML
 * [MAAS](https://maas.io/) - PXE-based, assumes network control which we don't have
   and has all sorts of features we don't want
 * [howto/puppet](howto/puppet) - Puppet could bootstrap itself, with `puppet apply` ran
   from a clone of the git repo. could be extended as deep as we want.
 * [terraform](https://www.terraform.io/) - config management for the cloud kind of thing,
   supports Hetzner Cloud, but not <del>Hetzner Robot or</del> Ganeti
   (update: there is a [Hetzner robot plugin](https://registry.terraform.io/providers/mwudka/hetznerrobot/latest) now)

Unfortuantely, I ruled out the official debian-installer because of
the complexity of the preseeding system and partman. It also wouldn't
work for installs on Hetzner Cloud or Ganeti.