retire-a-host.mdwn

# Decommissioning a host

 1. long before (weeks or months) the machine is decomissioned, make
    sure users are aware it will go away and of its replacement services
 2. remove the host from `tor-nagios/config/nagios-master.cfg`
 3. if applicable, stop the VM:

    * If the VM is on a KVM host: `virsh shutdown $host`, or at least stop the
    primary service on the machine

    * If the machine is on ganeti: `gnt-instance remove $host`

 4. On KVM hosts, undefine the VM: `virsh undefine $host`

 5. wipe host data, possibly with a delay:

    * On some KVM hosts, remove the LVM logical volumes:

          echo 'lvremove -y vgname/lvname' | at now + 7 days

      Use `lvs` will list the logical volumes on the machine.

    * Other KVM hosts use file-backed storage:
    
          echo 'rm -r /srv/vmstore/gayi.torproject.org/' | at now + 7 days

    * for a normal machine or a machine we do not own the parent host
      for, wipe the disks using the method described below

 6. remove it from ud-ldap: the host entry and any `@<host>` group memberships there might be as well as any `sudo` passwords users might have configured for that host
 7. if it has any associated records in `tor-dns/domains` or
    `auto-dns`, or upstream's reverse dns thing, remove it from there
    too. e.g.
    
        grep -r -e build-x86-07 -e 78.47.38.230 -e 2a01:4f8:211:6e8:0:823:6:1
    
    ... and check upstream reverse DNS.
 8. on pauli: `read host ; puppet node clean $host.torproject.org && puppet node deactivate $host.torproject.org`
 9. grep the `tor-puppet` repo for the host (and maybe its IP addresses) and clean up; also look for files with hostname in their name
 10. clean host from `tor-passwords`
 11. remove any certs and backup keys from letsencrypt-domains and
     letsencrypt-domains/backup-keys git repositories that are no
     longer relevant:

        git -C letsencrypt-domains grep -e $host -e storm.torproject.org
        # remove entries found above
        git -C letsencrypt-domains commit
        git -C letsencrypt-domains push
        find letsencrypt-domains/backup-keys -name "$host.torproject.org" -o -name 'storm.torproject.org*' -delete
        git -C letsencrypt-domains/backup-keys commit
        git -C letsencrypt-domains/backup-keys push

     Also clean up the relevant files on the letsencrypt master
     (currently `nevii`), for example:

        ssh nevii rm -rf /srv/letsencrypt.torproject.org/var/certs/storm.torproject.org
        ssh nevii find /srv/letsencrypt.torproject.org/ -name 'storm.torproject.org.*' -delete
 12. if the machine is handling mail, remove it from [dnswl.org](https://www.dnswl.org/)
     (password in tor-passwords, `hosts-extra-info`) - consider that
     it can take a long time (weeks? months?) to be able to "re-add"
     an IP address in that service, so if that IP can eventually be
     reused, it might be better to keep it there in the short term
 13. schedule a removal of the host's backup, on the backup server
     (currently `bungei`):

        cd  /srv/backups/bacula/
        mv $host.torproject.org $host.torproject.org-OLD
        echo rm -rf /srv/backups/bacula/$host.torproject.org.OLD/ | at now + 30 days

 14. remove from the machine from this wiki (if present in
     documentation), the [Nextcloud spreadsheet](https://nc.torproject.net/apps/onlyoffice/5395), and, if it's an
     entire service, the [services page](https://trac.torproject.org/projects/tor/wiki/org/operations/services)
 15. if it's a physical machine or a virtual host we don't control,
     schedule removal from racks or hosts with upstream

TODO: remove the client from the Bacula catalog, see <https://trac.torproject.org/projects/tor/ticket/30880>.

## Wiping disks

To wipe disks on servers without a serial console or management
interface, you need to be a little more creative. We do this with the
`nwipe(1)` command, which should be installed before anything:

    apt install nwipe

If there's a RAID array, first wipe one of the disks by taking it
offline and writing garbage:

    mdadm --fail /dev/md0 /dev/sdb1 &&
    mdadm --remove /dev/md0 /dev/sdb1 &&
    mdadm --fail /dev/md1 /dev/sdb3 &&
    mdadm --remove /dev/md1 /dev/sdb3 &&
    : etc, for the other RAID elements in /proc/mdstat &&
    nwipe --autonuke --method=random --verify=off /dev/sdb

This will take a long time. Note that it will start a GUI which is
useful because it will give you timing estimates, which the
commandline version [does not provide](https://github.com/martijnvanbrummelen/nwipe/issues/196).

When you return:

 1. start a `screen` session with a static `busybox` as your `SHELL`
    that will survive disk wiping:

        # make sure /tmp is on a tmpfs first!
        cp -av /root /tmp/root
        mount -o bind /tmp/root /root
        cp /bin/busybox /tmp/root/sh
        export SHELL=/tmp/root/sh
        exec screen -s $SHELL

 2. kill all processes but the SSH daemon, your SSH connexion and
    shell. this will vary from machine to machine, but a good way is
    to list all processes with `systemctl status` and `systemctl stop`
    the services one by one. Hint: multiple services can be passed on
    the same `stop` command, for example:

        systemctl stop acpid atd bacula-fd bind9 cron dbus dbus.socket fail2ban haveged irqbalance libvirtd lvm2-lvmetad.service mdmonitor nagios-nrpe-server ntp openvswitch-switch postfix prometheus-bind-exporter prometheus-node-exporter smartd strongswan syslog-ng.service systemd-journald systemd-journald-audit.socket systemd-journald-dev-log.socket systemd-journald.socket systemd-logind.service systemd-udevd systemd-udevd systemd-udevd-control.socket systemd-udevd-control.socket systemd-udevd-kernel.socket systemd-udevd-kernel.socket ulogd2 unbound virtlogd virtlogd.socket

 3. disable swap:

        swapoff -a

 4. unmount everything that can be unmounted (except `/proc`):

        umount -a

 5. remount everything else readonly:

        mount -o remount,ro /

 6. sync disks:

        sync

 7. wipe the remaining disk (note the dangerous `-f`) and shutdown:

        nwipe --autonuke --method=random --verify=off /dev/sda ; \
        echo "SHUTTING DOWN FOREVER IN ONE MINUTE" ; \
        sleep 60 ; \
        echo o > /proc/sysrq-trigger