retire-a-host.mdwn

# Decommissioning a host

 1. long before (weeks or months) the machine is decomissioned, make
    sure users are aware it will go away and of its replacement services
 1. remove the host from `tor-nagios/config/nagios-master.cfg`
 2. if applicable, stop the VM: `virsh destroy $host`, or at least
    stop the primary service on the machine
 3. if applicable, undefine the VM: `virsh undefine $host`
 4. wipe host data, possibly with a delay:
  
    * if applicable, remove the LVM logical volumes or virtual disk
      files:
      
          echo 'lvremove -y vgname/lvname' | at now + 7 days

    * for a normal machine or a machine we do not own the parent host
      for, wipe the disks using the method described below

 5. remove it from ud-ldap: the host entry and any `@<host>` group memberships there might be as well as any `sudo` passwords users might have configured for that host
 6. if it has any associated records in `tor-dns/domains` or `auto-dns`, or upstream's reverse dns thing, remove it from there too
 7. on pauli: `read host ; puppet node clean $host.torproject.org && puppet node deactivate $host.torproject.org`
 8. grep the `tor-puppet` repo for the host (and maybe its IP addresses) and clean up
 9. clean host from `tor-passwords`
 10. remove from the machine from the [Nextcloud spreadsheet](https://nc.riseup.net/remote.php/webdav/tpa/Tor%20VM%20Hosts.xlsx)
 11. schedule a removal of the host's backup, on the backup server
     (currently `bungei`):

        echo rm -rf /srv/backups/bacula/$host/ | at now + 30 days

 12. if it's a physical machine or a virtual host we don't control,
     schedule removal from racks or hosts with upstream

TODO: remove the client from the Bacula catalog, see <https://trac.torproject.org/projects/tor/ticket/30880>.

## Wiping disks

To wipe disks on servers without a serial console or management
interface, you need to be a little more creative. If there's a RAID
array, first wipe one of the disks by taking it offline and writing
garbage:

    mdadm --fail /dev/md0 /dev/sdb1 &&
    mdadm --remove /dev/md0 /dev/sdb1 &&
    mdadm --fail /dev/md1 /dev/sdb2 &&
    mdadm --remove /dev/md1 /dev/sdb2 &&
    : etc, for the other RAID elements (see /proc/mdstat) &&
    badblocks -w -s -v -p 2 /dev/sdb

This will take a long time. When you return:

 1. start a `screen` session with a static `busybox` as your `SHELL`
    that will survive disk wiping:

        mkdir /root/tmp
        mount -t tmpfs tmpfs /root/tmp
        cp /bin/busybox /root/tmp/sh
        export SHELL=/root/tmp/sh
        exec screen -s $SHELL

 2. kill all processes but the SSH daemon, your SSH connexion and
    shell. this will vary from machine to machine, but a good way is
    to list all processes with `systemctl status` and `systemctl stop`
    the services one by one. Hint: multiple services can be passed on
    the same `stop` command, for example:

        systemctl stop acpid atd bacula-df bind9 cron ntp postfix prometheus-node-exporter prometheus-bind-exporter

 3. disable swap:

        swapoff -a

 4. unmount everything that can be unmounted (except `/proc`):

        umount -a

 5. remount everything else readonly:

        mount -o remount,ro /

 6. sync disks:

        sync

 7. wipe the remaining disk (note the dangerous `-f`) and shutdown:

        badblocks -w -s -v -p 2 -f /dev/sda ; \
        echo "SHUTTING DOWN FOREVER IN ONE MINUTE" ; \
        sleep 60 ; \
        echo o > /proc/sysrq-trigger