anarcat · f6c6b238
--- a/howto/retire-a-host.md
+++ b/howto/retire-a-host.md
+# Decommissioning a host
+Warning: this procedure is difficult to follow and error-prone. A new
+procedure is being established in Fabric, below. It should still work,
+provided you follow the warnings.
+ 1. long before (weeks or months) the machine is decomissioned, make
+    sure users are aware it will go away and of its replacement services
+ 2. remove the host from `tor-nagios/config/nagios-master.cfg`
+ 3. if applicable, stop the VM in advance:
+    * If the VM is on a KVM host: `virsh shutdown $host`, or at least stop the
+    primary service on the machine
+    * If the machine is on ganeti: `gnt-instance remove $host`
+ 4. On KVM hosts, undefine the VM: `virsh undefine $host`
+ 5. wipe host data, possibly with a delay:
+    * On some KVM hosts, remove the LVM logical volumes:
+          echo 'lvremove -y vgname/lvname' | at now + 7 days
+      Use `lvs` will list the logical volumes on the machine.
+    * Other KVM hosts use file-backed storage:
+          echo 'rm -r /srv/vmstore/gayi.torproject.org/' | at now + 7 days
+    * for a normal machine or a machine we do not own the parent host
+      for, wipe the disks using the method described below
+ 6. remove it from ud-ldap: the host entry and any `@<host>` group memberships there might be as well as any `sudo` passwords users might have configured for that host
+ 7. if it has any associated records in `tor-dns/domains` or
+    `auto-dns`, or upstream's reverse dns thing, remove it from there
+    too. e.g.
+        grep -r -e build-x86-07 -e 78.47.38.230 -e 2a01:4f8:211:6e8:0:823:6:1
+    ... and check upstream reverse DNS.
+ 8. on pauli: `read host ; puppet node clean $host.torproject.org &&
+    puppet node deactivate $host.torproject.org`
+    TODO: That procedure is incomplete, use the `retire.revoke-puppet`
+    job in fabric instead.
+ 9. grep the `tor-puppet` repo for the host (and maybe its IP
+    addresses) and clean up; also look for files with hostname in
+    their name
+ 10. clean host from `tor-passwords`
+ 11. remove any certs and backup keys from letsencrypt-domains and
+     letsencrypt-domains/backup-keys git repositories that are no
+     longer relevant:
+        git -C letsencrypt-domains grep -e $host -e storm.torproject.org
+        # remove entries found above
+        git -C letsencrypt-domains commit
+        git -C letsencrypt-domains push
+        find letsencrypt-domains/backup-keys -name "$host.torproject.org" -o -name 'storm.torproject.org*' -delete
+        git -C letsencrypt-domains/backup-keys commit
+        git -C letsencrypt-domains/backup-keys push
+     Also clean up the relevant files on the letsencrypt master
+     (currently `nevii`), for example:
+        ssh nevii rm -rf /srv/letsencrypt.torproject.org/var/certs/storm.torproject.org
+        ssh nevii find /srv/letsencrypt.torproject.org/ -name 'storm.torproject.org.*' -delete
+ 12. if the machine is handling mail, remove it from [dnswl.org](https://www.dnswl.org/)
+     (password in tor-passwords, `hosts-extra-info`) - consider that
+     it can take a long time (weeks? months?) to be able to "re-add"
+     an IP address in that service, so if that IP can eventually be
+     reused, it might be better to keep it there in the short term
+ 13. schedule a removal of the host's backup, on the backup server
+     (currently `bungei`):
+        cd  /srv/backups/bacula/
+        mv $host.torproject.org $host.torproject.org-OLD
+        echo rm -rf /srv/backups/bacula/$host.torproject.org.OLD/ | at now + 30 days
+ 14. remove from the machine from this wiki (if present in
+     documentation), the [Nextcloud spreadsheet](https://nc.torproject.net/apps/onlyoffice/5395), and, if it's an
+     entire service, the [services page](https://gitlab.torproject.org/legacy/trac/-/wikis/org/operations/services)
+ 15. if it's a physical machine or a virtual host we don't control,
+     schedule removal from racks or hosts with upstream
+TODO: remove the client from the Bacula catalog, see <https://bugs.torproject.org/30880>.
+## Wiping disks
+To wipe disks on servers without a serial console or management
+interface, you need to be a little more creative. We do this with the
+`nwipe(1)` command, which should be installed before anything:
+    apt install nwipe
+Run in a screen:
+    screen
+If there's a RAID array, first wipe one of the disks by taking it
+offline and writing garbage:
+    mdadm --fail /dev/md0 /dev/sdb1 &&
+    mdadm --remove /dev/md0 /dev/sdb1 &&
+    mdadm --fail /dev/md1 /dev/sdb2 &&
+    mdadm --remove /dev/md1 /dev/sdb2 &&
+    : etc, for the other RAID elements in /proc/mdstat &&
+    nwipe --autonuke --method=random --verify=off /dev/sdb
+This will take a long time. Note that it will start a GUI which is
+useful because it will give you timing estimates, which the
+commandline version [does not provide](https://github.com/martijnvanbrummelen/nwipe/issues/196).
+WARNING: this procedure doesn't cover the case where the disk is an
+SSD. See [this paper](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.187.3062&rep=rep1&type=pdf) for details on how classic data scrubbing
+software might not work for SSDs. For now we use this:
+    nwipe --autonuke --method=random --rounds=2 /dev/nvme1n1
+When you return:
+ 1. start a `screen` session with a static `busybox` as your `SHELL`
+    that will survive disk wiping:
+        # make sure /tmp is on a tmpfs first!
+        cp -av /root /tmp/root &&
+        mount -o bind /tmp/root /root &&
+        cp /bin/busybox /tmp/root/sh &&
+        export SHELL=/tmp/root/sh &&
+        exec screen -s $SHELL
+    TODO: the above eventually failed to make busybox survive the
+    destruction, probably because it got evicted from RAM and couldn't
+    be found in swap again (as *that* was destroyed too). We should
+    try using [vmtouch](https://hoytech.com/vmtouch/) with something like `vmtouch -dl
+    /tmp/root/sh` next time, although that is only [available in buster
+    and later](https://tracker.debian.org/pkg/vmtouch).
+ 2. kill all processes but the SSH daemon, your SSH connexion and
+    shell. this will vary from machine to machine, but a good way is
+    to list all processes with `systemctl status` and `systemctl stop`
+    the services one by one. Hint: multiple services can be passed on
+    the same `stop` command, for example:
+        systemctl stop acpid acpid.socket acpid.path atd bacula-fd bind9 cron dbus dbus.socket fail2ban haveged irqbalance libvirtd lvm2-lvmetad.service lvm2-lvmetad.socket mdmonitor nagios-nrpe-server ntp openvswitch-switch postfix prometheus-bind-exporter prometheus-node-exporter smartd strongswan syslog-ng.service systemd-journald systemd-journald-audit.socket systemd-journald-dev-log.socket systemd-journald.socket systemd-logind.service systemd-udevd systemd-udevd systemd-udevd-control.socket systemd-udevd-control.socket systemd-udevd-kernel.socket systemd-udevd-kernel.socket ulogd2 unbound virtlogd virtlogd.socket
+ 3. disable swap:
+        swapoff -a
+ 4. unmount everything that can be unmounted (except `/proc`):
+        umount -a
+ 5. remount everything else readonly:
+        mount -o remount,ro /
+ 6. sync disks:
+        sync
+ 7. wipe the remaining disk and shutdown:
+        nwipe --autonuke --method=random --verify=off /dev/sda ; \
+        echo "SHUTTING DOWN FOREVER IN ONE MINUTE" ; \
+        sleep 60 ; \
+        echo o > /proc/sysrq-trigger
+A few tricks if nothing works in the shell which might work in a case
+of an emergency:
+ * `cat PATH` can be expressed as `mapfile -C "printf %s"  < PATH` in
+   bash
+ * `echo *` can be used as a rough approximation of `ls`
+## Alternate, fabric-based procedure
+ 1. long before (weeks or months) the machine is decomissioned, make
+    sure users are aware it will go away and of its replacement services
+ 2. remove the host from `tor-nagios/config/nagios-master.cfg`
+ 3. if applicable, stop the VM in advance:
+    * If the VM is on a KVM host: `virsh shutdown $host`, or at least stop the
+    primary service on the machine
+    * If the machine is on ganeti: `gnt-instance remove $host`
+    TODO: move this into Fabric
+ 4. after a delay, retire the host from its parent, backups and
+    Puppet, for example:
+        ./retire -v -H $INSTANCE retire-all --parent-host=$PARENT_HOST
+    TODO: `$PARENT_HOST` should be some ganeti node
+    (e.g. `fsn-node-01.torproject.org`) but could be auto-detected...
+    TODO: cover physical machines
+ 5. remove from LDAP with `ldapvi` (STEP 6 above) TODO: add to Fabric,
+    make sure you show the diff
+ 6. do one huge power-grep over all our source code, for example with
+    unifolium that was:
+        grep -nH -r -e 148.251.180.115 -e 2a01:4f8:211:6e8::2 -e kvm2.torproject.org  -e unifolium.torproject.org -e unifolium -e kvm2
+    TODO: extract those values from LDAP (e.g. purpose) and run the
+    grep in Fabric
+ 7. remove from tor-passwords (TODO: put in fabric). magic command
+    (not great):
+        for f in *; do
+            if gpg -d < $f 2>/dev/null | grep -i -e 148.251.180.115 -e 2a01:4f8:211:6e8::2 -e kvm2.torproject.org -e unifolium.torproject.org -e unifolium -e kvm2 ; then
+                echo match found in $f
+                ~/src/pwstore/pws ed $f
+            fi
+        done
+ 8. remove from DNSwl
+ 9. remove from the machine from this wiki (if present in
+     documentation), the [Nextcloud spreadsheet](https://nc.torproject.net/apps/onlyoffice/5395), and, if it's an
+     entire service, the [services page](https://gitlab.torproject.org/legacy/trac/-/wikis/org/operations/services)
+ 10. if it's a physical machine or a virtual host we don't control,
+     schedule removal from racks or hosts with upstream
+ 11. remove from reverse DNS
+TODO: remove the client from the Bacula catalog, see <https://bugs.torproject.org/30880>.