Skip to content
Snippets Groups Projects

Decommissioning a host

Warning: this procedure is difficult to follow and error-prone. A new procedure is being established in Fabric, below. It should still work, provided you follow the warnings.

  1. long before (weeks or months) the machine is decomissioned, make sure users are aware it will go away and of its replacement services

  2. remove the host from tor-nagios/config/nagios-master.cfg

  3. if applicable, stop the VM in advance:

    • If the VM is on a KVM host: virsh shutdown $host, or at least stop the primary service on the machine

    • If the machine is on ganeti: gnt-instance stop $host

  4. On KVM hosts, undefine the VM: virsh undefine $host

  5. wipe host data, possibly with a delay:

    • On some KVM hosts, remove the LVM logical volumes:

      echo 'lvremove -y vgname/lvname' | at now + 7 days

      Use lvs will list the logical volumes on the machine.

    • Other KVM hosts use file-backed storage:

      echo 'rm -r /srv/vmstore/gayi.torproject.org/' | at now + 7 days
    • On Ganeti hosts, remove the actual instance with a delay, from the Ganeti master:

      echo "gnt-instance remove $host" | at now + 7 days
    • for a normal machine or a machine we do not own the parent host for, wipe the disks using the method described below

  6. remove it from ud-ldap: the host entry and any @<host> group memberships there might be as well as any sudo passwords users might have configured for that host

  7. if it has any associated records in tor-dns/domains or auto-dns, or upstream's reverse dns thing, remove it from there too. e.g.

    grep -r -e build-x86-07 -e 78.47.38.230 -e 2a01:4f8:211:6e8:0:823:6:1

    ... and check upstream reverse DNS.

  8. on pauli: read host ; puppet node clean $host.torproject.org && puppet node deactivate $host.torproject.org TODO: That procedure is incomplete, use the retire.revoke-puppet job in fabric instead.

  9. grep the tor-puppet repo for the host (and maybe its IP addresses) and clean up; also look for files with hostname in their name

  10. clean host from tor-passwords

  11. remove any certs and backup keys from letsencrypt-domains and letsencrypt-domains/backup-keys git repositories that are no longer relevant:

    git -C letsencrypt-domains grep -e $host -e storm.torproject.org
    # remove entries found above
    git -C letsencrypt-domains commit
    git -C letsencrypt-domains push
    find letsencrypt-domains/backup-keys -name "$host.torproject.org" -o -name 'storm.torproject.org*' -delete
    git -C letsencrypt-domains/backup-keys commit
    git -C letsencrypt-domains/backup-keys push

    Also clean up the relevant files on the letsencrypt master (currently nevii), for example:

    ssh nevii rm -rf /srv/letsencrypt.torproject.org/var/certs/storm.torproject.org
    ssh nevii find /srv/letsencrypt.torproject.org/ -name 'storm.torproject.org.*' -delete
  12. if the machine is handling mail, remove it from dnswl.org (password in tor-passwords, hosts-extra-info) - consider that it can take a long time (weeks? months?) to be able to "re-add" an IP address in that service, so if that IP can eventually be reused, it might be better to keep it there in the short term

  13. schedule a removal of the host's backup, on the backup server (currently bungei):

    cd  /srv/backups/bacula/
    mv $host.torproject.org $host.torproject.org-OLD
    echo rm -rf /srv/backups/bacula/$host.torproject.org.OLD/ | at now + 30 days
  14. remove from the machine from this wiki (if present in documentation), the Nextcloud spreadsheet (if it is not in ganeti), and, if it's an entire service, the services page

  15. if it's a physical machine or a virtual host we don't control, schedule removal from racks or hosts with upstream

TODO: remove the client from the Bacula catalog, see https://bugs.torproject.org/30880.

Wiping disks

To wipe disks on servers without a serial console or management interface, you need to be a little more creative. We do this with the nwipe(1) command, which should be installed before anything:

apt install nwipe

Run in a screen:

screen

If there's a RAID array, first wipe one of the disks by taking it offline and writing garbage:

mdadm --fail /dev/md0 /dev/sdb1 &&
mdadm --remove /dev/md0 /dev/sdb1 &&
mdadm --fail /dev/md1 /dev/sdb2 &&
mdadm --remove /dev/md1 /dev/sdb2 &&
: etc, for the other RAID elements in /proc/mdstat &&
nwipe --autonuke --method=random --verify=off /dev/sdb

This will take a long time. Note that it will start a GUI which is useful because it will give you timing estimates, which the commandline version does not provide.

WARNING: this procedure doesn't cover the case where the disk is an SSD. See this paper for details on how classic data scrubbing software might not work for SSDs. For now we use this:

nwipe --autonuke --method=random --rounds=2 /dev/nvme1n1

When you return:

  1. start a screen session with a static busybox as your SHELL that will survive disk wiping:

    # make sure /tmp is on a tmpfs first!
    cp -av /root /tmp/root &&
    mount -o bind /tmp/root /root &&
    cp /bin/busybox /tmp/root/sh &&
    export SHELL=/tmp/root/sh &&
    exec screen -s $SHELL

    TODO: the above eventually failed to make busybox survive the destruction, probably because it got evicted from RAM and couldn't be found in swap again (as that was destroyed too). We should try using vmtouch with something like vmtouch -dl /tmp/root/sh next time, although that is only available in buster and later.

  2. kill all processes but the SSH daemon, your SSH connexion and shell. this will vary from machine to machine, but a good way is to list all processes with systemctl status and systemctl stop the services one by one. Hint: multiple services can be passed on the same stop command, for example:

    systemctl stop acpid acpid.socket acpid.path atd bacula-fd bind9 cron dbus dbus.socket fail2ban haveged irqbalance libvirtd lvm2-lvmetad.service lvm2-lvmetad.socket mdmonitor nagios-nrpe-server ntp openvswitch-switch postfix prometheus-bind-exporter prometheus-node-exporter smartd strongswan syslog-ng.service systemd-journald systemd-journald-audit.socket systemd-journald-dev-log.socket systemd-journald.socket systemd-logind.service systemd-udevd systemd-udevd systemd-udevd-control.socket systemd-udevd-control.socket systemd-udevd-kernel.socket systemd-udevd-kernel.socket ulogd2 unbound virtlogd virtlogd.socket
  3. disable swap:

    swapoff -a
  4. unmount everything that can be unmounted (except /proc):

    umount -a
  5. remount everything else readonly:

    mount -o remount,ro /
  6. sync disks:

    sync
  7. wipe the remaining disk and shutdown:

    nwipe --autonuke --method=random --verify=off /dev/sda ; \
    echo "SHUTTING DOWN FOREVER IN ONE MINUTE" ; \
    sleep 60 ; \
    echo o > /proc/sysrq-trigger

A few tricks if nothing works in the shell which might work in a case of an emergency:

  • cat PATH can be expressed as mapfile -C "printf %s" < PATH in bash
  • echo * can be used as a rough approximation of ls

Alternate, fabric-based procedure

  1. long before (weeks or months) the machine is decomissioned, make sure users are aware it will go away and of its replacement services

  2. remove the host from tor-nagios/config/nagios-master.cfg

  3. if applicable, stop the VM in advance:

    • If the VM is on a KVM host: virsh shutdown $host, or at least stop the primary service on the machine

    • If the machine is on ganeti: gnt-instance remove $host TODO: move this into Fabric

  4. after a delay, retire the host from its parent, backups and Puppet, for example:

    ./retire -v -H $INSTANCE retire-all --parent-host=$PARENT_HOST

    TODO: $PARENT_HOST should be some ganeti node (e.g. fsn-node-01.torproject.org) but could be auto-detected...

    TODO: cover physical machines

  5. remove from LDAP with ldapvi (STEP 6 above) TODO: add to Fabric, make sure you show the diff

  6. do one huge power-grep over all our source code, for example with unifolium that was:

    grep -nH -r -e 148.251.180.115 -e 2a01:4f8:211:6e8::2 -e kvm2.torproject.org  -e unifolium.torproject.org -e unifolium -e kvm2

    TODO: extract those values from LDAP (e.g. purpose) and run the grep in Fabric

  7. remove from tor-passwords (TODO: put in fabric). magic command (not great):

    for f in *; do
        if gpg -d < $f 2>/dev/null | grep -i -e 148.251.180.115 -e 2a01:4f8:211:6e8::2 -e kvm2.torproject.org -e unifolium.torproject.org -e unifolium -e kvm2 ; then
            echo match found in $f
            ~/src/pwstore/pws ed $f
        fi
    done
  8. remove from DNSwl

  9. remove from the machine from this wiki (if present in documentation), the Nextcloud spreadsheet (if it is not in ganeti), and, if it's an entire service, the services page

  10. if it's a physical machine or a virtual host we don't control, schedule removal from racks or hosts with upstream

  11. remove from reverse DNS

TODO: remove the client from the Bacula catalog, see https://bugs.torproject.org/30880.