retire-a-host.md

# Decommissioning a host

 1. long before (weeks or months) the machine is retired, make
    sure users are aware it will go away and of its replacement services
 3. retire the host from its parent, backups and Puppet, for example:

        fab -H $INSTANCE retire.retire-all --parent-host=$PARENT_HOST

    Copy the output of the script in the retirement ticket. Adjust
    delay for more sensitive hosts with:

        --retirement-delay-vm=30 --retirement-delay-backups=90

    Above is 30 days for the destruction of disks, 90 for
    backups. Default is 7 days for disks, 30 for backups.

    TODO: `$PARENT_HOST` should be some ganeti node
    (e.g. `fsn-node-01.torproject.org`) but could be auto-detected...

    TODO: cover physical machines
 5. remove from LDAP with `ldapvi` (STEP 6 above), copy-paste it in
    the ticket
 6. do one huge power-grep and find over all our source code, for example with
    unifolium that was:

        grep -nH -r -e 148.251.180.115 -e 2a01:4f8:211:6e8::2 -e kvm2.torproject.org  -e unifolium.torproject.org -e unifolium -e kvm2
        find -iname unifolium\*

    TODO: extract those values from LDAP (e.g. purpose) and run the
    grep in Fabric
 7. remove from tor-passwords (TODO: put in fabric). magic command
    (not great):

        pass rm root/unifolium.torproject.org
        # look for traces of the host elsewhere
        for f in */*; do
            if gpg -d < $f 2>/dev/null | \
                grep -i -e 148.251.180.115 -e 2a01:4f8:211:6e8::2 -e kvm2.torproject.org -e unifolium.torproject.org -e unifolium -e kvm2 
            then
                echo match found in $f
            fi
        done

 8. remove from DNSwl

 9. remove from the machine from this wiki (if present in
     documentation), the [Nextcloud spreadsheet](https://nc.torproject.net/apps/onlyoffice/5395) (if it is not in
     [ganeti](ganeti)), and, if it's an entire service, the [services
     page](/tpo/tpa/team/-/wikis/service)
 10. if it's a physical machine or a virtual host we don't control,
     schedule removal from racks or hosts with upstream

 11. remove from reverse DNS

Equivalent retirement checklist to copy-paste in retirement tickets:

 1. [ ] announcement
 3. [ ] retire the host in fabric
 4. [ ] remove from LDAP with `ldapvi`
 5. [ ] power-grep
 6. [ ] remove from tor-passwords
 7. [ ] remove from DNSwl
 8. [ ] remove from docs
 9. [ ] remove from racks
 10. [ ] remove from reverse DNS

## Wiping disks

To wipe disks on servers without a serial console or management
interface, you need to be a little more creative. We do this with the
`nwipe(1)` command, which should be installed before anything:

    apt install nwipe vmtouch

Run in a screen:

    screen

If there's a RAID array, first wipe one of the disks by taking it
offline and writing garbage:

    mdadm --fail /dev/md0 /dev/sdb1 &&
    mdadm --remove /dev/md0 /dev/sdb1 &&
    mdadm --fail /dev/md1 /dev/sdb2 &&
    mdadm --remove /dev/md1 /dev/sdb2 &&
    : etc, for the other RAID elements in /proc/mdstat &&
    nwipe --autonuke --method=random --verify=off /dev/sdb

This will take a long time. Note that it will start a GUI which is
useful because it will give you timing estimates, which the
command-line version [does not provide](https://github.com/martijnvanbrummelen/nwipe/issues/196).

WARNING: this procedure doesn't cover the case where the disk is an
SSD. See [this paper][] for details on how classic data scrubbing
software might not work for SSDs. For now we use this:

    nwipe --autonuke --method=random --rounds=2 --verify=off /dev/nvme1n1

TODO: consider `hdparm` and the "secure erase" procedure for SSDs:

    hdparm --user-master u --security-set-pass Eins /dev/sdc
    time hdparm --user-master u --security-erase Eins /dev/sdc

See also [stressant documentation](https://stressant.readthedocs.io/en/latest/usage.html#wiping-disks) abnout this.

When you return:

 1. start a `screen` session with a static `busybox` as your `SHELL`
    that will survive disk wiping:

        # make sure /tmp is on a tmpfs first!
        cp -av /root /tmp/root &&
        mount -o bind /tmp/root /root &&
        cp /bin/busybox /tmp/root/sh &&
        export SHELL=/tmp/root/sh &&
        exec screen -s $SHELL

 2. lock down busybox and screen in memory

        vmtouch -dl /usr/bin/screen /bin/busybox /tmp/root/sh /usr/sbin/nwipe

    TODO: the above aims at making busybox survive the destruction, so
    that it's cached in RAM. It's unclear if that actually works,
    because typically SSH is also busted and needs a lot more to
    bootstrap, so we can't log back in if we lose the
    console. Ideally, we'd run this in a serial console that would
    have more reliable access... See also [vmtouch](https://hoytech.com/vmtouch/).

 2. kill all processes but the SSH daemon, your SSH connection and
    shell. this will vary from machine to machine, but a good way is
    to list all processes with `systemctl status` and `systemctl stop`
    the services one by one. Hint: multiple services can be passed on
    the same `stop` command, for example:

        systemctl stop \
            acpid \
            acpid.path \
            acpid.socket \
            apache2 \
            atd \
            bacula-fd \
            bind9 \
            cron \
            dbus \
            dbus.socket \
            fail2ban \
            ganeti \
            haveged \
            irqbalance \
            ipsec \
            iscsid \
            libvirtd \
            lvm2-lvmetad.service \
            lvm2-lvmetad.socket \
            mdmonitor \
            multipathd.service \
            multipathd.socket \
            ntp \
            openvswitch-switch \
            postfix \
            prometheus-bind-exporter \
            prometheus-node-exporter \
            smartd \
            strongswan \
            syslog-ng.service \
            systemd-journald \
            systemd-journald-audit.socket \
            systemd-journald-dev-log.socket \
            systemd-journald.socket \
            systemd-logind.service \
            systemd-udevd \
            systemd-udevd \
            systemd-udevd-control.socket \
            systemd-udevd-control.socket \
            systemd-udevd-kernel.socket \
            systemd-udevd-kernel.socket \
            timers.target \
            ulogd2 \
            unbound \
            virtlogd \
            virtlogd.socket \

 3. disable swap:

        swapoff -a

 4. un-mount everything that can be unmounted (except `/proc`):

        umount -a

 5. remount everything else read-only:

        mount -o remount,ro /

 6. sync disks:

        sync

 7. wipe the remaining disk and shutdown:

        # hit control-a control-g to enable the bell in screen
        wipefs -af /dev/noop3 &&
        wipefs -af /dev/noop && \
        nwipe --autonuke --method=random --rounds=2 --verify=off /dev/noop ; \
        printf "SHUTTING DOWN FOREVER IN ONE MINUTE\a\n" ; \
        sleep 60 ; \
        echo o > /proc/sysrq-trigger ; \
        sleep 60 ; \
        echo b > /proc/sysrq-trigger ; \

    Note: as a safety precaution, the above device has been replaced
    by `noop`, that should be (say) `sda` instead.

A few tricks if nothing works in the shell which might work in a case
of an emergency:

 * `cat PATH` can be expressed as `mapfile -C "printf %s"  < PATH` in
   bash
 * `echo *` can be used as a rough approximation of `ls`

[this paper]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.187.3062&rep=rep1&type=pdf

## Deprecated manual procedure

Warning: this procedure is difficult to follow and error-prone. A new
procedure was established in Fabric, above. It should really just be
completely avoided.

 1. long before (weeks or months) the machine is retired, make
    sure users are aware it will go away and of its replacement services
 3. if applicable, stop the VM in advance:

    * If the VM is on a KVM host: `virsh shutdown $host`, or at least stop the
    primary service on the machine

    * If the machine is on ganeti: `gnt-instance stop $host`

 4. On KVM hosts, undefine the VM: `virsh undefine $host`

 5. wipe host data, possibly with a delay:

    * On some KVM hosts, remove the LVM logical volumes:

          echo 'lvremove -y vgname/lvname' | at now + 7 days

      Use `lvs` will list the logical volumes on the machine.

    * Other KVM hosts use file-backed storage:
    
          echo 'rm -r /srv/vmstore/gayi.torproject.org/' | at now + 7 days

    * On Ganeti hosts, remove the actual instance with a delay, from
      the Ganeti master:

          echo "gnt-instance remove $host" | at now + 7 days

    * for a normal machine or a machine we do not own the parent host
      for, wipe the disks using the method described below

 6. remove it from LDAP: the host entry and any `@<host>` group memberships there might be as well as any `sudo` passwords users might have configured for that host
 7. if it has any associated records in `tor-dns/domains` or
    `auto-dns`, or upstream's reverse dns thing, remove it from there
    too. e.g.
    
        grep -r -e build-x86-07 -e 78.47.38.230 -e 2a01:4f8:211:6e8:0:823:6:1
    
    ... and check upstream reverse DNS.
 8. on the puppet server (`pauli`): `read host ; puppet node clean $host.torproject.org &&
    puppet node deactivate $host.torproject.org`
    TODO: That procedure is incomplete, use the `retire.revoke-puppet`
    job in fabric instead.
 9. grep the `tor-puppet` repository for the host (and maybe its IP
    addresses) and clean up; also look for files with hostname in
    their name
 10. clean host from `tor-passwords`
 11. remove any certs and backup keys from `letsencrypt-domains.git` and
     `letsencrypt-domains/backup-keys.git` repositories that are no
     longer relevant:

         git -C letsencrypt-domains grep -e $host -e storm.torproject.org
         # remove entries found above
         git -C letsencrypt-domains commit
         git -C letsencrypt-domains push
         find letsencrypt-domains/backup-keys -name "$host.torproject.org" -o -name 'storm.torproject.org*' -delete
         git -C letsencrypt-domains/backup-keys commit
         git -C letsencrypt-domains/backup-keys push

     Also clean up the relevant files on the letsencrypt master
     (currently `nevii`), for example:

         ssh nevii rm -rf /srv/letsencrypt.torproject.org/var/certs/storm.torproject.org
         ssh nevii find /srv/letsencrypt.torproject.org/ -name 'storm.torproject.org.*' -delete

 12. if the machine is handling mail, remove it from [dnswl.org](https://www.dnswl.org/)
     (password in tor-passwords, `hosts-extra-info`) - consider that
     it can take a long time (weeks? months?) to be able to "re-add"
     an IP address in that service, so if that IP can eventually be
     reused, it might be better to keep it there in the short term
 13. schedule a removal of the host's backup, on the backup server
     (currently `bungei`):

         cd  /srv/backups/bacula/
         mv $host.torproject.org $host.torproject.org-OLD
         echo rm -rf /srv/backups/bacula/$host.torproject.org.OLD/ | at now + 30 days

 14. remove from the machine from this wiki (if present in
     documentation), the [Nextcloud spreadsheet](https://nc.torproject.net/apps/onlyoffice/5395) (if it is not in
     [ganeti](ganeti)), and, if it's an entire service, the [services
     page](https://gitlab.torproject.org/legacy/trac/-/wikis/org/operations/services)
 15. if it's a physical machine or a virtual host we don't control,
     schedule removal from racks or hosts with upstream

 16. after 30 days delay, retire from Bacula catalog, on the director
     (currently `bacula-director-01`), run `bconsole` then:
 
        delete client=$INSTANCE-fd
    
     for example:
     
        delete client=archeotrichon.torproject.org-fd

     <!-- sync this section with service/backup#retiring-a-client when -->
     <!-- changing -->
     
 17. after 30 days delay, remove PostgreSQL backups on the storage
     server (currently `/srv/backups/pg` on `bungi`), if relevant