Skip to content
Snippets Groups Projects
retire-a-host.md 9.79 KiB
Newer Older
  • Learn to ignore specific revisions
  • Warning: this procedure is difficult to follow and error-prone. A new
    procedure is being established in Fabric, below. It should still work,
    provided you follow the warnings.
    
    
    anarcat's avatar
    anarcat committed
     1. long before (weeks or months) the machine is decomissioned, make
        sure users are aware it will go away and of its replacement services
    
     2. remove the host from `tor-nagios/config/nagios-master.cfg`
    
     3. if applicable, stop the VM in advance:
    
    
        * If the VM is on a KVM host: `virsh shutdown $host`, or at least stop the
        primary service on the machine
    
        * If the machine is on ganeti: `gnt-instance stop $host`
    
     4. On KVM hosts, undefine the VM: `virsh undefine $host`
    
     5. wipe host data, possibly with a delay:
    
        * On some KVM hosts, remove the LVM logical volumes:
    
              echo 'lvremove -y vgname/lvname' | at now + 7 days
    
          Use `lvs` will list the logical volumes on the machine.
    
        * Other KVM hosts use file-backed storage:
        
              echo 'rm -r /srv/vmstore/gayi.torproject.org/' | at now + 7 days
    
    anarcat's avatar
    anarcat committed
    
    
        * On Ganeti hosts, remove the actual instance with a delay, from
          the Ganeti master:
    
    
              echo "gnt-instance remove $host" | at now + 7 days
    
    anarcat's avatar
    anarcat committed
        * for a normal machine or a machine we do not own the parent host
          for, wipe the disks using the method described below
    
    
     6. remove it from ud-ldap: the host entry and any `@<host>` group memberships there might be as well as any `sudo` passwords users might have configured for that host
    
    anarcat's avatar
    anarcat committed
     7. if it has any associated records in `tor-dns/domains` or
        `auto-dns`, or upstream's reverse dns thing, remove it from there
        too. e.g.
        
            grep -r -e build-x86-07 -e 78.47.38.230 -e 2a01:4f8:211:6e8:0:823:6:1
        
        ... and check upstream reverse DNS.
    
     8. on pauli: `read host ; puppet node clean $host.torproject.org &&
        puppet node deactivate $host.torproject.org`
        TODO: That procedure is incomplete, use the `retire.revoke-puppet`
        job in fabric instead.
     9. grep the `tor-puppet` repo for the host (and maybe its IP
        addresses) and clean up; also look for files with hostname in
        their name
    
     10. clean host from `tor-passwords`
     11. remove any certs and backup keys from letsencrypt-domains and
    
         letsencrypt-domains/backup-keys git repositories that are no
         longer relevant:
    
    
    anarcat's avatar
    anarcat committed
             git -C letsencrypt-domains grep -e $host -e storm.torproject.org
             # remove entries found above
             git -C letsencrypt-domains commit
             git -C letsencrypt-domains push
             find letsencrypt-domains/backup-keys -name "$host.torproject.org" -o -name 'storm.torproject.org*' -delete
             git -C letsencrypt-domains/backup-keys commit
             git -C letsencrypt-domains/backup-keys push
    
         Also clean up the relevant files on the letsencrypt master
    
         (currently `nevii`), for example:
    
    
    anarcat's avatar
    anarcat committed
             ssh nevii rm -rf /srv/letsencrypt.torproject.org/var/certs/storm.torproject.org
             ssh nevii find /srv/letsencrypt.torproject.org/ -name 'storm.torproject.org.*' -delete
    
    
     12. if the machine is handling mail, remove it from [dnswl.org](https://www.dnswl.org/)
    
         (password in tor-passwords, `hosts-extra-info`) - consider that
         it can take a long time (weeks? months?) to be able to "re-add"
         an IP address in that service, so if that IP can eventually be
         reused, it might be better to keep it there in the short term
    
     13. schedule a removal of the host's backup, on the backup server
         (currently `bungei`):
    
    
    anarcat's avatar
    anarcat committed
             cd  /srv/backups/bacula/
             mv $host.torproject.org $host.torproject.org-OLD
             echo rm -rf /srv/backups/bacula/$host.torproject.org.OLD/ | at now + 30 days
    
    
     14. remove from the machine from this wiki (if present in
    
         documentation), the [Nextcloud spreadsheet](https://nc.torproject.net/apps/onlyoffice/5395) (if it is not in
         [ganeti](ganeti)), and, if it's an entire service, the [services
         page](https://gitlab.torproject.org/legacy/trac/-/wikis/org/operations/services)
    
     15. if it's a physical machine or a virtual host we don't control,
         schedule removal from racks or hosts with upstream
    
    TODO: remove the client from the Bacula catalog, see <https://bugs.torproject.org/30880>.
    
    
    ## Wiping disks
    
    To wipe disks on servers without a serial console or management
    
    interface, you need to be a little more creative. We do this with the
    `nwipe(1)` command, which should be installed before anything:
    
        apt install nwipe
    
    
    Run in a screen:
    
        screen
    
    
    If there's a RAID array, first wipe one of the disks by taking it
    offline and writing garbage:
    
    
        mdadm --fail /dev/md0 /dev/sdb1 &&
        mdadm --remove /dev/md0 /dev/sdb1 &&
    
        mdadm --fail /dev/md1 /dev/sdb2 &&
        mdadm --remove /dev/md1 /dev/sdb2 &&
    
        : etc, for the other RAID elements in /proc/mdstat &&
    
        nwipe --autonuke --method=random --verify=off /dev/sdb
    
    This will take a long time. Note that it will start a GUI which is
    useful because it will give you timing estimates, which the
    commandline version [does not provide](https://github.com/martijnvanbrummelen/nwipe/issues/196).
    
    anarcat's avatar
    anarcat committed
    WARNING: this procedure doesn't cover the case where the disk is an
    SSD. See [this paper](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.187.3062&rep=rep1&type=pdf) for details on how classic data scrubbing
    
    anarcat's avatar
    anarcat committed
    software might not work for SSDs. For now we use this:
    
        nwipe --autonuke --method=random --rounds=2 /dev/nvme1n1
    
    When you return:
    
     1. start a `screen` session with a static `busybox` as your `SHELL`
        that will survive disk wiping:
    
            # make sure /tmp is on a tmpfs first!
    
            cp -av /root /tmp/root &&
            mount -o bind /tmp/root /root &&
            cp /bin/busybox /tmp/root/sh &&
            export SHELL=/tmp/root/sh &&
    
            exec screen -s $SHELL
    
    
        TODO: the above eventually failed to make busybox survive the
        destruction, probably because it got evicted from RAM and couldn't
        be found in swap again (as *that* was destroyed too). We should
        try using [vmtouch](https://hoytech.com/vmtouch/) with something like `vmtouch -dl
        /tmp/root/sh` next time, although that is only [available in buster
        and later](https://tracker.debian.org/pkg/vmtouch).
    
    
     2. kill all processes but the SSH daemon, your SSH connexion and
        shell. this will vary from machine to machine, but a good way is
        to list all processes with `systemctl status` and `systemctl stop`
        the services one by one. Hint: multiple services can be passed on
        the same `stop` command, for example:
    
    
            systemctl stop acpid acpid.socket acpid.path atd bacula-fd bind9 cron dbus dbus.socket fail2ban haveged irqbalance libvirtd lvm2-lvmetad.service lvm2-lvmetad.socket mdmonitor nagios-nrpe-server ntp openvswitch-switch postfix prometheus-bind-exporter prometheus-node-exporter smartd strongswan syslog-ng.service systemd-journald systemd-journald-audit.socket systemd-journald-dev-log.socket systemd-journald.socket systemd-logind.service systemd-udevd systemd-udevd systemd-udevd-control.socket systemd-udevd-control.socket systemd-udevd-kernel.socket systemd-udevd-kernel.socket ulogd2 unbound virtlogd virtlogd.socket
    
    
     3. disable swap:
    
            swapoff -a
    
     4. unmount everything that can be unmounted (except `/proc`):
    
            umount -a
    
     5. remount everything else readonly:
    
            mount -o remount,ro /
    
     6. sync disks:
    
            sync
    
    
    anarcat's avatar
    anarcat committed
     7. wipe the remaining disk and shutdown:
    
            nwipe --autonuke --method=random --verify=off /dev/sda ; \
    
            echo "SHUTTING DOWN FOREVER IN ONE MINUTE" ; \
            sleep 60 ; \
    
            echo o > /proc/sysrq-trigger
    
    anarcat's avatar
    anarcat committed
    A few tricks if nothing works in the shell which might work in a case
    of an emergency:
    
     * `cat PATH` can be expressed as `mapfile -C "printf %s"  < PATH` in
       bash
     * `echo *` can be used as a rough approximation of `ls`
    
    
    ## Alternate, fabric-based procedure
    
     1. long before (weeks or months) the machine is decomissioned, make
        sure users are aware it will go away and of its replacement services
     2. remove the host from `tor-nagios/config/nagios-master.cfg`
     3. if applicable, stop the VM in advance:
    
        * If the VM is on a KVM host: `virsh shutdown $host`, or at least stop the
        primary service on the machine
    
        * If the machine is on ganeti: `gnt-instance remove $host`
        TODO: move this into Fabric
     4. after a delay, retire the host from its parent, backups and
        Puppet, for example:
    
    
            ./retire -v -H $INSTANCE retire-all --parent-host=$PARENT_HOST
    
        TODO: `$PARENT_HOST` should be some ganeti node
        (e.g. `fsn-node-01.torproject.org`) but could be auto-detected...
    
        TODO: cover physical machines
    
     5. remove from LDAP with `ldapvi` (STEP 6 above) TODO: add to Fabric,
        make sure you show the diff
     6. do one huge power-grep over all our source code, for example with
        unifolium that was:
    
            grep -nH -r -e 148.251.180.115 -e 2a01:4f8:211:6e8::2 -e kvm2.torproject.org  -e unifolium.torproject.org -e unifolium -e kvm2
    
        TODO: extract those values from LDAP (e.g. purpose) and run the
        grep in Fabric
     7. remove from tor-passwords (TODO: put in fabric). magic command
        (not great):
    
            for f in *; do
                if gpg -d < $f 2>/dev/null | grep -i -e 148.251.180.115 -e 2a01:4f8:211:6e8::2 -e kvm2.torproject.org -e unifolium.torproject.org -e unifolium -e kvm2 ; then
                    echo match found in $f
                    ~/src/pwstore/pws ed $f
                fi
            done
    
     8. remove from DNSwl
    
    
     9. remove from the machine from this wiki (if present in
    
         documentation), the [Nextcloud spreadsheet](https://nc.torproject.net/apps/onlyoffice/5395) (if it is not in
         [ganeti](ganeti)), and, if it's an entire service, the [services
         page](https://gitlab.torproject.org/legacy/trac/-/wikis/org/operations/services)
    
     10. if it's a physical machine or a virtual host we don't control,
         schedule removal from racks or hosts with upstream
    
    
     11. remove from reverse DNS
    
    
    TODO: remove the client from the Bacula catalog, see <https://bugs.torproject.org/30880>.