Newer
Older
# Decommissioning a host
sure users are aware it will go away and of its replacement services
3. retire the host from its parent, backups and Puppet, for example:
fab -H $INSTANCE retire.retire-all --parent-host=$PARENT_HOST
Copy the output of the script in the retirement ticket. Adjust
delay for more sensitive hosts with:
Above is 30 days for the destruction of disks, 90 for
backups. Default is 7 days for disks, 30 for backups.
TODO: `$PARENT_HOST` should be some ganeti node
(e.g. `fsn-node-01.torproject.org`) but could be auto-detected...
5. remove from LDAP with `ldapvi` (STEP 6 above), copy-paste it in
the ticket
6. do one huge power-grep and find over all our source code, for example with
grep -nH -r -e 148.251.180.115 -e 2a01:4f8:211:6e8::2 -e kvm2.torproject.org -e unifolium.torproject.org -e unifolium -e kvm2
TODO: extract those values from LDAP (e.g. purpose) and run the
grep in Fabric
7. remove from tor-passwords (TODO: put in fabric). magic command
(not great):
pass rm root/unifolium.torproject.org
# look for traces of the host elsewhere
for f in */*; do
if gpg -d < $f 2>/dev/null | \
grep -i -e 148.251.180.115 -e 2a01:4f8:211:6e8::2 -e kvm2.torproject.org -e unifolium.torproject.org -e unifolium -e kvm2
then
9. remove from the machine from this wiki (if present in
documentation), the [Nextcloud spreadsheet](https://nc.torproject.net/apps/onlyoffice/5395) (if it is not in
[ganeti](ganeti)), and, if it's an entire service, the [services
10. if it's a physical machine or a virtual host we don't control,
schedule removal from racks or hosts with upstream
Equivalent retirement checklist to copy-paste in retirement tickets:
3. [ ] retire the host in fabric
4. [ ] remove from LDAP with `ldapvi`
5. [ ] power-grep
6. [ ] remove from tor-passwords
7. [ ] remove from DNSwl
8. [ ] remove from docs
9. [ ] remove from racks
10. [ ] remove from reverse DNS
## Wiping disks
To wipe disks on servers without a serial console or management
interface, you need to be a little more creative. We do this with the
`nwipe(1)` command, which should be installed before anything:
If there's a RAID array, first wipe one of the disks by taking it
offline and writing garbage:
mdadm --fail /dev/md0 /dev/sdb1 &&
mdadm --remove /dev/md0 /dev/sdb1 &&
mdadm --fail /dev/md1 /dev/sdb2 &&
mdadm --remove /dev/md1 /dev/sdb2 &&
: etc, for the other RAID elements in /proc/mdstat &&
nwipe --autonuke --method=random --verify=off /dev/sdb
This will take a long time. Note that it will start a GUI which is
useful because it will give you timing estimates, which the
command-line version [does not provide](https://github.com/martijnvanbrummelen/nwipe/issues/196).
WARNING: this procedure doesn't cover the case where the disk is an
SSD. See [this paper][] for details on how classic data scrubbing
software might not work for SSDs. For now we use this:
nwipe --autonuke --method=random --rounds=2 --verify=off /dev/nvme1n1
TODO: consider `hdparm` and the "secure erase" procedure for SSDs:
hdparm --user-master u --security-set-pass Eins /dev/sdc
time hdparm --user-master u --security-erase Eins /dev/sdc
See also [stressant documentation](https://stressant.readthedocs.io/en/latest/usage.html#wiping-disks) abnout this.
1. start a `screen` session with a static `busybox` as your `SHELL`
that will survive disk wiping:
# make sure /tmp is on a tmpfs first!
cp -av /root /tmp/root &&
mount -o bind /tmp/root /root &&
cp /bin/busybox /tmp/root/sh &&
export SHELL=/tmp/root/sh &&
2. lock down busybox and screen in memory
vmtouch -dl /usr/bin/screen /bin/busybox /tmp/root/sh /usr/sbin/nwipe
TODO: the above aims at making busybox survive the destruction, so
that it's cached in RAM. It's unclear if that actually works,
because typically SSH is also busted and needs a lot more to
bootstrap, so we can't log back in if we lose the
console. Ideally, we'd run this in a serial console that would
have more reliable access... See also [vmtouch](https://hoytech.com/vmtouch/).
shell. this will vary from machine to machine, but a good way is
to list all processes with `systemctl status` and `systemctl stop`
the services one by one. Hint: multiple services can be passed on
the same `stop` command, for example:
systemctl stop \
acpid \
acpid.path \
acpid.socket \
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
atd \
bacula-fd \
bind9 \
cron \
dbus \
dbus.socket \
fail2ban \
ganeti \
haveged \
irqbalance \
ipsec \
iscsid \
libvirtd \
lvm2-lvmetad.service \
lvm2-lvmetad.socket \
mdmonitor \
multipathd.service \
multipathd.socket \
ntp \
openvswitch-switch \
postfix \
prometheus-bind-exporter \
prometheus-node-exporter \
smartd \
strongswan \
syslog-ng.service \
systemd-journald \
systemd-journald-audit.socket \
systemd-journald-dev-log.socket \
systemd-journald.socket \
systemd-logind.service \
systemd-udevd \
systemd-udevd \
systemd-udevd-control.socket \
systemd-udevd-control.socket \
systemd-udevd-kernel.socket \
systemd-udevd-kernel.socket \
timers.target \
ulogd2 \
unbound \
virtlogd \
virtlogd.socket \
mount -o remount,ro /
6. sync disks:
sync
nwipe --autonuke --method=random --rounds=2 --verify=off /dev/noop ; \
echo o > /proc/sysrq-trigger ; \
sleep 60 ; \
echo b > /proc/sysrq-trigger ; \
Note: as a safety precaution, the above device has been replaced
by `noop`, that should be (say) `sda` instead.
A few tricks if nothing works in the shell which might work in a case
of an emergency:
* `cat PATH` can be expressed as `mapfile -C "printf %s" < PATH` in
bash
* `echo *` can be used as a rough approximation of `ls`
[this paper]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.187.3062&rep=rep1&type=pdf
## Deprecated manual procedure
Warning: this procedure is difficult to follow and error-prone. A new
procedure was established in Fabric, above. It should really just be
completely avoided.
sure users are aware it will go away and of its replacement services
3. if applicable, stop the VM in advance:
* If the VM is on a KVM host: `virsh shutdown $host`, or at least stop the
primary service on the machine
* If the machine is on ganeti: `gnt-instance stop $host`
4. On KVM hosts, undefine the VM: `virsh undefine $host`
5. wipe host data, possibly with a delay:
* On some KVM hosts, remove the LVM logical volumes:
echo 'lvremove -y vgname/lvname' | at now + 7 days
Use `lvs` will list the logical volumes on the machine.
* Other KVM hosts use file-backed storage:
echo 'rm -r /srv/vmstore/gayi.torproject.org/' | at now + 7 days
* On Ganeti hosts, remove the actual instance with a delay, from
the Ganeti master:
echo "gnt-instance remove $host" | at now + 7 days
* for a normal machine or a machine we do not own the parent host
for, wipe the disks using the method described below
6. remove it from LDAP: the host entry and any `@<host>` group memberships there might be as well as any `sudo` passwords users might have configured for that host
7. if it has any associated records in `tor-dns/domains` or
`auto-dns`, or upstream's reverse dns thing, remove it from there
too. e.g.
grep -r -e build-x86-07 -e 78.47.38.230 -e 2a01:4f8:211:6e8:0:823:6:1
... and check upstream reverse DNS.
8. on the puppet server (`pauli`): `read host ; puppet node clean $host.torproject.org &&
puppet node deactivate $host.torproject.org`
TODO: That procedure is incomplete, use the `retire.revoke-puppet`
job in fabric instead.
9. grep the `tor-puppet` repository for the host (and maybe its IP
addresses) and clean up; also look for files with hostname in
their name
10. clean host from `tor-passwords`
11. remove any certs and backup keys from `letsencrypt-domains.git` and
`letsencrypt-domains/backup-keys.git` repositories that are no
longer relevant:
git -C letsencrypt-domains grep -e $host -e storm.torproject.org
# remove entries found above
git -C letsencrypt-domains commit
git -C letsencrypt-domains push
find letsencrypt-domains/backup-keys -name "$host.torproject.org" -o -name 'storm.torproject.org*' -delete
git -C letsencrypt-domains/backup-keys commit
git -C letsencrypt-domains/backup-keys push
Also clean up the relevant files on the letsencrypt master
(currently `nevii`), for example:
ssh nevii rm -rf /srv/letsencrypt.torproject.org/var/certs/storm.torproject.org
ssh nevii find /srv/letsencrypt.torproject.org/ -name 'storm.torproject.org.*' -delete
12. if the machine is handling mail, remove it from [dnswl.org](https://www.dnswl.org/)
(password in tor-passwords, `hosts-extra-info`) - consider that
it can take a long time (weeks? months?) to be able to "re-add"
an IP address in that service, so if that IP can eventually be
reused, it might be better to keep it there in the short term
13. schedule a removal of the host's backup, on the backup server
(currently `bungei`):
cd /srv/backups/bacula/
mv $host.torproject.org $host.torproject.org-OLD
echo rm -rf /srv/backups/bacula/$host.torproject.org.OLD/ | at now + 30 days
14. remove from the machine from this wiki (if present in
documentation), the [Nextcloud spreadsheet](https://nc.torproject.net/apps/onlyoffice/5395) (if it is not in
[ganeti](ganeti)), and, if it's an entire service, the [services
page](https://gitlab.torproject.org/legacy/trac/-/wikis/org/operations/services)
15. if it's a physical machine or a virtual host we don't control,
schedule removal from racks or hosts with upstream
16. after 30 days delay, retire from Bacula catalog, on the director
(currently `bacula-director-01`), run `bconsole` then:
delete client=$INSTANCE-fd
for example:
delete client=archeotrichon.torproject.org-fd
<!-- sync this section with service/backup#retiring-a-client when -->