|
|
# Decommissioning a host
|
|
|
|
|
|
Warning: this procedure is difficult to follow and error-prone. A new
|
|
|
procedure is being established in Fabric, below. It should still work,
|
|
|
provided you follow the warnings.
|
|
|
|
|
|
1. long before (weeks or months) the machine is retired, make
|
|
|
sure users are aware it will go away and of its replacement services
|
|
|
2. remove the host from `tor-nagios/config/nagios-master.cfg`
|
... | ... | @@ -13,83 +9,83 @@ provided you follow the warnings. |
|
|
primary service on the machine
|
|
|
|
|
|
* If the machine is on ganeti: `gnt-instance stop $host`
|
|
|
TODO: move this into Fabric
|
|
|
4. after a delay, retire the host from its parent, backups and
|
|
|
Puppet, for example:
|
|
|
|
|
|
4. On KVM hosts, undefine the VM: `virsh undefine $host`
|
|
|
./retire -v -H $INSTANCE retire-all --parent-host=$PARENT_HOST
|
|
|
|
|
|
5. wipe host data, possibly with a delay:
|
|
|
TODO: `$PARENT_HOST` should be some ganeti node
|
|
|
(e.g. `fsn-node-01.torproject.org`) but could be auto-detected...
|
|
|
|
|
|
* On some KVM hosts, remove the LVM logical volumes:
|
|
|
TODO: cover physical machines
|
|
|
5. remove from LDAP with `ldapvi` (STEP 6 above) TODO: add to Fabric,
|
|
|
make sure you show the diff
|
|
|
6. do one huge power-grep over all our source code, for example with
|
|
|
unifolium that was:
|
|
|
|
|
|
echo 'lvremove -y vgname/lvname' | at now + 7 days
|
|
|
grep -nH -r -e 148.251.180.115 -e 2a01:4f8:211:6e8::2 -e kvm2.torproject.org -e unifolium.torproject.org -e unifolium -e kvm2
|
|
|
|
|
|
Use `lvs` will list the logical volumes on the machine.
|
|
|
TODO: extract those values from LDAP (e.g. purpose) and run the
|
|
|
grep in Fabric
|
|
|
7. remove from tor-passwords (TODO: put in fabric). magic command
|
|
|
(not great):
|
|
|
|
|
|
* Other KVM hosts use file-backed storage:
|
|
|
|
|
|
echo 'rm -r /srv/vmstore/gayi.torproject.org/' | at now + 7 days
|
|
|
for f in *; do
|
|
|
if gpg -d < $f 2>/dev/null | grep -i -e 148.251.180.115 -e 2a01:4f8:211:6e8::2 -e kvm2.torproject.org -e unifolium.torproject.org -e unifolium -e kvm2 ; then
|
|
|
echo match found in $f
|
|
|
~/src/pwstore/pws ed $f
|
|
|
fi
|
|
|
done
|
|
|
|
|
|
* On Ganeti hosts, remove the actual instance with a delay, from
|
|
|
the Ganeti master:
|
|
|
8. remove from DNSwl
|
|
|
|
|
|
echo "gnt-instance remove $host" | at now + 7 days
|
|
|
9. remove from the machine from this wiki (if present in
|
|
|
documentation), the [Nextcloud spreadsheet](https://nc.torproject.net/apps/onlyoffice/5395) (if it is not in
|
|
|
[ganeti](ganeti)), and, if it's an entire service, the [services
|
|
|
page](https://gitlab.torproject.org/legacy/trac/-/wikis/org/operations/services)
|
|
|
10. if it's a physical machine or a virtual host we don't control,
|
|
|
schedule removal from racks or hosts with upstream
|
|
|
|
|
|
* for a normal machine or a machine we do not own the parent host
|
|
|
for, wipe the disks using the method described below
|
|
|
11. remove from reverse DNS
|
|
|
|
|
|
6. remove it from LDAP: the host entry and any `@<host>` group memberships there might be as well as any `sudo` passwords users might have configured for that host
|
|
|
7. if it has any associated records in `tor-dns/domains` or
|
|
|
`auto-dns`, or upstream's reverse dns thing, remove it from there
|
|
|
too. e.g.
|
|
|
|
|
|
grep -r -e build-x86-07 -e 78.47.38.230 -e 2a01:4f8:211:6e8:0:823:6:1
|
|
|
12. after 30 days delay, retire from Bacula catalog, on the director
|
|
|
(currently `bacula-director-01`), run `bconsole` then:
|
|
|
|
|
|
delete client=$INSTANCE-fd
|
|
|
|
|
|
... and check upstream reverse DNS.
|
|
|
8. on the puppet server (`pauli`): `read host ; puppet node clean $host.torproject.org &&
|
|
|
puppet node deactivate $host.torproject.org`
|
|
|
TODO: That procedure is incomplete, use the `retire.revoke-puppet`
|
|
|
job in fabric instead.
|
|
|
9. grep the `tor-puppet` repository for the host (and maybe its IP
|
|
|
addresses) and clean up; also look for files with hostname in
|
|
|
their name
|
|
|
10. clean host from `tor-passwords`
|
|
|
11. remove any certs and backup keys from `letsencrypt-domains.git` and
|
|
|
`letsencrypt-domains/backup-keys.git` repositories that are no
|
|
|
longer relevant:
|
|
|
|
|
|
git -C letsencrypt-domains grep -e $host -e storm.torproject.org
|
|
|
# remove entries found above
|
|
|
git -C letsencrypt-domains commit
|
|
|
git -C letsencrypt-domains push
|
|
|
find letsencrypt-domains/backup-keys -name "$host.torproject.org" -o -name 'storm.torproject.org*' -delete
|
|
|
git -C letsencrypt-domains/backup-keys commit
|
|
|
git -C letsencrypt-domains/backup-keys push
|
|
|
|
|
|
Also clean up the relevant files on the letsencrypt master
|
|
|
(currently `nevii`), for example:
|
|
|
for example:
|
|
|
|
|
|
delete client=archeotrichon.torproject.org-fd
|
|
|
|
|
|
ssh nevii rm -rf /srv/letsencrypt.torproject.org/var/certs/storm.torproject.org
|
|
|
ssh nevii find /srv/letsencrypt.torproject.org/ -name 'storm.torproject.org.*' -delete
|
|
|
TODO: You might still to run the `dbcheck.sql` script to clean
|
|
|
related resources, see [issue 40525](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40525) for details.
|
|
|
|
|
|
<!-- sync this section with howto/backup#retiring-a-client when -->
|
|
|
<!-- changing -->
|
|
|
|
|
|
TODO: add to fabric
|
|
|
|
|
|
12. if the machine is handling mail, remove it from [dnswl.org](https://www.dnswl.org/)
|
|
|
(password in tor-passwords, `hosts-extra-info`) - consider that
|
|
|
it can take a long time (weeks? months?) to be able to "re-add"
|
|
|
an IP address in that service, so if that IP can eventually be
|
|
|
reused, it might be better to keep it there in the short term
|
|
|
13. schedule a removal of the host's backup, on the backup server
|
|
|
(currently `bungei`):
|
|
|
13. after 30 days delay, remove PostgreSQL backups on the storage
|
|
|
server (currently `/srv/backups/pg` on `bungi`), if relevant
|
|
|
|
|
|
cd /srv/backups/bacula/
|
|
|
mv $host.torproject.org $host.torproject.org-OLD
|
|
|
echo rm -rf /srv/backups/bacula/$host.torproject.org.OLD/ | at now + 30 days
|
|
|
TODO: add to fabric
|
|
|
|
|
|
14. remove from the machine from this wiki (if present in
|
|
|
documentation), the [Nextcloud spreadsheet](https://nc.torproject.net/apps/onlyoffice/5395) (if it is not in
|
|
|
[ganeti](ganeti)), and, if it's an entire service, the [services
|
|
|
page](https://gitlab.torproject.org/legacy/trac/-/wikis/org/operations/services)
|
|
|
15. if it's a physical machine or a virtual host we don't control,
|
|
|
schedule removal from racks or hosts with upstream
|
|
|
Equivalent retirement checklist to copy-paste in retirement tickets:
|
|
|
|
|
|
TODO: remove the client from the Bacula catalog, see <https://bugs.torproject.org/30880>.
|
|
|
1. [ ] announcement
|
|
|
2. [ ] nagios
|
|
|
3. [ ] stop the VM in advance:
|
|
|
4. [ ] retire the host in fabric
|
|
|
5. [ ] remove from LDAP with `ldapvi`
|
|
|
6. [ ] power-grep
|
|
|
7. [ ] remove from tor-passwords
|
|
|
8. [ ] remove from DNSwl
|
|
|
9. [ ] remove from docs
|
|
|
10. [ ] remove from racks
|
|
|
11. [ ] remove from reverse DNS
|
|
|
12. [ ] remove from bacula director
|
|
|
13. [ ] remove PostgreSQL backups
|
|
|
|
|
|
## Wiping disks
|
|
|
|
... | ... | @@ -183,7 +179,11 @@ of an emergency: |
|
|
bash
|
|
|
* `echo *` can be used as a rough approximation of `ls`
|
|
|
|
|
|
## Alternate, fabric-based procedure
|
|
|
## Deprecated manual procedure
|
|
|
|
|
|
Warning: this procedure is difficult to follow and error-prone. A new
|
|
|
procedure was established in Fabric, above. It should really just be
|
|
|
completely avoided.
|
|
|
|
|
|
1. long before (weeks or months) the machine is retired, make
|
|
|
sure users are aware it will go away and of its replacement services
|
... | ... | @@ -194,80 +194,82 @@ of an emergency: |
|
|
primary service on the machine
|
|
|
|
|
|
* If the machine is on ganeti: `gnt-instance stop $host`
|
|
|
TODO: move this into Fabric
|
|
|
4. after a delay, retire the host from its parent, backups and
|
|
|
Puppet, for example:
|
|
|
|
|
|
./retire -v -H $INSTANCE retire-all --parent-host=$PARENT_HOST
|
|
|
4. On KVM hosts, undefine the VM: `virsh undefine $host`
|
|
|
|
|
|
TODO: `$PARENT_HOST` should be some ganeti node
|
|
|
(e.g. `fsn-node-01.torproject.org`) but could be auto-detected...
|
|
|
5. wipe host data, possibly with a delay:
|
|
|
|
|
|
TODO: cover physical machines
|
|
|
5. remove from LDAP with `ldapvi` (STEP 6 above) TODO: add to Fabric,
|
|
|
make sure you show the diff
|
|
|
6. do one huge power-grep over all our source code, for example with
|
|
|
unifolium that was:
|
|
|
* On some KVM hosts, remove the LVM logical volumes:
|
|
|
|
|
|
grep -nH -r -e 148.251.180.115 -e 2a01:4f8:211:6e8::2 -e kvm2.torproject.org -e unifolium.torproject.org -e unifolium -e kvm2
|
|
|
echo 'lvremove -y vgname/lvname' | at now + 7 days
|
|
|
|
|
|
TODO: extract those values from LDAP (e.g. purpose) and run the
|
|
|
grep in Fabric
|
|
|
7. remove from tor-passwords (TODO: put in fabric). magic command
|
|
|
(not great):
|
|
|
Use `lvs` will list the logical volumes on the machine.
|
|
|
|
|
|
for f in *; do
|
|
|
if gpg -d < $f 2>/dev/null | grep -i -e 148.251.180.115 -e 2a01:4f8:211:6e8::2 -e kvm2.torproject.org -e unifolium.torproject.org -e unifolium -e kvm2 ; then
|
|
|
echo match found in $f
|
|
|
~/src/pwstore/pws ed $f
|
|
|
fi
|
|
|
done
|
|
|
* Other KVM hosts use file-backed storage:
|
|
|
|
|
|
echo 'rm -r /srv/vmstore/gayi.torproject.org/' | at now + 7 days
|
|
|
|
|
|
8. remove from DNSwl
|
|
|
* On Ganeti hosts, remove the actual instance with a delay, from
|
|
|
the Ganeti master:
|
|
|
|
|
|
9. remove from the machine from this wiki (if present in
|
|
|
documentation), the [Nextcloud spreadsheet](https://nc.torproject.net/apps/onlyoffice/5395) (if it is not in
|
|
|
[ganeti](ganeti)), and, if it's an entire service, the [services
|
|
|
page](https://gitlab.torproject.org/legacy/trac/-/wikis/org/operations/services)
|
|
|
10. if it's a physical machine or a virtual host we don't control,
|
|
|
schedule removal from racks or hosts with upstream
|
|
|
echo "gnt-instance remove $host" | at now + 7 days
|
|
|
|
|
|
11. remove from reverse DNS
|
|
|
* for a normal machine or a machine we do not own the parent host
|
|
|
for, wipe the disks using the method described below
|
|
|
|
|
|
12. after 30 days delay, retire from Bacula catalog, on the director
|
|
|
(currently `bacula-director-01`), run `bconsole` then:
|
|
|
|
|
|
delete client=$INSTANCE-fd
|
|
|
6. remove it from LDAP: the host entry and any `@<host>` group memberships there might be as well as any `sudo` passwords users might have configured for that host
|
|
|
7. if it has any associated records in `tor-dns/domains` or
|
|
|
`auto-dns`, or upstream's reverse dns thing, remove it from there
|
|
|
too. e.g.
|
|
|
|
|
|
for example:
|
|
|
|
|
|
delete client=archeotrichon.torproject.org-fd
|
|
|
grep -r -e build-x86-07 -e 78.47.38.230 -e 2a01:4f8:211:6e8:0:823:6:1
|
|
|
|
|
|
... and check upstream reverse DNS.
|
|
|
8. on the puppet server (`pauli`): `read host ; puppet node clean $host.torproject.org &&
|
|
|
puppet node deactivate $host.torproject.org`
|
|
|
TODO: That procedure is incomplete, use the `retire.revoke-puppet`
|
|
|
job in fabric instead.
|
|
|
9. grep the `tor-puppet` repository for the host (and maybe its IP
|
|
|
addresses) and clean up; also look for files with hostname in
|
|
|
their name
|
|
|
10. clean host from `tor-passwords`
|
|
|
11. remove any certs and backup keys from `letsencrypt-domains.git` and
|
|
|
`letsencrypt-domains/backup-keys.git` repositories that are no
|
|
|
longer relevant:
|
|
|
|
|
|
TODO: You might still to run the `dbcheck.sql` script to clean
|
|
|
related resources, see [issue 40525](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40525) for details.
|
|
|
|
|
|
<!-- sync this section with howto/backup#retiring-a-client when -->
|
|
|
<!-- changing -->
|
|
|
|
|
|
TODO: add to fabric
|
|
|
git -C letsencrypt-domains grep -e $host -e storm.torproject.org
|
|
|
# remove entries found above
|
|
|
git -C letsencrypt-domains commit
|
|
|
git -C letsencrypt-domains push
|
|
|
find letsencrypt-domains/backup-keys -name "$host.torproject.org" -o -name 'storm.torproject.org*' -delete
|
|
|
git -C letsencrypt-domains/backup-keys commit
|
|
|
git -C letsencrypt-domains/backup-keys push
|
|
|
|
|
|
13. after 30 days delay, remove PostgreSQL backups on the storage
|
|
|
server (currently `/srv/backups/pg` on `bungi`), if relevant
|
|
|
Also clean up the relevant files on the letsencrypt master
|
|
|
(currently `nevii`), for example:
|
|
|
|
|
|
TODO: add to fabric
|
|
|
ssh nevii rm -rf /srv/letsencrypt.torproject.org/var/certs/storm.torproject.org
|
|
|
ssh nevii find /srv/letsencrypt.torproject.org/ -name 'storm.torproject.org.*' -delete
|
|
|
|
|
|
Equivalent retirement checklist to copy-paste in retirement tickets:
|
|
|
12. if the machine is handling mail, remove it from [dnswl.org](https://www.dnswl.org/)
|
|
|
(password in tor-passwords, `hosts-extra-info`) - consider that
|
|
|
it can take a long time (weeks? months?) to be able to "re-add"
|
|
|
an IP address in that service, so if that IP can eventually be
|
|
|
reused, it might be better to keep it there in the short term
|
|
|
13. schedule a removal of the host's backup, on the backup server
|
|
|
(currently `bungei`):
|
|
|
|
|
|
1. [ ] announcement
|
|
|
2. [ ] nagios
|
|
|
3. [ ] stop the VM in advance:
|
|
|
4. [ ] retire the host in fabric
|
|
|
5. [ ] remove from LDAP with `ldapvi`
|
|
|
6. [ ] power-grep
|
|
|
7. [ ] remove from tor-passwords
|
|
|
8. [ ] remove from DNSwl
|
|
|
9. [ ] remove from docs
|
|
|
10. [ ] remove from racks
|
|
|
11. [ ] remove from reverse DNS
|
|
|
12. [ ] remove from bacula director
|
|
|
13. [ ] remove PostgreSQL backups |
|
|
cd /srv/backups/bacula/
|
|
|
mv $host.torproject.org $host.torproject.org-OLD
|
|
|
echo rm -rf /srv/backups/bacula/$host.torproject.org.OLD/ | at now + 30 days
|
|
|
|
|
|
14. remove from the machine from this wiki (if present in
|
|
|
documentation), the [Nextcloud spreadsheet](https://nc.torproject.net/apps/onlyoffice/5395) (if it is not in
|
|
|
[ganeti](ganeti)), and, if it's an entire service, the [services
|
|
|
page](https://gitlab.torproject.org/legacy/trac/-/wikis/org/operations/services)
|
|
|
15. if it's a physical machine or a virtual host we don't control,
|
|
|
schedule removal from racks or hosts with upstream
|
|
|
|
|
|
TODO: remove the client from the Bacula catalog, see
|
|
|
<https://bugs.torproject.org/30880>. Done by the Fabric procedure
|
|
|
above. |