full upgrade procedure review authored by anarcat's avatar anarcat
I reviewed the entire page to make sure it flows correctly. I mostly
worked on splitting up the per-service procedures in different
sections instead of one long list, but also reworked the headings to
have better categories that will look better in the TOC.

I also warn about individual reboots because the more common use case
is fleet, or at least cluster-wide reboots.

See: #33406
[[_TOC_]] [[_TOC_]]
# Major upgrades # Major upgrades
Major upgrades are done by hand, with a "cheat sheet" created for each Major upgrades are done by hand, with a "cheat sheet" created for each
...@@ -68,25 +67,18 @@ block certain upgrades. If you want to bypass that, use regular `apt`: ...@@ -68,25 +67,18 @@ block certain upgrades. If you want to bypass that, use regular `apt`:
cumin -b 10 '*' 'apt update ; apt upgrade -yy ; TERM=doit dsa-update-apt-status' cumin -b 10 '*' 'apt update ; apt upgrade -yy ; TERM=doit dsa-update-apt-status'
## GitLab runner upgrades # Special cases and manual restarts
Every month or so GitLab publishes a update to the `gitlab-runner` apt
package. The package is excluded from `unattended-upgrades` to avoid any
risk of interrupting long-running CI jobs (eg. large shadow sims).
The recommended procedure is to go through each CI machine one at a time, The above covers all upgrades that are automatically applied, but some
pause all the runners on that single machine, ensure no long-running are blocked from automation and require manual intervention.
shadow sims are being executed, and launch `apt upgrade`. If any regular
CI jobs are running, systemd will wait up to one hour for them to end,
then proceed with the package upgrade.
## Restarting services by hand Others do upgrade automatically, but require a manual
restart. Normally, [needrestart](https://github.com/liske/needrestart) runs after upgrades and takes care
of restarting services, but it can't actually deal with everything.
After upgrades, there's a Nagios check that might trigger and tell you There is a Nagios check that might trigger and tell you that some
that some services are running with outdated libraries. Normally, services are running with outdated libraries. You may see a warning
[needrestart](https://github.com/liske/needrestart) runs after upgrades and takes care of restarting like:
services, but it can't actually deal with everything. In Nagios, you
will see a warning like:
[web-chi-03] needrestart is WARNING: WARN - Kernel: 5.10.0-15-amd64, Services: 1 (!), Containers: none, Sessions: none [web-chi-03] needrestart is WARNING: WARN - Kernel: 5.10.0-15-amd64, Services: 1 (!), Containers: none, Sessions: none
...@@ -102,61 +94,94 @@ run the check by hand: ...@@ -102,61 +94,94 @@ run the check by hand:
needrestart -v needrestart -v
There are a few scenarios here: Packages are blocked from upgrades when they cause significant
breakage during an upgrade run, enough to cause an outage and/or
require significant recovery work. This is done through Puppet, in the
`profile::unattended_upgrades` class, in the `blacklist` setting.
Packages can be unblocked if and only if:
* `cron.service`: typically services that should run under `systemd * the bug is confirmed as fixed in Debian
--user`, reboot the box or ask the service admin to restart their * the fix is deployed on all servers and confirmed as working
services * we have good confidence that future upgrades will not break the
system again
* `cron.service`, special case: sometimes, userdir-ldap's This section documents how to do some of those upgrades and restarts
`ud-replicate` leaves a multiplexing SSH process lying by hand.
around. logging into the LDAP server (currently `alberti`) and
killing all the `sshdist` process will clear those: ## GitLab runner upgrades
Every month or so GitLab publishes a update to the `gitlab-runner` apt
package. The package is excluded from `unattended-upgrades` to avoid any
risk of interrupting long-running CI jobs (eg. large shadow sims).
The recommended procedure is to go through each CI machine one at a time,
pause all the runners on that single machine, ensure no long-running
shadow sims are being executed, and launch `apt upgrade`. If any regular
CI jobs are running, systemd will wait up to one hour for them to end,
then proceed with the package upgrade.
## cron.service
This is typically services that should be ran under `systemd --user`
but instead are started with a `@reboot` cron job.
For this kind of service, reboot the server or ask the service admin
to restart their services themselves. Ideally, this service should be
converted to a systemd unit, see [this documentation](doc/services).
### ud-replicate special case
Sometimes, userdir-ldap's `ud-replicate` leaves a multiplexing SSH
process lying around and those show up as part of
`cron.service`.
Logging into the LDAP server (currently `alberti`) and killing all the
`sshdist` process will clear those:
pkill -u sshdist ssh pkill -u sshdist ssh
* `ganeti.service`: typically this is an OpenSSL upgrade that affects ## Ganeti
qemu, and restarting ganeti (thankfully) doesn't restart VMs. to
fix this, migrate all VMs to their secondaries and back, see The `ganeti.service` warning is typically an OpenSSL upgrade that
[Ganeti reboot procedures](howto/ganeti#rebooting), possibly the [instance-only affects qemu, and restarting ganeti (thankfully) doesn't restart
restart](#instance-only-restarts) procedure. VMs. to Fix this, migrate all VMs to their secondaries and back, see
[Ganeti reboot procedures](howto/ganeti#rebooting), possibly the [instance-only restart](#instance-only-restarts)
procedure.
## Open vSwitch
This is generally the `openvswitch-switch` and `openvswitch-common`
services, which are blocked from upgrades because of [bug 34185](https://bugs.torproject.org/34185)
* **Open vSwitch** (`openvswitch-switch` and `openvswitch-common`, To upgrade manually, empty the server, restart, OVS, then migrate the
[bug 34185](https://bugs.torproject.org/34185)): to upgrade manually, empty the server, restart, machines back. It's actually easier to just treat this as a "[reboot
OVS, then migrate the machines back. It's actually easier to just the nodes only](howto/ganeti#node-only-reboot)" procedure, see the [Ganeti reboot procedures](howto/ganeti#rebooting)
treat this as a "[reboot the nodes only](howto/ganeti#node-only-reboot)" procedure, see the instead.
[Ganeti reboot procedures](howto/ganeti#rebooting) instead.
Note that this might be fixed in Debian bullseye, [bug 961746](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=961746) in Note that this might be fixed in Debian bullseye, [bug 961746](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=961746) in
Debian is marked as fixed, but will still need to be tested on our Debian is marked as fixed, but will still need to be tested on our
side first. Update: it hasn't been fixed. side first. Update: it hasn't been fixed.
- **Grub** (`grub-pc`, [bug 40042](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40042)) has been known to have issues as ## Grub
well, so it is blocked. to upgrade, make sure the install device is
defined, by running `dpkg-reconfigure grub-pc`. this issue might
actually have been fixed in the package, see [issue 40185](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40185).
Packages are blocked from upgrades when they cause significant `grub-pc` ([bug 40042](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40042)) has been known to have issues as well, so
breakage during an upgrade run, enough to cause an outage and/or it is blocked. to upgrade, make sure the install device is defined, by
require significant recovery work. This is done through Puppet, in the running `dpkg-reconfigure grub-pc`. this issue might actually have
`profile::unattended_upgrades` class, in the `blacklist` setting. been fixed in the package, see [issue 40185](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40185).
Packages can be unblocked if and only if: ## user@ services
* the bug is confirmed as fixed in Debian
* the fix is deployed on all servers and confirmed as working
* we have good confidence that future upgrades will not break the
system again
Services setup with the new systemd-based startup system documented in Services setup with the new systemd-based startup system documented in
[doc/services](doc/services) can be restarted with: [doc/services](doc/services) may not automatically restart. They may be
(manually) restarted with:
systemctl restart user@1504.service systemctl restart user@1504.service
There's a feature request ([bug #843778](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=843778)) to implement support for There's a feature request ([bug #843778](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=843778)) to implement support for
those services directly in needrestart. those services directly in needrestart.
## Kernel upgrades and reboots # Reboots
Sometimes it is necessary to perform a reboot on the hosts, when the Sometimes it is necessary to perform a reboot on the hosts, when the
kernel is updated. Nagios will warn about this, with something like kernel is updated. Nagios will warn about this, with something like
...@@ -164,7 +189,10 @@ this: ...@@ -164,7 +189,10 @@ this:
WARNING: Kernel needs upgrade [linux-image-4.9.0-9-amd64 != linux-image-4.9.0-8-amd64] WARNING: Kernel needs upgrade [linux-image-4.9.0-9-amd64 != linux-image-4.9.0-8-amd64]
### Rebooting guests TODO: the above is the old way, the needrestart check has a different
output. document it above.
## Rebooting a single host
If this is only a virtual machine, and the only one affected, it can If this is only a virtual machine, and the only one affected, it can
be rebooted directly. This can be done with the `tsa-misc` script be rebooted directly. This can be done with the `tsa-misc` script
...@@ -186,6 +214,29 @@ entered at boot time, either through the initramfs (if it has the ...@@ -186,6 +214,29 @@ entered at boot time, either through the initramfs (if it has the
the case for the `mandos-01` server itself, for example, as it the case for the `mandos-01` server itself, for example, as it
currently can't unlock itself, naturally. currently can't unlock itself, naturally.
## Batch rebooting multiple hosts
IMPORTANT: before following this procedure, make sure that only a
subset of the hosts need a restart. If *all* hosts need a reboot, it's
likely going to be faster and easier to reboot the entire clusters at
once, see the [Ganeti reboot procedures](howto/ganeti#reboot) instead.
LDAP hosts have information about how they can be rebooted, in the
`rebootPolicy` field. Here are what the various fields mean:
* `justdoit` - can be rebooted any time, with a 10 minute delay,
possibly in parallel
* `rotation` - part of a cluster where each machine needs to be
rebooted one at a time, with a 30 minute delay for DNS to update
* `manual` - needs to be done by hand or with a special tool (fabric
in case of ganeti, reboot-host in the case of KVM, nothing for
windows boxes)
Therefore, it's possible to selectively reboot some of those hosts in
batches. Again, this is pretty rare: typically, you would either
reboot only a single host or *all* hosts, in which case a cluster-wide
reboot (with Ganeti, below) would be more appropriate.
This routine should be able to reboot all hosts with a `rebootPolicy` This routine should be able to reboot all hosts with a `rebootPolicy`
defined to `justdoit` or `rotation`: defined to `justdoit` or `rotation`:
...@@ -197,29 +248,17 @@ defined to `justdoit` or `rotation`: ...@@ -197,29 +248,17 @@ defined to `justdoit` or `rotation`:
## Rebooting KVM hosts ## Rebooting KVM hosts
The remaining is the "manual" procedure, the KVM hosts: The remaining is the "manual" procedure, which includes one KVM last:
./reboot-host moly.torproject.org ./reboot-host moly.torproject.org
... and Ganeti nodes, below.
## Rebooting Ganeti nodes ## Rebooting Ganeti nodes
See the [Ganeti reboot procedures](howto/ganeti#rebooting) for this See the [Ganeti reboot procedures](howto/ganeti#rebooting) for this procedure.
procedure.
## Remaining nodes ## Remaining nodes
The [Nagios unhandled problems](https://nagios.torproject.org/cgi-bin/icinga/status.cgi?allunhandledproblems) will show remaining hosts that The [Nagios unhandled problems](https://nagios.torproject.org/cgi-bin/icinga/status.cgi?allunhandledproblems) will show remaining hosts that
might have been missed by the above procedure.. might have been missed by the above procedure..
### Generic upgrade routines
LDAP hosts have information about how they can be rebooted, in the
`rebootPolicy` field. Here are what the various fields mean:
* `justdoit` - can be rebooted any time, with a 10 minute delay,
possibly in parallel
* `rotation` - part of a cluster where each machine needs to be
rebooted one at a time, with a 30 minute delay for DNS to update
* `manual` - needs to be done by hand or with a special tool (fabric
in case of ganeti, reboot-host in the case of KVM, nothing for
windows boxes)