full upgrade procedure review authored by anarcat's avatar anarcat
I reviewed the entire page to make sure it flows correctly. I mostly
worked on splitting up the per-service procedures in different
sections instead of one long list, but also reworked the headings to
have better categories that will look better in the TOC.

I also warn about individual reboots because the more common use case
is fleet, or at least cluster-wide reboots.

See: #33406
[[_TOC_]]
# Major upgrades
Major upgrades are done by hand, with a "cheat sheet" created for each
......@@ -68,25 +67,18 @@ block certain upgrades. If you want to bypass that, use regular `apt`:
cumin -b 10 '*' 'apt update ; apt upgrade -yy ; TERM=doit dsa-update-apt-status'
## GitLab runner upgrades
Every month or so GitLab publishes a update to the `gitlab-runner` apt
package. The package is excluded from `unattended-upgrades` to avoid any
risk of interrupting long-running CI jobs (eg. large shadow sims).
# Special cases and manual restarts
The recommended procedure is to go through each CI machine one at a time,
pause all the runners on that single machine, ensure no long-running
shadow sims are being executed, and launch `apt upgrade`. If any regular
CI jobs are running, systemd will wait up to one hour for them to end,
then proceed with the package upgrade.
The above covers all upgrades that are automatically applied, but some
are blocked from automation and require manual intervention.
## Restarting services by hand
Others do upgrade automatically, but require a manual
restart. Normally, [needrestart](https://github.com/liske/needrestart) runs after upgrades and takes care
of restarting services, but it can't actually deal with everything.
After upgrades, there's a Nagios check that might trigger and tell you
that some services are running with outdated libraries. Normally,
[needrestart](https://github.com/liske/needrestart) runs after upgrades and takes care of restarting
services, but it can't actually deal with everything. In Nagios, you
will see a warning like:
There is a Nagios check that might trigger and tell you that some
services are running with outdated libraries. You may see a warning
like:
[web-chi-03] needrestart is WARNING: WARN - Kernel: 5.10.0-15-amd64, Services: 1 (!), Containers: none, Sessions: none
......@@ -102,61 +94,94 @@ run the check by hand:
needrestart -v
There are a few scenarios here:
Packages are blocked from upgrades when they cause significant
breakage during an upgrade run, enough to cause an outage and/or
require significant recovery work. This is done through Puppet, in the
`profile::unattended_upgrades` class, in the `blacklist` setting.
Packages can be unblocked if and only if:
* `cron.service`: typically services that should run under `systemd
--user`, reboot the box or ask the service admin to restart their
services
* the bug is confirmed as fixed in Debian
* the fix is deployed on all servers and confirmed as working
* we have good confidence that future upgrades will not break the
system again
* `cron.service`, special case: sometimes, userdir-ldap's
`ud-replicate` leaves a multiplexing SSH process lying
around. logging into the LDAP server (currently `alberti`) and
killing all the `sshdist` process will clear those:
This section documents how to do some of those upgrades and restarts
by hand.
## GitLab runner upgrades
Every month or so GitLab publishes a update to the `gitlab-runner` apt
package. The package is excluded from `unattended-upgrades` to avoid any
risk of interrupting long-running CI jobs (eg. large shadow sims).
The recommended procedure is to go through each CI machine one at a time,
pause all the runners on that single machine, ensure no long-running
shadow sims are being executed, and launch `apt upgrade`. If any regular
CI jobs are running, systemd will wait up to one hour for them to end,
then proceed with the package upgrade.
## cron.service
This is typically services that should be ran under `systemd --user`
but instead are started with a `@reboot` cron job.
For this kind of service, reboot the server or ask the service admin
to restart their services themselves. Ideally, this service should be
converted to a systemd unit, see [this documentation](doc/services).
### ud-replicate special case
Sometimes, userdir-ldap's `ud-replicate` leaves a multiplexing SSH
process lying around and those show up as part of
`cron.service`.
Logging into the LDAP server (currently `alberti`) and killing all the
`sshdist` process will clear those:
pkill -u sshdist ssh
* `ganeti.service`: typically this is an OpenSSL upgrade that affects
qemu, and restarting ganeti (thankfully) doesn't restart VMs. to
fix this, migrate all VMs to their secondaries and back, see
[Ganeti reboot procedures](howto/ganeti#rebooting), possibly the [instance-only
restart](#instance-only-restarts) procedure.
## Ganeti
The `ganeti.service` warning is typically an OpenSSL upgrade that
affects qemu, and restarting ganeti (thankfully) doesn't restart
VMs. to Fix this, migrate all VMs to their secondaries and back, see
[Ganeti reboot procedures](howto/ganeti#rebooting), possibly the [instance-only restart](#instance-only-restarts)
procedure.
## Open vSwitch
This is generally the `openvswitch-switch` and `openvswitch-common`
services, which are blocked from upgrades because of [bug 34185](https://bugs.torproject.org/34185)
* **Open vSwitch** (`openvswitch-switch` and `openvswitch-common`,
[bug 34185](https://bugs.torproject.org/34185)): to upgrade manually, empty the server, restart,
OVS, then migrate the machines back. It's actually easier to just
treat this as a "[reboot the nodes only](howto/ganeti#node-only-reboot)" procedure, see the
[Ganeti reboot procedures](howto/ganeti#rebooting) instead.
To upgrade manually, empty the server, restart, OVS, then migrate the
machines back. It's actually easier to just treat this as a "[reboot
the nodes only](howto/ganeti#node-only-reboot)" procedure, see the [Ganeti reboot procedures](howto/ganeti#rebooting)
instead.
Note that this might be fixed in Debian bullseye, [bug 961746](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=961746) in
Debian is marked as fixed, but will still need to be tested on our
side first. Update: it hasn't been fixed.
- **Grub** (`grub-pc`, [bug 40042](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40042)) has been known to have issues as
well, so it is blocked. to upgrade, make sure the install device is
defined, by running `dpkg-reconfigure grub-pc`. this issue might
actually have been fixed in the package, see [issue 40185](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40185).
## Grub
Packages are blocked from upgrades when they cause significant
breakage during an upgrade run, enough to cause an outage and/or
require significant recovery work. This is done through Puppet, in the
`profile::unattended_upgrades` class, in the `blacklist` setting.
`grub-pc` ([bug 40042](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40042)) has been known to have issues as well, so
it is blocked. to upgrade, make sure the install device is defined, by
running `dpkg-reconfigure grub-pc`. this issue might actually have
been fixed in the package, see [issue 40185](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40185).
Packages can be unblocked if and only if:
* the bug is confirmed as fixed in Debian
* the fix is deployed on all servers and confirmed as working
* we have good confidence that future upgrades will not break the
system again
## user@ services
Services setup with the new systemd-based startup system documented in
[doc/services](doc/services) can be restarted with:
[doc/services](doc/services) may not automatically restart. They may be
(manually) restarted with:
systemctl restart user@1504.service
There's a feature request ([bug #843778](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=843778)) to implement support for
those services directly in needrestart.
## Kernel upgrades and reboots
# Reboots
Sometimes it is necessary to perform a reboot on the hosts, when the
kernel is updated. Nagios will warn about this, with something like
......@@ -164,7 +189,10 @@ this:
WARNING: Kernel needs upgrade [linux-image-4.9.0-9-amd64 != linux-image-4.9.0-8-amd64]
### Rebooting guests
TODO: the above is the old way, the needrestart check has a different
output. document it above.
## Rebooting a single host
If this is only a virtual machine, and the only one affected, it can
be rebooted directly. This can be done with the `tsa-misc` script
......@@ -186,6 +214,29 @@ entered at boot time, either through the initramfs (if it has the
the case for the `mandos-01` server itself, for example, as it
currently can't unlock itself, naturally.
## Batch rebooting multiple hosts
IMPORTANT: before following this procedure, make sure that only a
subset of the hosts need a restart. If *all* hosts need a reboot, it's
likely going to be faster and easier to reboot the entire clusters at
once, see the [Ganeti reboot procedures](howto/ganeti#reboot) instead.
LDAP hosts have information about how they can be rebooted, in the
`rebootPolicy` field. Here are what the various fields mean:
* `justdoit` - can be rebooted any time, with a 10 minute delay,
possibly in parallel
* `rotation` - part of a cluster where each machine needs to be
rebooted one at a time, with a 30 minute delay for DNS to update
* `manual` - needs to be done by hand or with a special tool (fabric
in case of ganeti, reboot-host in the case of KVM, nothing for
windows boxes)
Therefore, it's possible to selectively reboot some of those hosts in
batches. Again, this is pretty rare: typically, you would either
reboot only a single host or *all* hosts, in which case a cluster-wide
reboot (with Ganeti, below) would be more appropriate.
This routine should be able to reboot all hosts with a `rebootPolicy`
defined to `justdoit` or `rotation`:
......@@ -197,29 +248,17 @@ defined to `justdoit` or `rotation`:
## Rebooting KVM hosts
The remaining is the "manual" procedure, the KVM hosts:
The remaining is the "manual" procedure, which includes one KVM last:
./reboot-host moly.torproject.org
... and Ganeti nodes, below.
## Rebooting Ganeti nodes
See the [Ganeti reboot procedures](howto/ganeti#rebooting) for this
procedure.
See the [Ganeti reboot procedures](howto/ganeti#rebooting) for this procedure.
## Remaining nodes
The [Nagios unhandled problems](https://nagios.torproject.org/cgi-bin/icinga/status.cgi?allunhandledproblems) will show remaining hosts that
might have been missed by the above procedure..
### Generic upgrade routines
LDAP hosts have information about how they can be rebooted, in the
`rebootPolicy` field. Here are what the various fields mean:
* `justdoit` - can be rebooted any time, with a 10 minute delay,
possibly in parallel
* `rotation` - part of a cluster where each machine needs to be
rebooted one at a time, with a 30 minute delay for DNS to update
* `manual` - needs to be done by hand or with a special tool (fabric
in case of ganeti, reboot-host in the case of KVM, nothing for
windows boxes)