I reviewed the entire page to make sure it flows correctly. I mostly worked on splitting up the per-service procedures in different sections instead of one long list, but also reworked the headings to have better categories that will look better in the TOC. I also warn about individual reboots because the more common use case is fleet, or at least cluster-wide reboots. See: tpo/tpa/team#33406

anarcat · 2607e1cd
--- a/howto/upgrades.md
+++ b/howto/upgrades.md
 [[_TOC_]]
 # Major upgrades
 Major upgrades are done by hand, with a "cheat sheet" created for each
@@ -68,25 +67,18 @@ block certain upgrades. If you want to bypass that, use regular `apt`:
    cumin -b 10  '*' 'apt update ; apt upgrade -yy ; TERM=doit dsa-update-apt-status'
-## GitLab runner upgrades
+# Special cases and manual restarts
-Every month or so GitLab publishes a update to the `gitlab-runner` apt
-package. The package is excluded from `unattended-upgrades` to avoid any
-risk of interrupting long-running CI jobs (eg. large shadow sims).
-The recommended procedure is to go through each CI machine one at a time,
+The above covers all upgrades that are automatically applied, but some
-pause all the runners on that single machine, ensure no long-running
+are blocked from automation and require manual intervention.
-shadow sims are being executed, and launch `apt upgrade`. If any regular
-CI jobs are running, systemd will wait up to one hour for them to end,
-then proceed with the package upgrade.
-## Restarting services by hand
+Others do upgrade automatically, but require a manual
+restart. Normally, [needrestart](https://github.com/liske/needrestart) runs after upgrades and takes care
+of restarting services, but it can't actually deal with everything. 
-After upgrades, there's a Nagios check that might trigger and tell you
+There is a Nagios check that might trigger and tell you that some
-that some services are running with outdated libraries. Normally,
+services are running with outdated libraries. You may see a warning
-[needrestart](https://github.com/liske/needrestart) runs after upgrades and takes care of restarting
+like:
-services, but it can't actually deal with everything. In Nagios, you
-will see a warning like:
    [web-chi-03] needrestart is WARNING: WARN - Kernel: 5.10.0-15-amd64, Services: 1 (!), Containers: none, Sessions: none
@@ -102,40 +94,6 @@ run the check by hand:
    needrestart -v
-There are a few scenarios here:
- * `cron.service`: typically services that should run under `systemd
-   --user`, reboot the box or ask the service admin to restart their
-   services
- * `cron.service`, special case: sometimes, userdir-ldap's
-   `ud-replicate` leaves a multiplexing SSH process lying
-   around. logging into the LDAP server (currently `alberti`) and
-   killing all the `sshdist` process will clear those:
-        pkill -u sshdist ssh
- * `ganeti.service`: typically this is an OpenSSL upgrade that affects
-   qemu, and restarting ganeti (thankfully) doesn't restart VMs. to
-   fix this, migrate all VMs to their secondaries and back, see
-   [Ganeti reboot procedures](howto/ganeti#rebooting), possibly the [instance-only
-   restart](#instance-only-restarts) procedure.
- * **Open vSwitch** (`openvswitch-switch` and `openvswitch-common`,
-   [bug 34185](https://bugs.torproject.org/34185)): to upgrade manually, empty the server, restart,
-   OVS, then migrate the machines back. It's actually easier to just
-   treat this as a "[reboot the nodes only](howto/ganeti#node-only-reboot)" procedure, see the
-   [Ganeti reboot procedures](howto/ganeti#rebooting) instead.
-   Note that this might be fixed in Debian bullseye, [bug 961746](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=961746) in
-   Debian is marked as fixed, but will still need to be tested on our
-   side first. Update: it hasn't been fixed.
- - **Grub** (`grub-pc`, [bug 40042](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40042)) has been known to have issues as
-   well, so it is blocked. to upgrade, make sure the install device is
-   defined, by running `dpkg-reconfigure grub-pc`. this issue might
-   actually have been fixed in the package, see [issue 40185](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40185).
 Packages are blocked from upgrades when they cause significant
 breakage during an upgrade run, enough to cause an outage and/or
 require significant recovery work. This is done through Puppet, in the
@@ -148,15 +106,82 @@ Packages can be unblocked if and only if:
 * we have good confidence that future upgrades will not break the
   system again
+This section documents how to do some of those upgrades and restarts
+by hand.
+## GitLab runner upgrades
+Every month or so GitLab publishes a update to the `gitlab-runner` apt
+package. The package is excluded from `unattended-upgrades` to avoid any
+risk of interrupting long-running CI jobs (eg. large shadow sims).
+The recommended procedure is to go through each CI machine one at a time,
+pause all the runners on that single machine, ensure no long-running
+shadow sims are being executed, and launch `apt upgrade`. If any regular
+CI jobs are running, systemd will wait up to one hour for them to end,
+then proceed with the package upgrade.
+## cron.service
+This is typically services that should be ran under `systemd --user`
+but instead are started with a `@reboot` cron job.
+For this kind of service, reboot the server or ask the service admin
+to restart their services themselves. Ideally, this service should be
+converted to a systemd unit, see [this documentation](doc/services).
+### ud-replicate special case
+Sometimes, userdir-ldap's `ud-replicate` leaves a multiplexing SSH
+process lying around and those show up as part of
+`cron.service`. 
+Logging into the LDAP server (currently `alberti`) and killing all the
+`sshdist` process will clear those:
+   pkill -u sshdist ssh
+## Ganeti
+The `ganeti.service` warning is typically an OpenSSL upgrade that
+affects qemu, and restarting ganeti (thankfully) doesn't restart
+VMs. to Fix this, migrate all VMs to their secondaries and back, see
+[Ganeti reboot procedures](howto/ganeti#rebooting), possibly the [instance-only restart](#instance-only-restarts)
+procedure.
+## Open vSwitch
+This is generally the `openvswitch-switch` and `openvswitch-common`
+services, which are blocked from upgrades because of [bug 34185](https://bugs.torproject.org/34185)
+To upgrade manually, empty the server, restart, OVS, then migrate the
+machines back. It's actually easier to just treat this as a "[reboot
+the nodes only](howto/ganeti#node-only-reboot)" procedure, see the [Ganeti reboot procedures](howto/ganeti#rebooting)
+instead.
+Note that this might be fixed in Debian bullseye, [bug 961746](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=961746) in
+Debian is marked as fixed, but will still need to be tested on our
+side first. Update: it hasn't been fixed.
+## Grub
+`grub-pc` ([bug 40042](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40042)) has been known to have issues as well, so
+it is blocked. to upgrade, make sure the install device is defined, by
+running `dpkg-reconfigure grub-pc`. this issue might actually have
+been fixed in the package, see [issue 40185](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40185).
+## user@ services
 Services setup with the new systemd-based startup system documented in
-[doc/services](doc/services) can be restarted with:
+[doc/services](doc/services) may not automatically restart. They may be
+(manually) restarted with:
    systemctl restart user@1504.service
 There's a feature request ([bug #843778](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=843778)) to implement support for
 those services directly in needrestart.
-## Kernel upgrades and reboots
+# Reboots
 Sometimes it is necessary to perform a reboot on the hosts, when the
 kernel is updated. Nagios will warn about this, with something like
@@ -164,7 +189,10 @@ this:
    WARNING: Kernel needs upgrade [linux-image-4.9.0-9-amd64 != linux-image-4.9.0-8-amd64]
-### Rebooting guests
+TODO: the above is the old way, the needrestart check has a different
+output. document it above.
+## Rebooting a single host
 If this is only a virtual machine, and the only one affected, it can
 be rebooted directly. This can be done with the `tsa-misc` script
@@ -186,6 +214,29 @@ entered at boot time, either through the initramfs (if it has the
 the case for the `mandos-01` server itself, for example, as it
 currently can't unlock itself, naturally.
+## Batch rebooting multiple hosts
+IMPORTANT: before following this procedure, make sure that only a
+subset of the hosts need a restart. If *all* hosts need a reboot, it's
+likely going to be faster and easier to reboot the entire clusters at
+once, see the [Ganeti reboot procedures](howto/ganeti#reboot) instead.
+LDAP hosts have information about how they can be rebooted, in the
+`rebootPolicy` field. Here are what the various fields mean:
+ * `justdoit` - can be rebooted any time, with a 10 minute delay,
+   possibly in parallel
+ * `rotation` - part of a cluster where each machine needs to be
+   rebooted one at a time, with a 30 minute delay for DNS to update
+ * `manual` - needs to be done by hand or with a special tool (fabric
+   in case of ganeti, reboot-host in the case of KVM, nothing for
+   windows boxes)
+Therefore, it's possible to selectively reboot some of those hosts in
+batches. Again, this is pretty rare: typically, you would either
+reboot only a single host or *all* hosts, in which case a cluster-wide
+reboot (with Ganeti, below) would be more appropriate.
 This routine should be able to reboot all hosts with a `rebootPolicy`
 defined to `justdoit` or `rotation`:
@@ -197,29 +248,17 @@ defined to `justdoit` or `rotation`:
 ## Rebooting KVM hosts
-The remaining is the "manual" procedure, the KVM hosts:
+The remaining is the "manual" procedure, which includes one KVM last:
    ./reboot-host moly.torproject.org
+... and Ganeti nodes, below.
 ## Rebooting Ganeti nodes
-See the [Ganeti reboot procedures](howto/ganeti#rebooting) for this
+See the [Ganeti reboot procedures](howto/ganeti#rebooting) for this procedure.
-procedure.
 ## Remaining nodes
 The [Nagios unhandled problems](https://nagios.torproject.org/cgi-bin/icinga/status.cgi?allunhandledproblems) will show remaining hosts that
 might have been missed by the above procedure..
-### Generic upgrade routines
-LDAP hosts have information about how they can be rebooted, in the
-`rebootPolicy` field. Here are what the various fields mean:
- * `justdoit` - can be rebooted any time, with a 10 minute delay,
-   possibly in parallel
- * `rotation` - part of a cluster where each machine needs to be
-   rebooted one at a time, with a 30 minute delay for DNS to update
- * `manual` - needs to be done by hand or with a special tool (fabric
-   in case of ganeti, reboot-host in the case of KVM, nothing for
-   windows boxes)