Changes

The easy and long methods. The long method reduces the amount of downtime for our VMs but is a bit longer to do, especially just to figure out which instances used a 'plain' disk template to handle them differently.
lelutin · d3dea782
--- a/howto/upgrades.md
+++ b/howto/upgrades.md
@@ -200,47 +200,52 @@ Others do upgrade automatically, but require a manual
 restart. Normally, [needrestart](https://github.com/liske/needrestart) runs after upgrades and takes care
 of restarting services, but it can't actually deal with everything. 

-There is a Nagios check that might trigger and tell you that some
-services are running with outdated libraries. You may see a warning
-like:
+Our alert in Alertmanager only shows a sum of how much hosts have pending
+restarts. To check the entire fleet and simultaneously discover which hosts are
+triggering the alert, run this command in [Fabric](howto/fabric):

-    [web-chi-03] needrestart is WARNING: WARN - Kernel: 5.10.0-15-amd64, Services: 1 (!), Containers: none, Sessions: none
-
-The detailed status information will show you which service it fails
-to restart:
-
-    WARN - Kernel: 5.10.0-15-amd64, Services: 1 (!), Containers: none, Sessions: none
-    Services:
-    - cron.service
+    fab fleet.pending-restarts

 If you cannot figure out why the warning happens, you might want to
-run the check by hand:
+run `needrestart` on a particular host by hand:

    needrestart -v

-To check the entire fleet, run this command in [Fabric](howto/fabric):
+Important notes:

-    fab fleet.pending-restarts
+1. Ganeti instance (VM) processes (kvm) might show up as running with an
+   outdated library and `needrestart` will try to restart the `ganeti.service`
+   unit but that will not fix the issue. In this situation, you can reboot the
+   whole node, which will cause a downtime for all instances on it.
+   * An alternative that can limit the downtimes on instances but takes longer
+     to operate is to issue a series of instance migrations to their secondaries
+     and then back to their primaries. However, some instances with disks of
+     type 'plain' cannot be migrated and need to be rebooted instead with
+     `gnt-instance stop $instance && gnt-instance start $instance` on the
+     cluster's main server (issuing a reboot from within the instance e.g. with
+     the `reboot` fabric script might not stop the instance's KVM process on the
+     ganeti node so is not enough)

-Note that there's a false alarm that occurs regularly here because
-there's lag between `needrestart` running after upgrades (which is on
-a `dpkg` post-invoke hook) and the metrics updates (which are on a
-timer running daily and 2 minutes after boot).
+2. There's a false alarm that occurs regularly here because there's lag between
+   `needrestart` running after upgrades (which is on a `dpkg` post-invoke hook)
+   and the metrics updates (which are on a timer running daily and 2 minutes
+   after boot).

-If a host is showing up in an alert and the above fabric task says:
+   If a host is showing up in an alert and the above fabric task says:

-    INFO: no host found requiring a restart
+       INFO: no host found requiring a restart

-It might be the timer hasn't ran recently enough, you can diagnose
-that with:
+   It might be the timer hasn't ran recently enough, you can diagnose
+   that with:

-    systemctl status tpa-needrestart-prometheus-metrics.timer tpa-needrestart-prometheus-metrics.service
+       systemctl status tpa-needrestart-prometheus-metrics.timer tpa-needrestart-prometheus-metrics.service

-And, normally, fix it with:
+   And, normally, fix it with:

-    systemctl start tpa-needrestart-prometheus-metrics.service
+       systemctl start tpa-needrestart-prometheus-metrics.service

-See [issue `prometheus-alerts#20`](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/issues/20) to get rid of that false positive.
+   See [issue `prometheus-alerts#20`](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/issues/20)
+   to get rid of that false positive.

 Packages are blocked from upgrades when they cause significant
 breakage during an upgrade run, enough to cause an outage and/or