Changes

This merges a bunch of different procedures that had accumulated all over the place. It also adds new procedures that were flying around as copy-paste ideas in IRC channels. It should now be possible to copy-paste from the wiki instead, which is a slight improvement. See: #32920
anarcat · a2957440
--- a/howto/ganeti.md
+++ b/howto/ganeti.md
@@ -1263,14 +1263,50 @@ reboots on those machines. The `reboot` script in `tsa-misc` takes
 care of the special steps involved (which is basically to empty a
 node before rebooting it).

-Such a reboot should be ran interactively, inside a `tmux` or `screen`
-session, and takes over 15 minutes to complete right now, but depends
-on the size of the cluster (in terms of core memory usage).
+Such a reboot should be ran interactively. 

-Once the reboot is completed, all instances might end up on a single
-machine, and the cluster might need to be rebalanced, see
-below. (Note: the update script should eventually do that, see [ticket
-33406](https://bugs.torproject.org/33406)).
+### Full fleet reboot
+
+This command will reboot the entire Ganeti fleets, including the
+hosted VMs, use this when (for example) you have kernel upgrades to
+deploy everywhere:
+
+    ./reboot --skip-ganeti-empty -v --reason 'qemu flagged in needrestart' \
+        -H chi-node-0{1,2,3,4,5,6,7,8,9}.torproject.org \
+           chi-node-1{0,1}.torproject.org \
+           fsn-node-0{1,2,3,4,5,6,7,8}.torproject.org
+
+This is long and rather disruptive. Notifications should be posted on
+IRC, in `#tor-project`, as instances are rebooted.
+
+It can take about a day to complete a full fleet-wide reboot.
+
+### Node-only reboot
+
+In certain cases (Open vSwitch restarts, for example), only the nodes
+need a reboot, and not the instances. In that case, you want to reboot
+the nodes but before that, migrate the instances off the node and then
+migrate it back when done. This incantation should do so:
+
+    ./reboot --ganeti-migrate-back -v --reason 'Open vSwitch upgrade' \
+        -H fsn-node-0{1,2,3,4,5,6,7,8}.torproject.org
+
+This should cause no user-visible disruption.
+
+### Instance-only restarts
+
+An alternative procedure should be used if only the `ganeti.service`
+requires a restart. This happens when a Qemu dependency that has been
+upgraded, for example `libxml` or OpenSSL.
+
+This will only migrate the VMs without rebooting the hosts:
+
+    ./reboot --ganeti-migrate-back --kind=cancel -v --reason 'qemu flagged in needrestart' \
+        -H chi-node-0{1,2,3,4,5,6,7,8,9}.torproject.org \
+           chi-node-1{0,1}.torproject.org \
+           fsn-node-0{1,2,3,4,5,6,7,8}.torproject.org
+
+This should cause no user-visible disruption.

 ### Slow disk sync after rebooting/Broken migrate-back

@@ -2175,6 +2211,49 @@ The [move-instance](https://docs.ganeti.org/docs/ganeti/3.0/html/move-instance.h

 TODO: document mass cluster migrations.

+### Reboot procedures
+
+If you get this email in Nagios:
+
+    Subject: ** PROBLEM Service Alert: chi-node-01/needrestart is WARNING **
+
+... and in the detailed results, you see:
+
+    WARN - Kernel: 5.10.0-19-amd64, Microcode: CURRENT, Services: 1 (!), Containers: none, Sessions: none
+    Services:
+    - ganeti.service
+
+You can try to make `needrestart` fix Ganeti by hand:
+
+    root@chi-node-01:~# needrestart
+    Scanning processes...
+    Scanning candidates...
+    Scanning processor microcode...
+    Scanning linux images...
+
+    Running kernel seems to be up-to-date.
+
+    The processor microcode seems to be up-to-date.
+
+    Restarting services...
+     systemctl restart ganeti.service
+
+    No containers need to be restarted.
+
+    No user sessions are running outdated binaries.
+    root@chi-node-01:~#
+
+... but it's actually likely this didn't fix anything. A rerun will
+yield the same result.
+
+That is likely because the virtual machines, running inside a `qemu`
+process, need a restart. This can be fixed by rebooting the entire
+host, if it needs a reboot, or, if it doesn't, just migrating the VMs
+around.
+
+See the [Ganeti reboot procedures](#rebooting) for how to proceed from
+here on. This is likely a case of an [Instance-only restart](#instance-only-restarts).
+
 ## Disaster recovery

 If things get completely out of hand and the cluster becomes too