This merges a bunch of different procedures that had accumulated all over the place. It also adds new procedures that were flying around as copy-paste ideas in IRC channels. It should now be possible to copy-paste from the wiki instead, which is a slight improvement. See: #32920

anarcat · a2957440
--- a/howto/upgrades.md
+++ b/howto/upgrades.md
@@ -118,38 +118,19 @@ There are a few scenarios here:
 * `ganeti.service`: typically this is an OpenSSL upgrade that affects
   qemu, and restarting ganeti (thankfully) doesn't restart VMs. to
-   fix this, migrate all VMs to their secondaries and back:
+   fix this, migrate all VMs to their secondaries and back, see
+   [Ganeti reboot procedures](howto/ganeti#rebooting), possibly the [instance-only
-        ./reboot --ganeti-migrate-back -v --kind=cancel --reason 'qemu flagged in needrestart' \
+   restart](#instance-only-restarts) procedure.
-          -H chi-node-0{1,2,3,4,5,6,7,8,9}.torproject.org \
-             chi-node-1{0,1}.torproject.org \
-             fsn-node-0{1,2,3,4,5,6,7,8}.torproject.org
 * **Open vSwitch** (`openvswitch-switch` and `openvswitch-common`,
   [bug 34185](https://bugs.torproject.org/34185)): to upgrade manually, empty the server, restart,
-   OVS, then migrate the machines back.
+   OVS, then migrate the machines back. It's actually easier to just
+   treat this as a "[reboot the nodes only](howto/ganeti#node-only-reboot)" procedure, see the
-   1. on the Ganeti master, list the instances on the Ganeti node:
+   [Ganeti reboot procedures](howto/ganeti#rebooting) instead.
-        INSTANCES=$(gnt-instance list -o name --no-headers --filter "pnode == \"$NODE\"")
-   2. on the Ganeti master, empty the Ganeti node:
-        gnt-node migrate -f $NODE
-   2. on the Ganeti node where OVS needs to be upgraded:
-        service openvswitch-nonetwork.service restart
-   3. on the Ganeti master, migrate all the instances back:
-        gnt-instance migrate -f $INSTANCES
-      the instance list comes from the first step
+   Note that this might be fixed in Debian bullseye, [bug 961746](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=961746) in
+   Debian is marked as fixed, but will still need to be tested on our
-  Note that this might be fixed in Debian bullseye, [bug 961746](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=961746) in
+   side first. Update: it hasn't been fixed.
-  Debian is marked as fixed, but will still need to be tested on our
-  side first. Update: it hasn't been fixed.
 - **Grub** (`grub-pc`, [bug 40042](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40042)) has been known to have issues as
   well, so it is blocked. to upgrade, make sure the install device is
@@ -223,67 +204,13 @@ The remaining is the "manual" procedure, the KVM hosts:
 ### Rebooting Ganeti nodes
-The ganeti hosts, using Fabric:
+See the [Ganeti reboot procedures](howto/ganeti#rebooting) for this
+procedure.
-    ./reboot -v --delay-shutdown 1 --delay-hosts 30 -H fsn-node-0{1,2,3,4,5,6,7}.torproject.org
-This can be done in parallel across clusters:
-    ./reboot -v --delay-shutdown 1 --delay-hosts 30 -H chi-node-0{1,2,3,4}.torproject.org
-This is also documented in the [howto/ganeti](howto/ganeti) section. Do not
-forget to rebalance the cluster after the reboot.
-### Rebooting Ganeti guests
-If you see this in Nagios:
-    The following processes have libs linked that were upgraded: ganeti14: qemu-system-x86 (41509): ganeti15: qemu-system-x86 (41081): ganeti8: qemu-system-x86 (22106)
-... and the Ganeti node itself doesn't need to be restarted, you can
-see a stressful reboot by just migrating the instances between the
-nodes. This will restart the `qemu` processes and complete the
-upgrade, while imposing minimal (if any) downtime.
-The process here is to do a `gnt-node migrate` on all nodes, which
-will empty one node at a time. When that is complete, the cluster
-needs to be rebalanced. This is not exactly an "idempotent" process:
-you might not end up with exactly the same state as you had in the
-beginning, even after rebalancing the cluster.
-Make sure you run in a screen session, because this process takes
-time:
-    screen
-Then, look at the current state of the cluster:
-    hbal -L -C -v
-Take note of the score and the proposed solution, but do not execute
-it. This will give you an idea of how good or bad things are after the
-migrate.
-Then migrate all guests, for example:
-    for node in chi-node-0{1,2,3,4}; do gnt-node migrate -f $node; done
-Once that is done, all the warnings should be gone from Nagios.
-Then rebalance the cluster:
-    hbal -L -C -v --no-disk-moves
-Note that we use `--no-disk-moves` to try to keep the solver from
-moving actual disks. Since the `migrate` task above shouldn't have
-moved any disk, it should be able to find a solution with a score
-similar than the one we started with, without moving disks (which is
-an even slower operation).
 ### Remaining nodes
-When all hosts are rebooted, see [Nagios unhandled problems](https://nagios.torproject.org/cgi-bin/icinga/status.cgi?allunhandledproblems) to
+The [Nagios unhandled problems](https://nagios.torproject.org/cgi-bin/icinga/status.cgi?allunhandledproblems) will show remaining hosts that
-confirm.
+might have been missed by the above procedure..
 #### Generic upgrade routines