move all ganeti reboot procedures in howto/ganeti authored by anarcat's avatar anarcat
This merges a bunch of different procedures that had accumulated all
over the place.

It also adds new procedures that were flying around as copy-paste
ideas in IRC channels. It should now be possible to copy-paste from
the wiki instead, which is a slight improvement.

See: #32920
......@@ -118,34 +118,15 @@ There are a few scenarios here:
* `ganeti.service`: typically this is an OpenSSL upgrade that affects
qemu, and restarting ganeti (thankfully) doesn't restart VMs. to
fix this, migrate all VMs to their secondaries and back:
./reboot --ganeti-migrate-back -v --kind=cancel --reason 'qemu flagged in needrestart' \
-H chi-node-0{1,2,3,4,5,6,7,8,9}.torproject.org \
chi-node-1{0,1}.torproject.org \
fsn-node-0{1,2,3,4,5,6,7,8}.torproject.org
fix this, migrate all VMs to their secondaries and back, see
[Ganeti reboot procedures](howto/ganeti#rebooting), possibly the [instance-only
restart](#instance-only-restarts) procedure.
* **Open vSwitch** (`openvswitch-switch` and `openvswitch-common`,
[bug 34185](https://bugs.torproject.org/34185)): to upgrade manually, empty the server, restart,
OVS, then migrate the machines back.
1. on the Ganeti master, list the instances on the Ganeti node:
INSTANCES=$(gnt-instance list -o name --no-headers --filter "pnode == \"$NODE\"")
2. on the Ganeti master, empty the Ganeti node:
gnt-node migrate -f $NODE
2. on the Ganeti node where OVS needs to be upgraded:
service openvswitch-nonetwork.service restart
3. on the Ganeti master, migrate all the instances back:
gnt-instance migrate -f $INSTANCES
the instance list comes from the first step
OVS, then migrate the machines back. It's actually easier to just
treat this as a "[reboot the nodes only](howto/ganeti#node-only-reboot)" procedure, see the
[Ganeti reboot procedures](howto/ganeti#rebooting) instead.
Note that this might be fixed in Debian bullseye, [bug 961746](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=961746) in
Debian is marked as fixed, but will still need to be tested on our
......@@ -223,67 +204,13 @@ The remaining is the "manual" procedure, the KVM hosts:
### Rebooting Ganeti nodes
The ganeti hosts, using Fabric:
./reboot -v --delay-shutdown 1 --delay-hosts 30 -H fsn-node-0{1,2,3,4,5,6,7}.torproject.org
This can be done in parallel across clusters:
./reboot -v --delay-shutdown 1 --delay-hosts 30 -H chi-node-0{1,2,3,4}.torproject.org
This is also documented in the [howto/ganeti](howto/ganeti) section. Do not
forget to rebalance the cluster after the reboot.
### Rebooting Ganeti guests
If you see this in Nagios:
The following processes have libs linked that were upgraded: ganeti14: qemu-system-x86 (41509): ganeti15: qemu-system-x86 (41081): ganeti8: qemu-system-x86 (22106)
... and the Ganeti node itself doesn't need to be restarted, you can
see a stressful reboot by just migrating the instances between the
nodes. This will restart the `qemu` processes and complete the
upgrade, while imposing minimal (if any) downtime.
The process here is to do a `gnt-node migrate` on all nodes, which
will empty one node at a time. When that is complete, the cluster
needs to be rebalanced. This is not exactly an "idempotent" process:
you might not end up with exactly the same state as you had in the
beginning, even after rebalancing the cluster.
Make sure you run in a screen session, because this process takes
time:
screen
Then, look at the current state of the cluster:
hbal -L -C -v
Take note of the score and the proposed solution, but do not execute
it. This will give you an idea of how good or bad things are after the
migrate.
Then migrate all guests, for example:
for node in chi-node-0{1,2,3,4}; do gnt-node migrate -f $node; done
Once that is done, all the warnings should be gone from Nagios.
Then rebalance the cluster:
hbal -L -C -v --no-disk-moves
Note that we use `--no-disk-moves` to try to keep the solver from
moving actual disks. Since the `migrate` task above shouldn't have
moved any disk, it should be able to find a solution with a score
similar than the one we started with, without moving disks (which is
an even slower operation).
See the [Ganeti reboot procedures](howto/ganeti#rebooting) for this
procedure.
### Remaining nodes
When all hosts are rebooted, see [Nagios unhandled problems](https://nagios.torproject.org/cgi-bin/icinga/status.cgi?allunhandledproblems) to
confirm.
The [Nagios unhandled problems](https://nagios.torproject.org/cgi-bin/icinga/status.cgi?allunhandledproblems) will show remaining hosts that
might have been missed by the above procedure..
#### Generic upgrade routines
......
......