describe a software ganeti restart procedure authored by anarcat's avatar anarcat
......@@ -211,6 +211,52 @@ This can be done in parallel across clusters:
This is also documented in the [howto/ganeti](howto/ganeti) section. Do not
forget to rebalance the cluster after the reboot.
### Rebooting Ganeti guests
If you see this in Nagios:
The following processes have libs linked that were upgraded: ganeti14: qemu-system-x86 (41509): ganeti15: qemu-system-x86 (41081): ganeti8: qemu-system-x86 (22106)
... and the Ganeti node itself doesn't need to be restarted, you can
see a stressful reboot by just migrating the instances between the
nodes. This will restart the `qemu` processes and complete the
upgrade, while imposing minimal (if any) downtime.
The process here is to do a `gnt-node migrate` on all nodes, which
will empty one node at a time. When that is complete, the cluster
needs to be rebalanced. This is not exactly an "idempotent" process:
you might not end up with exactly the same state as you had in the
beginning, even after rebalancing the cluster.
Make sure you run in a screen session, because this process takes
time:
screen
Then, look at the current state of the cluster:
hbal -L -C -v
Take note of the score and the proposed solution, but do not execute
it. This will give you an idea of how good or bad things are after the
migrate.
Then migrate all guests, for example:
for node in chi-node-0{1,2,3,4}; do gnt-node migrate -f $node; done
Once that is done, all the warnings should be gone from Nagios.
Then rebalance the cluster:
hbal -L -C -v --no-disk-moves
Note that we use `--no-disk-moves` to try to keep the solver from
moving actual disks. Since the `migrate` task above shouldn't have
moved any disk, it should be able to find a solution with a score
similar than the one we started with, without moving disks (which is
an even slower operation).
### Remaining nodes
The scaleway box needs special handholding, see [ticket 32920](https://bugs.torproject.org/32920). The
......
......