... | ... | @@ -1263,14 +1263,50 @@ reboots on those machines. The `reboot` script in `tsa-misc` takes |
|
|
care of the special steps involved (which is basically to empty a
|
|
|
node before rebooting it).
|
|
|
|
|
|
Such a reboot should be ran interactively, inside a `tmux` or `screen`
|
|
|
session, and takes over 15 minutes to complete right now, but depends
|
|
|
on the size of the cluster (in terms of core memory usage).
|
|
|
Such a reboot should be ran interactively.
|
|
|
|
|
|
Once the reboot is completed, all instances might end up on a single
|
|
|
machine, and the cluster might need to be rebalanced, see
|
|
|
below. (Note: the update script should eventually do that, see [ticket
|
|
|
33406](https://bugs.torproject.org/33406)).
|
|
|
### Full fleet reboot
|
|
|
|
|
|
This command will reboot the entire Ganeti fleets, including the
|
|
|
hosted VMs, use this when (for example) you have kernel upgrades to
|
|
|
deploy everywhere:
|
|
|
|
|
|
./reboot --skip-ganeti-empty -v --reason 'qemu flagged in needrestart' \
|
|
|
-H chi-node-0{1,2,3,4,5,6,7,8,9}.torproject.org \
|
|
|
chi-node-1{0,1}.torproject.org \
|
|
|
fsn-node-0{1,2,3,4,5,6,7,8}.torproject.org
|
|
|
|
|
|
This is long and rather disruptive. Notifications should be posted on
|
|
|
IRC, in `#tor-project`, as instances are rebooted.
|
|
|
|
|
|
It can take about a day to complete a full fleet-wide reboot.
|
|
|
|
|
|
### Node-only reboot
|
|
|
|
|
|
In certain cases (Open vSwitch restarts, for example), only the nodes
|
|
|
need a reboot, and not the instances. In that case, you want to reboot
|
|
|
the nodes but before that, migrate the instances off the node and then
|
|
|
migrate it back when done. This incantation should do so:
|
|
|
|
|
|
./reboot --ganeti-migrate-back -v --reason 'Open vSwitch upgrade' \
|
|
|
-H fsn-node-0{1,2,3,4,5,6,7,8}.torproject.org
|
|
|
|
|
|
This should cause no user-visible disruption.
|
|
|
|
|
|
### Instance-only restarts
|
|
|
|
|
|
An alternative procedure should be used if only the `ganeti.service`
|
|
|
requires a restart. This happens when a Qemu dependency that has been
|
|
|
upgraded, for example `libxml` or OpenSSL.
|
|
|
|
|
|
This will only migrate the VMs without rebooting the hosts:
|
|
|
|
|
|
./reboot --ganeti-migrate-back --kind=cancel -v --reason 'qemu flagged in needrestart' \
|
|
|
-H chi-node-0{1,2,3,4,5,6,7,8,9}.torproject.org \
|
|
|
chi-node-1{0,1}.torproject.org \
|
|
|
fsn-node-0{1,2,3,4,5,6,7,8}.torproject.org
|
|
|
|
|
|
This should cause no user-visible disruption.
|
|
|
|
|
|
### Slow disk sync after rebooting/Broken migrate-back
|
|
|
|
... | ... | @@ -2175,6 +2211,49 @@ The [move-instance](https://docs.ganeti.org/docs/ganeti/3.0/html/move-instance.h |
|
|
|
|
|
TODO: document mass cluster migrations.
|
|
|
|
|
|
### Reboot procedures
|
|
|
|
|
|
If you get this email in Nagios:
|
|
|
|
|
|
Subject: ** PROBLEM Service Alert: chi-node-01/needrestart is WARNING **
|
|
|
|
|
|
... and in the detailed results, you see:
|
|
|
|
|
|
WARN - Kernel: 5.10.0-19-amd64, Microcode: CURRENT, Services: 1 (!), Containers: none, Sessions: none
|
|
|
Services:
|
|
|
- ganeti.service
|
|
|
|
|
|
You can try to make `needrestart` fix Ganeti by hand:
|
|
|
|
|
|
root@chi-node-01:~# needrestart
|
|
|
Scanning processes...
|
|
|
Scanning candidates...
|
|
|
Scanning processor microcode...
|
|
|
Scanning linux images...
|
|
|
|
|
|
Running kernel seems to be up-to-date.
|
|
|
|
|
|
The processor microcode seems to be up-to-date.
|
|
|
|
|
|
Restarting services...
|
|
|
systemctl restart ganeti.service
|
|
|
|
|
|
No containers need to be restarted.
|
|
|
|
|
|
No user sessions are running outdated binaries.
|
|
|
root@chi-node-01:~#
|
|
|
|
|
|
... but it's actually likely this didn't fix anything. A rerun will
|
|
|
yield the same result.
|
|
|
|
|
|
That is likely because the virtual machines, running inside a `qemu`
|
|
|
process, need a restart. This can be fixed by rebooting the entire
|
|
|
host, if it needs a reboot, or, if it doesn't, just migrating the VMs
|
|
|
around.
|
|
|
|
|
|
See the [Ganeti reboot procedures](#rebooting) for how to proceed from
|
|
|
here on. This is likely a case of an [Instance-only restart](#instance-only-restarts).
|
|
|
|
|
|
## Disaster recovery
|
|
|
|
|
|
If things get completely out of hand and the cluster becomes too
|
... | ... | |