... | @@ -118,38 +118,19 @@ There are a few scenarios here: |
... | @@ -118,38 +118,19 @@ There are a few scenarios here: |
|
|
|
|
|
* `ganeti.service`: typically this is an OpenSSL upgrade that affects
|
|
* `ganeti.service`: typically this is an OpenSSL upgrade that affects
|
|
qemu, and restarting ganeti (thankfully) doesn't restart VMs. to
|
|
qemu, and restarting ganeti (thankfully) doesn't restart VMs. to
|
|
fix this, migrate all VMs to their secondaries and back:
|
|
fix this, migrate all VMs to their secondaries and back, see
|
|
|
|
[Ganeti reboot procedures](howto/ganeti#rebooting), possibly the [instance-only
|
|
./reboot --ganeti-migrate-back -v --kind=cancel --reason 'qemu flagged in needrestart' \
|
|
restart](#instance-only-restarts) procedure.
|
|
-H chi-node-0{1,2,3,4,5,6,7,8,9}.torproject.org \
|
|
|
|
chi-node-1{0,1}.torproject.org \
|
|
|
|
fsn-node-0{1,2,3,4,5,6,7,8}.torproject.org
|
|
|
|
|
|
|
|
* **Open vSwitch** (`openvswitch-switch` and `openvswitch-common`,
|
|
* **Open vSwitch** (`openvswitch-switch` and `openvswitch-common`,
|
|
[bug 34185](https://bugs.torproject.org/34185)): to upgrade manually, empty the server, restart,
|
|
[bug 34185](https://bugs.torproject.org/34185)): to upgrade manually, empty the server, restart,
|
|
OVS, then migrate the machines back.
|
|
OVS, then migrate the machines back. It's actually easier to just
|
|
|
|
treat this as a "[reboot the nodes only](howto/ganeti#node-only-reboot)" procedure, see the
|
|
1. on the Ganeti master, list the instances on the Ganeti node:
|
|
[Ganeti reboot procedures](howto/ganeti#rebooting) instead.
|
|
|
|
|
|
INSTANCES=$(gnt-instance list -o name --no-headers --filter "pnode == \"$NODE\"")
|
|
|
|
|
|
|
|
2. on the Ganeti master, empty the Ganeti node:
|
|
|
|
|
|
|
|
gnt-node migrate -f $NODE
|
|
|
|
|
|
|
|
2. on the Ganeti node where OVS needs to be upgraded:
|
|
|
|
|
|
|
|
service openvswitch-nonetwork.service restart
|
|
|
|
|
|
|
|
3. on the Ganeti master, migrate all the instances back:
|
|
|
|
|
|
|
|
gnt-instance migrate -f $INSTANCES
|
|
|
|
|
|
|
|
the instance list comes from the first step
|
|
Note that this might be fixed in Debian bullseye, [bug 961746](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=961746) in
|
|
|
|
Debian is marked as fixed, but will still need to be tested on our
|
|
Note that this might be fixed in Debian bullseye, [bug 961746](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=961746) in
|
|
side first. Update: it hasn't been fixed.
|
|
Debian is marked as fixed, but will still need to be tested on our
|
|
|
|
side first. Update: it hasn't been fixed.
|
|
|
|
|
|
|
|
- **Grub** (`grub-pc`, [bug 40042](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40042)) has been known to have issues as
|
|
- **Grub** (`grub-pc`, [bug 40042](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40042)) has been known to have issues as
|
|
well, so it is blocked. to upgrade, make sure the install device is
|
|
well, so it is blocked. to upgrade, make sure the install device is
|
... | @@ -223,67 +204,13 @@ The remaining is the "manual" procedure, the KVM hosts: |
... | @@ -223,67 +204,13 @@ The remaining is the "manual" procedure, the KVM hosts: |
|
|
|
|
|
### Rebooting Ganeti nodes
|
|
### Rebooting Ganeti nodes
|
|
|
|
|
|
The ganeti hosts, using Fabric:
|
|
See the [Ganeti reboot procedures](howto/ganeti#rebooting) for this
|
|
|
|
procedure.
|
|
./reboot -v --delay-shutdown 1 --delay-hosts 30 -H fsn-node-0{1,2,3,4,5,6,7}.torproject.org
|
|
|
|
|
|
|
|
This can be done in parallel across clusters:
|
|
|
|
|
|
|
|
./reboot -v --delay-shutdown 1 --delay-hosts 30 -H chi-node-0{1,2,3,4}.torproject.org
|
|
|
|
|
|
|
|
This is also documented in the [howto/ganeti](howto/ganeti) section. Do not
|
|
|
|
forget to rebalance the cluster after the reboot.
|
|
|
|
|
|
|
|
### Rebooting Ganeti guests
|
|
|
|
|
|
|
|
If you see this in Nagios:
|
|
|
|
|
|
|
|
The following processes have libs linked that were upgraded: ganeti14: qemu-system-x86 (41509): ganeti15: qemu-system-x86 (41081): ganeti8: qemu-system-x86 (22106)
|
|
|
|
|
|
|
|
... and the Ganeti node itself doesn't need to be restarted, you can
|
|
|
|
see a stressful reboot by just migrating the instances between the
|
|
|
|
nodes. This will restart the `qemu` processes and complete the
|
|
|
|
upgrade, while imposing minimal (if any) downtime.
|
|
|
|
|
|
|
|
The process here is to do a `gnt-node migrate` on all nodes, which
|
|
|
|
will empty one node at a time. When that is complete, the cluster
|
|
|
|
needs to be rebalanced. This is not exactly an "idempotent" process:
|
|
|
|
you might not end up with exactly the same state as you had in the
|
|
|
|
beginning, even after rebalancing the cluster.
|
|
|
|
|
|
|
|
Make sure you run in a screen session, because this process takes
|
|
|
|
time:
|
|
|
|
|
|
|
|
screen
|
|
|
|
|
|
|
|
Then, look at the current state of the cluster:
|
|
|
|
|
|
|
|
hbal -L -C -v
|
|
|
|
|
|
|
|
Take note of the score and the proposed solution, but do not execute
|
|
|
|
it. This will give you an idea of how good or bad things are after the
|
|
|
|
migrate.
|
|
|
|
|
|
|
|
Then migrate all guests, for example:
|
|
|
|
|
|
|
|
for node in chi-node-0{1,2,3,4}; do gnt-node migrate -f $node; done
|
|
|
|
|
|
|
|
Once that is done, all the warnings should be gone from Nagios.
|
|
|
|
|
|
|
|
Then rebalance the cluster:
|
|
|
|
|
|
|
|
hbal -L -C -v --no-disk-moves
|
|
|
|
|
|
|
|
Note that we use `--no-disk-moves` to try to keep the solver from
|
|
|
|
moving actual disks. Since the `migrate` task above shouldn't have
|
|
|
|
moved any disk, it should be able to find a solution with a score
|
|
|
|
similar than the one we started with, without moving disks (which is
|
|
|
|
an even slower operation).
|
|
|
|
|
|
|
|
### Remaining nodes
|
|
### Remaining nodes
|
|
|
|
|
|
When all hosts are rebooted, see [Nagios unhandled problems](https://nagios.torproject.org/cgi-bin/icinga/status.cgi?allunhandledproblems) to
|
|
The [Nagios unhandled problems](https://nagios.torproject.org/cgi-bin/icinga/status.cgi?allunhandledproblems) will show remaining hosts that
|
|
confirm.
|
|
might have been missed by the above procedure..
|
|
|
|
|
|
#### Generic upgrade routines
|
|
#### Generic upgrade routines
|
|
|
|
|
... | | ... | |