... | ... | @@ -1235,7 +1235,7 @@ it directly: |
|
|
LV Tags
|
|
|
originstname+bacula-director-01.torproject.org
|
|
|
|
|
|
### Node failures
|
|
|
### Node failure
|
|
|
|
|
|
Ganeti clusters are designed to be [self-healing](http://docs.ganeti.org/ganeti/2.15/html/admin.html#autorepair). As long as only
|
|
|
one machine disappears, the cluster should be able to recover by
|
... | ... | @@ -1245,33 +1245,56 @@ WARNING: the following procedure should be considered a LAST |
|
|
RESORT. In the vast majority of cases, it is simpler and less risky to
|
|
|
just restart the node using a remote power cycle to restore the
|
|
|
service than risking a split brain scenario which this procedure can
|
|
|
case.
|
|
|
case when not followed properly.
|
|
|
|
|
|
WARNING, AGAIN: if for some reason the node you are failing over from
|
|
|
actually returns on its own without you being able to stop it, it
|
|
|
*will* start those DRBD disks and virtual machines, and you *will* end
|
|
|
up in a split brain scenario.
|
|
|
may run those DRBD disks and virtual machines, and you *may* end
|
|
|
up in a split brain scenario.
|
|
|
|
|
|
If, say, `fsn-node-07` completely fail and you are completely
|
|
|
confident it is not still running in parallel (which could lead to a
|
|
|
"split brain" scenario), you can run this command which will switch
|
|
|
all the instances on that node to their secondaries:
|
|
|
If, say, `fsn-node-07` completely fails and you need to restore
|
|
|
service to the virtual machines running on that server, you can
|
|
|
failover to the secondaries. Before you do, however, you need to be
|
|
|
completely confident it is not still running in parallel, which could
|
|
|
lead to a "split brain" scenario. For that, just cut the power to the
|
|
|
machine using out of band management (e.g. on Hetzner, power down the
|
|
|
machine through the Hetzner Robot, on Cymru, use the iDRAC to cut the
|
|
|
power to the main board).
|
|
|
|
|
|
Once the machine is powered down, instruct Ganeti to stop using it
|
|
|
altogether:
|
|
|
|
|
|
gnt-node modify --offline=yes fsn-node-07
|
|
|
|
|
|
Then, once the machine is offline and Ganeti also agrees, switch all
|
|
|
the instances on that node to their secondaries:
|
|
|
|
|
|
gnt-node failover fsn-node-07.torproject.org
|
|
|
|
|
|
It's possible that you need `--ignore-consistency` but this has caused
|
|
|
trouble in at least once instance (although it might have been
|
|
|
unrelated). In any case, it is [not used at the WMF](https://wikitech.wikimedia.org/wiki/Ganeti#Failed_hardware_node), for
|
|
|
example, they explicitly say "this has never happened in our
|
|
|
setup".
|
|
|
trouble in the past (see [40229](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40229)). In any case, it is [not used at
|
|
|
the WMF](https://wikitech.wikimedia.org/wiki/Ganeti#Failed_hardware_node), for example, they explicitly say that never needed the
|
|
|
flag.
|
|
|
|
|
|
Note that it will still try to connect to the failed not to shutdown
|
|
|
Note that it will still try to connect to the failed node to shutdown
|
|
|
the DRBD devices, as a last resort.
|
|
|
|
|
|
Also note that recovering from this failure is non-trivial. You will,
|
|
|
at the very least, need to boot the machine *without* starting the
|
|
|
DRBD disks and virtual machines somehow, otherwise that is how you end
|
|
|
up in the split brain scenario.
|
|
|
Recovering from the failure should be automatic: once the failed
|
|
|
server is repaired and restarts, it will contact the master to ask for
|
|
|
instances to start. Since the machines the instances have been
|
|
|
migrated, none will be started and there *should* not be any
|
|
|
inconsistencies.
|
|
|
|
|
|
Once the machine is up and running and you are confident you do not
|
|
|
have a split brain scenario, you can re-add the machine to the cluster
|
|
|
with:
|
|
|
|
|
|
gnt-node add --readd fsn-node-07.torproject.org
|
|
|
|
|
|
Once that is done, rebalance the cluster because you now have an empty
|
|
|
node which could be reused (hopefully). It might, obviously, be worth
|
|
|
exploring the root case of the failure, however, before readding the
|
|
|
machine to the cluster.
|
|
|
|
|
|
Recoveries could eventually be automated if such situations occur more
|
|
|
often, by scheduling a [harep](http://docs.ganeti.org/ganeti/2.15/man/harep.html) cron job, which isn't enabled in
|
... | ... | @@ -1292,10 +1315,117 @@ TODO: expand documentation on master node failure recovery. |
|
|
### Split brain recovery
|
|
|
|
|
|
A split brain occurred during a partial failure, failover, then
|
|
|
unexpected recovery of fsn-node-07. This needs to be documented
|
|
|
properly, but a work log is available in [issue 40229](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40229).
|
|
|
unexpected recovery of `fsn-node-07` ([issue 40229](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40229)). It might
|
|
|
occur in other scenarios, but this section documents that specific
|
|
|
one. Hopefully the recovery will be similar in other scenarios.
|
|
|
|
|
|
The split brain was the result of an operator running this command to
|
|
|
failover the instances running on the node:
|
|
|
|
|
|
gnt-node failover --ignore-consistency fsn-node-07.torproject.org
|
|
|
|
|
|
The symptom of the split brain is that the VM is running on two
|
|
|
machines. You will see that in `gnt-cluster verify`:
|
|
|
|
|
|
Thu Apr 22 01:28:04 2021 * Verifying node status
|
|
|
Thu Apr 22 01:28:04 2021 - ERROR: instance palmeri.torproject.org: instance should not run on node fsn-node-07.torproject.org
|
|
|
Thu Apr 22 01:28:04 2021 - ERROR: instance onionoo-backend-02.torproject.org: instance should not run on node fsn-node-07.torproject.org
|
|
|
Thu Apr 22 01:28:04 2021 - ERROR: instance polyanthum.torproject.org: instance should not run on node fsn-node-07.torproject.org
|
|
|
Thu Apr 22 01:28:04 2021 - ERROR: instance onionbalance-01.torproject.org: instance should not run on node fsn-node-07.torproject.org
|
|
|
Thu Apr 22 01:28:04 2021 - ERROR: instance henryi.torproject.org: instance should not run on node fsn-node-07.torproject.org
|
|
|
Thu Apr 22 01:28:04 2021 - ERROR: instance nevii.torproject.org: instance should not run on node fsn-node-07.torproject.org
|
|
|
|
|
|
In the above, the verification finds an instance running on an
|
|
|
unexpected server (the old primary). Disks will be in a similar
|
|
|
"degraded" state, according to `gnt-cluster verify`:
|
|
|
|
|
|
Thu Apr 22 01:28:04 2021 * Verifying instance status
|
|
|
Thu Apr 22 01:28:04 2021 - WARNING: instance onionoo-backend-02.torproject.org: disk/0 on fsn-node-07.torproject.org is degraded; local disk state is 'ok'
|
|
|
Thu Apr 22 01:28:04 2021 - WARNING: instance onionoo-backend-02.torproject.org: disk/1 on fsn-node-07.torproject.org is degraded; local disk state is 'ok'
|
|
|
Thu Apr 22 01:28:04 2021 - WARNING: instance onionoo-backend-02.torproject.org: disk/2 on fsn-node-07.torproject.org is degraded; local disk state is 'ok'
|
|
|
Thu Apr 22 01:28:04 2021 - WARNING: instance onionoo-backend-02.torproject.org: disk/0 on fsn-node-06.torproject.org is degraded; local disk state is 'ok'
|
|
|
Thu Apr 22 01:28:04 2021 - WARNING: instance onionoo-backend-02.torproject.org: disk/1 on fsn-node-06.torproject.org is degraded; local disk state is 'ok'
|
|
|
Thu Apr 22 01:28:04 2021 - WARNING: instance onionoo-backend-02.torproject.org: disk/2 on fsn-node-06.torproject.org is degraded; local disk state is 'ok'
|
|
|
|
|
|
We can also see that symptom on an individual instance:
|
|
|
|
|
|
root@fsn-node-01:~# gnt-instance info onionbalance-01.torproject.org
|
|
|
- Instance name: onionbalance-01.torproject.org
|
|
|
[...]
|
|
|
Disks:
|
|
|
- disk/0: drbd, size 10.0G
|
|
|
access mode: rw
|
|
|
nodeA: fsn-node-05.torproject.org, minor=29
|
|
|
nodeB: fsn-node-07.torproject.org, minor=26
|
|
|
port: 11031
|
|
|
on primary: /dev/drbd29 (147:29) in sync, status *DEGRADED*
|
|
|
on secondary: /dev/drbd26 (147:26) in sync, status *DEGRADED*
|
|
|
[...]
|
|
|
|
|
|
The first (optional) thing to do in a split brain scenario is to stop the damage
|
|
|
made by running instances: stop all the instances running in parallel,
|
|
|
on both the previous and new primaries:
|
|
|
|
|
|
gnt-instance stop $INSTANCES
|
|
|
|
|
|
Then on `fsn-node-07` just use `kill(1)` to shutdown the `qemu`
|
|
|
processes running the VMs directly. Now the instances should all be
|
|
|
shutdown and no further changes will be done on the VM that could
|
|
|
possibly be lost.
|
|
|
|
|
|
(This step is optional because you can also skip straight to the hard
|
|
|
decision below, while leaving the instances running. But that adds
|
|
|
pressure to you, and we don't want to do that to your poor brain right
|
|
|
now.)
|
|
|
|
|
|
That will leave you time to make a more important decision: which node
|
|
|
will be authoritative (which will keep running as primary) and which
|
|
|
one will "lose" (and will have its instances destroyed)? There's no
|
|
|
easy good or wrong answer for this: it's a judgement call. In any
|
|
|
case, there might already been data loss: for as long as both nodes
|
|
|
were available and the VMs running on both, data registered on one of
|
|
|
the nodes during the split brain will be lost when we destroy the
|
|
|
state on the "losing" node.
|
|
|
|
|
|
If you have picked the previous primary as the "new" primary, you will
|
|
|
need to *first* revert the failover and flip the instances back to the
|
|
|
previous primary:
|
|
|
|
|
|
for instance in $INSTANCES; do
|
|
|
gnt-instance failover $instance
|
|
|
done
|
|
|
|
|
|
When that is done, or if you have picked the "new" primary (the one
|
|
|
the instances were originally failed over to) as the official one: you
|
|
|
need to fix the disks' state. For this, flip to a "plain" disk
|
|
|
(i.e. turn off DRBD) and turn DRBD back on. This will stop mirroring
|
|
|
the disk, and reallocate a new disk in the right place. Assuming all
|
|
|
instances are stopped, this should do it:
|
|
|
|
|
|
for instance in $INSTANCES ; do
|
|
|
gnt-instance modify -t plain $instance
|
|
|
gnt-instance modify -t drbd --no-wait-for-sync $instance
|
|
|
gnt-instance start $instance
|
|
|
gnt-instance console $instance
|
|
|
done
|
|
|
|
|
|
Then the machines should be back up on a single machine and the split
|
|
|
brain scenario resolved. Note that this means the other side of the
|
|
|
DRBD mirror will be destroyed in the procedure, that is the step that
|
|
|
drops the data which was sent to the wrong part of the "split
|
|
|
brain".
|
|
|
|
|
|
Once everything is back to normal, it might be a good idea to
|
|
|
rebalance the cluster.
|
|
|
|
|
|
References:
|
|
|
|
|
|
TODO: expand documentation on split brain recovery.
|
|
|
* the `-t plain` hack comes from [this post on the Ganeti list](https://groups.google.com/g/ganeti/c/l8www_IcFFI)
|
|
|
* [this procedure](https://blkperl.github.io/split-brain-ganeti.html) suggests using `replace-disks -n` which also
|
|
|
works, but requires us to pick the secondary by hand each time,
|
|
|
which is annoying
|
|
|
* [this procedure](https://www.ipserverone.info/knowledge-base/how-to-fix-drbd-recovery-from-split-brain/) has instructions on how to recover at the DRBD
|
|
|
level directly, but have not required those instructions so far
|
|
|
|
|
|
### Bridge configuration failures
|
|
|
|
... | ... | |