... | ... | @@ -1239,14 +1239,64 @@ it directly: |
|
|
|
|
|
Ganeti clusters are designed to be [self-healing](http://docs.ganeti.org/ganeti/2.15/html/admin.html#autorepair). As long as only
|
|
|
one machine disappears, the cluster should be able to recover by
|
|
|
failing over other nodes. This is currently done manually, see the
|
|
|
migrate section above.
|
|
|
failing over other nodes. This is currently done manually, however.
|
|
|
|
|
|
This could eventually be automated if such situations occur more
|
|
|
WARNING: the following procedure should be considered a LAST
|
|
|
RESORT. In the vast majority of cases, it is simpler and less risky to
|
|
|
just restart the node using a remote power cycle to restore the
|
|
|
service than risking a split brain scenario which this procedure can
|
|
|
case.
|
|
|
|
|
|
WARNING, AGAIN: if for some reason the node you are failing over from
|
|
|
actually returns on its own without you being able to stop it, it
|
|
|
*will* start those DRBD disks and virtual machines, and you *will* end
|
|
|
up in a split brain scenario.
|
|
|
|
|
|
If, say, `fsn-node-07` completely fail and you are completely
|
|
|
confident it is not still running in parallel (which could lead to a
|
|
|
"split brain" scenario), you can run this command which will switch
|
|
|
all the instances on that node to their secondaries:
|
|
|
|
|
|
gnt-node failover fsn-node-07.torproject.org
|
|
|
|
|
|
It's possible that you need `--ignore-consistency` but this has caused
|
|
|
trouble in at least once instance (although it might have been
|
|
|
unrelated). In any case, it is [not used at the WMF](https://wikitech.wikimedia.org/wiki/Ganeti#Failed_hardware_node), for
|
|
|
example, they explicitly say "this has never happened in our
|
|
|
setup".
|
|
|
|
|
|
Note that it will still try to connect to the failed not to shutdown
|
|
|
the DRBD devices, as a last resort.
|
|
|
|
|
|
Also note that recovering from this failure is non-trivial. You will,
|
|
|
at the very least, need to boot the machine *without* starting the
|
|
|
DRBD disks and virtual machines somehow, otherwise that is how you end
|
|
|
up in the split brain scenario.
|
|
|
|
|
|
Recoveries could eventually be automated if such situations occur more
|
|
|
often, by scheduling a [harep](http://docs.ganeti.org/ganeti/2.15/man/harep.html) cron job, which isn't enabled in
|
|
|
Debian by default. See also the [autorepair](http://docs.ganeti.org/ganeti/2.15/html/admin.html#autorepair) section of the admin
|
|
|
Debian by default. See also the [autorepair](http://docs.ganeti.org/docs/ganeti/2.15/html/admin.html#autorepair) section of the admin
|
|
|
manual.
|
|
|
|
|
|
### Master node failure
|
|
|
|
|
|
A master node failure is a special case, as you do not have access to
|
|
|
the node to run Ganeti commands. We have not established our own
|
|
|
procedure for this yet, see:
|
|
|
|
|
|
* [Ganeti wiki master failover procedure](https://github.com/ganeti/ganeti/wiki/Common-Issues#master-failuresafter-a-failure-two-nodes-think-they-are-master)
|
|
|
* [Riseup master failover procedure](https://we.riseup.net/riseup+tech/ganeti#primary-node-fails)
|
|
|
|
|
|
TODO: expand documentation on master node failure recovery.
|
|
|
|
|
|
### Split brain recovery
|
|
|
|
|
|
A split brain occurred during a partial failure, failover, then
|
|
|
unexpected recovery of fsn-node-07. This needs to be documented
|
|
|
properly, but a work log is available in [issue 40229](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40229).
|
|
|
|
|
|
TODO: expand documentation on split brain recovery.
|
|
|
|
|
|
### Bridge configuration failures
|
|
|
|
|
|
If you get the following error while trying to bring up the bridge:
|
... | ... | @@ -1347,11 +1397,6 @@ For this, see [DRBD: deleting a stray device](howto/drbd#deleting-a-stray-device |
|
|
|
|
|
### Other troubleshooting
|
|
|
|
|
|
Riseup has [documentation on various failure scenarios](https://we.riseup.net/riseup+tech/ganeti#failure-scenarios) including
|
|
|
master failover, which we haven't tested. There's also upstream
|
|
|
documentation on [changing node roles](http://docs.ganeti.org/ganeti/2.15/html/admin.html#changing-the-node-role) which might be useful for a
|
|
|
master failover scenario.
|
|
|
|
|
|
The [walkthrough](http://docs.ganeti.org/ganeti/2.15/html/walkthrough.html) also has a few recipes to resolve common
|
|
|
problems.
|
|
|
|
... | ... | |