Skip to content
Snippets Groups Projects
Unverified Commit 6ba5ccbb authored by anarcat's avatar anarcat
Browse files

expand on node failures after tonight's adventure

parent d9ccd76e
No related branches found
No related tags found
No related merge requests found
......@@ -1239,14 +1239,64 @@ it directly:
Ganeti clusters are designed to be [self-healing](http://docs.ganeti.org/ganeti/2.15/html/admin.html#autorepair). As long as only
one machine disappears, the cluster should be able to recover by
failing over other nodes. This is currently done manually, see the
migrate section above.
failing over other nodes. This is currently done manually, however.
This could eventually be automated if such situations occur more
WARNING: the following procedure should be considered a LAST
RESORT. In the vast majority of cases, it is simpler and less risky to
just restart the node using a remote power cycle to restore the
service than risking a split brain scenario which this procedure can
case.
WARNING, AGAIN: if for some reason the node you are failing over from
actually returns on its own without you being able to stop it, it
*will* start those DRBD disks and virtual machines, and you *will* end
up in a split brain scenario.
If, say, `fsn-node-07` completely fail and you are completely
confident it is not still running in parallel (which could lead to a
"split brain" scenario), you can run this command which will switch
all the instances on that node to their secondaries:
gnt-node failover fsn-node-07.torproject.org
It's possible that you need `--ignore-consistency` but this has caused
trouble in at least once instance (although it might have been
unrelated). In any case, it is [not used at the WMF](https://wikitech.wikimedia.org/wiki/Ganeti#Failed_hardware_node), for
example, they explicitly say "this has never happened in our
setup".
Note that it will still try to connect to the failed not to shutdown
the DRBD devices, as a last resort.
Also note that recovering from this failure is non-trivial. You will,
at the very least, need to boot the machine *without* starting the
DRBD disks and virtual machines somehow, otherwise that is how you end
up in the split brain scenario.
Recoveries could eventually be automated if such situations occur more
often, by scheduling a [harep](http://docs.ganeti.org/ganeti/2.15/man/harep.html) cron job, which isn't enabled in
Debian by default. See also the [autorepair](http://docs.ganeti.org/ganeti/2.15/html/admin.html#autorepair) section of the admin
Debian by default. See also the [autorepair](http://docs.ganeti.org/docs/ganeti/2.15/html/admin.html#autorepair) section of the admin
manual.
### Master node failure
A master node failure is a special case, as you do not have access to
the node to run Ganeti commands. We have not established our own
procedure for this yet, see:
* [Ganeti wiki master failover procedure](https://github.com/ganeti/ganeti/wiki/Common-Issues#master-failuresafter-a-failure-two-nodes-think-they-are-master)
* [Riseup master failover procedure](https://we.riseup.net/riseup+tech/ganeti#primary-node-fails)
TODO: expand documentation on master node failure recovery.
### Split brain recovery
A split brain occurred during a partial failure, failover, then
unexpected recovery of fsn-node-07. This needs to be documented
properly, but a work log is available in [issue 40229](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40229).
TODO: expand documentation on split brain recovery.
### Bridge configuration failures
If you get the following error while trying to bring up the bridge:
......@@ -1347,11 +1397,6 @@ For this, see [DRBD: deleting a stray device](howto/drbd#deleting-a-stray-device
### Other troubleshooting
Riseup has [documentation on various failure scenarios](https://we.riseup.net/riseup+tech/ganeti#failure-scenarios) including
master failover, which we haven't tested. There's also upstream
documentation on [changing node roles](http://docs.ganeti.org/ganeti/2.15/html/admin.html#changing-the-node-role) which might be useful for a
master failover scenario.
The [walkthrough](http://docs.ganeti.org/ganeti/2.15/html/walkthrough.html) also has a few recipes to resolve common
problems.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment