Ganeti clusters are designed to be [self-healing](http://docs.ganeti.org/ganeti/2.15/html/admin.html#autorepair). As long as only
one machine disappears, the cluster should be able to recover by
failing over other nodes. This is currently done manually, see the
migrate section above.
failing over other nodes. This is currently done manually, however.
This could eventually be automated if such situations occur more
WARNING: the following procedure should be considered a LAST
RESORT. In the vast majority of cases, it is simpler and less risky to
just restart the node using a remote power cycle to restore the
service than risking a split brain scenario which this procedure can
case.
WARNING, AGAIN: if for some reason the node you are failing over from
actually returns on its own without you being able to stop it, it
*will* start those DRBD disks and virtual machines, and you *will* end
up in a split brain scenario.
If, say, `fsn-node-07` completely fail and you are completely
confident it is not still running in parallel (which could lead to a
"split brain" scenario), you can run this command which will switch
all the instances on that node to their secondaries:
gnt-node failover fsn-node-07.torproject.org
It's possible that you need `--ignore-consistency` but this has caused
trouble in at least once instance (although it might have been
unrelated). In any case, it is [not used at the WMF](https://wikitech.wikimedia.org/wiki/Ganeti#Failed_hardware_node), for
example, they explicitly say "this has never happened in our
setup".
Note that it will still try to connect to the failed not to shutdown
the DRBD devices, as a last resort.
Also note that recovering from this failure is non-trivial. You will,
at the very least, need to boot the machine *without* starting the
DRBD disks and virtual machines somehow, otherwise that is how you end
up in the split brain scenario.
Recoveries could eventually be automated if such situations occur more
often, by scheduling a [harep](http://docs.ganeti.org/ganeti/2.15/man/harep.html) cron job, which isn't enabled in
Debian by default. See also the [autorepair](http://docs.ganeti.org/ganeti/2.15/html/admin.html#autorepair) section of the admin
Debian by default. See also the [autorepair](http://docs.ganeti.org/docs/ganeti/2.15/html/admin.html#autorepair) section of the admin
manual.
### Master node failure
A master node failure is a special case, as you do not have access to
the node to run Ganeti commands. We have not established our own
procedure for this yet, see:
*[Ganeti wiki master failover procedure](https://github.com/ganeti/ganeti/wiki/Common-Issues#master-failuresafter-a-failure-two-nodes-think-they-are-master)