expand on node failures after tonight's adventure

6ba5ccbb · anarcat · d9ccd76e · 6ba5ccbb
Unverified Commit 6ba5ccbb authored 3 years ago by anarcat
--- a/howto/ganeti.md
+++ b/howto/ganeti.md
@@ -1239,14 +1239,64 @@ it directly:

 Ganeti clusters are designed to be [self-healing](http://docs.ganeti.org/ganeti/2.15/html/admin.html#autorepair). As long as only
 one machine disappears, the cluster should be able to recover by
-failing over other nodes. This is currently done manually, see the
-migrate section above.
+failing over other nodes. This is currently done manually, however.

-This could eventually be automated if such situations occur more
+WARNING: the following procedure should be considered a LAST
+RESORT. In the vast majority of cases, it is simpler and less risky to
+just restart the node using a remote power cycle to restore the
+service than risking a split brain scenario which this procedure can
+case.
+
+WARNING, AGAIN: if for some reason the node you are failing over from
+actually returns on its own without you being able to stop it, it
+*will* start those DRBD disks and virtual machines, and you *will* end
+up in a split brain scenario. 
+
+If, say, `fsn-node-07` completely fail and you are completely
+confident it is not still running in parallel (which could lead to a
+"split brain" scenario), you can run this command which will switch
+all the instances on that node to their secondaries:
+
+    gnt-node failover fsn-node-07.torproject.org
+
+It's possible that you need `--ignore-consistency` but this has caused
+trouble in at least once instance (although it might have been
+unrelated). In any case, it is [not used at the WMF](https://wikitech.wikimedia.org/wiki/Ganeti#Failed_hardware_node), for
+example, they explicitly say "this has never happened in our
+setup".
+
+Note that it will still try to connect to the failed not to shutdown
+the DRBD devices, as a last resort.
+
+Also note that recovering from this failure is non-trivial. You will,
+at the very least, need to boot the machine *without* starting the
+DRBD disks and virtual machines somehow, otherwise that is how you end
+up in the split brain scenario.
+
+Recoveries could eventually be automated if such situations occur more
 often, by scheduling a [harep](http://docs.ganeti.org/ganeti/2.15/man/harep.html) cron job, which isn't enabled in
-Debian by default. See also the [autorepair](http://docs.ganeti.org/ganeti/2.15/html/admin.html#autorepair) section of the admin
+Debian by default. See also the [autorepair](http://docs.ganeti.org/docs/ganeti/2.15/html/admin.html#autorepair) section of the admin
 manual.

+### Master node failure
+
+A master node failure is a special case, as you do not have access to
+the node to run Ganeti commands. We have not established our own
+procedure for this yet, see:
+
+ * [Ganeti wiki master failover procedure](https://github.com/ganeti/ganeti/wiki/Common-Issues#master-failuresafter-a-failure-two-nodes-think-they-are-master)
+ * [Riseup master failover procedure](https://we.riseup.net/riseup+tech/ganeti#primary-node-fails)
+
+TODO: expand documentation on master node failure recovery.
+
+### Split brain recovery
+
+A split brain occurred during a partial failure, failover, then
+unexpected recovery of fsn-node-07. This needs to be documented
+properly, but a work log is available in [issue 40229](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40229).
+
+TODO: expand documentation on split brain recovery.
+
 ### Bridge configuration failures

 If you get the following error while trying to bring up the bridge:
@@ -1347,11 +1397,6 @@ For this, see [DRBD: deleting a stray device](howto/drbd#deleting-a-stray-device

 ### Other troubleshooting

-Riseup has [documentation on various failure scenarios](https://we.riseup.net/riseup+tech/ganeti#failure-scenarios) including
-master failover, which we haven't tested. There's also upstream
-documentation on [changing node roles](http://docs.ganeti.org/ganeti/2.15/html/admin.html#changing-the-node-role) which might be useful for a
-master failover scenario.
-
 The [walkthrough](http://docs.ganeti.org/ganeti/2.15/html/walkthrough.html) also has a few recipes to resolve common
 problems.