diff --git a/howto/ganeti.md b/howto/ganeti.md index 609a8e0162d99fdc3da6be200a945a8df8c8a1cc..4a4ca78f53f5e5ee5d03d28f96e12d1d452058ab 100644 --- a/howto/ganeti.md +++ b/howto/ganeti.md @@ -1239,14 +1239,64 @@ it directly: Ganeti clusters are designed to be [self-healing](http://docs.ganeti.org/ganeti/2.15/html/admin.html#autorepair). As long as only one machine disappears, the cluster should be able to recover by -failing over other nodes. This is currently done manually, see the -migrate section above. +failing over other nodes. This is currently done manually, however. -This could eventually be automated if such situations occur more +WARNING: the following procedure should be considered a LAST +RESORT. In the vast majority of cases, it is simpler and less risky to +just restart the node using a remote power cycle to restore the +service than risking a split brain scenario which this procedure can +case. + +WARNING, AGAIN: if for some reason the node you are failing over from +actually returns on its own without you being able to stop it, it +*will* start those DRBD disks and virtual machines, and you *will* end +up in a split brain scenario. + +If, say, `fsn-node-07` completely fail and you are completely +confident it is not still running in parallel (which could lead to a +"split brain" scenario), you can run this command which will switch +all the instances on that node to their secondaries: + + gnt-node failover fsn-node-07.torproject.org + +It's possible that you need `--ignore-consistency` but this has caused +trouble in at least once instance (although it might have been +unrelated). In any case, it is [not used at the WMF](https://wikitech.wikimedia.org/wiki/Ganeti#Failed_hardware_node), for +example, they explicitly say "this has never happened in our +setup". + +Note that it will still try to connect to the failed not to shutdown +the DRBD devices, as a last resort. + +Also note that recovering from this failure is non-trivial. You will, +at the very least, need to boot the machine *without* starting the +DRBD disks and virtual machines somehow, otherwise that is how you end +up in the split brain scenario. + +Recoveries could eventually be automated if such situations occur more often, by scheduling a [harep](http://docs.ganeti.org/ganeti/2.15/man/harep.html) cron job, which isn't enabled in -Debian by default. See also the [autorepair](http://docs.ganeti.org/ganeti/2.15/html/admin.html#autorepair) section of the admin +Debian by default. See also the [autorepair](http://docs.ganeti.org/docs/ganeti/2.15/html/admin.html#autorepair) section of the admin manual. +### Master node failure + +A master node failure is a special case, as you do not have access to +the node to run Ganeti commands. We have not established our own +procedure for this yet, see: + + * [Ganeti wiki master failover procedure](https://github.com/ganeti/ganeti/wiki/Common-Issues#master-failuresafter-a-failure-two-nodes-think-they-are-master) + * [Riseup master failover procedure](https://we.riseup.net/riseup+tech/ganeti#primary-node-fails) + +TODO: expand documentation on master node failure recovery. + +### Split brain recovery + +A split brain occurred during a partial failure, failover, then +unexpected recovery of fsn-node-07. This needs to be documented +properly, but a work log is available in [issue 40229](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40229). + +TODO: expand documentation on split brain recovery. + ### Bridge configuration failures If you get the following error while trying to bring up the bridge: @@ -1347,11 +1397,6 @@ For this, see [DRBD: deleting a stray device](howto/drbd#deleting-a-stray-device ### Other troubleshooting -Riseup has [documentation on various failure scenarios](https://we.riseup.net/riseup+tech/ganeti#failure-scenarios) including -master failover, which we haven't tested. There's also upstream -documentation on [changing node roles](http://docs.ganeti.org/ganeti/2.15/html/admin.html#changing-the-node-role) which might be useful for a -master failover scenario. - The [walkthrough](http://docs.ganeti.org/ganeti/2.15/html/walkthrough.html) also has a few recipes to resolve common problems.