Changes

anarcat · c1c1de5e
--- a/howto/ganeti.md
+++ b/howto/ganeti.md
@@ -1235,7 +1235,7 @@ it directly:
      LV Tags
      originstname+bacula-director-01.torproject.org

-### Node failures
+### Node failure

 Ganeti clusters are designed to be [self-healing](http://docs.ganeti.org/ganeti/2.15/html/admin.html#autorepair). As long as only
 one machine disappears, the cluster should be able to recover by
@@ -1245,33 +1245,56 @@ WARNING: the following procedure should be considered a LAST
 RESORT. In the vast majority of cases, it is simpler and less risky to
 just restart the node using a remote power cycle to restore the
 service than risking a split brain scenario which this procedure can
-case.
+case when not followed properly.

 WARNING, AGAIN: if for some reason the node you are failing over from
 actually returns on its own without you being able to stop it, it
-*will* start those DRBD disks and virtual machines, and you *will* end
-up in a split brain scenario. 
+may run those DRBD disks and virtual machines, and you *may* end
+up in a split brain scenario.

-If, say, `fsn-node-07` completely fail and you are completely
-confident it is not still running in parallel (which could lead to a
-"split brain" scenario), you can run this command which will switch
-all the instances on that node to their secondaries:
+If, say, `fsn-node-07` completely fails and you need to restore
+service to the virtual machines running on that server, you can
+failover to the secondaries. Before you do, however, you need to be
+completely confident it is not still running in parallel, which could
+lead to a "split brain" scenario. For that, just cut the power to the
+machine using out of band management (e.g. on Hetzner, power down the
+machine through the Hetzner Robot, on Cymru, use the iDRAC to cut the
+power to the main board).
+
+Once the machine is powered down, instruct Ganeti to stop using it
+altogether:
+
+    gnt-node modify --offline=yes fsn-node-07
+
+Then, once the machine is offline and Ganeti also agrees, switch all
+the instances on that node to their secondaries:

    gnt-node failover fsn-node-07.torproject.org

 It's possible that you need `--ignore-consistency` but this has caused
-trouble in at least once instance (although it might have been
-unrelated). In any case, it is [not used at the WMF](https://wikitech.wikimedia.org/wiki/Ganeti#Failed_hardware_node), for
-example, they explicitly say "this has never happened in our
-setup".
+trouble in the past (see [40229](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40229)). In any case, it is [not used at
+the WMF](https://wikitech.wikimedia.org/wiki/Ganeti#Failed_hardware_node), for example, they explicitly say that never needed the
+flag.

-Note that it will still try to connect to the failed not to shutdown
+Note that it will still try to connect to the failed node to shutdown
 the DRBD devices, as a last resort.

-Also note that recovering from this failure is non-trivial. You will,
-at the very least, need to boot the machine *without* starting the
-DRBD disks and virtual machines somehow, otherwise that is how you end
-up in the split brain scenario.
+Recovering from the failure should be automatic: once the failed
+server is repaired and restarts, it will contact the master to ask for
+instances to start. Since the machines the instances have been
+migrated, none will be started and there *should* not be any
+inconsistencies. 
+
+Once the machine is up and running and you are confident you do not
+have a split brain scenario, you can re-add the machine to the cluster
+with:
+
+    gnt-node add --readd fsn-node-07.torproject.org
+
+Once that is done, rebalance the cluster because you now have an empty
+node which could be reused (hopefully). It might, obviously, be worth
+exploring the root case of the failure, however, before readding the
+machine to the cluster.

 Recoveries could eventually be automated if such situations occur more
 often, by scheduling a [harep](http://docs.ganeti.org/ganeti/2.15/man/harep.html) cron job, which isn't enabled in
@@ -1292,10 +1315,117 @@ TODO: expand documentation on master node failure recovery.
 ### Split brain recovery

 A split brain occurred during a partial failure, failover, then
-unexpected recovery of fsn-node-07. This needs to be documented
-properly, but a work log is available in [issue 40229](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40229).
+unexpected recovery of `fsn-node-07` ([issue 40229](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40229)). It might
+occur in other scenarios, but this section documents that specific
+one. Hopefully the recovery will be similar in other scenarios.
+
+The split brain was the result of an operator running this command to
+failover the instances running on the node:
+
+    gnt-node failover --ignore-consistency fsn-node-07.torproject.org
+
+The symptom of the split brain is that the VM is running on two
+machines. You will see that in `gnt-cluster verify`:
+
+    Thu Apr 22 01:28:04 2021 * Verifying node status
+    Thu Apr 22 01:28:04 2021   - ERROR: instance palmeri.torproject.org: instance should not run on node fsn-node-07.torproject.org
+    Thu Apr 22 01:28:04 2021   - ERROR: instance onionoo-backend-02.torproject.org: instance should not run on node fsn-node-07.torproject.org
+    Thu Apr 22 01:28:04 2021   - ERROR: instance polyanthum.torproject.org: instance should not run on node fsn-node-07.torproject.org
+    Thu Apr 22 01:28:04 2021   - ERROR: instance onionbalance-01.torproject.org: instance should not run on node fsn-node-07.torproject.org
+    Thu Apr 22 01:28:04 2021   - ERROR: instance henryi.torproject.org: instance should not run on node fsn-node-07.torproject.org
+    Thu Apr 22 01:28:04 2021   - ERROR: instance nevii.torproject.org: instance should not run on node fsn-node-07.torproject.org
+
+In the above, the verification finds an instance running on an
+unexpected server (the old primary). Disks will be in a similar
+"degraded" state, according to `gnt-cluster verify`:
+
+    Thu Apr 22 01:28:04 2021 * Verifying instance status
+    Thu Apr 22 01:28:04 2021   - WARNING: instance onionoo-backend-02.torproject.org: disk/0 on fsn-node-07.torproject.org is degraded; local disk state is 'ok'
+    Thu Apr 22 01:28:04 2021   - WARNING: instance onionoo-backend-02.torproject.org: disk/1 on fsn-node-07.torproject.org is degraded; local disk state is 'ok'
+    Thu Apr 22 01:28:04 2021   - WARNING: instance onionoo-backend-02.torproject.org: disk/2 on fsn-node-07.torproject.org is degraded; local disk state is 'ok'
+    Thu Apr 22 01:28:04 2021   - WARNING: instance onionoo-backend-02.torproject.org: disk/0 on fsn-node-06.torproject.org is degraded; local disk state is 'ok'
+    Thu Apr 22 01:28:04 2021   - WARNING: instance onionoo-backend-02.torproject.org: disk/1 on fsn-node-06.torproject.org is degraded; local disk state is 'ok'
+    Thu Apr 22 01:28:04 2021   - WARNING: instance onionoo-backend-02.torproject.org: disk/2 on fsn-node-06.torproject.org is degraded; local disk state is 'ok'
+
+We can also see that symptom on an individual instance:
+
+    root@fsn-node-01:~# gnt-instance info onionbalance-01.torproject.org
+    - Instance name: onionbalance-01.torproject.org
+    [...]
+      Disks: 
+        - disk/0: drbd, size 10.0G
+          access mode: rw
+          nodeA: fsn-node-05.torproject.org, minor=29
+          nodeB: fsn-node-07.torproject.org, minor=26
+          port: 11031
+          on primary: /dev/drbd29 (147:29) in sync, status *DEGRADED*
+          on secondary: /dev/drbd26 (147:26) in sync, status *DEGRADED*
+    [...]
+
+The first (optional) thing to do in a split brain scenario is to stop the damage
+made by running instances: stop all the instances running in parallel,
+on both the previous and new primaries:
+
+    gnt-instance stop $INSTANCES
+
+Then on `fsn-node-07` just use `kill(1)` to shutdown the `qemu`
+processes running the VMs directly. Now the instances should all be
+shutdown and no further changes will be done on the VM that could
+possibly be lost.
+
+(This step is optional because you can also skip straight to the hard
+decision below, while leaving the instances running. But that adds
+pressure to you, and we don't want to do that to your poor brain right
+now.)
+
+That will leave you time to make a more important decision: which node
+will be authoritative (which will keep running as primary) and which
+one will "lose" (and will have its instances destroyed)? There's no
+easy good or wrong answer for this: it's a judgement call. In any
+case, there might already been data loss: for as long as both nodes
+were available and the VMs running on both, data registered on one of
+the nodes during the split brain will be lost when we destroy the
+state on the "losing" node.
+
+If you have picked the previous primary as the "new" primary, you will
+need to *first* revert the failover and flip the instances back to the
+previous primary:
+
+    for instance in $INSTANCES; do
+        gnt-instance failover $instance
+    done
+
+When that is done, or if you have picked the "new" primary (the one
+the instances were originally failed over to) as the official one: you
+need to fix the disks' state. For this, flip to a "plain" disk
+(i.e. turn off DRBD) and turn DRBD back on. This will stop mirroring
+the disk, and reallocate a new disk in the right place. Assuming all
+instances are stopped, this should do it:
+
+    for instance in $INSTANCES ; do
+      gnt-instance modify -t plain $instance
+      gnt-instance modify -t drbd --no-wait-for-sync $instance
+      gnt-instance start $instance
+      gnt-instance console $instance
+    done
+
+Then the machines should be back up on a single machine and the split
+brain scenario resolved. Note that this means the other side of the
+DRBD mirror will be destroyed in the procedure, that is the step that
+drops the data which was sent to the wrong part of the "split
+brain". 
+
+Once everything is back to normal, it might be a good idea to
+rebalance the cluster.
+
+References:

-TODO: expand documentation on split brain recovery.
+ * the `-t plain` hack comes from [this post on the Ganeti list](https://groups.google.com/g/ganeti/c/l8www_IcFFI)
+ * [this procedure](https://blkperl.github.io/split-brain-ganeti.html) suggests using `replace-disks -n` which also
+   works, but requires us to pick the secondary by hand each time,
+   which is annoying
+ * [this procedure](https://www.ipserverone.info/knowledge-base/how-to-fix-drbd-recovery-from-split-brain/) has instructions on how to recover at the DRBD
+   level directly, but have not required those instructions so far

 ### Bridge configuration failures