Changes

anarcat · fd104c15
--- a/howto/ganeti.md
+++ b/howto/ganeti.md
@@ -1637,14 +1637,138 @@ manual.

 ### Master node failure

-A master node failure is a special case, as you do not have access to
-the node to run Ganeti commands. We have not established our own
-procedure for this yet, see:
+A master node failure is a special case, as you may not have access to
+the node to run Ganeti commands. The [Ganeti wiki master failover
+procedure](https://github.com/ganeti/ganeti/wiki/Common-Issues#master-failuresafter-a-failure-two-nodes-think-they-are-master) has good documentation on this, but we also include
+scenarios specific to our use cases, to make sure this is also
+available offline.

- * [Ganeti wiki master failover procedure](https://github.com/ganeti/ganeti/wiki/Common-Issues#master-failuresafter-a-failure-two-nodes-think-they-are-master)
- * [Riseup master failover procedure](https://we.riseup.net/riseup+tech/ganeti#primary-node-fails)
+There are two different scenarios that might require a master
+failover:

-TODO: expand documentation on master node failure recovery.
+ 1. the master is *expected* to fail or go down for maintenance
+    (looming HDD failure, planned maintenance) and we want to retain
+    availability
+
+ 2. the master has completely failed (motherboard fried, power failure,
+    etc)
+
+The key difference between scenario 1 and 2 here is that in scenario
+1, the master is *still* available.
+
+#### Scenario 1: preventive maintenance
+
+This is the best case scenario, as the master is still available. In
+that case, it should simply be a matter of doing the `master-failover`
+command and marking the old master as offline. 
+
+On the machine you want to elect as the new master:
+
+    gnt-cluster master-failover
+    gnt-node modify --offline yes OLDMASTER.torproject.org
+
+When the old master is available again, re-add it to the cluster with:
+
+    gnt-node add --readd OLDMASTER.torproject.org
+
+Note that it *should* be safe to boot the old master normally, as long
+as it doesn't think it's the master before reboot. That is because
+it's the master which tells nodes which VMs to start on boot. You can
+check that by running this on the OLDMASTER:
+
+    gnt-cluster getmaster
+
+It should return the *NEW* master.
+
+Here's an example of a routine failover performed on `fsn-node-01`,
+the nominal master of the `gnt-fsn` cluster, falling over to a
+secondary master (we picked `fsn-node-02` here) in prevision for a
+disk replacement:
+
+    root@fsn-node-02:~# gnt-cluster master-failover
+    root@fsn-node-02:~# gnt-cluster getmaster
+    fsn-node-02.torproject.org
+    root@fsn-node-02:~# gnt-node modify --offline yes fsn-node-01.torproject.org
+    Tue Jun 21 14:30:56 2022 Failed to stop KVM daemon on node 'fsn-node-01.torproject.org': Node is marked offline
+    Modified node fsn-node-01.torproject.org
+     - master_candidate -> False
+     - offline -> True
+
+And indeed, `fsn-node-01` now thinks it's not the master anymore:
+
+    root@fsn-node-01:~# gnt-cluster getmaster
+    fsn-node-02.torproject.org
+
+And this is how the node was recovered, after a reboot, on the new
+master:
+
+    root@fsn-node-02:~# gnt-node add --readd fsn-node-01.torproject.org
+    2022-06-21 16:43:52,666: The certificate differs after being reencoded. Please renew the certificates cluster-wide to prevent future inconsistencies.
+    Tue Jun 21 16:43:54 2022  - INFO: Readding a node, the offline/drained flags were reset
+    Tue Jun 21 16:43:54 2022  - INFO: Node will be a master candidate
+
+And to promote it back, on the old master:
+
+    root@fsn-node-01:~# gnt-cluster master-failover
+    root@fsn-node-01:~# 
+
+And both nodes agree on who the master is:
+
+    root@fsn-node-01:~# gnt-cluster getmaster
+    fsn-node-01.torproject.org
+
+    root@fsn-node-02:~# gnt-cluster getmaster
+    fsn-node-01.torproject.org
+
+Now is a good time to verify the cluster too:
+
+    gnt-cluster verify
+
+That's pretty much it! See [tpo/tpa/team#40805](https://gitlab.torproject.org/tpo/tpa/team/-/issues/incident/40805) for the rest of
+that incident.
+
+#### Scenario 2: complete master node failure
+
+In this scenario, the master node is *completely* unavailable. In this
+case, the [Ganeti wiki master failover procedure](https://github.com/ganeti/ganeti/wiki/Common-Issues#master-failuresafter-a-failure-two-nodes-think-they-are-master) should be
+followed pretty much to the letter.
+
+WARNING: if you follow this procedure and skip step 1, you will
+probably end up with a split brain scenario (recovery documented
+below). So make absolutely sure the old master is *REALLY* unavailable
+before moving ahead with this.
+
+The procedure is, at the time of writing (WARNING: UNTESTED):
+
+ 1. Make sure that the original failed master won't start again while
+    a new master is present, preferably by physically shutting down
+    the node.
+
+ 2. To upgrade one of the master candidates to the master, issue the
+    following command on the machine you intend to be the new master:
+
+        gnt-cluster master-failover
+
+ 3. Offline the old master so the new master doesn't try to
+    communicate with it. Issue the following command:
+
+        gnt-node modify --offline yes oldmaster
+
+ 4. If there were any DRBD instances on the old master node, they can
+    be failed over by issuing the following commands:
+
+        gnt-node evacuate -s oldmaster
+        gnt-node evacuate -p oldmaster
+
+ 5. Any plain instances on the old master need to be recreated again.
+
+If the old master becomes available again, re-add it to the cluster
+with:
+
+    gnt-node add --readd OLDMASTER.torproject.org
+
+The above procedure is UNTESTED. See also the [Riseup master failover
+procedure](https://we.riseup.net/riseup+tech/ganeti#primary-node-fails) for further ideas.

 ### Split brain recovery