... | ... | @@ -1637,14 +1637,138 @@ manual. |
|
|
|
|
|
### Master node failure
|
|
|
|
|
|
A master node failure is a special case, as you do not have access to
|
|
|
the node to run Ganeti commands. We have not established our own
|
|
|
procedure for this yet, see:
|
|
|
A master node failure is a special case, as you may not have access to
|
|
|
the node to run Ganeti commands. The [Ganeti wiki master failover
|
|
|
procedure](https://github.com/ganeti/ganeti/wiki/Common-Issues#master-failuresafter-a-failure-two-nodes-think-they-are-master) has good documentation on this, but we also include
|
|
|
scenarios specific to our use cases, to make sure this is also
|
|
|
available offline.
|
|
|
|
|
|
* [Ganeti wiki master failover procedure](https://github.com/ganeti/ganeti/wiki/Common-Issues#master-failuresafter-a-failure-two-nodes-think-they-are-master)
|
|
|
* [Riseup master failover procedure](https://we.riseup.net/riseup+tech/ganeti#primary-node-fails)
|
|
|
There are two different scenarios that might require a master
|
|
|
failover:
|
|
|
|
|
|
TODO: expand documentation on master node failure recovery.
|
|
|
1. the master is *expected* to fail or go down for maintenance
|
|
|
(looming HDD failure, planned maintenance) and we want to retain
|
|
|
availability
|
|
|
|
|
|
2. the master has completely failed (motherboard fried, power failure,
|
|
|
etc)
|
|
|
|
|
|
The key difference between scenario 1 and 2 here is that in scenario
|
|
|
1, the master is *still* available.
|
|
|
|
|
|
#### Scenario 1: preventive maintenance
|
|
|
|
|
|
This is the best case scenario, as the master is still available. In
|
|
|
that case, it should simply be a matter of doing the `master-failover`
|
|
|
command and marking the old master as offline.
|
|
|
|
|
|
On the machine you want to elect as the new master:
|
|
|
|
|
|
gnt-cluster master-failover
|
|
|
gnt-node modify --offline yes OLDMASTER.torproject.org
|
|
|
|
|
|
When the old master is available again, re-add it to the cluster with:
|
|
|
|
|
|
gnt-node add --readd OLDMASTER.torproject.org
|
|
|
|
|
|
Note that it *should* be safe to boot the old master normally, as long
|
|
|
as it doesn't think it's the master before reboot. That is because
|
|
|
it's the master which tells nodes which VMs to start on boot. You can
|
|
|
check that by running this on the OLDMASTER:
|
|
|
|
|
|
gnt-cluster getmaster
|
|
|
|
|
|
It should return the *NEW* master.
|
|
|
|
|
|
Here's an example of a routine failover performed on `fsn-node-01`,
|
|
|
the nominal master of the `gnt-fsn` cluster, falling over to a
|
|
|
secondary master (we picked `fsn-node-02` here) in prevision for a
|
|
|
disk replacement:
|
|
|
|
|
|
root@fsn-node-02:~# gnt-cluster master-failover
|
|
|
root@fsn-node-02:~# gnt-cluster getmaster
|
|
|
fsn-node-02.torproject.org
|
|
|
root@fsn-node-02:~# gnt-node modify --offline yes fsn-node-01.torproject.org
|
|
|
Tue Jun 21 14:30:56 2022 Failed to stop KVM daemon on node 'fsn-node-01.torproject.org': Node is marked offline
|
|
|
Modified node fsn-node-01.torproject.org
|
|
|
- master_candidate -> False
|
|
|
- offline -> True
|
|
|
|
|
|
And indeed, `fsn-node-01` now thinks it's not the master anymore:
|
|
|
|
|
|
root@fsn-node-01:~# gnt-cluster getmaster
|
|
|
fsn-node-02.torproject.org
|
|
|
|
|
|
And this is how the node was recovered, after a reboot, on the new
|
|
|
master:
|
|
|
|
|
|
root@fsn-node-02:~# gnt-node add --readd fsn-node-01.torproject.org
|
|
|
2022-06-21 16:43:52,666: The certificate differs after being reencoded. Please renew the certificates cluster-wide to prevent future inconsistencies.
|
|
|
Tue Jun 21 16:43:54 2022 - INFO: Readding a node, the offline/drained flags were reset
|
|
|
Tue Jun 21 16:43:54 2022 - INFO: Node will be a master candidate
|
|
|
|
|
|
And to promote it back, on the old master:
|
|
|
|
|
|
root@fsn-node-01:~# gnt-cluster master-failover
|
|
|
root@fsn-node-01:~#
|
|
|
|
|
|
And both nodes agree on who the master is:
|
|
|
|
|
|
root@fsn-node-01:~# gnt-cluster getmaster
|
|
|
fsn-node-01.torproject.org
|
|
|
|
|
|
root@fsn-node-02:~# gnt-cluster getmaster
|
|
|
fsn-node-01.torproject.org
|
|
|
|
|
|
Now is a good time to verify the cluster too:
|
|
|
|
|
|
gnt-cluster verify
|
|
|
|
|
|
That's pretty much it! See [tpo/tpa/team#40805](https://gitlab.torproject.org/tpo/tpa/team/-/issues/incident/40805) for the rest of
|
|
|
that incident.
|
|
|
|
|
|
#### Scenario 2: complete master node failure
|
|
|
|
|
|
In this scenario, the master node is *completely* unavailable. In this
|
|
|
case, the [Ganeti wiki master failover procedure](https://github.com/ganeti/ganeti/wiki/Common-Issues#master-failuresafter-a-failure-two-nodes-think-they-are-master) should be
|
|
|
followed pretty much to the letter.
|
|
|
|
|
|
WARNING: if you follow this procedure and skip step 1, you will
|
|
|
probably end up with a split brain scenario (recovery documented
|
|
|
below). So make absolutely sure the old master is *REALLY* unavailable
|
|
|
before moving ahead with this.
|
|
|
|
|
|
The procedure is, at the time of writing (WARNING: UNTESTED):
|
|
|
|
|
|
1. Make sure that the original failed master won't start again while
|
|
|
a new master is present, preferably by physically shutting down
|
|
|
the node.
|
|
|
|
|
|
2. To upgrade one of the master candidates to the master, issue the
|
|
|
following command on the machine you intend to be the new master:
|
|
|
|
|
|
gnt-cluster master-failover
|
|
|
|
|
|
3. Offline the old master so the new master doesn't try to
|
|
|
communicate with it. Issue the following command:
|
|
|
|
|
|
gnt-node modify --offline yes oldmaster
|
|
|
|
|
|
4. If there were any DRBD instances on the old master node, they can
|
|
|
be failed over by issuing the following commands:
|
|
|
|
|
|
gnt-node evacuate -s oldmaster
|
|
|
gnt-node evacuate -p oldmaster
|
|
|
|
|
|
5. Any plain instances on the old master need to be recreated again.
|
|
|
|
|
|
If the old master becomes available again, re-add it to the cluster
|
|
|
with:
|
|
|
|
|
|
gnt-node add --readd OLDMASTER.torproject.org
|
|
|
|
|
|
The above procedure is UNTESTED. See also the [Riseup master failover
|
|
|
procedure](https://we.riseup.net/riseup+tech/ganeti#primary-node-fails) for further ideas.
|
|
|
|
|
|
### Split brain recovery
|
|
|
|
... | ... | |