Changes

The steps are simple, but it's nice to have a quick reminder of what should be done in that situation. We already have details for how to handle the RAID-side of the disk failure so we'll just link to that.
lelutin · 51613c6e
--- a/howto/ganeti.md
+++ b/howto/ganeti.md
@@ -2639,6 +2639,23 @@ disks (so to speak) with, for example:
    gnt-instance activate-disks onionbalance-02.torproject.org
+### Failed disk on node
+If a disk fails on a node, we should get it replaced as soon as possible. Here
+are the steps one can follow to achieve that:
+1. Open an incident-type issue in gitlab in the TPA/Team project. Set its
+   priority to High.
+2. empty the node of its instances. in the `fabric-tasks` repository: `./ganeti
+   -H $cluster-node-$number.torproject.org empty-node`
+   * Take note in the issue of which instances were migrated by this operation.
+3. Open a support ticket with Hetzner and then once the machine is back online
+   with the new disk, replace the it in the appropriate RAID arrays. See [the
+   RAID documentation page](howto/raid#replacing-a-drive)
+4. Finally, bring back the instances on the node with the list of instances
+   noted down at step 1. Still in `fabric-tasks`: `fab -H $cluster_master -i
+   instance1 -i instance2`
 ## Disaster recovery
 If things get completely out of hand and the cluster becomes too