anarcat · 840401c3
--- a/howto/backup.md
+++ b/howto/backup.md
@@ -703,6 +703,79 @@ itself. This is now part of the [host retirement procedure][].
 Hint: see also the [howto/postgresql](howto/postgresql) documentation for the backup
 procedures specific to that database.

+### Out of disk scenario
+
+The storage server disk space *can* (and *has*) filled up, which will
+lead to backup jobs failing. A first sign of this is Nagios warning
+about disk usage:
+
+    DISK WARNING - free space: /srv/backups/bacula 5891123 MB (9% inode=99%):
+
+Normally, this is not too much of an issue: above, there is still a
+whopping 5TB of disk space available on the server! But in certain
+conditions, this can actually disappear quickly. In October 2023, that
+5TB was filled up in less than 24 hours ([tpo/tpa/team#41361](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41361/)),
+leading to the critical notification:
+
+    Subject: ** PROBLEM Service Alert: bungei/disk usage on /srv/backups/bacula is CRITICAL **
+
+Then jobs started failing:
+
+    Date: Wed, 18 Oct 2023 17:15:47 +0000
+    From: bacula-service@torproject.org
+    To: bacula-service@torproject.org
+    Subject: Bacula: Intervention needed for archive-01.torproject.org.2023-10-18_13.15.43_59
+
+    18-Oct 17:15 bungei.torproject.org-sd JobId 246219: Job archive-01.torproject.org.2023-10-18_13.15.43_59 is waiting. Cannot find any appendable volumes.
+    Please use the "label" command to create a new Volume for:
+        Storage:      "FileStorage-archive-01.torproject.org" (/srv/backups/bacula/archive-01.torproject.org)
+        Pool:         poolfull-torproject-archive-01.torproject.org
+        Media type:   File-archive-01.torproject.org
+
+Eventually, an email with the following first line goes out:
+
+    18-Oct 18:15 bungei.torproject.org-sd JobId 246219: Please mount append Volume "torproject-archive-01.torproject.org-full.2023-10-18_18:10" or label a new one for:
+
+At this point, space need to be made on the backup server. Normally,
+there's extra space on the volume group available in LVM that can be
+allocated to deal with such situation. See the output of the `vgs`
+command and follow the resize procedures in the [LVM docs](howto/lvm) in that
+case.
+
+If there *isn't* any space available on the volume group, it *may* be
+acceptable to manually remove old, large files from the storage
+server, but that is generally not recommended.
+
+One disk space is available again, there will be pending jobs listed
+in `bconsole`'s `status director`:
+
+    JobId  Type Level     Files     Bytes  Name              Status
+    ======================================================================
+    246219  Back Full    723,866    5.763 T archive-01.torproject.org is running
+    246222  Back Incr          0         0  dangerzone-01.torproject.org is waiting for a mount request
+    246223  Back Incr          0         0  ns5.torproject.org is waiting for a mount request
+    246224  Back Incr          0         0  tb-build-05.torproject.org is waiting for a mount request
+    246225  Back Incr          0         0  crm-ext-01.torproject.org is waiting for a mount request
+    246226  Back Incr          0         0  media-01.torproject.org is waiting for a mount request
+    246227  Back Incr          0         0  weather-01.torproject.org is waiting for a mount request
+    246228  Back Incr          0         0  neriniflorum.torproject.org is waiting for a mount request
+    246229  Back Incr          0         0  tb-build-02.torproject.org is waiting for a mount request
+    246230  Back Incr          0         0  survey-01.torproject.org is waiting for a mount request
+
+In the above, the `archive-01` job was the one which took up all free
+space. The job was restarted and was then running, above, but all the
+other ones were `waiting for a mount request`. The solution there is
+to just do that mount, with their job ID, for example, for the
+`dangerzone-01` job above:
+
+    bconsole> mount jobid=24622
+
+This should resume all jobs and eventually fix the Nagios warnings.
+
+Note that when that available space becomes too low (say less than 10%
+of the volume size), plans should be made to order new hardware, so in
+the emergency subsides, a ticket should be created for followup.
+
 ### Out of date backups

 If a job is behaving strangely, you can inspect its job log to see