... | ... | @@ -703,6 +703,79 @@ itself. This is now part of the [host retirement procedure][]. |
|
|
Hint: see also the [howto/postgresql](howto/postgresql) documentation for the backup
|
|
|
procedures specific to that database.
|
|
|
|
|
|
### Out of disk scenario
|
|
|
|
|
|
The storage server disk space *can* (and *has*) filled up, which will
|
|
|
lead to backup jobs failing. A first sign of this is Nagios warning
|
|
|
about disk usage:
|
|
|
|
|
|
DISK WARNING - free space: /srv/backups/bacula 5891123 MB (9% inode=99%):
|
|
|
|
|
|
Normally, this is not too much of an issue: above, there is still a
|
|
|
whopping 5TB of disk space available on the server! But in certain
|
|
|
conditions, this can actually disappear quickly. In October 2023, that
|
|
|
5TB was filled up in less than 24 hours ([tpo/tpa/team#41361](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41361/)),
|
|
|
leading to the critical notification:
|
|
|
|
|
|
Subject: ** PROBLEM Service Alert: bungei/disk usage on /srv/backups/bacula is CRITICAL **
|
|
|
|
|
|
Then jobs started failing:
|
|
|
|
|
|
Date: Wed, 18 Oct 2023 17:15:47 +0000
|
|
|
From: bacula-service@torproject.org
|
|
|
To: bacula-service@torproject.org
|
|
|
Subject: Bacula: Intervention needed for archive-01.torproject.org.2023-10-18_13.15.43_59
|
|
|
|
|
|
18-Oct 17:15 bungei.torproject.org-sd JobId 246219: Job archive-01.torproject.org.2023-10-18_13.15.43_59 is waiting. Cannot find any appendable volumes.
|
|
|
Please use the "label" command to create a new Volume for:
|
|
|
Storage: "FileStorage-archive-01.torproject.org" (/srv/backups/bacula/archive-01.torproject.org)
|
|
|
Pool: poolfull-torproject-archive-01.torproject.org
|
|
|
Media type: File-archive-01.torproject.org
|
|
|
|
|
|
Eventually, an email with the following first line goes out:
|
|
|
|
|
|
18-Oct 18:15 bungei.torproject.org-sd JobId 246219: Please mount append Volume "torproject-archive-01.torproject.org-full.2023-10-18_18:10" or label a new one for:
|
|
|
|
|
|
At this point, space need to be made on the backup server. Normally,
|
|
|
there's extra space on the volume group available in LVM that can be
|
|
|
allocated to deal with such situation. See the output of the `vgs`
|
|
|
command and follow the resize procedures in the [LVM docs](howto/lvm) in that
|
|
|
case.
|
|
|
|
|
|
If there *isn't* any space available on the volume group, it *may* be
|
|
|
acceptable to manually remove old, large files from the storage
|
|
|
server, but that is generally not recommended.
|
|
|
|
|
|
One disk space is available again, there will be pending jobs listed
|
|
|
in `bconsole`'s `status director`:
|
|
|
|
|
|
JobId Type Level Files Bytes Name Status
|
|
|
======================================================================
|
|
|
246219 Back Full 723,866 5.763 T archive-01.torproject.org is running
|
|
|
246222 Back Incr 0 0 dangerzone-01.torproject.org is waiting for a mount request
|
|
|
246223 Back Incr 0 0 ns5.torproject.org is waiting for a mount request
|
|
|
246224 Back Incr 0 0 tb-build-05.torproject.org is waiting for a mount request
|
|
|
246225 Back Incr 0 0 crm-ext-01.torproject.org is waiting for a mount request
|
|
|
246226 Back Incr 0 0 media-01.torproject.org is waiting for a mount request
|
|
|
246227 Back Incr 0 0 weather-01.torproject.org is waiting for a mount request
|
|
|
246228 Back Incr 0 0 neriniflorum.torproject.org is waiting for a mount request
|
|
|
246229 Back Incr 0 0 tb-build-02.torproject.org is waiting for a mount request
|
|
|
246230 Back Incr 0 0 survey-01.torproject.org is waiting for a mount request
|
|
|
|
|
|
In the above, the `archive-01` job was the one which took up all free
|
|
|
space. The job was restarted and was then running, above, but all the
|
|
|
other ones were `waiting for a mount request`. The solution there is
|
|
|
to just do that mount, with their job ID, for example, for the
|
|
|
`dangerzone-01` job above:
|
|
|
|
|
|
bconsole> mount jobid=24622
|
|
|
|
|
|
This should resume all jobs and eventually fix the Nagios warnings.
|
|
|
|
|
|
Note that when that available space becomes too low (say less than 10%
|
|
|
of the volume size), plans should be made to order new hardware, so in
|
|
|
the emergency subsides, a ticket should be created for followup.
|
|
|
|
|
|
### Out of date backups
|
|
|
|
|
|
If a job is behaving strangely, you can inspect its job log to see
|
... | ... | |