colchicifolium backup job hanged

we're getting one of those email every 5 minutes now:

07-Dec 15:33 bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device.

It is, indeed, waiting for a device:

*list joblog jobid=146833
+----------------------------------------------------------------------------------------------------+
| logtext                                                                                              |
+----------------------------------------------------------------------------------------------------+
| bacula-director-01.torproject.org-dir JobId 146833: Start Backup JobId 146833, Job=colchicifolium.torproject.org.2020-12-07_15.18.44_05 |
| bacula-director-01.torproject.org-dir JobId 146833: Created new Volume="torproject-colchicifolium.torproject.org-diff.2020-12-07_15:18", Pool="pooldiff-torproject-colchicifolium.torproject.org", MediaType="File-colchicifolium.torproject.org" in catalog. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
| bungei.torproject.org-sd JobId 146833: JobId=146833, Job colchicifolium.torproject.org.2020-12-07_15.18.44_05 waiting to reserve a device. |
+----------------------------------------------------------------------------------------------------+
+---------+-------------------------------+---------------------+------+-------+----------+----------+-----------+
| jobid   | name                          | starttime           | type | level | jobfiles | jobbytes | jobstatus |
+---------+-------------------------------+---------------------+------+-------+----------+----------+-----------+
| 146,833 | colchicifolium.torproject.org | 2020-12-07 15:18:46 | B    | D     |        0 |        0 | R         |
+---------+-------------------------------+---------------------+------+-------+----------+----------+-----------+

And indeed, according to the storage server, another backup job is running:

*status
Status available for:
     1: Director
     2: Storage
     3: Client
     4: Scheduled
     5: Network
     6: All
Select daemon type for status (1-6): 2
Automatically selected Storage: File-alberti.torproject.org
Connecting to Storage daemon File-alberti.torproject.org at bungei.torproject.org:9103

bungei.torproject.org-sd Version: 9.4.2 (04 February 2019) x86_64-pc-linux-gnu debian 10.5
Daemon started 21-Nov-20 17:58. Jobs: run=1280, running=2.
 Heap: heap=331,776 smbytes=3,095,113 max_bytes=943,958,428 bufs=997 max_bufs=5,349,436
 Sizes: boffset_t=8 size_t=8 int32_t=4 int64_t=8 mode=0,0 newbsr=0
 Res: ndevices=79 nautochgr=0

Running Jobs:
Writing: Differential Backup job colchicifolium.torproject.org JobId=146826 Volume="torproject-colchicifolium.torproject.org-diff.2020-12-07_04:52"
    pool="pooldiff-torproject-colchicifolium.torproject.org" device="FileStorage-colchicifolium.torproject.org" (/srv/backups/bacula/colchicifolium.torproject.org)
    spooling=0 despooling=0 despool_wait=0
    Files=713,896 Bytes=69,913,318,123 AveBytes/sec=1,662,663 LastBytes/sec=1,536,422
    FDReadSeqNo=5,289,218 in_msg=3871535 out_msg=6 fd=10
Writing: Differential Backup job colchicifolium.torproject.org JobId=146833 Volume="torproject-colchicifolium.torproject.org-diff.2020-12-07_04:52"
    pool="pooldiff-torproject-colchicifolium.torproject.org" device="FileStorage-colchicifolium.torproject.org" (/srv/backups/bacula/colchicifolium.torproject.org)
    spooling=0 despooling=0 despool_wait=0
    Files=0 Bytes=0 AveBytes/sec=0 LastBytes/sec=0
    FDSocket closed
====

Jobs waiting to reserve a drive:
   3611 JobId=146833 Volume max jobs=1 exceeded on File device "FileStorage-colchicifolium.torproject.org" (/srv/backups/bacula/colchicifolium.torproject.org).
====

Terminated Jobs:
 JobId  Level    Files      Bytes   Status   Finished        Name 
===================================================================
146827  Diff      8,794    2.895 G  OK       07-Dec-20 05:00 henryi.torproject.org
146828  Diff     14,561    723.3 M  OK       07-Dec-20 07:27 alberti.torproject.org
146829  Diff     36,983    557.8 M  OK       07-Dec-20 08:10 chi-node-03.torproject.org
146830  Diff      8,008    486.7 M  OK       07-Dec-20 13:23 web-cymru-01.torproject.org
146804  Diff   6,377,078    101.7 G  OK       07-Dec-20 15:14 corsicum.torproject.org
146836  Incr     11,593    597.2 M  OK       07-Dec-20 15:27 build-x86-06.torproject.org
146832  Diff    219,907    4.949 G  OK       07-Dec-20 15:29 build-x86-12.torproject.org
146835  Incr     11,535    449.5 M  OK       07-Dec-20 15:30 cupani.torproject.org
146834  Incr     12,425    432.3 M  OK       07-Dec-20 15:30 majus.torproject.org
146837  Incr     11,260    498.0 M  OK       07-Dec-20 15:34 moly.torproject.org
====

Device status:

[...]
==

Device File is "FileStorage-colchicifolium.torproject.org" (/srv/backups/bacula/colchicifolium.torproject.org) mounted with:
    Volume:      torproject-colchicifolium.torproject.org-diff.2020-12-07_04:52
    Pool:        pooldiff-torproject-colchicifolium.torproject.org
    Media type:  File-colchicifolium.torproject.org
    Total Bytes=69,981,777,264 Blocks=1,084,787 Bytes/block=64,511
    Positioned at File=0 Block=0
   Available Space=5.588 TB
==

[...]
==
====

Used Volume status:
Reserved volume: torproject-colchicifolium.torproject.org-diff.2020-12-07_04:52 on File device "FileStorage-colchicifolium.torproject.org" (/srv/backups/bacula/colchicifolium.torproject.org)
    Reader=0 writers=1 reserves=0 volinuse=1 worm=0
====

Attr spooling: 2 active jobs, 1,490,775,378 bytes; 1279 total jobs, 1,655,086,784 max bytes.
====

Yet it's unclear why the storage server is still hanging on to this thing. The job has finished and is marked as failed on the director (although that's not clear why either):

*llist  jobid=146826
           jobid: 146,826
             job: colchicifolium.torproject.org.2020-12-07_04.45.53_42
            name: colchicifolium.torproject.org
     purgedfiles: 0
            type: B
           level: D
        clientid: 55
      clientname: colchicifolium.torproject.org-fd
       jobstatus: f
       schedtime: 2020-12-07 04:45:53
       starttime: 2020-12-07 04:52:15
         endtime: 2020-12-07 04:52:15
     realendtime: 
        jobtdate: 1,607,316,735
    volsessionid: 0
  volsessiontime: 0
        jobfiles: 0
        jobbytes: 0
       readbytes: 0
       joberrors: 0
 jobmissingfiles: 0
          poolid: 221
        poolname: pooldiff-torproject-colchicifolium.torproject.org
      priorjobid: 0
       filesetid: 1
         fileset: Standard Set
         hasbase: 0
        hascache: 0
         comment: 

The job log is not helpful either:

*list joblog jobid=146826
+----------------------------------------------------------------------------------------------------+
| logtext                                                                                              |
+----------------------------------------------------------------------------------------------------+
| bacula-director-01.torproject.org-dir JobId 146826: Start Backup JobId 146826, Job=colchicifolium.torproject.org.2020-12-07_04.45.53_42 |
| bacula-director-01.torproject.org-dir JobId 146826: There are no more Jobs associated with Volume "torproject-colchicifolium.torproject.org-diff.2020-10-13_09:54". Marking it purged. |
| bacula-director-01.torproject.org-dir JobId 146826: New Pool is: poolgraveyard-torproject-colchicifolium.torproject.org |
| bacula-director-01.torproject.org-dir JobId 146826: All records pruned from Volume "torproject-colchicifolium.torproject.org-diff.2020-10-13_09:54"; marking it "Purged" |
| bacula-director-01.torproject.org-dir JobId 146826: Created new Volume="torproject-colchicifolium.torproject.org-diff.2020-12-07_04:52", Pool="pooldiff-torproject-colchicifolium.torproject.org", MediaType="File-colchicifolium.torproject.org" in catalog. |
| bacula-director-01.torproject.org-dir JobId 146826: Using Device "FileStorage-colchicifolium.torproject.org" to write. |
| bacula-director-01.torproject.org-dir JobId 146826: Sending Accurate information to the FD.          |
| bungei.torproject.org-sd JobId 146826: Labeled new Volume "torproject-colchicifolium.torproject.org-diff.2020-12-07_04:52" on File device "FileStorage-colchicifolium.torproject.org" (/srv/backups/bacula/colchicifolium.torproject.org). |
| bungei.torproject.org-sd JobId 146826: Wrote label to prelabeled Volume "torproject-colchicifolium.torproject.org-diff.2020-12-07_04:52" on File device "FileStorage-colchicifolium.torproject.org" (/srv/backups/bacula/colchicifolium.torproject.org) |
| bacula-director-01.torproject.org-dir JobId 146826: Max Volume jobs=1 exceeded. Marking Volume "torproject-colchicifolium.torproject.org-diff.2020-12-07_04:52" as Used. |
| colchicifolium.torproject.org-fd JobId 146826:      /home is a different filesystem. Will not descend from / into it. |
| colchicifolium.torproject.org-fd JobId 146826:      /run is a different filesystem. Will not descend from / into it. |
+----------------------------------------------------------------------------------------------------+
+---------+-------------------------------+---------------------+------+-------+----------+----------+-----------+
| jobid   | name                          | starttime           | type | level | jobfiles | jobbytes | jobstatus |
+---------+-------------------------------+---------------------+------+-------+----------+----------+-----------+
| 146,826 | colchicifolium.torproject.org | 2020-12-07 04:52:15 | B    | D     |        0 |        0 | f         |
+---------+-------------------------------+---------------------+------+-------+----------+----------+-----------+

There was a reboot on the director today which might have interrupted that backup:

reboot   system boot  4.19.0-13-amd64  Mon Dec  7 15:14   still running

the storage server (bungei) hasn't been rebooted since nov 2nd.

how do recover from this state? @weasel, any idea?

i have canceled job 146833 at least, so that the warnings stop already. But this is bound to come back at the next scheduling...

This also begs the question of how we can reboot the bacula servers more safely.