unbound crashes when running out of disk space (fixed in trixie)
we have an issue about monitoring for this issue (#41967 (closed)), but the underlying issue is ridiculous and should be fixed, either in unbound, or by switching to another resolver.
it's not clear to me if this is an issue in our integration or unbound itself, but when a server runs out of disk space, it can very easily happen that unbound gets in a state where it cannot resolve anything. it will look something like this:
root@dal-rescue-01:~# ls -al /var/lib/unbound/
total 12
drwxrwxr-x 2 unbound unbound 4096 Mar 16 06:45 .
drwxr-xr-x 39 root root 4096 Feb 26 21:40 ..
-rw-r--r-- 1 unbound unbound 794 Mar 16 06:38 30.172.in-addr.arpa.key
-rw-r--r-- 1 unbound unbound 0 Mar 16 06:45 root.key
-rw-r--r-- 1 unbound unbound 0 Mar 16 06:40 torproject.org.key
this happens so often that we had a check specifically for those files in Nagios. (Why we had a check for broken files instead of, you know, FIXING THE FILES, is beyond me, but that's another question.)
root.key is easy: that file is shipped by the dns-root-data package and we should just use that file directly instead of shipping our own.
the torproject.org.key file is more delicate: it's not one that's shipped in dns-root-data (obviously), so we need to address the issue of it being wiped anyways. i suspect that, however, having root.key around might make it easier to recover the latter. perhaps wiping the file out and running puppet would be sufficient? we could have a PreExec thing in the systemd unit, maybe, to bring us to such a state?
anyways, just throwing ideas at the wall to see what sticks here too. perhaps running a simulation (on idle-fsn?) to reproduce the issue and figure out if it's unbound, puppet, ud-replicate or what that creates the empty files would be a good first step, then filing (or finding) an issue upstream to correlate...