+9
−1
+5
−0
Loading
Today, colchicifolium's backups are stalled, but this alert wasn't
detecting it. That's because it *did* try to run them, but
somewhat *failed*, which reverted to the *old* "last execution"
timestamp!
I tried *many* things before ending up with this. First, I tried to
figure out a check that would evaluate how often that metric was set
to zero, but couldn't figure out the right promql.
Then I found the status was record, so I formulated the dubious:
quantile_over_time(0.5, bacula_job_last_execution_job_status[7d])
... but that's *slow. If rendered on a week-long graph, it times out!
So I tried the faster:
avg_over_time(bacula_job_last_execution_job_status[7d])
... but then it gives me a float that's much harder to interpret.
So I ended up with this check.