Verified Commit 8d7353f7 authored by anarcat's avatar anarcat
Browse files

backups: handle weird edge case

Today, colchicifolium's backups are stalled, but this alert wasn't
detecting it. That's because it *did* try to run them, but
somewhat *failed*, which reverted to the *old* "last execution"
timestamp!

I tried *many* things before ending up with this. First, I tried to
figure out a check that would evaluate how often that metric was set
to zero, but couldn't figure out the right promql.

Then I found the status was record, so I formulated the dubious:

    quantile_over_time(0.5, bacula_job_last_execution_job_status[7d])

... but that's *slow. If rendered on a week-long graph, it times out!

So I tried the faster:

   avg_over_time(bacula_job_last_execution_job_status[7d])

... but then it gives me a float that's much harder to interpret.

So I ended up with this check.
parent 26ca912e
Loading
Loading
Loading
Loading
Loading
+9 −1
Original line number Diff line number Diff line
@@ -27,7 +27,15 @@ groups:
      dashboard: "https://grafana.torproject.org/d/ang5zlv/backups-health?var-server={{ $labels.bacula_job }}"

  - alert: BackupStalled
    expr: changes(bacula_job_last_execution_end_time[7d]) < 1
    # this alert checks two things:
    # 1. that the execution time actually changes, which means backup
    # jobs *are* running
    # 2. that the stored timestamp is actually within the checked time
    # range (in this case, 7 days), as it seems like the timestamp can
    # revert back to an old one on failure
    expr: |
      changes(bacula_job_last_execution_end_time[7d]) < 1
      or (time()-max_over_time(bacula_job_last_execution_end_time[7d]))/(24*60*60)>7
    labels:
      severity: warning
    annotations:
+5 −0
Original line number Diff line number Diff line
@@ -102,6 +102,11 @@ tests:
      # case: backup completes every day
      - series: 'bacula_job_last_execution_end_time{alias="web-dal-07.torproject.org",backup_host="bacula-director-01.torproject.org",bacula_job="web-dal-07.torproject.org",instance="bacula-director-01.torproject.org:9133",job="bacula",team="TPA"}'
        values: '150x24 86550x24 172950x24 259350x24 345750x24 432150x24 518550x24'
      # weird case: backups running (time = 0) for 3 days then some
      # random timestamp for an hour because the backup failed, then
      # we are running it again for 3 days
      - series: 'bacula_job_last_execution_end_time{alias="test-01.torproject.org",backup_host="bacula-director-01.torproject.org",bacula_job="test-01.torproject.org",instance="bacula-director-01.torproject.org:9133",job="bacula",team="TPA"}'
        values: '0x72 1779735254x1 0x73'
    alert_rule_test:
      - eval_time: 7d
        alertname: BackupStalled