Verified Commit f28317e5 authored by anarcat's avatar anarcat
Browse files

add dashboard links to disk alerts

This is an attempt at making our incident response easier to
handle. It provides a direct link to a relevant dashboard for the
alert, in this case the disk usage metrics for the particular
mountpoint on the host affected by the alert.

This should appear as a direct link in Karma and email
alerts (although I am not sure about either).

This might duplicate what's present in the playbook. I think that's
fine, as it provides a more direct reference than what the playbook
can provide.

I'm hoping we can start using this more uniformly, but I'm not sure it
should be as mandatory as playbooks.

See also #16
parent 7f7c844a
Loading
Loading
Loading
Loading
+2 −0
Original line number Diff line number Diff line
@@ -40,6 +40,7 @@ groups:
      summary: 'Disk {{ $labels.mountpoint }} on {{ $labels.alias }} is almost full'
      description: 'Disk {{ $labels.mountpoint }} on {{ $labels.alias }} will be full in less than 24 hours'
      playbook: "https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/incident-response#disk-is-full-or-nearly-full"
      dashboard: "https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?from=now-7d&to=now&var-instance={{ $labels.alias }}&var-Filters=mountpoint|%3D|{{ $labels.mountpoint | urlquery }}"

  - alert: DiskFull
    expr: node_filesystem_avail_bytes == 0
@@ -50,6 +51,7 @@ groups:
      summary: 'Disk {{ $labels.mountpoint }} on {{ $labels.alias }} is full'
      description: 'Disk {{ $labels.mountpoint }} on {{ $labels.alias }} has no space available.'
      playbook: "https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/incident-response#disk-is-full-or-nearly-full"
      dashboard: "https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?from=now-7d&to=now&var-instance={{ $labels.alias }}&var-Filters=mountpoint|%3D|{{ $labels.mountpoint | urlquery }}"

  - alert: InodeCountLow
    # vfat shows up in the metrics, but exposes 0 for both metrics which makes
+2 −0
Original line number Diff line number Diff line
@@ -68,6 +68,7 @@ tests:
              summary: 'Disk / on archive-01.torproject.org is almost full'
              description: 'Disk / on archive-01.torproject.org will be full in less than 24 hours'
              playbook: "https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/incident-response#disk-is-full-or-nearly-full"
              dashboard: "https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?from=now-7d&to=now&var-instance=archive-01.torproject.org&var-Filters=mountpoint|%3D|%2F"

  - interval: 1m
    input_series:
@@ -90,6 +91,7 @@ tests:
              summary: 'Disk /tmp on staticiforme.torproject.org is full'
              description: 'Disk /tmp on staticiforme.torproject.org has no space available.'
              playbook: "https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/incident-response#disk-is-full-or-nearly-full"
              dashboard: "https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?from=now-7d&to=now&var-instance=staticiforme.torproject.org&var-Filters=mountpoint|%3D|%2Ftmp"

  - interval: 1m
    input_series: