check ICMP reachabilty of all hosts

We currently use the node exporter as a metric of host availability:

- name: tpa_node
  rules:
  - alert: HostDown
    expr: up{job="node"} < 1
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: 'Host {{ $labels.alias }} is not responding'
      description: 'The host {{ $labels.alias }} has stopped responding for more than 15 minutes.'
      playbook: 'https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/incident-response#server-down'

The problem with this is that this doesn't actually check if the host is up, it checks if the node exporter is up, which could fail for a slew of other reasons:

  • failure to start the node exporter
  • MTU or transient network problems leading to TCP failures
  • Puppet exporting rules failures

I think a better approach might be to just monitor all hosts for ICMP reachability, which would solve some (but not all) of the above problems and give us some nice benefits:

This is also a followup to port icinga DNS and DNSSEC checks to prometheus (#41794 - closed) in the sense that the blackbox exporter could serve as a poor man's DNSSEC monitor. It also relates to #41967 (closed) in the sense that it also monitors DNS resolution (if not quite as explicitly).

Edited by anarcat