check ICMP reachabilty of all hosts

We currently use the node exporter as a metric of host availability:

- name: tpa_node
  rules:
  - alert: HostDown
    expr: up{job="node"} < 1
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: 'Host {{ $labels.alias }} is not responding'
      description: 'The host {{ $labels.alias }} has stopped responding for more than 15 minutes.'
      playbook: 'https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/incident-response#server-down'

The problem with this is that this doesn't actually check if the host is up, it checks if the node exporter is up, which could fail for a slew of other reasons:

  • failure to start the node exporter
  • MTU or transient network problems leading to TCP failures
  • Puppet exporting rules failures

I think a better approach might be to just monitor all hosts for ICMP reachability, which would solve some (but not all) of the above problems and give us some nice benefits:

  • it's more lightweight than checking TCP
  • it's more reliable (modulus firewalls)
  • it can easily check IPv4 and IPv6 (see also triple stack (IPv4, IPv6, .onion) monitoring (#41714))
  • it collects metrics on latency, DNS resolution and so on
  • more importantly, it could distinguish between DNS and reachability failures, which is what brought me here in the first place (from the internal DNSSEC failures (#42308 - closed) post-mortem)

This is also a followup to port icinga DNS and DNSSEC checks to prometheus (#41794 - closed) in the sense that the blackbox exporter could serve as a poor man's DNSSEC monitor. It also relates to #41967 (closed) in the sense that it also monitors DNS resolution (if not quite as explicitly).

Edited Sep 29, 2025 by anarcat
Assignee Loading
Time tracking Loading