check ICMP reachabilty of all hosts

We currently use the node exporter as a metric of host availability:

- name: tpa_node
  rules:
  - alert: HostDown
    expr: up{job="node"} < 1
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: 'Host {{ $labels.alias }} is not responding'
      description: 'The host {{ $labels.alias }} has stopped responding for more than 15 minutes.'
      playbook: 'https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/incident-response#server-down'

The problem with this is that this doesn't actually check if the host is up, it checks if the node exporter is up, which could fail for a slew of other reasons:

failure to start the node exporter
MTU or transient network problems leading to TCP failures
Puppet exporting rules failures

I think a better approach might be to just monitor all hosts for ICMP reachability, which would solve some (but not all) of the above problems and give us some nice benefits:

it's more lightweight than checking TCP
it's more reliable (modulus firewalls)
it can easily check IPv4 and IPv6 (see also triple stack (IPv4, IPv6, .onion) monitoring (#41714))
it collects metrics on latency, DNS resolution and so on
more importantly, it could distinguish between DNS and reachability failures, which is what brought me here in the first place (from the internal DNSSEC failures (#42308 - closed) post-mortem)

This is also a followup to port icinga DNS and DNSSEC checks to prometheus (#41794 - closed) in the sense that the blackbox exporter could serve as a poor man's DNSSEC monitor. It also relates to #41967 (closed) in the sense that it also monitors DNS resolution (if not quite as explicitly).

Edited Sep 29, 2025 by anarcat