check ICMP reachabilty of all hosts
We currently use the node exporter as a metric of host availability:
- name: tpa_node
rules:
- alert: HostDown
expr: up{job="node"} < 1
for: 15m
labels:
severity: critical
annotations:
summary: 'Host {{ $labels.alias }} is not responding'
description: 'The host {{ $labels.alias }} has stopped responding for more than 15 minutes.'
playbook: 'https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/incident-response#server-down'
The problem with this is that this doesn't actually check if the host is up, it checks if the node exporter is up, which could fail for a slew of other reasons:
- failure to start the node exporter
- MTU or transient network problems leading to TCP failures
- Puppet exporting rules failures
I think a better approach might be to just monitor all hosts for ICMP reachability, which would solve some (but not all) of the above problems and give us some nice benefits:
- it's more lightweight than checking TCP
- it's more reliable (modulus firewalls)
- it can easily check IPv4 and IPv6 (see also triple stack (IPv4, IPv6, .onion) monitoring (#41714))
- it collects metrics on latency, DNS resolution and so on
- more importantly, it could distinguish between DNS and reachability failures, which is what brought me here in the first place (from the internal DNSSEC failures (#42308 - closed) post-mortem)
This is also a followup to port icinga DNS and DNSSEC checks to prometheus (#41794 - closed) in the sense that the blackbox exporter could serve as a poor man's DNSSEC monitor. It also relates to #41967 (closed) in the sense that it also monitors DNS resolution (if not quite as explicitly).
Edited by anarcat