Skip to content

check systemd units for failures in Prometheus

Add Prometheus metrics to get a warning when a systemd unit fails.

This can be done with the node exporter's node_systemd_unit_state metric, but needs the systemd collector to be enabled in the node exporter's commandline flags.

Watch out for cardinal explosion on the detailed per unit stats, probably with a recording rules to drop or aggregate those. This caused an outage (out of disk, #41070 (closed)) in the past.

This is the equivalent of NRPE's systemctl is-system-running check.

Spun out of #41639 (closed) because it was found to be more complicated than just adding an alert, and higher priority than other checks in #41791 (closed).

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information