check catalog runs in Prometheus

Closed Issue created 6 months ago by anarcat

Add Prometheus metrics to make sure we get warned when Puppet nodes have a failing catalog, or have been paused for too long.

This is probably done with the puppet-prometheus_reporter module, but if that's too complicated, consider just throwing a metric by hand in the textfile collector when puppet runs (and, in fact, we might want to do both for some reason).

prometheus-alerts!56 (merged) is a draft alert that matches the metrics of the above reporter.

Note that Icinga was running check_puppetdb_nodes which checks for failed catalog runs, probably equivalent to puppet_status{state="failed"} > 0 and time() - puppet_report > TIMEOUT.

Watch out for cardinal explosion on puppet_report_time, will likely need a recording rule to drop those or sum them up without individual resource labels.

Spun out of #41639 (closed) because it was found to be more complicated than just adding an alert, and higher priority than other checks in #41791 (closed).

check catalog runs in Prometheus

Linked items ... 0

Activity