check catalog runs in Prometheus
Add Prometheus metrics to make sure we get warned when Puppet nodes have a failing catalog, or have been paused for too long.
This is probably done with the puppet-prometheus_reporter module, but if that's too complicated, consider just throwing a metric by hand in the textfile collector when puppet runs (and, in fact, we might want to do both for some reason).
prometheus-alerts!56 (merged) is a draft alert that matches the metrics of the above reporter.
Note that Icinga was running check_puppetdb_nodes which checks for failed catalog runs, probably equivalent to puppet_status{state="failed"} > 0
and time() - puppet_report > TIMEOUT
.
Watch out for cardinal explosion on puppet_report_time
, will likely need a recording rule to drop those or sum them up without individual resource labels.
Spun out of #41639 because it was found to be more complicated than just adding an alert, and higher priority than other checks in #41791 (closed).