Skip to content

priority B metrics and alerts deployment

Deploy the following metrics/exporters with alerting rules:

  • node exporter: load, uptime, swap, NTP, systemd, obsolete packages, filesystem checks
  • blackbox: ping
  • textfile: LDAP freshness
  • ganeti exporter: running instances, cluster verification?
  • unbound resolvers: ?
  • puppet exporter: last run time, catalog failures, disabled state (the exporter does not report the disabled state since in that case puppet will not generate a report -- we will see this condition as an agent not having reported in a timely fashion. we should document in our pager playbook to look for whether the agent was disabled or not)

Quote from TPA-RFC-33:

We assign each Icinga check an exporter and a priority:

  • A: must have, should be completed before Icinga is shutdown, as soon as possible
  • B: should have, would ideally be done before Icinga is shutdown, but we can live without it for a while
  • C: nice to have, we can live without it
  • D: drop, we wouldn't even keep checking this in Icinga if we kept it
  • E: what on earth is this thing and how do we deal with it, to review

In the appendix, the Icinga checks inventory lists every Icinga check and what should happen with it.

Summary:

Kind Checks A B C D E Exporters
existing 8 4 4 1
missing, existing exporter 8 5 3 3
missing, new exporters 8 4 4 8
DNS 7 1 6 3?
To investigate 4 2 1 1 1 existing, 2 new?
dropped 8 8 0
delegated to service admins 4 4 4?
new exporters 0 14 (priority C)

Checks by alerting levels:

  • warning: 31
  • critical: 3
  • dropped: 12

This is a followup to the priority A deployment (#41633 (closed)).

We checked the RFC after the audit (#41713 (closed)) to come up with a list of alerts to generate, based on this table comes from TPA-RFC-33.

So here are the list of alerts to create:

existing metrics, easy

  • failed systemd units (systemd_unit_state or node_systemd_unit_state, make sure we exclude detailed per unit stats that lead to cardinal explosion, see also tpo/tpa/team#41070), NRPE's systemctl is-system-running, spun off into #41807 (closed)
  • sudden peak of puppet catalog failures (puppet_status{state="failed"}) -- need to smooth out the value to catch only sudden peaks
  • pressure counters, see prometheus-alerts!63 (closed) not an actual problem symptom
  • Nagios' check_load -- not sure what this translates to in node metrics. presumably if we monitor the pressure counters adequately, we won't need to check for the load? don't check the load, check pressure counters, if at all
  • reboot alerts (time()-node_boot_time_seconds (source) or reboots per day: changes(process_start_time_seconds[1d])), Nagios' dsa-check-uptime
  • swap usage? (maybe check paging rate or node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes < 0.5? marked as "sanity check" in RFC, consider dropping, Nagios' check_swap not directly indicative of a service outage or impending doom
  • NTP sync, see #41639 (comment 3097338)
  • inodes: this wasn't actually checked by Nagios (!) and should be checked by Prom, the same way we check for disk space, with the node_filesystem_files metric, see also this post
    • readonly fs: node_filesystem_readonly > 0 is usually very indicative of serious fs problems
  • SMART checks: this wasn't checked by Nagios, but should probably be checked by Prom:

existing exporters requiring extra config, easy

  • ICMP checks on all nodes? we have SSH checks everywhere, is that enough? critical after 1h? inhibit other errors? Nagios' check_ping

new checks to implement

those should be filed in separate tickets:

new exporters to deploy

those should be filed in separate tickets:

  • Ganeti health checks (ganeti-exporter): Nagios does a "verify" and check the output, it also checks for stopped instances, but maybe we don't care about that, Nagios' check_ganeti_instances, the check_ganeti_disks was disabled in 2020 because it was timing out and check_ganeti_cluster in 2024 because it was flooding the job queue, the exporter should give us hbal health values we can check, and other metrics to investigate moved to #41968
  • Puppet health checks (puppet-exporter): Nagios runs check_puppetdb_nodes which checks for failed catalog runs, probably equivalent to puppet_status{state="failed"} > 0, time() - puppet_report > TIMEOUT, watch out for cardinal explosion on puppet_report_time, will likely need a recording rule to drop those or sum them up without individual resource labels, filed as #41806 (closed)

All of those alerts are warnings, not critical.

Consider splitting all of the above in different tickets unless they already have metrics recorded in Prometheus, particularly new exporter deployments, but also text files changes that require cron jobs and what not. Those took extra long time in #41633 (closed) and we should split up the ticket to reflect this.

Note that the above list was re-evaluated and prioritized in #41791 (closed). There, the following checks were prioritized:

Edited by lelutin
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information