priority B metrics and alerts deployment
Deploy the following metrics/exporters with alerting rules:
- node exporter: load, uptime, swap, NTP, systemd, obsolete packages, filesystem checks
- blackbox: ping
- textfile: LDAP freshness
- ganeti exporter: running instances, cluster verification?
- unbound resolvers: ?
-
puppet exporter: last run time, catalog failures,
disabled state(the exporter does not report the disabled state since in that case puppet will not generate a report -- we will see this condition as an agent not having reported in a timely fashion. we should document in our pager playbook to look for whether the agent was disabled or not)
Quote from TPA-RFC-33:
We assign each Icinga check an exporter and a priority:
- A: must have, should be completed before Icinga is shutdown, as soon as possible
- B: should have, would ideally be done before Icinga is shutdown, but we can live without it for a while
- C: nice to have, we can live without it
- D: drop, we wouldn't even keep checking this in Icinga if we kept it
- E: what on earth is this thing and how do we deal with it, to review
In the appendix, the Icinga checks inventory lists every Icinga check and what should happen with it.
Summary:
Kind Checks A B C D E Exporters existing 8 4 4 1 missing, existing exporter 8 5 3 3 missing, new exporters 8 4 4 8 DNS 7 1 6 3? To investigate 4 2 1 1 1 existing, 2 new? dropped 8 8 0 delegated to service admins 4 4 4? new exporters 0 14 (priority C) Checks by alerting levels:
- warning: 31
- critical: 3
- dropped: 12
This is a followup to the priority A deployment (#41633 (closed)).
We checked the RFC after the audit (#41713 (closed)) to come up with a list of alerts to generate, based on this table comes from TPA-RFC-33.
So here are the list of alerts to create:
existing metrics, easy
-
failed systemd units ( systemd_unit_state
ornode_systemd_unit_state
, make sure we exclude detailed per unit stats that lead to cardinal explosion, see also tpo/tpa/team#41070), NRPE'ssystemctl is-system-running
, spun off into #41807 (closed) -
sudden peak of puppet catalog failures ( puppet_status{state="failed"}
) -- need to smooth out the value to catch only sudden peaks -
pressure counters, see prometheus-alerts!63 (closed)not an actual problem symptom -
Nagios'don't check the load, check pressure counters, if at allcheck_load
-- not sure what this translates to in node metrics. presumably if we monitor the pressure counters adequately, we won't need to check for the load? -
reboot alerts ( time()-node_boot_time_seconds
(source) or reboots per day:changes(process_start_time_seconds[1d])
), Nagios'dsa-check-uptime
-
swap usage? (maybe check paging rate ornot directly indicative of a service outage or impending doomnode_memory_SwapFree_bytes / node_memory_SwapTotal_bytes < 0.5
? marked as "sanity check" in RFC, consider dropping, Nagios'check_swap
-
NTP sync, see #41639 (comment 3097338) -
inodes: this wasn't actually checked by Nagios (!) and should be checked by Prom, the same way we check for disk space, with the node_filesystem_files
metric, see also this post-
readonly fs: node_filesystem_readonly > 0
is usually very indicative of serious fs problems
-
-
SMART checks: this wasn't checked by Nagios, but should probably be checked by Prom: -
temperature: ((smartmon_airflow_temperature_cel_raw_value) OR (smartmon_temperature_celsius_raw_value)) > 50
(from riseup), see #41639 (comment 3142814) -
health: smartmon_device_smart_healthy < 1
, see #41639 (comment 3142813)
-
existing exporters requiring extra config, easy
-
ICMP checks on all nodes? we have SSH checks everywhere, is that enough? critical after 1h? inhibit other errors? Nagios'check_ping
new checks to implement
those should be filed in separate tickets:
-
filesystem health checks, Nagios' dsa-check-filesystems
, seemingly missing from the node exporter (upstream issue 3113), workaround: check/sys/fs/ext4/*/errors_count != 0
-
LDAP freshness: make a "timestamp of file $foo
" metric, in this case/var/lib/misc/thishost/last_update.trace
, Nagios'dsa-check-udldap-freshness
-
unbound. it keeps crashing when root hints are missing or broken, and leaves those stray files around. perhaps systemd unit checks cover for this? review Nagios' dsa-check-unbound-anchors
-
apt obsolete packages (see #41712) -
DNS (see #41794)
new exporters to deploy
those should be filed in separate tickets:
-
Ganeti health checks (ganeti-exporter): Nagios does a "verify" and check the output, it also checks for stopped instances, but maybe we don't care about that, Nagios' check_ganeti_instances
, thecheck_ganeti_disks
was disabled in 2020 because it was timing out andcheck_ganeti_cluster
in 2024 because it was flooding the job queue, the exporter should give us hbal health values we can check, and other metrics to investigate -
Puppet health checks (puppet-exporter): Nagios runs check_puppetdb_nodes which checks for failed catalog runs, probably equivalent to puppet_status{state="failed"} > 0
,time() - puppet_report > TIMEOUT
, watch out for cardinal explosion onpuppet_report_time
, will likely need a recording rule to drop those or sum them up without individual resource labels, filed as #41806 (closed)
All of those alerts are warnings, not critical.
Consider splitting all of the above in different tickets unless they already have metrics recorded in Prometheus, particularly new exporter deployments, but also text files changes that require cron jobs and what not. Those took extra long time in #41633 (closed) and we should split up the ticket to reflect this.
Note that the above list was re-evaluated and prioritized in #41791 (closed). There, the following checks were prioritized:
- DNS - DS expiry (#41794)
- puppet - catalog run (#41806 (closed))
- puppet - all catalog runs (#41806 (closed))
- SSL cert - db.torproject.org (#41732 (closed))
- SSL certs - LE (#41731 (closed))
- system - all services running (#41807 (closed))