From 7af6da4b056866b7d1bcb99a738655b4690b6aaa Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Antoine=20Beaupr=C3=A9?= <anarcat@debian.org> Date: Thu, 9 May 2024 17:13:57 -0400 Subject: [PATCH] review all nagios metrics (tpo/tpa/team#40755) --- howto/prometheus.md | 12 -- policy/tpa-rfc-33-monitoring.md | 222 ++++++++++++++++++++++++++------ 2 files changed, 183 insertions(+), 51 deletions(-) diff --git a/howto/prometheus.md b/howto/prometheus.md index c0a7d6f0..4c9a0ec7 100644 --- a/howto/prometheus.md +++ b/howto/prometheus.md @@ -394,18 +394,6 @@ Basically, Prometheus is similar to Munin in many ways: without sending duplicate alerts - `munin-limits` can only run on a single server -## Migrating from Icinga / Nagios - -Key metric equivalence: - - * uptime: `time()-node_boot_time_seconds` ([source](https://github.com/m-lab/prometheus-support/issues/91#issuecomment-687785774)) also: count - reboots per day: `changes(process_start_time_seconds[1d])`, see - also [alerting on crash loops](https://www.robustperception.io/alerting-on-crash-loops-with-prometheus/) - * availability: `avg_over_time(up{job="node"}[7d])` ([source](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29864#note_2540787)) - -More ideas in [this issue](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29864), followup on the migration in [this -issue](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755). See also [TPA-RFC-33: Monitoring](policy/tpa-rfc-33-monitoring). - ## Push metrics to the Pushgateway The [Pushgateway][] is setup on the secondary Prometheus server diff --git a/policy/tpa-rfc-33-monitoring.md b/policy/tpa-rfc-33-monitoring.md index 9602f0dc..5de4196e 100644 --- a/policy/tpa-rfc-33-monitoring.md +++ b/policy/tpa-rfc-33-monitoring.md @@ -404,7 +404,7 @@ trends. He builds dashboards by clicking around Grafana and saving the resulting JSON in the [grafana-dashboards git repository](https://gitlab.com/anarcat/grafana-dashboards). Ethan would love to monitor user endpoints better, and particularly -wants to have [better monitoring for webserver response times][] +wants to have [better monitoring for webserver response times][tpo/tpa/team#40568]. ### Note @@ -593,7 +593,7 @@ In any case, we might want to have a separate targets directory for TPA services than service admins. Some work is clearly necessary to clean up [this mess](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/blob/cd0f37fc296a29e71a87e44d24d9908f4b3bd333/targets.d/README.md). -### Inventory +### Metrics types In [monitoring distributed systems][], Google defines 4 "golden signals", categories of metrics that need to be monitored: @@ -607,36 +607,155 @@ In the book, they argue all four should issue pager alerts, but we believe warnings for saturation, except for extreme cases ("disk actually full") might be sufficient. -TODO: Get a sense of what metrics we have and what we want to keep. - - * EDAC: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755#note_2908372 - * DRBD: - https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755#note_2912119 - andhttps://gitlab.torproject.org/tpo/tpa/team/-/issues/29864#note_2903908 - * unexpected open ports: https://github.com/stanford-esrg/lzr - * disk full: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755#note_2946792 - * needrestart: https://github.com/liske/needrestart/issues/291 - * cert expirations: https://github.com/joe-elliott/cert-exporter - * fingerprint checking: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41385 - * imap/web roundtrips: https://git.autistici.org/ai3/tools/service-prober - * puppet: https://github.com/voxpupuli/puppet-prometheus_reporter - -https://github.com/chrj/prometheus-dnssec-exporter -https://gitlab.com/gitlab-com/gl-infra/prometheus-git-exporter -https://github.com/hipages/php-fpm_exporter -https://gitlab.torproject.org/tpo/tpa/team/-/issues/30028 -fail2ban: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41544 -gitlab issue counts: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40591 -gitlab mail processing: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41410 -network interfaces: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41387 -ipmi dashboard: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41569 -technical debt: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41456 +### Existing metrics + +Those checks are present in Nagios and have a corresponding metric in +Prometheus, and an alerting rule might need to be created. + +| Check | Type | Exporter | Metric | Rule level | Note | +|--------------------|-------|----------|-----------------------------------------------------------|--------------------|----------------------------------------------------------------------------------------------------------------------------------------------| +| `check_disk` | NRPE | node | TODO | warning / critical | disk full, critical when [< 24h to full][] | +| `check_load` | NRPE | node | `node_load1` or `node_pressure_cpu_waiting_seconds_total` | warning | sanity check, if using load, compare against CPU count | +| `dsa-check-uptime` | NRPE | node | `node_boot_time_seconds` | warning | `time()-node_boot_time_seconds` ([source][mlab91]), reboots per day: `changes(process_start_time_seconds[1d])`, [alerting on crash loops][] | +| `check_swap` | NRPE | node | `node_memory_SwapFree_bytes` | warning | sanity check, reuse checks from [memory dashboard][] | +| `check_nrpe` | local | node | `up` | warning | | +| `check_ntp_peer` | NRPE | node | `node_ntp_offset_seconds` | warning | | + +[mlab91]: https://github.com/m-lab/prometheus-support/issues/91#issuecomment-687785774 +[alerting on crash loops]: https://www.robustperception.io/alerting-on-crash-loops-with-prometheus/ +[memory dashboard]: https://grafana.torproject.org/d/amgrk2Qnk/memory-usage?orgId=1 +[< 24h to full]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755#note_2946792 + +### Missing metrics + +| Check | Type | Exporter | Metric | Rule level | Note | +|----------------------------------------|-------|---------------------|-------------------------------------|------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `check_ping` | local | blackbox | | warning | critical after 1h? inhibit other errors? | +| `needrestart -p` | NRPE | textfile | `kernel_status`, `microcode_status` | warning | [not supported upstream][needrestart#291], [alternative implementation lacking][] | +| `systemctl is-system-running` | NRPE | node | TODO? | warning | sanity check, checks for failing timers and services | +| `check_ssh --timeout=40` | local | blackbox | TODO | warning | sanity check, overlaps with systemd check, but better be safe | +| `check_smtp` | local | blackbox | TODO | warning | incomplete, need [end-to-end deliverability checks][] | +| `dsa-check-udldap-freshness` | NRPE | textfile | TODO | warning | make a "timestamp of file $foo" metric, in this case `/var/lib/misc/thishost/last_update.trace` | +| `dsa-check-cert-expire` | NRPE | [cert-exporter][] | TODO | warning | checks local CA for expiry, on disk, `/etc/ssl/certs/thishost.pem` and `db.torproject.org.pem` on each host | +| `dsa-check-raid-sw` | NRPE | ? | TODO | warning | warns about inconsistent arrays, maybe in node exporter? | +| `dsa-check-drbd` | NRPE | ? | TODO | warning | TODO: review https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755#note_2912119 https://gitlab.torproject.org/tpo/tpa/team/-/issues/29864#note_2903908 | +| `check_ganeti_cluster` | NRPE | [ganeti-exporter][] | TODO | warning | runs a full verify, costly | +| `check_ganeti_instances` | NRPE | idem | TODO | warning | currently noisy: warns about retired hosts waiting for destruction, drop? | +| `check_http` | local | blackbox | TODO | warning/critical | critical only for key sites, after significant delay, see also [tpo/tpa/team#40568][] | +| `check_https` | local | idem | idem | idem | idem | +| `dsa_check_cert` | local | [cert-exporter][] | | warning | check for cert expiry for all sites, the above will check for real user-visible failures, this is about "pending renewal failed", nagios checks for 14 days | +| `dsa-check-unbound-anchors` | NRPE | ??? | ? | warning? | checks if `/var/lib/unbound` files have the string `VALID` and are newer than 5 days, catches bug in unbound that writes empty files on full disk, fix bug? | +| "redis liveness" | NRPE | blackbox | TODO | warning? | checks that the Redis tunnel works, might require blackbox exporter, possibly better served by end-to-end donation testing? | +| `dsa_check_staticsync` | NRPE | textfile? | TODO | warning | runs on all mirrors, see if components are up to date, to rewrite? | +| `dsa_check_soas_add` | NRPE | ??? | TODO | warning | checks that zones are in sync on secondaries | +| `dsa-check-zone-rrsig-expiration-many` | NRPE | [dnssec-exporter][] | TODO? | warning | TODO, drop DNSSEC? | +| `dsa-check-zone-signature-all` | NRPE | ??? | TODO? | warning | idem | +| `dsa-check-dnssec-delegation` | NRPE | ??? | TODO? | warning | idem | +| "DNS - key coverage" | NRPE | ??? | TODO? | warning | idem, `dsa-check-statusfile /srv/dns.torproject.org/var/nagios/coverage` on nevii, could be converted as is | +| "DNS - DS expiry" | NRPE | ??? | TODO? | warning | idem, `dsa-check-statusfile /srv/dns.torproject.org/var/nagios/ds` on nevii | +| `dsa-check-backuppg` | NRPE | [barman-exporter][] | TODO | warning | tricky dependency on [barman rebuild][], [maybe builtin?][] | +| `dsa-check-bacula` | NRPE | ? | TODO | warning | | +| `check_puppetdb_nodes` | NRPE | [puppet-exporter][] | TODO | warning | | + +[needrestart#291]: https://github.com/liske/needrestart/issues/291 +[alternative implementation lacking]: https://git.fsmpi.rwth-aachen.de/thomas/needrestart2prom/ +[end-to-end deliverability checks]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40494 +[cert-exporter]: https://github.com/joe-elliott/cert-exporter +[ganeti-exporter]: https://github.com/ganeti/prometheus-ganeti-exporter + +The "redis liveness" check is particularly tricky to implement, here +is the magic configuration right now: + +``` + - + name: "redis liveness" + nrpe: "if echo PING | nc -w 1 localhost 6379 | grep -m 1 -q +PONG; then echo 'OK: redis seems to be alive.'; else echo 'CRITICAL: Did not get a PONG from redis.'; exit 2; fi" + hosts: crm-int-01 + + - + name: "redis liveness on crm-int-01 from crm-ext-01" + nrpe: "if echo PING | nc -w 1 crm-int-01-priv 6379 | grep -m 1 -q +PONG; then echo 'OK: redis seems to be alive.'; else echo 'CRITICAL: Did not get a PONG from redis.'; exit 2; fi" + hosts: crm-ext-01 +``` + +[maybe builtin?]: https://github.com/EnterpriseDB/barman/issues/220 +[barman rebuild]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40950 +[dnssec-exporter]: https://github.com/chrj/prometheus-dnssec-exporter +[puppet-exporter]: https://github.com/voxpupuli/puppet-prometheus_reporter +[barman-exporter]: https://github.com/marcinhlybin/prometheus-barman-exporter + +### Dropped checks + +| Check | Type | Rationale | +|------------------------------|-------|-------------------------------------------------------------------| +| `check_users` | NRPE | who has logged-in users?? | +| `check_procs -s Z` | NRPE | useless | +| `check_procs 620` | NRPE | too noisy, needed exclusions for builders | +| `check_procs $foo` | NRPE | better to check systemd | +| weird Let's Encrypt X3 check | NRPE | see below | +| `dsa-check-ucode-intel` | NRPE | overlaps with `needrestart` check | +| "unexpected sw raid" | NRPE | needlessly noisy, just means an extra module is loaded, who cares | +| `dsa_check_port_closed` | local | needlessly noisy, if we *really* want this, use [lzr][] | + +`check_procs`, in particular, was generating a *lot* of noise in +Nagios, as we were checking dozens of different processes, which would +all explode at once when a host would go down and Nagios didn't notice +the host being down. + +In [tpo/tpa/team#40052](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40052), weasel implemented a NRPE check like this: + +``` + - + name: "LE - chain - see tpo/tpa/team#40052" + nrpe: "if [ -e /home/letsencrypt/non-X3-cert-encountered ]; then echo 'CRITICAL: found flag file'; exit 1; else echo 'OK: flag-file not found (good)'; fi" + hosts: nevii +``` + +It's unclear what it does or why it is necessary, assuming sanity. + +[lzr]: https://github.com/stanford-esrg/lzr + +### Dropped checks to delegate to service admins + +| Check | Type | Note | +|---------------------------|-------|-----------------------------| +| "bridges.tpo web service" | local | `check_http` on bridges.tpo | +| "mail queue" | NRPE | `check_mailq` on polyanthum | +| `tor_check_collector` | NRPE | ??? | +| `tor-check-onionoo` | NRPE | ??? | + +### To investigate + +| Check | Type | Possible exporter | Rule level | Note | +|-------------------------|------|-------------------|------------|------------------------------------------------| +| `dsa-check-filesystems` | NRPE | node | `warning` | checks for fsck errors with tune2fs | +| `check_ntp_time` | NRPE | node | `warning` | unclear how that differs from `check_ntp_peer` | + +### Other possible metrics + + * error detection and correction ([EDAC][]) checks in the node exporter (`node_edac_correctable_errors_total`) + * [multiple viewpoint fingerprint checking](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41385) + * [fail2ban](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41544) + * [gitlab issue counts](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40591) + * [gitlab mail processing](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41410) + * [network interfaces](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41387) + * [ipmi dashboard](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41569) + * [technical debt](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41456) + * [php-fpm exporter](https://github.com/hipages/php-fpm_exporter) + * [git exporter](https://gitlab.com/gitlab-com/gl-infra/prometheus-git-exporter) - latency for push/pull + * [hetzner exporter](https://promhippie.github.io/hetzner_exporter/) and [hcloud exporter](https://promhippie.github.io/hcloud_exporter/) to monitor costs + +TODO: review for possible new exporters in https://gitlab.torproject.org/tpo/tpa/team/-/issues/30028 + +TODO: review the sr.ht stack: https://man.sr.ht/ops/monitoring.md https://git.sr.ht/~sircmpwn/metrics.sr.ht https://metrics.sr.ht/rules https://metrics.sr.ht/alerts +[EDAC]: https://en.wikipedia.org/wiki/Error_detection_and_correction#Error-correcting_memory + ### Retention We have been looking at [longer-term metrics retention][]. This could @@ -764,7 +883,7 @@ TODO: move elsewhere * **availability**: * how many hosts are online at any given point: `sum(count(up==1))/sum(count(up)) by (alias)` - * percentage of hosts available over a given period: `avg_over_time(up{job="node"}[7d])` + * percentage of hosts available over a given period: `avg_over_time(up{job="node"}[7d])` ([source](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29864#note_2540787)) * memory pressure: @@ -994,17 +1113,33 @@ operators. It features: There is a [Karma demo](https://demo.karma-dashboard.io/) available although it's a bit slow and crowded, hopefully ours will look cleaner. -### silences +### Silences & Inhibitions + +Alertmanager supports two different concepts for turning off +notifications: -https://github.com/prymitive/kthxbye + * silences: operator issued override that turns off notifications for + a given amount of time -### inhibitions + * inhibitions: configured override that turns off notifications for + an alert if another alert is already firing -TODO: inhibitions, see also https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGoodDownExporterAlert +We will make sure we can silence alerts from the Karma dashboard, +which should work out of the box. It should also be possible to +silence alerts in the built-in Alertmanager web interface, although +that might require some manual work to deploy correctly in the Debian +package. -### Unit tests +By default, silences have a time limit in Alertmanager. If that +becomes a problem, we could deploy [kthxbye](https://github.com/prymitive/kthxbye) to automatically +extend alerts. -TODO: https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/ +The other system, inhibitions, needs configuration to be +effective. Micah said it is worth spending at least some time +configuring some basic inhibitions to keep major outages from flooding +operators with alerts, for example turning off alerts on reboots and +so on. There are also [ways to write alerting rules that do not need +inhibitions at all](https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGoodDownExporterAlert). ## Notifications: IRC / Matrix? @@ -1211,11 +1346,20 @@ objectives) generator" which generates Grafana dashboards and alerts. [Sachet](https://github.com/messagebird/sachet/) could be used to send SMS notifications. -## Additional metrics +## Dashboard variables consistency + +One of the issues with dashboards right now is the lack of consistency +in variable names. Some dashboards use `node`, `instance`, `alias` or +`host` to all basically refer to the same thing, the frigging machine +on which the metrics are. That variability makes it hard to cross-link +dashboards and reuse panels. + +We would love to fix this, but it's out of scope of this proposal. + +## Alerting rules unit tests -https://promhippie.github.io/hetzner_exporter/ -https://promhippie.github.io/hcloud_exporter/ -https://github.com/ganeti/prometheus-ganeti-exporter +It's possible to [write unit tests for alerting rules](https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/) but this +seems a little premature and overkill at this stage. ## Other implementations @@ -1263,8 +1407,8 @@ This proposal is discussed in [tpo/tpa/team#40755][]. * [automate deployment of Grafana dashboards][tpo/tpa/team#41312] * [exporter policy][tpo/tpa/team#41280] * [improve incident response procedures][tpo/tpa/team#40421] - * [better monitoring for webserver response times][] + * [better monitoring for webserver response times][tpo/tpa/team#40568] * [longer-term metrics retention][] -[better monitoring for webserver response times]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40568 + [tpo/tpa/team#40568]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40568 [longer-term metrics retention]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40330 -- GitLab