From 7af6da4b056866b7d1bcb99a738655b4690b6aaa Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Antoine=20Beaupr=C3=A9?= <anarcat@debian.org>
Date: Thu, 9 May 2024 17:13:57 -0400
Subject: [PATCH] review all nagios metrics (tpo/tpa/team#40755)

---
 howto/prometheus.md             |  12 --
 policy/tpa-rfc-33-monitoring.md | 222 ++++++++++++++++++++++++++------
 2 files changed, 183 insertions(+), 51 deletions(-)

diff --git a/howto/prometheus.md b/howto/prometheus.md
index c0a7d6f0..4c9a0ec7 100644
--- a/howto/prometheus.md
+++ b/howto/prometheus.md
@@ -394,18 +394,6 @@ Basically, Prometheus is similar to Munin in many ways:
    without sending duplicate alerts - `munin-limits` can only run on a
    single server
 
-## Migrating from Icinga / Nagios
-
-Key metric equivalence:
-
- * uptime: `time()-node_boot_time_seconds` ([source](https://github.com/m-lab/prometheus-support/issues/91#issuecomment-687785774)) also: count
-   reboots per day: `changes(process_start_time_seconds[1d])`, see
-   also [alerting on crash loops](https://www.robustperception.io/alerting-on-crash-loops-with-prometheus/)
- * availability: `avg_over_time(up{job="node"}[7d])` ([source](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29864#note_2540787))
-
-More ideas in [this issue](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29864), followup on the migration in [this
-issue](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755). See also [TPA-RFC-33: Monitoring](policy/tpa-rfc-33-monitoring).
-
 ## Push metrics to the Pushgateway
 
 The [Pushgateway][] is setup on the secondary Prometheus server
diff --git a/policy/tpa-rfc-33-monitoring.md b/policy/tpa-rfc-33-monitoring.md
index 9602f0dc..5de4196e 100644
--- a/policy/tpa-rfc-33-monitoring.md
+++ b/policy/tpa-rfc-33-monitoring.md
@@ -404,7 +404,7 @@ trends. He builds dashboards by clicking around Grafana and saving the
 resulting JSON in the [grafana-dashboards git repository](https://gitlab.com/anarcat/grafana-dashboards).
 
 Ethan would love to monitor user endpoints better, and particularly
-wants to have [better monitoring for webserver response times][]
+wants to have [better monitoring for webserver response times][tpo/tpa/team#40568].
 
 ### Note
 
@@ -593,7 +593,7 @@ In any case, we might want to have a separate targets directory for
 TPA services than service admins. Some work is clearly necessary to
 clean up [this mess](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/blob/cd0f37fc296a29e71a87e44d24d9908f4b3bd333/targets.d/README.md).
 
-### Inventory
+### Metrics types
 
 In [monitoring distributed systems][], Google defines 4 "golden
 signals", categories of metrics that need to be monitored:
@@ -607,36 +607,155 @@ In the book, they argue all four should issue pager alerts, but we
 believe warnings for saturation, except for extreme cases ("disk
 actually full") might be sufficient.
 
-TODO: Get a sense of what metrics we have and what we want to keep.
-
- * EDAC: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755#note_2908372
- * DRBD:
-   https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755#note_2912119
-   andhttps://gitlab.torproject.org/tpo/tpa/team/-/issues/29864#note_2903908
- * unexpected open ports: https://github.com/stanford-esrg/lzr
- * disk full: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755#note_2946792
- * needrestart: https://github.com/liske/needrestart/issues/291
- * cert expirations: https://github.com/joe-elliott/cert-exporter
- * fingerprint checking: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41385
- * imap/web roundtrips: https://git.autistici.org/ai3/tools/service-prober
- * puppet: https://github.com/voxpupuli/puppet-prometheus_reporter
-
-https://github.com/chrj/prometheus-dnssec-exporter
-https://gitlab.com/gitlab-com/gl-infra/prometheus-git-exporter
-https://github.com/hipages/php-fpm_exporter
-https://gitlab.torproject.org/tpo/tpa/team/-/issues/30028
-fail2ban: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41544
-gitlab issue counts: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40591
-gitlab mail processing: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41410
-network interfaces: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41387
-ipmi dashboard: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41569
-technical debt: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41456
+### Existing metrics
+
+Those checks are present in Nagios and have a corresponding metric in
+Prometheus, and an alerting rule might need to be created.
+
+| Check              | Type  | Exporter | Metric                                                    | Rule level         | Note                                                                                                                                         |
+|--------------------|-------|----------|-----------------------------------------------------------|--------------------|----------------------------------------------------------------------------------------------------------------------------------------------|
+| `check_disk`       | NRPE  | node     | TODO                                                      | warning / critical | disk full, critical when [< 24h to full][]                                                                                                       |
+| `check_load`       | NRPE  | node     | `node_load1` or `node_pressure_cpu_waiting_seconds_total` | warning            | sanity check, if using load, compare against CPU count                                                                                       |
+| `dsa-check-uptime` | NRPE  | node     | `node_boot_time_seconds`                                  | warning            | `time()-node_boot_time_seconds`  ([source][mlab91]), reboots per day: `changes(process_start_time_seconds[1d])`, [alerting on crash loops][] |
+| `check_swap`       | NRPE  | node     | `node_memory_SwapFree_bytes`                              | warning            | sanity check, reuse checks from [memory dashboard][]                                                                                         |
+| `check_nrpe`       | local | node     | `up`                                                      | warning            |                                                                                                                                              |
+| `check_ntp_peer`   | NRPE  | node     | `node_ntp_offset_seconds`                                 | warning            |                                                                                                                                              |
+
+[mlab91]: https://github.com/m-lab/prometheus-support/issues/91#issuecomment-687785774
+[alerting on crash loops]: https://www.robustperception.io/alerting-on-crash-loops-with-prometheus/
+[memory dashboard]: https://grafana.torproject.org/d/amgrk2Qnk/memory-usage?orgId=1
+[< 24h to full]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755#note_2946792
+
+### Missing metrics
+
+| Check                                  | Type  | Exporter            | Metric                              | Rule level       | Note                                                                                                                                                        |
+|----------------------------------------|-------|---------------------|-------------------------------------|------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `check_ping`                           | local | blackbox            |                                     | warning          | critical after 1h? inhibit other errors?                                                                                                                    |
+| `needrestart -p`                       | NRPE  | textfile            | `kernel_status`, `microcode_status` | warning          | [not supported upstream][needrestart#291], [alternative implementation lacking][]                                                                           |
+| `systemctl is-system-running`          | NRPE  | node                | TODO?                               | warning          | sanity check, checks for failing timers and services                                                                                                        |
+| `check_ssh --timeout=40`               | local | blackbox            | TODO                                | warning          | sanity check, overlaps with systemd check, but better be safe                                                                                               |
+| `check_smtp`                           | local | blackbox            | TODO                                | warning          | incomplete, need [end-to-end deliverability checks][]                                                                                                       |
+| `dsa-check-udldap-freshness`           | NRPE  | textfile            | TODO                                | warning          | make a "timestamp of file $foo" metric, in this case `/var/lib/misc/thishost/last_update.trace`                                                             |
+| `dsa-check-cert-expire`                | NRPE  | [cert-exporter][]   | TODO                                | warning          | checks local CA for expiry, on disk, `/etc/ssl/certs/thishost.pem` and `db.torproject.org.pem` on each host                                                 |
+| `dsa-check-raid-sw`                    | NRPE  | ?                   | TODO                                | warning          | warns about inconsistent arrays, maybe in node exporter?                                                                                                    |
+| `dsa-check-drbd`                       | NRPE  | ?                   | TODO                                | warning          | TODO: review https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755#note_2912119 https://gitlab.torproject.org/tpo/tpa/team/-/issues/29864#note_2903908  |
+| `check_ganeti_cluster`                 | NRPE  | [ganeti-exporter][] | TODO                                | warning          | runs a full verify, costly                                                                                                                                  |
+| `check_ganeti_instances`               | NRPE  | idem                | TODO                                | warning          | currently noisy: warns about retired hosts waiting for destruction, drop?                                                                                   |
+| `check_http`                           | local | blackbox            | TODO                                | warning/critical | critical only for key sites, after significant delay, see also [tpo/tpa/team#40568][]                                                                       |
+| `check_https`                          | local | idem                | idem                                | idem             | idem                                                                                                                                                        |
+| `dsa_check_cert`                       | local | [cert-exporter][]   |                                     | warning          | check for cert expiry for all sites, the above will check for real user-visible failures, this is about "pending renewal failed", nagios checks for 14 days |
+| `dsa-check-unbound-anchors`            | NRPE  | ???                 | ?                                   | warning?         | checks if `/var/lib/unbound` files have the string `VALID` and are newer than 5 days, catches bug in unbound that writes empty files on full disk, fix bug? |
+| "redis liveness"                       | NRPE  | blackbox            | TODO                                | warning?         | checks that the Redis tunnel works, might require blackbox exporter, possibly better served by end-to-end donation testing?                                 |
+| `dsa_check_staticsync`                 | NRPE  | textfile?           | TODO                                | warning          | runs on all mirrors, see if components are up to date, to rewrite?                                                                                          |
+| `dsa_check_soas_add`                   | NRPE  | ???                 | TODO                                | warning          | checks that zones are in sync on secondaries                                                                                                                |
+| `dsa-check-zone-rrsig-expiration-many` | NRPE  | [dnssec-exporter][] | TODO?                               | warning          | TODO, drop DNSSEC?                                                                                                                                          |
+| `dsa-check-zone-signature-all`         | NRPE  | ???                 | TODO?                               | warning          | idem                                                                                                                                                        |
+| `dsa-check-dnssec-delegation`          | NRPE  | ???                 | TODO?                               | warning          | idem                                                                                                                                                        |
+| "DNS - key coverage"                   | NRPE  | ???                 | TODO?                               | warning          | idem, `dsa-check-statusfile /srv/dns.torproject.org/var/nagios/coverage` on nevii, could be converted as is                                                 |
+| "DNS - DS expiry"                      | NRPE  | ???                 | TODO?                               | warning          | idem, `dsa-check-statusfile /srv/dns.torproject.org/var/nagios/ds` on nevii                                                                                 |
+| `dsa-check-backuppg`                   | NRPE  | [barman-exporter][] | TODO                                | warning          | tricky dependency on [barman rebuild][], [maybe builtin?][]                                                                                                 |
+| `dsa-check-bacula`                     | NRPE  | ?                   | TODO                                | warning          |                                                                                                                                                             |
+| `check_puppetdb_nodes`                 | NRPE  | [puppet-exporter][] | TODO                                | warning          |                                                                                                                                                             |
+
+[needrestart#291]: https://github.com/liske/needrestart/issues/291
+[alternative implementation lacking]: https://git.fsmpi.rwth-aachen.de/thomas/needrestart2prom/
+[end-to-end deliverability checks]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40494
+[cert-exporter]: https://github.com/joe-elliott/cert-exporter
+[ganeti-exporter]: https://github.com/ganeti/prometheus-ganeti-exporter
+
+The "redis liveness" check is particularly tricky to implement, here
+is the magic configuration right now:
+
+```
+  -
+    name: "redis liveness"
+    nrpe: "if echo PING | nc -w 1 localhost 6379 | grep -m 1 -q +PONG; then echo 'OK: redis seems to be alive.'; else echo 'CRITICAL: Did not get a PONG from redis.'; exit 2; fi"
+    hosts: crm-int-01
+
+  -
+    name: "redis liveness on crm-int-01 from crm-ext-01"
+    nrpe: "if echo PING | nc -w 1 crm-int-01-priv 6379 | grep -m 1 -q +PONG; then echo 'OK: redis seems to be alive.'; else echo 'CRITICAL: Did not get a PONG from redis.'; exit 2; fi"
+    hosts: crm-ext-01
+```
+
+[maybe builtin?]: https://github.com/EnterpriseDB/barman/issues/220
+[barman rebuild]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40950
+[dnssec-exporter]: https://github.com/chrj/prometheus-dnssec-exporter
+[puppet-exporter]: https://github.com/voxpupuli/puppet-prometheus_reporter
+[barman-exporter]: https://github.com/marcinhlybin/prometheus-barman-exporter
+
+### Dropped checks
+
+| Check                        | Type  | Rationale                                                         |
+|------------------------------|-------|-------------------------------------------------------------------|
+| `check_users`                | NRPE  | who has logged-in users??                                         |
+| `check_procs -s Z`           | NRPE  | useless                                                           |
+| `check_procs 620`            | NRPE  | too noisy, needed exclusions for builders                         |
+| `check_procs $foo`           | NRPE  | better to check systemd                                           |
+| weird Let's Encrypt X3 check | NRPE  | see below                                                         |
+| `dsa-check-ucode-intel`      | NRPE  | overlaps with `needrestart` check                                 |
+| "unexpected sw raid"         | NRPE  | needlessly noisy, just means an extra module is loaded, who cares |
+| `dsa_check_port_closed`      | local | needlessly noisy, if we *really* want this, use [lzr][]                                                  |
+
+`check_procs`, in particular, was generating a *lot* of noise in
+Nagios, as we were checking dozens of different processes, which would
+all explode at once when a host would go down and Nagios didn't notice
+the host being down.
+
+In [tpo/tpa/team#40052](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40052), weasel implemented a NRPE check like this:
+
+```
+  -
+    name: "LE - chain - see tpo/tpa/team#40052"
+    nrpe: "if [ -e /home/letsencrypt/non-X3-cert-encountered ]; then echo 'CRITICAL: found flag file'; exit 1; else echo 'OK: flag-file not found (good)'; fi"
+    hosts: nevii
+```
+
+It's unclear what it does or why it is necessary, assuming sanity.
+
+[lzr]: https://github.com/stanford-esrg/lzr
+
+### Dropped checks to delegate to service admins
+
+| Check                     | Type  | Note                        |
+|---------------------------|-------|-----------------------------|
+| "bridges.tpo web service" | local | `check_http` on bridges.tpo |
+| "mail queue"              | NRPE  | `check_mailq` on polyanthum |
+| `tor_check_collector`     | NRPE  | ???                         |
+| `tor-check-onionoo`       | NRPE  | ???                            |
+
+### To investigate
+
+| Check                   | Type | Possible exporter | Rule level | Note                                           |
+|-------------------------|------|-------------------|------------|------------------------------------------------|
+| `dsa-check-filesystems` | NRPE | node              | `warning`  | checks for fsck errors with tune2fs            |
+| `check_ntp_time`        | NRPE | node              | `warning`  | unclear how that differs from `check_ntp_peer` |
+
+### Other possible metrics
+
+ * error detection and correction ([EDAC][]) checks in the node exporter (`node_edac_correctable_errors_total`)
+ * [multiple viewpoint fingerprint checking](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41385)
+ * [fail2ban](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41544)
+ * [gitlab issue counts](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40591)
+ * [gitlab mail processing](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41410)
+ * [network interfaces](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41387)
+ * [ipmi dashboard](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41569)
+ * [technical debt](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41456)
+ * [php-fpm exporter](https://github.com/hipages/php-fpm_exporter)
+ * [git exporter](https://gitlab.com/gitlab-com/gl-infra/prometheus-git-exporter)  - latency for push/pull
+ * [hetzner exporter](https://promhippie.github.io/hetzner_exporter/) and [hcloud exporter](https://promhippie.github.io/hcloud_exporter/) to monitor costs
+
+TODO: review for possible new exporters in https://gitlab.torproject.org/tpo/tpa/team/-/issues/30028
+
+TODO: review the sr.ht stack:
 
 https://man.sr.ht/ops/monitoring.md
 https://git.sr.ht/~sircmpwn/metrics.sr.ht
 https://metrics.sr.ht/rules
 https://metrics.sr.ht/alerts
 
+[EDAC]: https://en.wikipedia.org/wiki/Error_detection_and_correction#Error-correcting_memory
+
 ### Retention
 
 We have been looking at [longer-term metrics retention][]. This could
@@ -764,7 +883,7 @@ TODO: move elsewhere
 
  * **availability**:
    * how many hosts are online at any given point: `sum(count(up==1))/sum(count(up)) by (alias)`
-   * percentage of hosts available over a given period: `avg_over_time(up{job="node"}[7d])`
+   * percentage of hosts available over a given period: `avg_over_time(up{job="node"}[7d])` ([source](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29864#note_2540787))
 
  * memory pressure:
 
@@ -994,17 +1113,33 @@ operators. It features:
 There is a [Karma demo](https://demo.karma-dashboard.io/) available although it's a bit slow and
 crowded, hopefully ours will look cleaner.
 
-### silences
+### Silences & Inhibitions
+
+Alertmanager supports two different concepts for turning off
+notifications:
 
-https://github.com/prymitive/kthxbye
+ * silences: operator issued override that turns off notifications for
+   a given amount of time
 
-### inhibitions
+ * inhibitions: configured override that turns off notifications for
+   an alert if another alert is already firing
 
-TODO: inhibitions, see also https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGoodDownExporterAlert
+We will make sure we can silence alerts from the Karma dashboard,
+which should work out of the box. It should also be possible to
+silence alerts in the built-in Alertmanager web interface, although
+that might require some manual work to deploy correctly in the Debian
+package.
 
-### Unit tests
+By default, silences have a time limit in Alertmanager. If that
+becomes a problem, we could deploy [kthxbye](https://github.com/prymitive/kthxbye) to automatically
+extend alerts.
 
-TODO: https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/
+The other system, inhibitions, needs configuration to be
+effective. Micah said it is worth spending at least some time
+configuring some basic inhibitions to keep major outages from flooding
+operators with alerts, for example turning off alerts on reboots and
+so on. There are also [ways to write alerting rules that do not need
+inhibitions at all](https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGoodDownExporterAlert).
 
 ## Notifications: IRC / Matrix?
 
@@ -1211,11 +1346,20 @@ objectives) generator" which generates Grafana dashboards and alerts.
 
 [Sachet](https://github.com/messagebird/sachet/) could be used to send SMS notifications.
 
-## Additional metrics
+## Dashboard variables consistency
+
+One of the issues with dashboards right now is the lack of consistency
+in variable names. Some dashboards use `node`, `instance`, `alias` or
+`host` to all basically refer to the same thing, the frigging machine
+on which the metrics are. That variability makes it hard to cross-link
+dashboards and reuse panels.
+
+We would love to fix this, but it's out of scope of this proposal.
+
+## Alerting rules unit tests
 
-https://promhippie.github.io/hetzner_exporter/
-https://promhippie.github.io/hcloud_exporter/
-https://github.com/ganeti/prometheus-ganeti-exporter
+It's possible to [write unit tests for alerting rules](https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/) but this
+seems a little premature and overkill at this stage.
 
 ## Other implementations
 
@@ -1263,8 +1407,8 @@ This proposal is discussed in [tpo/tpa/team#40755][].
  * [automate deployment of Grafana dashboards][tpo/tpa/team#41312]
  * [exporter policy][tpo/tpa/team#41280]
  * [improve incident response procedures][tpo/tpa/team#40421]
- * [better monitoring for webserver response times][]
+ * [better monitoring for webserver response times][tpo/tpa/team#40568]
  * [longer-term metrics retention][]
 
-[better monitoring for webserver response times]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40568
+ [tpo/tpa/team#40568]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40568
  [longer-term metrics retention]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40330
-- 
GitLab