document alternatives to nagios checks (#41655) authored by anarcat's avatar anarcat
......@@ -2471,3 +2471,153 @@ Basically, Prometheus is similar to Munin in many ways:
to its `alertmanager` that can run multiple copies in parallel
without sending duplicate alerts - `munin-limits` can only run on a
single server
### Migrating from Nagios/Icinga
Near the end of 2024, Icinga was replaced by Prometheus and
Alertmanager, as part of [TPA-RFC-33][].
TODO: document a little bit how the actual migration went, along with
the three stages and milestones
Before Icinga was retired, we performed an audit of the notifications
sent from Icinga about our services ([#41791][]) to see if we're
missing coverage over something critical.
Overall, phase A covered most critical alerts we were worried about,
but left out key components as well, which are not currently covered
by monitoring.
[#41791]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41791
#### Prometheus equivalence for Icinga/Nagios checks
This is an equivalence table between Nagios checks and their
equivalent Prometheus metric, for checks that have been explicitly
converted into Prometheus alerts and metrics as part of phase A.
| Name | Command | Metric | Severity | Note |
|--------------------------------|--------------------------|------------------------------------------------------|------------------------|------------------------------------------------------------------------------|
| `disk usage - *` | `check_disk` | `node_filesystem_avail_bytes` | `warning` / `critical` | Critical when less than 24h to full |
| `network service - nrpe` | `check_tcp!5666` | `up` | `warning` | |
| `raid -DRBD` | `dsa-check-drbd` | `node_drbd_out_of_sync_bytes`, `node_drbd_connected` | `warning` | |
| `raid - sw raid` | `dsa-check-raid-sw` | `node_md_disks` / `node_md_state` | `warning` | Not warning about arrays synchronization |
| `apt - security updates` | `dsa-check-statusfile` | `apt_upgrades_*` | `warning` | [Incomplete][] |
| `needrestart` | `needrestart -p` | `kernel_status`, `microcode_status` | `warning` | Required patching upstream |
| `network service - sshd` | `check_ssh --timeout=40` | `probe_success` | `warning` | Sanity check, overlaps with systemd check, but better be safe |
| `network service - smtp` | `check_smtp` | `probe_success` | `warning` | Incomplete, need [end-to-end deliverability checks][], scheduled for phase B |
| `network service - submission` | `check_smtp_port!587` | `probe_success` | `warning` | |
| `network service - smtps` | `dsa_check_cert!465` | `probe_success` | `warning` | |
| `network service - http` | `check_http` | `probe_http_duration_seconds` | `warning` | See also [#40568][] for phase B |
| `network service - https` | `check_https` | Idem | `warning` | Idem, see also [#41731][] for exhaustive coverage of HTTPS sites |
| `https cert` and `smtps` | `dsa_check_cert` | `probe_ssl_earliest_cert_expiry` | `warning` | Check for cert expiry for all sites, this is about "renewal failed" |
| `backup - bacula - *` | `dsa-check-bacula` | `bacula_job_last_good_backup` | `warning` | Based on WMF's [`check_bacula.py`][] |
| `redis liveness` | Custom command | `probe_success` | `warning` | Checks that the Redis tunnel works |
| `postgresql backups` | `dsa-check-backuppg` | `tpa_backuppg_last_check_timestamp_seconds` | `warning` | Built on top of NRPE check for now, see [TPA-RFC-65][] for long term |
Actual alerting rules can be found in the [`prometheus-alerts.git`
repository][].
[incomplete]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41712
[`check_bacula.py`]: https://wikitech.wikimedia.org/wiki/Check_bacula.py
[end-to-end deliverability checks]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40494
[TPA-RFC-65]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40950
[#40568]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40568
#### High priority missing checks, phase B
[#41731]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41731
[#41794]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41794
[#41732]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41732
[#41639]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41639
Those checks are *all* scheduled in phase B, and are considered high
priority, or at least specific due dates have been set in issues to
make sure we don't miss (for example) the next certificate expiry
dates.
| Name | Command | Metric | Severity | Note |
|---------------------------------|-------------------------------|---------------------------|-----------|-----------------------------------------------------------------------------------------------------------------------------|
| `DNS - DS expiry` | `dsa-check-statusfile` | TBD | `warning` | Drop DNSSEC? See [#41795][] |
| `Ganeti - cluster` | `check_ganeti_cluster` | [`ganeti-exporter`][] | `warning` | Runs a full verify, costly, was already disabled |
| `Ganeti - disks` | `check_ganeti_instances` | Idem | `warning` | Was timing out and already disabled |
| `Ganeti - instances` | `check_ganeti_instances` | Idem | `warning` | Currently noisy: warns about retired hosts waiting for destruction, drop? |
| `SSL cert - LE` | `dsa-check-cert-expire-dir` | TBD | `warning` | Exhaustively check *all* certs, see [#41731][], possibly with `critical` severity for actual prolonged downtimes |
| `SSL cert - db.torproject.org` | `dsa-check-cert-expire` | TBD | `warning` | Checks local CA for expiry, on disk, `/etc/ssl/certs/thishost.pem` and `db.torproject.org.pem` on each host, see [#41732][] |
| `puppet - * catalog run(s)` | `check_puppetdb_nodes` | [`puppet-exporter`][] | `warning` | |
| `system - all services running` | `systemctl is-system-running` | `node_systemd_unit_state` | `warning` | Sanity check, checks for failing timers and services |
[`ganeti-exporter`]: https://github.com/ganeti/prometheus-ganeti-exporter
[`puppet-exporter`]: https://github.com/voxpupuli/puppet-prometheus_reporter
Those checks are covered by the priority "B" ticket ([#41639][]),
unless otherwise noted.
[#41795]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41795
#### Low priority missing checks, phase B
Unless otherwise mentioned, most of those checks are noisy and
generally do not indicate an actual failure, so they were not
qualified as being priorities at all.
| Name | Command | Metric | Severity | Note |
|-----------------------------------------|----------------------------------------|-------------------------------------------|-----------|---------------------------------------------------------|
| `DNS - delegation and signature expiry` | `dsa-check-zone-rrsig-expiration-many` | [`dnssec-exporter`][] | `warning` | |
| `DNS - key coverage` | `dsa-check-statusfile` | TBD | `warning` | |
| `DNS - security delegations` | `dsa-check-dnssec-delegation` | TBD | `warning` | |
| `DNS - zones signed properly` | `dsa-check-zone-signature-all` | TBD | `warning` | |
| `DNS SOA sync - *` | `dsa_check_soas_add` | TBD | `warning` | Never actually failed |
| `PING` | `check_ping` | `probe_success` | `warning` | |
| `load` | `check_load` | `node_pressure_cpu_waiting_seconds_total` | `warning` | Sanity check, replace with the better pressure counters |
| `mirror (static) sync - *` | `dsa_check_staticsync` | TBD | `warning` | Never actually failed |
| `network service - ntp peer` | `check_ntp_peer` | `node_ntp_offset_seconds` | `warning` | |
| `network service - ntp time` | `check_ntp_time` | TBD | `warning` | Unclear how that differs from `check_ntp_peer` |
| `setup - ud-ldap freshness` | `dsa-check-udldap-freshness` | TBD | `warning` | |
| `swap usage - *` | `check_swap` | `node_memory_SwapFree_bytes` | `warning` | |
| `system - filesystem check` | `dsa-check-filesystems` | TBD | `warning` | |
| `unbound trust anchors` | `dsa-check-unbound-anchors` | TBD | `warning` | |
| `uptime check` | `dsa-check-uptime` | `node_boot_time_seconds` | `warning` | |
Those are also covered by the priority "B" ticket ([#41639][]), unless
otherwise noted. In particular, all DNS issues are covered by issue [#41794][].
[upstream issue 3113]: https://github.com/prometheus/node_exporter/issues/3113
[`dnssec-exporter`]: https://github.com/chrj/prometheus-dnssec-exporter
#### Retired checks
| Name | Command | Rationale |
|--------------------------------|-----------------------------|-------------------------------------------------------------------|
| `users` | `check_users` | Who has logged-in users?? |
| `processes - zombies` | `check_procs -s Z` | Useless |
| `processes - total` | `check_procs 620 700` | Too noisy, needed exclusions for builders |
| `processes - *` | `check_procs $foo` | Better to check systemd |
| `unwanted processes - *` | `check_procs $foo` | Basically the opposite of the above, useless |
| `LE - chain` | Checks for flag file | See [#40052][] |
| `CPU - intel ucode` | `dsa-check-ucode-intel` | Overlaps with `needrestart` check |
| `unexpected sw raid` | Checks for `/proc/mdstat` | Needlessly noisy, just means an extra module is loaded, who cares |
| `unwanted network service - *` | `dsa_check_port_closed` | Needlessly noisy, if we *really* want this, use [`lzr`][] |
| `network - v6 gw` | `dsa-check-ipv6-default-gw` | Useless, see [#41714][] for analysis |
`check_procs`, in particular, was generating a *lot* of noise in
Icinga, as we were checking dozens of different processes, which would
all explode at once when a host would go down and Icinga didn't notice
the host being down.
[#41714]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41714
[`lzr`]: https://github.com/stanford-esrg/lzr
#### Service admin checks
The following checks were not audited by TPA but checked by the
respective team's service admins.
| Check | Team |
|---------------------------|-----------------|
| `bridges.tpo web service` | Anti-censorship |
| "mail queue" | Anti-censorship |
| `tor_check_collector` | Network health |
| `tor-check-onionoo` | Network health |
[#40052]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40052