diff --git a/policy/tpa-rfc-33-monitoring.md b/policy/tpa-rfc-33-monitoring.md index b70af0089ae8a2bc354310d8a3cfce795757cd5b..2bc6b54493c263e1e4050320fd7c9df7d918cf0d 100644 --- a/policy/tpa-rfc-33-monitoring.md +++ b/policy/tpa-rfc-33-monitoring.md @@ -259,15 +259,96 @@ monitoring system, as provided by TPA. syslog-ng, rsyslog, journald, or loki are currently out of scope of this proposal -# Examples or Personas - -Examples: - - * ... - -Counter examples: - - * ... +# Personas + +## Jackie, the TPA admin + +Jackie is a member of the TPA team. She has access to the Puppet +repository, and all other Git repositories managed by TPA. She has +access to everything and the kitchen sink, and is generally asked to +fix all of this on a regular basis. + +She sometimes ends rotating as the "star of the week", which makes her +responsible for handling "interruptions", new tickets, and also +keeping an eye on the monitoring server. This involves responding to +alerts like, by order of frequency in the last year: + + * 2805 pending upgrades (packages blocked from unattended upgrades) + * 2325 pending restarts (services blocked from needrestart) or reboots + * 1818 load alerts + * 1709 disk usage alerts + * 1062 puppet catalog failures + * 999 uptime alerts (after reboots) + * 843 reachability alerts + * 602 process count alerts + * 585 swap usage alerts + * 499 backup alerts + * 484 systemd alerts e.g. systemd says "degraded" and you get to + figure out what didn't start) + * 383 zombie alerts + * 199 missing process (e.g. "0 postgresql processes") + * 168 unwanted processes or network services + * numerous warnings about service admin specific things: + * 129 mirror static sync alert storms (15 at a time), mostly host + unreachability warnings + * 69 bridgedb + * 67 collector + * 26 out of date chroots + * 14 translation cron - stuck + * 17 mail queue (polyanthum) + * 96 RAID - DRBD warnings, mostly false alerts + * 95 SSL cert warnings about db.torproject.org, all about the same + problem + * 94 DNS SOA synchronization alerts + * 88 DNSSEC alerts (81 delegation and signature expiry, 4 DS expiry, + 2 security delegations) + * 69 hardware RAID warnings + * 69 Ganeti cluster verification warnings + * numerous alerts about NRPE availability, often falsly flagged as an + error in a specific service (e.g. "SSL cert - host") + * 28 unbound trust alerts + * 24 alerts about unexpected software RAID + * 19 SAN health alerts + * 5 false (?) alerts about mdadm resyncing + * 3 expiring Let's Encrypt X509 certificates alerts + * 3 redis liveness alerts + * 4 onionoo backend reachability alerts + +Jackie finds that is way too much noise. That list is actually an +interpretation of the actual alerts received to make them more human +readable. + +The current Nagios dashboard, that said, is pretty useful in the sense +that she can ignore all of those emails and just look at the dashboard +to see what's *actually* going on right now. This sometimes causes her +to miss some problems, however. + +TODO: what does she want out of monitoring? + +### Note + +The alert list was created with the following utterly horrible shell +pipeline: + + notmuch search --format=sexp tag:nagios date:2021-06-20.. \ + | sed -n '/PROBLEM/{s/.*:subject "//;s/" :query .*//;s/.*Alert: [^\/ ]*[\/ ]//;p}' + | sed -e 's/ is UNKNOWN.*//' -e 's/ is WARNING.*//' -e 's/ is CRITICAL.*//' \ + -e 's/disk usage .*/disk usage/'\ + -e 's/mirror static sync.*/mirror static sync/' \ + -e 's/unwanted.*/unwanted/' \ + -e '/DNS/s/ - .*//' \ + -e 's/process - .*/process/' \ + -e 's/network service - .*/network service/' \ + -e 's/backup - .*/backup/' \ + -e 's/mirror sync - .*/mirror sync/' \ + | sort | uniq -c | sort -n + +Then the alerts were parsed by a TPA brain. Some alerts were redacted +because considered mostly noise. + +## Ethan, the service admin + +TODO: what do service admins want? # Proposal