From e6838005df24e106cde5be2eea21b9f22409ca1b Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Antoine=20Beaupr=C3=A9?= <anarcat@debian.org>
Date: Mon, 20 Jun 2022 17:01:22 -0400
Subject: [PATCH] start crafting service admins

---
 policy/tpa-rfc-33-monitoring.md | 99 ++++++++++++++++++++++++++++++---
 1 file changed, 90 insertions(+), 9 deletions(-)

diff --git a/policy/tpa-rfc-33-monitoring.md b/policy/tpa-rfc-33-monitoring.md
index b70af008..2bc6b544 100644
--- a/policy/tpa-rfc-33-monitoring.md
+++ b/policy/tpa-rfc-33-monitoring.md
@@ -259,15 +259,96 @@ monitoring system, as provided by TPA.
    syslog-ng, rsyslog, journald, or loki are currently out of scope of
    this proposal
 
-# Examples or Personas
-
-Examples:
-
- * ...
-
-Counter examples:
-
- * ...
+# Personas
+
+## Jackie, the TPA admin
+
+Jackie is a member of the TPA team. She has access to the Puppet
+repository, and all other Git repositories managed by TPA. She has
+access to everything and the kitchen sink, and is generally asked to
+fix all of this on a regular basis.
+
+She sometimes ends rotating as the "star of the week", which makes her
+responsible for handling "interruptions", new tickets, and also
+keeping an eye on the monitoring server. This involves responding to
+alerts like, by order of frequency in the last year:
+
+ * 2805 pending upgrades (packages blocked from unattended upgrades)
+ * 2325 pending restarts (services blocked from needrestart) or reboots
+ * 1818 load alerts
+ * 1709 disk usage alerts
+ * 1062 puppet catalog failures
+ * 999 uptime alerts (after reboots)
+ * 843 reachability alerts
+ * 602 process count alerts
+ * 585 swap usage alerts
+ * 499 backup alerts
+ * 484 systemd alerts e.g. systemd says "degraded" and you get to
+   figure out what didn't start)
+ * 383 zombie alerts
+ * 199 missing process (e.g. "0 postgresql processes")
+ * 168 unwanted processes or network services
+ * numerous warnings about service admin specific things:
+   * 129 mirror static sync alert storms (15 at a time), mostly host
+     unreachability warnings
+   * 69 bridgedb
+   * 67 collector
+   * 26 out of date chroots
+   * 14 translation cron - stuck
+   * 17 mail queue (polyanthum)
+ * 96 RAID - DRBD warnings, mostly false alerts
+ * 95 SSL cert warnings about db.torproject.org, all about the same
+   problem
+ * 94 DNS SOA synchronization alerts
+ * 88 DNSSEC alerts (81 delegation and signature expiry, 4 DS expiry,
+   2 security delegations)
+ * 69 hardware RAID warnings
+ * 69 Ganeti cluster verification warnings
+ * numerous alerts about NRPE availability, often falsly flagged as an
+   error in a specific service (e.g. "SSL cert - host")
+ * 28 unbound trust alerts
+ * 24 alerts about unexpected software RAID
+ * 19 SAN health alerts
+ * 5 false (?) alerts about mdadm resyncing
+ * 3 expiring Let's Encrypt X509 certificates alerts
+ * 3 redis liveness alerts
+ * 4 onionoo backend reachability alerts
+
+Jackie finds that is way too much noise. That list is actually an
+interpretation of the actual alerts received to make them more human
+readable.
+
+The current Nagios dashboard, that said, is pretty useful in the sense
+that she can ignore all of those emails and just look at the dashboard
+to see what's *actually* going on right now. This sometimes causes her
+to miss some problems, however.
+
+TODO: what does she want out of monitoring?
+
+### Note
+
+The alert list was created with the following utterly horrible shell
+pipeline:
+
+    notmuch search --format=sexp  tag:nagios date:2021-06-20.. \
+      | sed -n '/PROBLEM/{s/.*:subject "//;s/" :query .*//;s/.*Alert: [^\/ ]*[\/ ]//;p}' 
+      | sed -e 's/ is UNKNOWN.*//' -e 's/ is WARNING.*//' -e 's/ is CRITICAL.*//' \
+        -e 's/disk usage .*/disk usage/'\
+        -e 's/mirror static sync.*/mirror static sync/' \
+        -e 's/unwanted.*/unwanted/' \
+        -e '/DNS/s/ - .*//' \
+        -e 's/process - .*/process/' \
+        -e 's/network service - .*/network service/' \
+        -e 's/backup - .*/backup/' \
+        -e 's/mirror sync - .*/mirror sync/' \
+        | sort | uniq -c | sort -n 
+
+Then the alerts were parsed by a TPA brain. Some alerts were redacted
+because considered mostly noise.
+
+## Ethan, the service admin
+
+TODO: what do service admins want?
 
 # Proposal
 
-- 
GitLab