Skip to content
Snippets Groups Projects
Verified Commit e6838005 authored by anarcat's avatar anarcat
Browse files

start crafting service admins

parent dcda9e6b
No related branches found
No related tags found
No related merge requests found
......@@ -259,15 +259,96 @@ monitoring system, as provided by TPA.
syslog-ng, rsyslog, journald, or loki are currently out of scope of
this proposal
# Examples or Personas
Examples:
* ...
Counter examples:
* ...
# Personas
## Jackie, the TPA admin
Jackie is a member of the TPA team. She has access to the Puppet
repository, and all other Git repositories managed by TPA. She has
access to everything and the kitchen sink, and is generally asked to
fix all of this on a regular basis.
She sometimes ends rotating as the "star of the week", which makes her
responsible for handling "interruptions", new tickets, and also
keeping an eye on the monitoring server. This involves responding to
alerts like, by order of frequency in the last year:
* 2805 pending upgrades (packages blocked from unattended upgrades)
* 2325 pending restarts (services blocked from needrestart) or reboots
* 1818 load alerts
* 1709 disk usage alerts
* 1062 puppet catalog failures
* 999 uptime alerts (after reboots)
* 843 reachability alerts
* 602 process count alerts
* 585 swap usage alerts
* 499 backup alerts
* 484 systemd alerts e.g. systemd says "degraded" and you get to
figure out what didn't start)
* 383 zombie alerts
* 199 missing process (e.g. "0 postgresql processes")
* 168 unwanted processes or network services
* numerous warnings about service admin specific things:
* 129 mirror static sync alert storms (15 at a time), mostly host
unreachability warnings
* 69 bridgedb
* 67 collector
* 26 out of date chroots
* 14 translation cron - stuck
* 17 mail queue (polyanthum)
* 96 RAID - DRBD warnings, mostly false alerts
* 95 SSL cert warnings about db.torproject.org, all about the same
problem
* 94 DNS SOA synchronization alerts
* 88 DNSSEC alerts (81 delegation and signature expiry, 4 DS expiry,
2 security delegations)
* 69 hardware RAID warnings
* 69 Ganeti cluster verification warnings
* numerous alerts about NRPE availability, often falsly flagged as an
error in a specific service (e.g. "SSL cert - host")
* 28 unbound trust alerts
* 24 alerts about unexpected software RAID
* 19 SAN health alerts
* 5 false (?) alerts about mdadm resyncing
* 3 expiring Let's Encrypt X509 certificates alerts
* 3 redis liveness alerts
* 4 onionoo backend reachability alerts
Jackie finds that is way too much noise. That list is actually an
interpretation of the actual alerts received to make them more human
readable.
The current Nagios dashboard, that said, is pretty useful in the sense
that she can ignore all of those emails and just look at the dashboard
to see what's *actually* going on right now. This sometimes causes her
to miss some problems, however.
TODO: what does she want out of monitoring?
### Note
The alert list was created with the following utterly horrible shell
pipeline:
notmuch search --format=sexp tag:nagios date:2021-06-20.. \
| sed -n '/PROBLEM/{s/.*:subject "//;s/" :query .*//;s/.*Alert: [^\/ ]*[\/ ]//;p}'
| sed -e 's/ is UNKNOWN.*//' -e 's/ is WARNING.*//' -e 's/ is CRITICAL.*//' \
-e 's/disk usage .*/disk usage/'\
-e 's/mirror static sync.*/mirror static sync/' \
-e 's/unwanted.*/unwanted/' \
-e '/DNS/s/ - .*//' \
-e 's/process - .*/process/' \
-e 's/network service - .*/network service/' \
-e 's/backup - .*/backup/' \
-e 's/mirror sync - .*/mirror sync/' \
| sort | uniq -c | sort -n
Then the alerts were parsed by a TPA brain. Some alerts were redacted
because considered mostly noise.
## Ethan, the service admin
TODO: what do service admins want?
# Proposal
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment