Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
Wiki Replica
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Container Registry
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
The Tor Project
TPA
Wiki Replica
Commits
e6838005
Verified
Commit
e6838005
authored
2 years ago
by
anarcat
Browse files
Options
Downloads
Patches
Plain Diff
start crafting service admins
parent
dcda9e6b
No related branches found
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
policy/tpa-rfc-33-monitoring.md
+90
-9
90 additions, 9 deletions
policy/tpa-rfc-33-monitoring.md
with
90 additions
and
9 deletions
policy/tpa-rfc-33-monitoring.md
+
90
−
9
View file @
e6838005
...
...
@@ -259,15 +259,96 @@ monitoring system, as provided by TPA.
syslog-ng, rsyslog, journald, or loki are currently out of scope of
this proposal
# Examples or Personas
Examples:
*
...
Counter examples:
*
...
# Personas
## Jackie, the TPA admin
Jackie is a member of the TPA team. She has access to the Puppet
repository, and all other Git repositories managed by TPA. She has
access to everything and the kitchen sink, and is generally asked to
fix all of this on a regular basis.
She sometimes ends rotating as the "star of the week", which makes her
responsible for handling "interruptions", new tickets, and also
keeping an eye on the monitoring server. This involves responding to
alerts like, by order of frequency in the last year:
*
2805 pending upgrades (packages blocked from unattended upgrades)
*
2325 pending restarts (services blocked from needrestart) or reboots
*
1818 load alerts
*
1709 disk usage alerts
*
1062 puppet catalog failures
*
999 uptime alerts (after reboots)
*
843 reachability alerts
*
602 process count alerts
*
585 swap usage alerts
*
499 backup alerts
*
484 systemd alerts e.g. systemd says "degraded" and you get to
figure out what didn't start)
*
383 zombie alerts
*
199 missing process (e.g. "0 postgresql processes")
*
168 unwanted processes or network services
*
numerous warnings about service admin specific things:
*
129 mirror static sync alert storms (15 at a time), mostly host
unreachability warnings
*
69 bridgedb
*
67 collector
*
26 out of date chroots
*
14 translation cron - stuck
*
17 mail queue (polyanthum)
*
96 RAID - DRBD warnings, mostly false alerts
*
95 SSL cert warnings about db.torproject.org, all about the same
problem
*
94 DNS SOA synchronization alerts
*
88 DNSSEC alerts (81 delegation and signature expiry, 4 DS expiry,
2 security delegations)
*
69 hardware RAID warnings
*
69 Ganeti cluster verification warnings
*
numerous alerts about NRPE availability, often falsly flagged as an
error in a specific service (e.g. "SSL cert - host")
*
28 unbound trust alerts
*
24 alerts about unexpected software RAID
*
19 SAN health alerts
*
5 false (?) alerts about mdadm resyncing
*
3 expiring Let's Encrypt X509 certificates alerts
*
3 redis liveness alerts
*
4 onionoo backend reachability alerts
Jackie finds that is way too much noise. That list is actually an
interpretation of the actual alerts received to make them more human
readable.
The current Nagios dashboard, that said, is pretty useful in the sense
that she can ignore all of those emails and just look at the dashboard
to see what's
*actually*
going on right now. This sometimes causes her
to miss some problems, however.
TODO: what does she want out of monitoring?
### Note
The alert list was created with the following utterly horrible shell
pipeline:
notmuch search --format=sexp tag:nagios date:2021-06-20.. \
| sed -n '/PROBLEM/{s/.*:subject "//;s/" :query .*//;s/.*Alert: [^\/ ]*[\/ ]//;p}'
| sed -e 's/ is UNKNOWN.*//' -e 's/ is WARNING.*//' -e 's/ is CRITICAL.*//' \
-e 's/disk usage .*/disk usage/'\
-e 's/mirror static sync.*/mirror static sync/' \
-e 's/unwanted.*/unwanted/' \
-e '/DNS/s/ - .*//' \
-e 's/process - .*/process/' \
-e 's/network service - .*/network service/' \
-e 's/backup - .*/backup/' \
-e 's/mirror sync - .*/mirror sync/' \
| sort | uniq -c | sort -n
Then the alerts were parsed by a TPA brain. Some alerts were redacted
because considered mostly noise.
## Ethan, the service admin
TODO: what do service admins want?
# Proposal
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment