Skip to content
Snippets Groups Projects
Verified Commit 43671378 authored by anarcat's avatar anarcat
Browse files

dump bunch more work in tpa-rfc-33 (team#40755)

parent 6b17c87a
No related branches found
No related tags found
No related merge requests found
......@@ -4,8 +4,7 @@ title: TPA-RFC-33: Monitoring
[[_TOC_]]
Summary: TODO. currently writing requirements for the new monitoring
system.
Summary: TODO.
# Background
......@@ -18,10 +17,14 @@ Icinga 1 is not available in Debian bullseye and this is therefore a
mandatory upgrade. Because of the design of the service, it cannot just
be converted over easily, so we are considering alternatives.
This has become urgent as of May 2024, as Debian buster will stop
being supported by [Debian LTS][] in June 2024.
[Debian bullseye upgrades]: howto/upgrades/bullseye
[Debian Icinga 1 package]: https://tracker.debian.org/pkg/icinga
[Icinga 2]: https://tracker.debian.org/pkg/icinga2
[switching to Prometheus]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/29864
[Debian LTS]: https://wiki.debian.org/LTS
## History
......@@ -96,12 +99,6 @@ minutes. It processes about 200 checks per minute.
[previously estimated]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31244#note_2541965
[tor-nagios.git repository]: https://gitweb.torproject.org/admin/tor-nagios.git/
Icinga is running version 1.14, from Debian buster.
TODO: document the upgrade problem. https://gitlab.torproject.org/tpo/tpa/team/-/issues/40695
TODO: document puppetization problem. https://gitlab.torproject.org/tpo/tpa/team/-/issues/32901
TODO: document why nagios is not puppet (so nagios tests the puppet config, rejected idea)
## Problem statement
The current Icinga deployment cannot be upgraded without Bullseye as is.
......@@ -109,21 +106,18 @@ At the very least the post-receive hook in git needs to be rewritten to
support the Icinga 2 configuration files, since Icinga 2 has dropped
support for Nagios configurations.
Note that weasel has started on rewriting the [DSA Puppet
configuration][] to automatically generate Icinga 2 configurations using
[a custom Puppet module][], ditching the "push to git" design. This has
the limitation that service admins will not have access to modifying the
alerting configuration unless they somehow get access to the Puppet
repository.
The Nagios configuration is error-prone: because of the way the script
is deployed (post-receive), an error in the configuration can go
un-detected and not being deployed for extended periods of time, which
had lead some services to stay unmonitored.
Since new services must be manually configured in Nagios, this has lead
to new servers and services not being monitored at all, and in fact many
services do not have any form of monitoring.
Having Nagios be a separate source of truth for host information was
originally a deliberate decision: it allowed for external verification
of configurations deployed by Puppet.
But since new services must be manually configured in Nagios, this has
lead to new servers and services not being monitored at all, and in
fact many services do not have any form of monitoring.
The way the NRPE configuration is deployed is also problematic: because
the files get deployed asynchronously, it's common for warnings to pop
......@@ -148,6 +142,16 @@ Prometheus/Grafana services. In particular, both:
[DSA Puppet configuration]: https://salsa.debian.org/dsa-team/mirror/dsa-puppet
[a custom Puppet module]: https://salsa.debian.org/dsa-team/mirror/dsa-puppet-weasel-mon
Note that weasel has started on rewriting the [DSA Puppet
configuration][] to automatically generate Icinga 2 configurations using
[a custom Puppet module][], ditching the "push to git" design. This has
the limitation that service admins will not have access to modifying the
alerting configuration unless they somehow get access to the Puppet
repository. We have the option of [automate Nagios configuration][]
of course, either with DSA's work or another Nagios module.
[automate Nagios configuration]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/32901
## Definitions
- **"system" metrics**: directly under the responsibility of TPA, for
......@@ -285,8 +289,10 @@ monitoring system, as provided by TPA.
### Non-Goals
- **SLA**: we do not plan on providing any specific Service Level
Agreement through this proposal, those are defined in [TPA-RFC-2:
Support][]
Agreement through this proposal, those are still defined in
[TPA-RFC-2: Support][].
[TPA-RFC-2: Support]: policy/tpa-rfc-2-support
- **on-call rotation**: we do not provide 24/7 on-call services, nor do
we ascribe to an on-call schedule - there is a "star of the weeks"
......@@ -294,6 +300,14 @@ monitoring system, as provided by TPA.
interruptions, but they do so during work hours, in their own time, in
accordance with [TPA-RFC-2: Support][]
In particular, we do not introduce notifications that "page"
operators on their mobile devices, instead we keep the current
"email / IRC" notifications with optional integration with GitLab.
We will absolutely not wake up humans at night for servers. If we
desire 24/7 availability, shifts should be implemented with staff in
multiple time zones instead.
- **escalation**: we do not need to call Y when X person fails to
answer, mainly because we do not expect either X or Y to answer alerts
immediately
......@@ -302,18 +316,30 @@ monitoring system, as provided by TPA.
of our monitoring systems, the questions of whether we use syslog-ng,
rsyslog, journald, or loki are currently out of scope of this proposal
[TPA-RFC-2: Support]: policy/tpa-rfc-2-support
- **exporter policy**: we need to clarify how new exporters are setup,
this is covered by another issue, in [tpo/tpa/team#41280][]
- **incident response**: we need to improve our incident response
procedures, but those are not covered by this policy, see
[tpo/tpa/team#40421][] for that discussion
- **public dashboards**: we currently copy-paste screenshots into
GitLab when we want to share data publicly and will continue to do
so, see the [Authentication](#authentication) section for more details
[tpo/tpa/team#40421]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40421
[tpo/tpa/team#41280]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41280
# Personas
## Jackie, the TPA admin
## Ethan, the TPA admin
Jackie is a member of the TPA team. She has access to the Puppet
repository, and all other Git repositories managed by TPA. She has
Ethan is a member of the TPA team. He has access to the Puppet
repository, and all other Git repositories managed by TPA. He has
access to everything and the kitchen sink, and is generally asked to fix
all of this on a regular basis.
She sometimes ends rotating as the "star of the week", which makes her
He sometimes ends rotating as the "star of the week", which makes him
responsible for handling "interruptions", new tickets, and also keeping
an eye on the monitoring server. This involves responding to alerts
like, by order of frequency in the last year:
......@@ -329,7 +355,7 @@ like, by order of frequency in the last year:
- 585 swap usage alerts
- 499 backup alerts
- 484 systemd alerts e.g. systemd says "degraded" and you get to figure
out what didn't start)
out what didn't start
- 383 zombie alerts
- 199 missing process (e.g. "0 postgresql processes")
- 168 unwanted processes or network services
......@@ -359,16 +385,21 @@ like, by order of frequency in the last year:
- 3 redis liveness alerts
- 4 onionoo backend reachability alerts
Jackie finds that is way too much noise. That list is actually an
Ethan finds that is way too much noise. That list is actually an
interpretation of the actual alerts received to make them more human
readable.
The current Nagios dashboard, that said, is pretty useful in the sense
that she can ignore all of those emails and just look at the dashboard
to see what's *actually* going on right now. This sometimes causes her
that he can ignore all of those emails and just look at the dashboard
to see what's *actually* going on right now. This sometimes causes him
to miss some problems, however.
TODO: what does she want out of monitoring?
Ethan uses Grafana for trending, to diagnose issues and see long term
trends. He builds dashboards by clicking around Grafana and saving the
resulting JSON in the [grafana-dashboards git repository](https://gitlab.com/anarcat/grafana-dashboards).
Ethan would love to monitor user endpoints better, and particularly
wants to have [better monitoring for webserver response times][]
### Note
......@@ -391,9 +422,33 @@ pipeline:
Then the alerts were parsed by a TPA brain. Some alerts were redacted
because considered mostly noise.
## Ethan, the service admin
## Jackie, the service admin
Jackie manages a service deployed on TPA servers, but doesn't have
administrative access on the servers or the monitoring servers, either
Nagios or Prometheus. She can, however, submit merge requests to the
[prometheus-alerts][] repository to deploy targets and alerting
rules. She also has access to the Grafana server with a shared
password that gets passed along.
She would love to use a more normal authentication method than sharing
the password, which feels wrong. She wonders how exporters should be
setup: all on different ports, or subpaths on the same domain name?
Should there be authentication and transport-layer security (TLS)?
TODO: what do service admins want?
She also feels clicking through Grafana to build dashboards is
suboptimal and would love to have a more declarative mechanism to
build dashboards and has, in fact, worked on such a system based on
Python and [grafanalib][]. She directly participates in the
[discussion to automate deployment of Grafana dashboards][tpo/tpa/team#41312].
She would love to get [alerts over Matrix][], but currently receives
notifications by email, sometimes to a Mailman mailing list.
[prometheus-alerts]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts
[grafanalib]: https://github.com/weaveworks/grafanalib
[alerts over Matrix]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40216
[tpo/tpa/team#41312]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41312
# Proposal
......@@ -436,6 +491,13 @@ TODO: Get a sense of what metrics we have and what we want to keep.
https://github.com/chrj/prometheus-dnssec-exporter
https://gitlab.com/gitlab-com/gl-infra/prometheus-git-exporter
https://github.com/hipages/php-fpm_exporter
https://gitlab.torproject.org/tpo/tpa/team/-/issues/30028
fail2ban: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41544
gitlab issue counts: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40591
gitlab mail processing: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41410
network interfaces: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41387
ipmi dashboard: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41569
technical debt: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41456
https://man.sr.ht/ops/monitoring.md
https://git.sr.ht/~sircmpwn/metrics.sr.ht
......@@ -487,7 +549,48 @@ TODO: review https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusAlertma
## Authentication
TODO: check if we have a web password in LDAP, use it for auth
We should use this opportunity to fix authentication on the Prometheus
and Grafana servers.
### Current situation
Authentication is currently handled as follows:
* Nagios: static htpasswd file, not managed by Puppet, modified
manually when onboarding/offboarding
* Prometheus 1: static htpasswd file with dummy password managed by
Puppet
* Grafana 1: same, with an extra admin password kept in Trocla
* Prometheus 2: static htpasswd file with real admin password
deployed, extra password generated for [prometheus-alerts][]
continuous integration (CI) validation, all deployed through Puppet
* Grafana 2: static htpasswd file with real admin password for
"admin" and "metrics", both of which are shared with an unclear
number of people
### Proposed changes
The plan was originally to just [delegate authentication to
Grafana](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40124) but we're concerned this is going to introduce yet another
authentication source, which we want to avoid. Instead, we're looking
at re-enabling the `webPassword` field in [LDAP](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/ldap), which has been
mysteriously in `userdir-ldap-cgi`'s `7cba921` (drop many fields from
update form, 2016-03-20), but could be trivially re-enabled.
This would favorably allow any tor-internal person to access the
dashboards, provided they can jump through the LDAP hoops.
Access level controls would be managed inside the Grafana database.
Prometheus servers would reuse the same password file, allowing
tor-internal users to access a more "raw" interface for alerts and
queries.
We have briefly considered making Grafana dashboards publicly
available, but ultimately rejected this idea, as it would mean having
two entirely different time series datasets, which would be too hard
to reliably separate. That would also impose a cardinal explosion of
servers if we want to provide high availability.
## Trending: Grafana
......@@ -518,10 +621,6 @@ this document, but a few glossary items seem worth defining here:
TODO: rules in Puppet and/or git?
TODO: inhibitions, see also https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGoodDownExporterAlert
TODO: incident response procedures?
### Dashboard
We will deploy a [Karma](https://github.com/prymitive/karma) dashboard to expose Prometheus alerts to
......@@ -564,6 +663,14 @@ systems][], part of the [Site Reliability Engineering book][].
[monitoring distributed systems]: https://sre.google/sre-book/monitoring-distributed-systems/
[Philosophy on Alerting]: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/mobilebasic
### silences
https://github.com/prymitive/kthxbye
### inhibitions
TODO: inhibitions, see also https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGoodDownExporterAlert
### Unit tests
TODO: https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/
......@@ -647,49 +754,26 @@ TODO: https://grafana.com/blog/2022/06/14/introducing-grafana-oncall-oss-open-so
https://gitlab.torproject.org/tpo/tpa/team/-/issues/29864#note_2801540
## Out of scope
### Exporter policy
TODO: exporters policy [tpo/tpa/team#41280](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41280)
### SLA improvements
We make not change to the current support policy ([TPA-RFC-2][]), in
particular this doesn't introduce a new "pager" service that rings
operators on their phones.
[TPA-RFC-2]: policy/tpa-rfc-2-support
We keep the current "email / IRC" notification, with the possible
addition of GitLab incidents/alerts.
## SLA and notifications improvements
We *MAY* introduce push notifications (e.g. with [ntfy.sh](https://ntfy.sh/) or
Signal) if we significantly trim down the amount of noise emanating
from the monitoring server, and *only* if we send notifications during
business hours of the affected parties.
We will absolutely not wake up humans at night for servers. If we
desire 24/7 availability, shifts should be implemented with staff in
multiple time zones instead.
If we do want to improve on SLA metrics, we should consider using
[Sloth](https://github.com/slok/sloth), an "easy and simple Prometheus SLO (service level
objectives) generator" which generates Grafana dashboards and alerts.
[Sachet](https://github.com/messagebird/sachet/) could be used to send SMS notifications.
### Incident response procedures
see https://gitlab.torproject.org/tpo/tpa/team/-/issues/40421
### Additional metrics
## Additional metrics
https://promhippie.github.io/hetzner_exporter/
https://promhippie.github.io/hcloud_exporter/
https://github.com/ganeti/prometheus-ganeti-exporter
### Flap detection
## Flap detection
https://github.com/prometheus/alertmanager/issues/204
......@@ -708,3 +792,13 @@ This proposal is currently in the `draft` state.
This proposal is discussed in [tpo/tpa/team#40755][].
[tpo/tpa/team#40755]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755
## Related issues
* [Nagios server retirement issue](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40695)
* [automate deployment of Grafana dashboards][tpo/tpa/team#41312]
* [exporter policy][tpo/tpa/team#41280]
* [improve incident response procedures][tpo/tpa/team#40421]
* [better monitoring for webserver response times][]
[better monitoring for webserver response times]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40568
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment