dump bunch more work in tpa-rfc-33 (team#40755)

43671378 · anarcat · 6b17c87a · 43671378
Verified Commit 43671378 authored 10 months ago by anarcat
--- a/policy/tpa-rfc-33-monitoring.md
+++ b/policy/tpa-rfc-33-monitoring.md
@@ -4,8 +4,7 @@ title: TPA-RFC-33: Monitoring

 [[_TOC_]]

-Summary: TODO. currently writing requirements for the new monitoring
-system.
+Summary: TODO.

 # Background

@@ -18,10 +17,14 @@ Icinga 1 is not available in Debian bullseye and this is therefore a
 mandatory upgrade. Because of the design of the service, it cannot just
 be converted over easily, so we are considering alternatives.

+This has become urgent as of May 2024, as Debian buster will stop
+being supported by [Debian LTS][] in June 2024.
+
  [Debian bullseye upgrades]: howto/upgrades/bullseye
  [Debian Icinga 1 package]: https://tracker.debian.org/pkg/icinga
  [Icinga 2]: https://tracker.debian.org/pkg/icinga2
  [switching to Prometheus]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/29864
+[Debian LTS]: https://wiki.debian.org/LTS

 ## History

@@ -96,12 +99,6 @@ minutes. It processes about 200 checks per minute.
  [previously estimated]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31244#note_2541965
  [tor-nagios.git repository]: https://gitweb.torproject.org/admin/tor-nagios.git/

-Icinga is running version 1.14, from Debian buster.
-
-TODO: document the upgrade problem. https://gitlab.torproject.org/tpo/tpa/team/-/issues/40695
-TODO: document puppetization problem. https://gitlab.torproject.org/tpo/tpa/team/-/issues/32901
-TODO: document why nagios is not puppet (so nagios tests the puppet config, rejected idea)
-
 ## Problem statement

 The current Icinga deployment cannot be upgraded without Bullseye as is.
@@ -109,21 +106,18 @@ At the very least the post-receive hook in git needs to be rewritten to
 support the Icinga 2 configuration files, since Icinga 2 has dropped
 support for Nagios configurations.

-Note that weasel has started on rewriting the [DSA Puppet
-configuration][] to automatically generate Icinga 2 configurations using
-[a custom Puppet module][], ditching the "push to git" design. This has
-the limitation that service admins will not have access to modifying the
-alerting configuration unless they somehow get access to the Puppet
-repository.
-
 The Nagios configuration is error-prone: because of the way the script
 is deployed (post-receive), an error in the configuration can go
 un-detected and not being deployed for extended periods of time, which
 had lead some services to stay unmonitored.

-Since new services must be manually configured in Nagios, this has lead
-to new servers and services not being monitored at all, and in fact many
-services do not have any form of monitoring.
+Having Nagios be a separate source of truth for host information was
+originally a deliberate decision: it allowed for external verification
+of configurations deployed by Puppet.
+
+But since new services must be manually configured in Nagios, this has
+lead to new servers and services not being monitored at all, and in
+fact many services do not have any form of monitoring.

 The way the NRPE configuration is deployed is also problematic: because
 the files get deployed asynchronously, it's common for warnings to pop
@@ -148,6 +142,16 @@ Prometheus/Grafana services. In particular, both:
  [DSA Puppet configuration]: https://salsa.debian.org/dsa-team/mirror/dsa-puppet
  [a custom Puppet module]: https://salsa.debian.org/dsa-team/mirror/dsa-puppet-weasel-mon

+Note that weasel has started on rewriting the [DSA Puppet
+configuration][] to automatically generate Icinga 2 configurations using
+[a custom Puppet module][], ditching the "push to git" design. This has
+the limitation that service admins will not have access to modifying the
+alerting configuration unless they somehow get access to the Puppet
+repository. We have the option of [automate Nagios configuration][]
+of course, either with DSA's work or another Nagios module.
+
+[automate Nagios configuration]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/32901
+
 ## Definitions

 - **"system" metrics**: directly under the responsibility of TPA, for
@@ -285,8 +289,10 @@ monitoring system, as provided by TPA.
 ### Non-Goals

 - **SLA**: we do not plan on providing any specific Service Level
-  Agreement through this proposal, those are defined in [TPA-RFC-2:
-  Support][]
+  Agreement through this proposal, those are still defined in
+  [TPA-RFC-2: Support][].
+  
+ [TPA-RFC-2: Support]: policy/tpa-rfc-2-support

 - **on-call rotation**: we do not provide 24/7 on-call services, nor do
  we ascribe to an on-call schedule - there is a "star of the weeks"
@@ -294,6 +300,14 @@ monitoring system, as provided by TPA.
  interruptions, but they do so during work hours, in their own time, in
  accordance with [TPA-RFC-2: Support][]

+  In particular, we do not introduce notifications that "page"
+  operators on their mobile devices, instead we keep the current
+  "email / IRC" notifications with optional integration with GitLab.
+
+  We will absolutely not wake up humans at night for servers. If we
+  desire 24/7 availability, shifts should be implemented with staff in
+  multiple time zones instead.
+
 - **escalation**: we do not need to call Y when X person fails to
  answer, mainly because we do not expect either X or Y to answer alerts
  immediately
@@ -302,18 +316,30 @@ monitoring system, as provided by TPA.
  of our monitoring systems, the questions of whether we use syslog-ng,
  rsyslog, journald, or loki are currently out of scope of this proposal

-  [TPA-RFC-2: Support]: policy/tpa-rfc-2-support
+- **exporter policy**: we need to clarify how new exporters are setup,
+  this is covered by another issue, in [tpo/tpa/team#41280][]
+
+- **incident response**: we need to improve our incident response
+  procedures, but those are not covered by this policy, see
+  [tpo/tpa/team#40421][] for that discussion
+
+- **public dashboards**: we currently copy-paste screenshots into
+  GitLab when we want to share data publicly and will continue to do
+  so, see the [Authentication](#authentication) section for more details
+
+ [tpo/tpa/team#40421]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40421
+ [tpo/tpa/team#41280]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41280

 # Personas

-## Jackie, the TPA admin
+## Ethan, the TPA admin

-Jackie is a member of the TPA team. She has access to the Puppet
-repository, and all other Git repositories managed by TPA. She has
+Ethan is a member of the TPA team. He has access to the Puppet
+repository, and all other Git repositories managed by TPA. He has
 access to everything and the kitchen sink, and is generally asked to fix
 all of this on a regular basis.

-She sometimes ends rotating as the "star of the week", which makes her
+He sometimes ends rotating as the "star of the week", which makes him
 responsible for handling "interruptions", new tickets, and also keeping
 an eye on the monitoring server. This involves responding to alerts
 like, by order of frequency in the last year:
@@ -329,7 +355,7 @@ like, by order of frequency in the last year:
 - 585 swap usage alerts
 - 499 backup alerts
 - 484 systemd alerts e.g. systemd says "degraded" and you get to figure
-  out what didn't start)
+  out what didn't start
 - 383 zombie alerts
 - 199 missing process (e.g. "0 postgresql processes")
 - 168 unwanted processes or network services
@@ -359,16 +385,21 @@ like, by order of frequency in the last year:
 - 3 redis liveness alerts
 - 4 onionoo backend reachability alerts

-Jackie finds that is way too much noise. That list is actually an
+Ethan finds that is way too much noise. That list is actually an
 interpretation of the actual alerts received to make them more human
 readable.

 The current Nagios dashboard, that said, is pretty useful in the sense
-that she can ignore all of those emails and just look at the dashboard
-to see what's *actually* going on right now. This sometimes causes her
+that he can ignore all of those emails and just look at the dashboard
+to see what's *actually* going on right now. This sometimes causes him
 to miss some problems, however.

-TODO: what does she want out of monitoring?
+Ethan uses Grafana for trending, to diagnose issues and see long term
+trends. He builds dashboards by clicking around Grafana and saving the
+resulting JSON in the [grafana-dashboards git repository](https://gitlab.com/anarcat/grafana-dashboards).
+
+Ethan would love to monitor user endpoints better, and particularly
+wants to have [better monitoring for webserver response times][]

 ### Note

@@ -391,9 +422,33 @@ pipeline:
 Then the alerts were parsed by a TPA brain. Some alerts were redacted
 because considered mostly noise.

-## Ethan, the service admin
+## Jackie, the service admin
+
+Jackie manages a service deployed on TPA servers, but doesn't have
+administrative access on the servers or the monitoring servers, either
+Nagios or Prometheus. She can, however, submit merge requests to the
+[prometheus-alerts][] repository to deploy targets and alerting
+rules. She also has access to the Grafana server with a shared
+password that gets passed along.
+
+She would love to use a more normal authentication method than sharing
+the password, which feels wrong. She wonders how exporters should be
+setup: all on different ports, or subpaths on the same domain name?
+Should there be authentication and transport-layer security (TLS)? 

-TODO: what do service admins want?
+She also feels clicking through Grafana to build dashboards is
+suboptimal and would love to have a more declarative mechanism to
+build dashboards and has, in fact, worked on such a system based on
+Python and [grafanalib][]. She directly participates in the
+[discussion to automate deployment of Grafana dashboards][tpo/tpa/team#41312].
+
+She would love to get [alerts over Matrix][], but currently receives
+notifications by email, sometimes to a Mailman mailing list.
+
+[prometheus-alerts]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts
+[grafanalib]: https://github.com/weaveworks/grafanalib
+[alerts over Matrix]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40216
+[tpo/tpa/team#41312]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41312

 # Proposal

@@ -436,6 +491,13 @@ TODO: Get a sense of what metrics we have and what we want to keep.
 https://github.com/chrj/prometheus-dnssec-exporter
 https://gitlab.com/gitlab-com/gl-infra/prometheus-git-exporter
 https://github.com/hipages/php-fpm_exporter
+https://gitlab.torproject.org/tpo/tpa/team/-/issues/30028
+fail2ban: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41544
+gitlab issue counts: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40591
+gitlab mail processing: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41410
+network interfaces: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41387
+ipmi dashboard: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41569
+technical debt: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41456

 https://man.sr.ht/ops/monitoring.md
 https://git.sr.ht/~sircmpwn/metrics.sr.ht
@@ -487,7 +549,48 @@ TODO: review https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusAlertma

 ## Authentication

-TODO: check if we have a web password in LDAP, use it for auth
+We should use this opportunity to fix authentication on the Prometheus
+and Grafana servers. 
+
+### Current situation
+
+Authentication is currently handled as follows:
+
+ * Nagios: static htpasswd file, not managed by Puppet, modified
+   manually when onboarding/offboarding
+ * Prometheus 1: static htpasswd file with dummy password managed by
+   Puppet
+ * Grafana 1: same, with an extra admin password kept in Trocla
+ * Prometheus 2: static htpasswd file with real admin password
+   deployed, extra password generated for [prometheus-alerts][]
+   continuous integration (CI) validation, all deployed through Puppet
+ * Grafana 2: static htpasswd file with real admin password for
+   "admin" and "metrics", both of which are shared with an unclear
+   number of people
+
+### Proposed changes
+
+The plan was originally to just [delegate authentication to
+Grafana](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40124) but we're concerned this is going to introduce yet another
+authentication source, which we want to avoid. Instead, we're looking
+at re-enabling the `webPassword` field in [LDAP](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/ldap), which has been
+mysteriously in `userdir-ldap-cgi`'s `7cba921` (drop many fields from
+update form, 2016-03-20), but could be trivially re-enabled.
+
+This would favorably allow any tor-internal person to access the
+dashboards, provided they can jump through the LDAP hoops.
+
+Access level controls would be managed inside the Grafana database.
+
+Prometheus servers would reuse the same password file, allowing
+tor-internal users to access a more "raw" interface for alerts and
+queries.
+
+We have briefly considered making Grafana dashboards publicly
+available, but ultimately rejected this idea, as it would mean having
+two entirely different time series datasets, which would be too hard
+to reliably separate. That would also impose a cardinal explosion of
+servers if we want to provide high availability.

 ## Trending: Grafana

@@ -518,10 +621,6 @@ this document, but a few glossary items seem worth defining here:

 TODO: rules in Puppet and/or git?

-TODO: inhibitions, see also https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGoodDownExporterAlert
-
-TODO: incident response procedures?
-
 ### Dashboard

 We will deploy a [Karma](https://github.com/prymitive/karma) dashboard to expose Prometheus alerts to
@@ -564,6 +663,14 @@ systems][], part of the [Site Reliability Engineering book][].
 [monitoring distributed systems]: https://sre.google/sre-book/monitoring-distributed-systems/
 [Philosophy on Alerting]: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/mobilebasic

+### silences
+
+https://github.com/prymitive/kthxbye
+
+### inhibitions
+
+TODO: inhibitions, see also https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGoodDownExporterAlert
+
 ### Unit tests

 TODO: https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/
@@ -647,49 +754,26 @@ TODO: https://grafana.com/blog/2022/06/14/introducing-grafana-oncall-oss-open-so

 https://gitlab.torproject.org/tpo/tpa/team/-/issues/29864#note_2801540

-## Out of scope
-
-### Exporter policy
-
-TODO: exporters policy [tpo/tpa/team#41280](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41280)
-
-### SLA improvements
-
-We make not change to the current support policy ([TPA-RFC-2][]), in
-particular this doesn't introduce a new "pager" service that rings
-operators on their phones.
-
-[TPA-RFC-2]: policy/tpa-rfc-2-support
-
-We keep the current "email / IRC" notification, with the possible
-addition of GitLab incidents/alerts.
+## SLA and notifications improvements

 We *MAY* introduce push notifications (e.g. with [ntfy.sh](https://ntfy.sh/) or
 Signal) if we significantly trim down the amount of noise emanating
 from the monitoring server, and *only* if we send notifications during
 business hours of the affected parties.

-We will absolutely not wake up humans at night for servers. If we
-desire 24/7 availability, shifts should be implemented with staff in
-multiple time zones instead.
-
 If we do want to improve on SLA metrics, we should consider using
 [Sloth](https://github.com/slok/sloth), an "easy and simple Prometheus SLO (service level
 objectives) generator" which generates Grafana dashboards and alerts.

 [Sachet](https://github.com/messagebird/sachet/) could be used to send SMS notifications.

-### Incident response procedures
-
-see https://gitlab.torproject.org/tpo/tpa/team/-/issues/40421
-
-### Additional metrics
+## Additional metrics

 https://promhippie.github.io/hetzner_exporter/
 https://promhippie.github.io/hcloud_exporter/
 https://github.com/ganeti/prometheus-ganeti-exporter

-### Flap detection
+## Flap detection

 https://github.com/prometheus/alertmanager/issues/204

@@ -708,3 +792,13 @@ This proposal is currently in the `draft` state.
 This proposal is discussed in [tpo/tpa/team#40755][].

  [tpo/tpa/team#40755]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755
+
+## Related issues
+
+ * [Nagios server retirement issue](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40695)
+ * [automate deployment of Grafana dashboards][tpo/tpa/team#41312]
+ * [exporter policy][tpo/tpa/team#41280]
+ * [improve incident response procedures][tpo/tpa/team#40421]
+ * [better monitoring for webserver response times][]
+
+[better monitoring for webserver response times]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40568