lots more notes about prometheus, now all is in the doc (e29dc3c3) · Commits · The Tor Project / TPA / Wiki Replica

policy/tpa-rfc-33-monitoring.md

+302 −2

Original line number	Diff line number	Diff line
		@@ -96,6 +96,12 @@ minutes. It processes about 200 checks per minute.
		[previously estimated]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31244#note_2541965
		[tor-nagios.git repository]: https://gitweb.torproject.org/admin/tor-nagios.git/

		Icinga is running version 1.14, from Debian buster.

		TODO: document the upgrade problem. https://gitlab.torproject.org/tpo/tpa/team/-/issues/40695
		TODO: document puppetization problem. https://gitlab.torproject.org/tpo/tpa/team/-/issues/32901
		TODO: document why nagios is not puppet (so nagios tests the puppet config, rejected idea)

		## Problem statement

		The current Icinga deployment cannot be upgraded without Bullseye as is.
		@@ -391,8 +397,302 @@ TODO: what do service admins want?

		# Proposal

		TODO: overview

		## Architecture overview

		TODO: architecture diagram before / after?

		## Metrics: Prometheus

		In [monitoring distributed systems][], Google defines 4 "golden
		signals", categories of metrics that need to be monitored:

		* Latency: time to service a request
		* Traffic: transactions per second or bandwidth
		* Errors: failure rates, e.g. 500 errors in web servers
		* Saturation: full disks, memory, CPU utilization, etc

		In the book, they argue all four should issue pager alerts, but we
		believe warnings for saturation, except for extreme cases ("disk
		actually full") might be sufficient.

		### Inventory

		TODO: Get a sense of what metrics we have and what we want to keep.

		* EDAC: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755#note_2908372
		* DRBD:
		https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755#note_2912119
		andhttps://gitlab.torproject.org/tpo/tpa/team/-/issues/29864#note_2903908
		* unexpected open ports: https://github.com/stanford-esrg/lzr
		* disk full: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755#note_2946792
		* needrestart: https://github.com/liske/needrestart/issues/291
		* cert expirations: https://github.com/joe-elliott/cert-exporter
		* fingerprint checking: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41385
		* imap/web roundtrips: https://git.autistici.org/ai3/tools/service-prober
		* puppet: https://github.com/voxpupuli/puppet-prometheus_reporter

		https://github.com/chrj/prometheus-dnssec-exporter
		https://gitlab.com/gitlab-com/gl-infra/prometheus-git-exporter
		https://github.com/hipages/php-fpm_exporter

		https://man.sr.ht/ops/monitoring.md
		https://git.sr.ht/~sircmpwn/metrics.sr.ht
		https://metrics.sr.ht/rules
		https://metrics.sr.ht/alerts

		### Retention

		TODO: long term storage? https://gitlab.torproject.org/tpo/tpa/team/-/issues/40330

		### Privacy

		TODO: prom1/prom2

		### Self-monitoring

		Prometheus should monitor itself and its [Alertmanager][] for
		outages. Some mechanism should be set to make sure alerts can and do
		get delivered, probably through a "dead man's switch" that
		continuously sends alerts and makes sure they get delivered.

		Prometheus calls this [metamonitoring](https://prometheus.io/docs/practices/alerting/#metamonitoring).

		TODO: review https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusAlertmanagerHealth

		### Queries cheat sheet

		* availability:
		* how many hosts are online at any given point: `sum(count(up==1))/sum(count(up)) by (alias)`
		* percentage of hosts available over a given period: `avg_over_time(up{job="node"}[7d])`

		* memory pressure:

		```
		# PSI alerts - in testing mode for now.
		- alert: HostMemoryPressureHigh
		expr: rate(node_pressure_memory_waiting_seconds_total[10m]) > 0.2
		for: 10m
		labels:
		scope: host
		severity: warn
		annotations:
		summary: "High memory pressure on host {{$labels.host}}"
		description: \|
		PSI metrics report high memory pressure on host {{$labels.host}}:
		{{$value}} > 0.2.
		Processes might be at risk of eventually OOMing.
		```

		## Authentication

		TODO: check if we have a web password in LDAP, use it for auth

		## Trending: Grafana

		TODO: document the (future) grafana setup

		## Alerting: Alertmanager, Karma

		Alerting will be performed by [Alertmanager][], ideally in a
		high-availability cluster. Documenting Alertmanager is out of scope of
		this document, but a few glossary items seem worth defining here:

		* alerting rules: rules defined, in PromQL, on the Prometheus
		server that fire if they are true (e.g. `node_reboot_required > 0`
		for a host requiring a reboot)
		* alert: an alert sent following an alerting rule "firing" from a
		Prometheus server
		* grouping: grouping multiple alerts together in a single
		notification
		* inhibition: suppressing notification from an alert if another
		is already firing, configured in the Alertmanager configuration file
		* silence: muting an alert for a specific amount of time,
		configured through the Alertmanager web interface
		* high availability: support for receiving alerts from multiple
		Prometheus servers and avoiding duplicate notifications between
		multiple Alertmanager servers

		### Configuration

		TODO: rules in Puppet and/or git?

		TODO: inhibitions, see also https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGoodDownExporterAlert

		TODO: incident response procedures?

		### Dashboard

		We will deploy a [Karma](https://github.com/prymitive/karma) dashboard to expose Prometheus alerts to
		operators. It features:

		* silencing alerts
		* showing alert inhibitions
		* aggregate alerts from multiple alert managers
		* alert groups
		* alert history
		* dead man's switch (an alert always firing that signals an error
		when it stops firing)

		There is a [Karma demo](https://demo.karma-dashboard.io/) available although it's a bit slow and
		crowded, hopefully ours will look cleaner.

		### Alert levels

		The current noise levels in Icinga are unsustainable and makes alert
		fatigue such a problem that we often miss critical issues before it's
		too late. And while Icinga operators (anarcat, in particular, has
		experience with this) succeeded in reducing the amount of noise from
		monitoring, we feel a different approach is necessary here.

		From the start, we'll take the approach of labeling each alert with
		one of two `severity` label:

		* `warning`: non-urgent condition, requiring investigation and
		fixing, but not immediately, no user-visible impact; example:
		server needs to be rebooted
		* `error`: serious condition with disruptive user-visible impact
		which requires prompt response; example: donation site gives a 500
		error

		This distinction is partly inspired from Rob Ewaschuk's [Philosophy on
		Alerting][] which form the basis of Google's [monitoring distributed
		systems][], part of the [Site Reliability Engineering book][].

		[Site Reliability Engineering book]: https://sre.google/sre-book/table-of-contents/
		[monitoring distributed systems]: https://sre.google/sre-book/monitoring-distributed-systems/
		[Philosophy on Alerting]: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/mobilebasic

		### Unit tests

		TODO: https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/

		## Notifications: IRC / Matrix?

		TODO: experiment with IRC alerting a little more to get a go / no-go
		on this.

		avoid pages as much as possible, https://utcc.utoronto.ca/~cks/space/blog/sysadmin/AlertsAsNotificationsFreedom

		gitlab alerting example
		https://gitlab.torproject.org/tpo/community/l10n/-/alert_management

		tpa incidents https://gitlab.torproject.org/tpo/tpa/team/-/incidents

		We will aggressively restrict the kind and number of alerts that will
		actually send notifications.

		dashbaord has everything
		irc notifications for warnings, micah suggests keeping that to pages to reduce the noise... maybe split: pages in main channel, everything in a separate channel?
		email / gitlab incidents for pages?

		TODO: review https://gitlab.com/gitlab-com/gl-infra/helicopter

		### Dashboard management

		TODO: see [tpo/tpa/team#41312](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41312)

		### Access control

		TODO: see
		[tpo/tpa/team#40124](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40124) https://gitlab.torproject.org/tpo/tpa/team/-/issues/30023

		## Migration plan

		* deploy Alertmanager on prometheus1
		* reimplement the Nagios alerting commands (optional?)
		* send Nagios alerts through the alertmanager (optional?)
		* rewrite (non-NRPE) commands (9) as Prometheus alerts
		* scrape the NRPE metrics from Prometheus (optional)
		* create a dashboard and/or alerts for the NRPE metrics (optional)
		* review the NRPE commands (300+) to see which one to rewrite as Prometheus alerts
		* turn off the Icinga server
		* remove all traces of NRPE on all nodes

		# Alternatives considered

		## Limitations

		TODO: flapping, re https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusOnExtendingAlerts

		## Wikimedia Foundation

		TODO: evaluate https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755#note_2907267

		## fedora tracer

		https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755#note_2968812

		## Other dashboards

		### Grafana

		Grafana was tested to provide an alerting dashboard, but seemed
		insufficient. There's a [builtin "dashboard"](https://grafana2.torproject.org/alerting/list?view=state) for alerts it finds
		already with the existing prometheus data source

		It doesn't support silencing alerts.

		It's possible to make grafana dashboards with queries as well, I found
		only a couple that only use the prometheus stats, most of the better
		ones use the Alertmanager metrics themselves. It also seems dashboards
		rely on Prometheus scraping metrics off the Alertmanager.

		TODO: https://grafana.com/docs/grafana/latest/alerting/unified-alerting/

		TODO: https://grafana.com/blog/2022/06/14/introducing-grafana-oncall-oss-open-source/

		## Nagios

		https://gitlab.torproject.org/tpo/tpa/team/-/issues/29864#note_2801540

		## Out of scope

		### Exporter policy

		TODO: exporters policy [tpo/tpa/team#41280](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41280)

		### SLA improvements

		We make not change to the current support policy ([TPA-RFC-2][]), in
		particular this doesn't introduce a new "pager" service that rings
		operators on their phones.

		[TPA-RFC-2]: policy/tpa-rfc-2-support

		We keep the current "email / IRC" notification, with the possible
		addition of GitLab incidents/alerts.

		We MAY introduce push notifications (e.g. with [ntfy.sh](https://ntfy.sh/) or
		Signal) if we significantly trim down the amount of noise emanating
		from the monitoring server, and only if we send notifications during
		business hours of the affected parties.

		We will absolutely not wake up humans at night for servers. If we
		desire 24/7 availability, shifts should be implemented with staff in
		multiple time zones instead.

		If we do want to improve on SLA metrics, we should consider using
		[Sloth](https://github.com/slok/sloth), an "easy and simple Prometheus SLO (service level
		objectives) generator" which generates Grafana dashboards and alerts.

		[Sachet](https://github.com/messagebird/sachet/) could be used to send SMS notifications.

		### Incident response procedures

		see https://gitlab.torproject.org/tpo/tpa/team/-/issues/40421

		### Additional metrics

		https://promhippie.github.io/hetzner_exporter/
		https://promhippie.github.io/hcloud_exporter/
		https://github.com/ganeti/prometheus-ganeti-exporter

		### Flap detection

		https://github.com/prometheus/alertmanager/issues/204

		# Costs

		# Approval