Verified Commit e29dc3c3 authored by anarcat's avatar anarcat
Browse files

lots more notes about prometheus, now all is in the doc

Next step is to run through the entire todo list and design like
hell.

see team#40755
parent 1e6f12d7
Loading
Loading
Loading
Loading
+302 −2
Original line number Diff line number Diff line
@@ -96,6 +96,12 @@ minutes. It processes about 200 checks per minute.
  [previously estimated]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31244#note_2541965
  [tor-nagios.git repository]: https://gitweb.torproject.org/admin/tor-nagios.git/

Icinga is running version 1.14, from Debian buster.

TODO: document the upgrade problem. https://gitlab.torproject.org/tpo/tpa/team/-/issues/40695
TODO: document puppetization problem. https://gitlab.torproject.org/tpo/tpa/team/-/issues/32901
TODO: document why nagios is not puppet (so nagios tests the puppet config, rejected idea)

## Problem statement

The current Icinga deployment cannot be upgraded without Bullseye as is.
@@ -391,8 +397,302 @@ TODO: what do service admins want?

# Proposal

TODO: overview

## Architecture overview

TODO: architecture diagram before / after?

## Metrics: Prometheus

In [monitoring distributed systems][], Google defines 4 "golden
signals", categories of metrics that need to be monitored:

 * **Latency**: time to service a request
 * **Traffic**: transactions per second or bandwidth
 * **Errors**: failure rates, e.g. 500 errors in web servers
 * **Saturation**: full disks, memory, CPU utilization, etc

In the book, they argue all four should issue pager alerts, but we
believe warnings for saturation, except for extreme cases ("disk
actually full") might be sufficient.

### Inventory

TODO: Get a sense of what metrics we have and what we want to keep.

 * EDAC: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755#note_2908372
 * DRBD:
   https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755#note_2912119
   andhttps://gitlab.torproject.org/tpo/tpa/team/-/issues/29864#note_2903908
 * unexpected open ports: https://github.com/stanford-esrg/lzr
 * disk full: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755#note_2946792
 * needrestart: https://github.com/liske/needrestart/issues/291
 * cert expirations: https://github.com/joe-elliott/cert-exporter
 * fingerprint checking: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41385
 * imap/web roundtrips: https://git.autistici.org/ai3/tools/service-prober
 * puppet: https://github.com/voxpupuli/puppet-prometheus_reporter

https://github.com/chrj/prometheus-dnssec-exporter
https://gitlab.com/gitlab-com/gl-infra/prometheus-git-exporter
https://github.com/hipages/php-fpm_exporter

https://man.sr.ht/ops/monitoring.md
https://git.sr.ht/~sircmpwn/metrics.sr.ht
https://metrics.sr.ht/rules
https://metrics.sr.ht/alerts

### Retention

TODO: long term storage? https://gitlab.torproject.org/tpo/tpa/team/-/issues/40330

### Privacy

TODO: prom1/prom2

### Self-monitoring

Prometheus should monitor itself and its [Alertmanager][] for
outages. Some mechanism should be set to make sure alerts can and do
get delivered, probably through a "dead man's switch" that
continuously sends alerts and makes sure they get delivered.

Prometheus calls this [metamonitoring](https://prometheus.io/docs/practices/alerting/#metamonitoring).

TODO: review https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusAlertmanagerHealth

### Queries cheat sheet

 * **availability**:
   * how many hosts are online at any given point: `sum(count(up==1))/sum(count(up)) by (alias)`
   * percentage of hosts available over a given period: `avg_over_time(up{job="node"}[7d])`

 * memory pressure:

```
   # PSI alerts - in testing mode for now.
  - alert: HostMemoryPressureHigh
    expr: rate(node_pressure_memory_waiting_seconds_total[10m]) > 0.2
    for: 10m
    labels:
      scope: host
      severity: warn
    annotations:
      summary: "High memory pressure on host {{$labels.host}}"
      description: |
        PSI metrics report high memory pressure on host {{$labels.host}}:
          {{$value}} > 0.2.
        Processes might be at risk of eventually OOMing.
```

## Authentication

TODO: check if we have a web password in LDAP, use it for auth

## Trending: Grafana

TODO: document the (future) grafana setup

## Alerting: Alertmanager, Karma

Alerting will be performed by [Alertmanager][], ideally in a
high-availability cluster. Documenting Alertmanager is out of scope of
this document, but a few glossary items seem worth defining here:

 * **alerting rules**: rules defined, in PromQL, on the Prometheus
   server that fire if they are true (e.g. `node_reboot_required > 0`
   for a host requiring a reboot)
 * **alert**: an alert sent following an alerting rule "firing" from a
   Prometheus server
 * **grouping**: grouping multiple alerts together in a single
   notification
 * **inhibition**: suppressing notification from an alert if another
   is already firing, configured in the Alertmanager configuration file
 * **silence**: muting an alert for a specific amount of time,
   configured through the Alertmanager web interface
 * **high availability**: support for receiving alerts from multiple
   Prometheus servers and avoiding duplicate notifications between
   multiple Alertmanager servers

### Configuration

TODO: rules in Puppet and/or git?

TODO: inhibitions, see also https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusGoodDownExporterAlert

TODO: incident response procedures?

### Dashboard

We will deploy a [Karma](https://github.com/prymitive/karma) dashboard to expose Prometheus alerts to
operators. It features:

 * silencing alerts
 * showing alert inhibitions
 * aggregate alerts from multiple alert managers
 * alert groups
 * alert history
 * dead man's switch (an alert always firing that signals an error
   when it *stops* firing)

There is a [Karma demo](https://demo.karma-dashboard.io/) available although it's a bit slow and
crowded, hopefully ours will look cleaner.

### Alert levels

The current noise levels in Icinga are unsustainable and makes alert
fatigue such a problem that we often miss critical issues before it's
too late. And while Icinga operators (anarcat, in particular, has
experience with this) succeeded in reducing the amount of noise from
monitoring, we feel a different approach is necessary here.

From the start, we'll take the approach of labeling each alert with
one of two `severity` label:

 * `warning`: non-urgent condition, requiring investigation and
   fixing, but not immediately, no user-visible impact; example:
   server needs to be rebooted
 * `error`: serious condition with disruptive user-visible impact
   which requires prompt response; example: donation site gives a 500
   error

This distinction is partly inspired from Rob Ewaschuk's [Philosophy on
Alerting][] which form the basis of Google's [monitoring distributed
systems][], part of the [Site Reliability Engineering book][].

 [Site Reliability Engineering book]: https://sre.google/sre-book/table-of-contents/
 [monitoring distributed systems]: https://sre.google/sre-book/monitoring-distributed-systems/
 [Philosophy on Alerting]: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/mobilebasic

### Unit tests

TODO: https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/

## Notifications: IRC / Matrix?

TODO: experiment with IRC alerting a little more to get a go / no-go
on this.

avoid pages as much as possible, https://utcc.utoronto.ca/~cks/space/blog/sysadmin/AlertsAsNotificationsFreedom

gitlab alerting example
https://gitlab.torproject.org/tpo/community/l10n/-/alert_management

tpa incidents https://gitlab.torproject.org/tpo/tpa/team/-/incidents

We will aggressively restrict the kind and number of alerts that will
actually send notifications.

dashbaord has everything
irc notifications for warnings, micah suggests keeping that to pages to reduce the noise... maybe split: pages in main channel, everything in a separate channel?
email / gitlab incidents for pages?

TODO: review https://gitlab.com/gitlab-com/gl-infra/helicopter

### Dashboard management

TODO: see [tpo/tpa/team#41312](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41312)

### Access control

TODO: see
[tpo/tpa/team#40124](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40124) https://gitlab.torproject.org/tpo/tpa/team/-/issues/30023

## Migration plan

 * deploy Alertmanager on prometheus1
 * reimplement the Nagios alerting commands (optional?)
 * send Nagios alerts through the alertmanager (optional?)
 * rewrite (non-NRPE) commands (9) as Prometheus alerts
 * scrape the NRPE metrics from Prometheus (optional)
 * create a dashboard and/or alerts for the NRPE metrics (optional)
 * review the NRPE commands (300+) to see which one to rewrite as Prometheus alerts
 * turn off the Icinga server
 * remove all traces of NRPE on all nodes

# Alternatives considered

## Limitations

TODO: flapping, re https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusOnExtendingAlerts

## Wikimedia Foundation

TODO: evaluate https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755#note_2907267

## fedora tracer

https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755#note_2968812

## Other dashboards

### Grafana

Grafana was tested to provide an alerting dashboard, but seemed
insufficient. There's a [builtin "dashboard"](https://grafana2.torproject.org/alerting/list?view=state) for alerts it finds
already with the existing prometheus data source

It doesn't support silencing alerts.

It's possible to make grafana dashboards with queries as well, I found
only a couple that only use the prometheus stats, most of the better
ones use the Alertmanager metrics themselves. It also seems dashboards
rely on Prometheus scraping metrics off the Alertmanager.

TODO: https://grafana.com/docs/grafana/latest/alerting/unified-alerting/

TODO: https://grafana.com/blog/2022/06/14/introducing-grafana-oncall-oss-open-source/

## Nagios

https://gitlab.torproject.org/tpo/tpa/team/-/issues/29864#note_2801540

## Out of scope

### Exporter policy

TODO: exporters policy [tpo/tpa/team#41280](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41280)

### SLA improvements

We make not change to the current support policy ([TPA-RFC-2][]), in
particular this doesn't introduce a new "pager" service that rings
operators on their phones.

[TPA-RFC-2]: policy/tpa-rfc-2-support

We keep the current "email / IRC" notification, with the possible
addition of GitLab incidents/alerts.

We *MAY* introduce push notifications (e.g. with [ntfy.sh](https://ntfy.sh/) or
Signal) if we significantly trim down the amount of noise emanating
from the monitoring server, and *only* if we send notifications during
business hours of the affected parties.

We will absolutely not wake up humans at night for servers. If we
desire 24/7 availability, shifts should be implemented with staff in
multiple time zones instead.

If we do want to improve on SLA metrics, we should consider using
[Sloth](https://github.com/slok/sloth), an "easy and simple Prometheus SLO (service level
objectives) generator" which generates Grafana dashboards and alerts.

[Sachet](https://github.com/messagebird/sachet/) could be used to send SMS notifications.

### Incident response procedures

see https://gitlab.torproject.org/tpo/tpa/team/-/issues/40421

### Additional metrics

https://promhippie.github.io/hetzner_exporter/
https://promhippie.github.io/hcloud_exporter/
https://github.com/ganeti/prometheus-ganeti-exporter

### Flap detection

https://github.com/prometheus/alertmanager/issues/204

# Costs

# Approval