add more info about alerts

e4a68d22 · anarcat · 5a4be1f7 · e4a68d22
Unverified Commit e4a68d22 authored 3 years ago by anarcat
--- a/howto/prometheus.md
+++ b/howto/prometheus.md
@@ -89,6 +89,135 @@ TODO: talk about `scrape_jobs` for in-puppet configurations.
 TODO: show how to hook a custom scrape job, and on where server to put
 it.

+## Web dashboard usage
+
+The main web dashboard for the internal Prometheus server should be
+accessible at <https://prometheus.torproject.org> using the
+well-known, public username.
+
+The dashboard for the external Prometheus server, however, is not
+publicly available. To bypass it, use the following commandline to
+forward ports over SSH:
+
+    ssh -L 9090:localhost:9090 -L 9091:localhost:9091 -L 9093:localhost:9093 prometheus2.torproject.org
+
+The above will also forward the management interfaces of the
+Alertmanager (port 9093) and Pushgateway (9091).
+
+## Alerting
+
+We currently do not do alerting for TPA services with Prometheus. We
+do, however, have the Alertmanager setup to do alerting for other
+teams on the secondary Prometheus server (`prometheus2`). This
+documentation details how that works, but could also eventually cover
+the main server, if it eventually replaces [Nagios](howto/nagios) for
+alerting ([ticket 29864][]).
+
+In general, the upstream documentation for alerting starts from [the
+Alerting Overview](https://prometheus.io/docs/alerting/latest/overview/) but I have found it to be lacking at times. I
+have instead been following [this tutorial](https://ashish.one/blogs/setup-alertmanager/) which was quite
+helpful.
+
+### Adding alerts
+
+The Alertmanager is currently managed through Puppet, in
+`profile::prometheus::server::external`. An alerting rule is defined
+like:
+
+        {
+          'name' => 'bridgestrap',
+          'rules' => [
+            'alert' => 'Bridges down',
+            'expr'  => 'bridgestrap_fraction_functional < 0.50',
+            'for'   => '5m',
+            'labels'       =>
+            {
+              'severity' => 'critical',
+              'team'     => 'anti-censorship',
+            },
+            'annotations'  =>
+            {
+              'title' => 'Bridges down',
+              'description' => 'Too many bridges down',
+              # use humanizePercentage when upgrading to prom > 2.11
+              'summary' => 'Number of functional bridges is `{{$value}}%`',
+              'host' => '{{$labels.instance}}',
+            },
+          ],
+        },
+
+The key part of the alert is the `expr` setting which is a PromQL
+expression that, when evaluated to "true" for more than `5m` (the
+`for` settings), will fire an error at the Alertmanager. Also note
+the `team` label which will route the message to the right team. Those
+routes are defined later, in the `routes` and `receivers` settings.
+
+Note that those might move to separate files and/or Hiera later on.
+
+### Adding alert recipients
+
+To add a new recipient for alerts, look for the `receivers` setting
+and add something like this:
+
+    receivers      => [
+      {
+        'name'          => 'anti-censorship team',
+        'email_configs' => [
+          'to'          => 'anti-censorship-alerts@lists.torproject.org',
+          # see above
+          'require_tls' => false,
+        ],
+      },
+      # [...]
+
+Then alerts can be routed to that receiver by adding a "route" in the
+`routes` setting. For example, this will route alerts with the `team:
+anti-censorship` label:
+
+      routes            => [
+        {
+          'receiver' => 'anti-censorship team',
+          'match'    => {
+            'team' => 'anti-censorship',
+          },
+        },
+      ],
+
+### Testing alerts
+
+Normally, alerts should fire on the Prometheus server and be sent out
+to the Alertmanager server, if the latter is correctly configured
+(ie. if it's configured in `prometheus.yml`, the `alerting` section,
+see [Installation](#installation) below).
+
+If you're not sure alerts are working, head to the web dashboard (see
+[the access instructions](#web-dashboard-usage)) and look at the
+`/alerts`, and `/rules` pages. For example, if you're
+using port forwarding:
+
+ * <http://localhost:9090/alerts> - should show the configure alerts,
+   and if they are firing
+ * <http://localhost:9090/rules> - should show the configured rules,
+   and whether they match
+
+Typically, the <http://localhost:9093> URL should also be useful to
+manage the Alertmanager, but in practice the Debian package does not
+ship the web interface, so its interest is limited in that regard. See
+the `amtool` section below for more information.
+
+Note that the `/targets` URL is also useful to diagnose problems with
+exporters, in general.
+
+### Managing alerts with amtool
+
+Since the Alertmanager web UI is not available in Debian, you need to
+use the [amtool](https://manpages.debian.org/amtool.1) command. A few useful commands:
+
+ * `amtool alert`: show firing alerts
+ * `amtool silence add --duration=1h --author=anarcat
+   --comment="working on it" ALERTNAME`: silence alert ALERTNAME for
+   an hour, with some comments
+
 ## Pager playbook

 TBD.
@@ -101,6 +230,8 @@ dashboard is not available, how to bypass authentication restrictions
 on said dashboard, talk about the Alertmanager (lack of?) UI, the
 Pushgateway UI, how to access them, `amtool`, rules debugging...

+TODO: talk about `/targets`.
+
 ## Disaster recovery

 If a Prometheus/Grafana is destroyed, it should be compltely
@@ -257,13 +388,22 @@ changed.
 The [Alertmanager][] is configured on the external Prometheus server
 for the metrics and anti-censorship teams to monitor the health of the
 network. It may eventually also be used to replace or enhance
-[Nagios](howto/nagios) ([ticket 29864](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29864)).
+[Nagios](howto/nagios) ([ticket 29864][]).

 It is installed through Puppet, in
 `profile::prometheus::server::external`, but could be moved to its own
 profile if it is deployed on more than one server.

-TODO: document how to add stuff to the Alertmanager.
+Note that Alertmanager only dispatches alerts, which are actually
+generated on the Prometheus server side of things. Make sure the
+following block exists in the `prometheus.yml` file:
+
+    alerting:
+      alert_relabel_configs: []
+      alertmanagers:
+      - static_configs:
+        - targets:
+          - localhost:9093

 ### Manual node configuration