Changes

anarcat · 7f4ee242
--- a/service/prometheus.md
+++ b/service/prometheus.md
@@ -2064,7 +2064,25 @@ Nagios deployment.
 [Kubernetes]: https://kubernetes.io/

 It does not show that Prometheus can federate to multiple instances
-and the Alertmanager can be configured with High availability.
+and the Alertmanager can be configured with High availability. We have
+a monolithic server setup right now, that's planned for
+[TPA-RFC-33-C][].
+
+### Metrics types
+
+In [monitoring distributed systems][], Google defines 4 "golden
+signals", categories of metrics that need to be monitored:
+
+ * **Latency**: time to service a request
+ * **Traffic**: transactions per second or bandwidth
+ * **Errors**: failure rates, e.g. 500 errors in web servers
+ * **Saturation**: full disks, memory, CPU utilization, etc
+
+In the book, they argue all four should issue pager alerts, but we
+believe warnings for saturation, except for extreme cases ("disk
+actually full") might be sufficient.
+
+ [monitoring distributed systems]: https://sre.google/sre-book/monitoring-distributed-systems/

 ### Alertmanager

@@ -2087,7 +2105,6 @@ but it's not deployed in our configuration, we use [Karma][]
 (previously Cloudflare's [unsee][]) instead.

 [the "My Philosophy on Alerting" paper from a Google engineer]: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit
-[Monitoring distributed systems]: https://www.oreilly.com/radar/monitoring-distributed-systems/
 [Site Reliability Engineering]: https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/
 [kthxbye bot]: https://github.com/prymitive/kthxbye

@@ -2289,6 +2306,16 @@ would otherwise be around long enough for Prometheus to scrape their
 metrics. We use it as a workaround to bridge Metrics data with
 Prometheus/Grafana.

+## Configuration
+
+The Prometheus server is currently configured mostly through Puppet,
+where modules define exporters and "export resources" that get
+collected on the central server, which then scrapes those targets.
+
+The [`prometheus-alerts.git` repository][] contains all alerts and
+some non-TPA targets, specified in the `targets.d` directory for all
+teams.
+
 ## Services

 Prometheus is made of multiple components:
@@ -2393,7 +2420,74 @@ There's also a [list of third-party exporters][] in the Prometheus documentation

 ## Interfaces

-<!-- TODO e.g. web APIs, commandline clients, etc -->
+This system has multiple interfaces. Let's take them one by one.
+
+### Trending: Grafana
+
+Long term trends are visible in the [Grafana][] dashboards, which taps
+into the Prometheus API to show graphs for history. Documentation on
+that is in the [Grafana][] wiki page.
+
+### Alerting: Karma
+
+The main alerting dashboard is the [Karma dashboard][], which shows
+the currently firing alerts, and allows users to silence alerts.
+
+Technically, alerts are generated by the Prometheus server and relayed
+through the Alertmanager server, then Karma taps into the Alertmanager
+API to show those alerts. Karma provides those features:
+
+ * Silencing alerts
+ * Showing alert inhibitions
+ * Aggregate alerts from multiple alert managers
+ * Alert groups
+ * Alert history
+ * Dead man's switch (an alert always firing that signals an error
+   when it *stops* firing)
+
+### Notifications: Alertmanager
+
+We aggressively restrict the kind and number of alerts that will
+actually send notifications. This was done mainly by creating two
+different alerting levels ("warning" and "critical", above), and
+drastically limiting the number of critical alerts.
+
+The basic idea is that the dashboard (Karma) has "everything": alerts
+(both with "warning" and "critical" levels) show up there, and it's
+expected that it is "noisy". Operators are be expected to look at the
+dashboard while on rotation for tasks to do. A typical example is
+pending reboots, but anomalies like high load on a server or a
+partition to expand in a few weeks is also expected.
+
+All notifications are also sent over the IRC channel (`#tor-alerts` on
+OFTC) and logged through the `tpa_http_post_dump.service`. It is
+expected that operators look at their emails or the IRC channels
+regularly and will act upon those notifications promptly.
+
+IRC notifications are handled by the [`alertmanager-irc-relay`][].
+
+[`alertmanager-irc-relay`]: https://github.com/google/alertmanager-irc-relay
+
+### Command-line
+
+Prometheus has a [`promtool`][] that allows you to query the server
+from the command-line, but there's also a [HTTP API](https://prometheus.io/docs/prometheus/latest/querying/api/) that we can
+use with `curl`. For example, this shows the hosts with pending
+upgrades:
+
+    curl -sSL --data-urlencode query='apt_upgrades_pending>0' \
+      'https://$HTTP_USER@prometheus.torproject.org/api/v1/query \
+      | jq -r .data.result[].metric.alias \
+      | grep -v '^null$' | paste -sd,
+
+The output can be passed to a tool like [Cumin][], for example. This
+is actually used in the `fleet.pending-upgrades` task to show an
+inventory of the pending upgrades across the fleet.
+
+[`promtool`]: http://manpages.debian.org/promtool.1
+
+Alertmanager also has a [`amtool`](https://manpages.debian.org/amtool.1) tool which can be used to
+inspect alerts, and issue silences. It's used in our test suite.

 ## Authentication

@@ -2492,7 +2586,9 @@ details.

 The server monitors itself for system-level metrics but also
 application-specific metrics. There's a long-term plan for
-high-availability in [TPA-RFC-33-C](https://gitlab.torproject.org/groups/tpo/tpa/-/milestones/15).
+high-availability in [TPA-RFC-33-C][].
+
+ [TPA-RFC-33-C]: https://gitlab.torproject.org/groups/tpo/tpa/-/milestones/15

 Metrics are held for about a year or less, depending on the server,
 see [ticket 29388][] for storage requirements and possible
@@ -2550,6 +2646,7 @@ setup ([#41643][]).
 * [Prometheus developer blog][]
 * [Awesome Prometheus](https://github.com/roaldnefs/awesome-prometheus) listen
 * [Blue book](https://lyz-code.github.io/blue-book/devops/prometheus/prometheus/) - interesting guide
+ * [Robust perception consulting](https://www.robustperception.io/) has a [series of blog posts on Prometheus](https://www.robustperception.io/tag/prometheus/)

 [Prometheus home page]: https://prometheus.io/
 [Prometheus documentation]: https://prometheus.io/docs/introduction/overview/
@@ -2962,3 +3059,23 @@ to consider is [Crochet][].
 [Elm compiler]: https://github.com/elm/compiler
 [not in Debian]: http://bugs.debian.org/973915
 [Crochet]: https://github.com/simonpasquier/crochet
+
+### Mobile notifications
+
+Like [others][] we do not intend on having on-call rotation yet, and
+will not ring people on their mobile devices at first. After all
+exporters have been deployed (priority "C", "nice to have") and alerts
+properly configured, we will evaluate the number of notifications that
+get sent out. If levels are acceptable (say, once a month or so),
+we might implement push notifications during business hours to
+consenting staff.
+
+ [others]: https://utcc.utoronto.ca/~cks/space/blog/sysadmin/AlertsAsNotificationsFreedom
+
+We have been advised to avoid Signal notifications as that setup is
+often brittle, `signal.org` frequently changing their API and leading
+to silent failures. We might implement [alerts over Matrix][]
+depending on what messaging platform gets standardized in the Tor
+project.
+
+[alerts over Matrix]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40216