Changes

anarcat · be814ff0
--- a/service/prometheus.md
+++ b/service/prometheus.md
@@ -1317,6 +1317,26 @@ IRC relay:
 [default route errors]: #default-route-errors
+## Debugging the blackbox exporter
+The [upstream documentation][] has some details that can help. We also
+have examples [above][] for how to configure it in our setup.
+One thing that's nice to know in addition to how it's configured is how you can
+debug it. You can query the exporter from `localhost` in order to get more
+information. If you are using this method for debugging, you'll most probably
+want to include debugging output. For example, to run an ICMP test on host
+`pauli.torproject.org`:
+    curl http://localhost:9115/probe?target=pauli.torproject.org&module=icmp&debug=true
+Note that the above trick can be used for _any_ target, not just for ones
+currently configured in the blackbox exporter. So you can also use this to test
+things before creating the final configuration for the target.
+[upstream documentation]: https://github.com/prometheus/blackbox_exporter
+[above]: #adding-alert-rules
 ## Advanced metrics ingestion
 This section documents more advanced metrics injection topics that we
@@ -2010,9 +2030,9 @@ See also [Adding metrics to applications][], above.
 ## Upgrades
-<!-- TODO: how upgrades are performed. preferably automated through Debian -->
+Upgrades are automatically managed by official Debian packages
-<!-- packages, otherwise document how upgrades are performed. see also -->
+everywhere, except Grafana that's managed by upstream packages and
-<!-- the Testing section below -->
+Karma that's managed through a container, still automated.
 ## SLA
@@ -2046,95 +2066,6 @@ Nagios deployment.
 It does not show that Prometheus can federate to multiple instances
 and the Alertmanager can be configured with High availability.
-### Alert routing details
-Once Prometheus has created an alert, it sends it to one or more instances of
-Alertmanager. This one in turn is responsible for routing the alert to the right
-communication channel.
-That is, if Alertmanager is correctly configured, that is if it's
-configured in `prometheus.yml`, the `alerting` section, see
-[Installation][] section.
-Alert routes are set as a hierarchical tree in which the first route that
-matches gets to handle the alert. The first-matching route may decide to ask
-Alertmanager to continue processing with other routes so that the same alert can
-match multiple routes. This is how TPA receives emails for critical alerts and
-also IRC notifications for both warning and critical.
-Each route needs to have one or more receivers set.
-Receivers are and routes are defined in Hiera in `hiera/common/prometheus.yaml`.
-#### Receivers
-Receivers are set in the key `prometheus::alertmanager::receivers` and look like
-this:
-    - name: 'TPA-email'
-      email_configs:
-        - to: 'recipient@example.com'
-          require_tls: false
-          text: '{{ template "email.custom.txt" . }}'
-          headers:
-            subject: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " -- " }}'
-Here we've configured an email recipient. Alertmanager can send alerts with a
-bunch of other communications channels. For example to send IRC notifications,
-we have a daemon binding to `localhost` on the Prometheus server waiting for
-web hook calls, and the corresponding receiver has a section `webhook_configs`
-instead of `email_configs`.
-#### Routes
-Alert routes are set in the key `prometheus::alertmanager::route` in Hiera. The
-default route, the one set at the top level of that key, uses the receiver
-`fallback` and some default options for other routes.
-The default route _should not be explicitly used_ by alerts. We always want to
-explicitly match on a set of labels to send alerts to the correct destination.
-Thus, the default recipient uses a different message template that explicitly
-says there is a configuration error. This way we can more easily catch what's
-been wrongly configured.
-The default route has a key `routes`. This is where additional routes are set.
-A route needs to set a recipient and then can match on certain label values,
-using the `matchers` list. Here's an example for the TPA IRC route:
-    - receiver: 'irc-tor-admin'
-      matchers:
-        - 'team = "TPA"'
-        - 'severity =~ "critical|warning"'
-### Pushgateway
-The [Pushgateway][] is a separate server from the main Prometheus
-server that is designed to "hold" onto metrics for ephemeral jobs that
-would otherwise be around long enough for Prometheus to scrape their
-metrics. We use it as a workaround to bridge Metrics data with
-Prometheus/Grafana.
-### Debugging the blackbox exporter
-The [upstream documentation][] has some details that can help. We also
-have examples [above][] for how to configure it in our setup.
-One thing that's nice to know in addition to how it's configured is how you can
-debug it. You can query the exporter from `localhost` in order to get more
-information. If you are using this method for debugging, you'll most probably
-want to include debugging output. For example, to run an ICMP test on host
-`pauli.torproject.org`:
-    curl http://localhost:9115/probe?target=pauli.torproject.org&module=icmp&debug=true
-Note that the above trick can be used for _any_ target, not just for ones
-currently configured in the blackbox exporter. So you can also use this to test
-things before creating the final configuration for the target.
-[upstream documentation]: https://github.com/prometheus/blackbox_exporter
-[above]: #adding-alert-rules
 ### Alertmanager
 The [Alertmanager][] is a separate program that receives notifications
@@ -2365,15 +2296,103 @@ notification in a particularly flappy alert][].
 [in `dispatch.go`, line 460, function `aggrGroup.run()`]: https://github.com/prometheus/alertmanager/blob/e9904f93a7efa063bac628ed0b74184acf1c7401/dispatch/dispatch.go#L460
 [mysterious failure to send notification in a particularly flappy alert]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/issues/18
+### Alert routing details
+Once Prometheus has created an alert, it sends it to one or more instances of
+Alertmanager. This one in turn is responsible for routing the alert to the right
+communication channel.
+That is, if Alertmanager is correctly configured, that is if it's
+configured in `prometheus.yml`, the `alerting` section, see
+[Installation][] section.
+Alert routes are set as a hierarchical tree in which the first route that
+matches gets to handle the alert. The first-matching route may decide to ask
+Alertmanager to continue processing with other routes so that the same alert can
+match multiple routes. This is how TPA receives emails for critical alerts and
+also IRC notifications for both warning and critical.
+Each route needs to have one or more receivers set.
+Receivers are and routes are defined in Hiera in `hiera/common/prometheus.yaml`.
+#### Receivers
+Receivers are set in the key `prometheus::alertmanager::receivers` and look like
+this:
+    - name: 'TPA-email'
+      email_configs:
+        - to: 'recipient@example.com'
+          require_tls: false
+          text: '{{ template "email.custom.txt" . }}'
+          headers:
+            subject: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " -- " }}'
+Here we've configured an email recipient. Alertmanager can send alerts with a
+bunch of other communications channels. For example to send IRC notifications,
+we have a daemon binding to `localhost` on the Prometheus server waiting for
+web hook calls, and the corresponding receiver has a section `webhook_configs`
+instead of `email_configs`.
+#### Routes
+Alert routes are set in the key `prometheus::alertmanager::route` in Hiera. The
+default route, the one set at the top level of that key, uses the receiver
+`fallback` and some default options for other routes.
+The default route _should not be explicitly used_ by alerts. We always want to
+explicitly match on a set of labels to send alerts to the correct destination.
+Thus, the default recipient uses a different message template that explicitly
+says there is a configuration error. This way we can more easily catch what's
+been wrongly configured.
+The default route has a key `routes`. This is where additional routes are set.
+A route needs to set a recipient and then can match on certain label values,
+using the `matchers` list. Here's an example for the TPA IRC route:
+    - receiver: 'irc-tor-admin'
+      matchers:
+        - 'team = "TPA"'
+        - 'severity =~ "critical|warning"'
+### Pushgateway
+The [Pushgateway][] is a separate server from the main Prometheus
+server that is designed to "hold" onto metrics for ephemeral jobs that
+would otherwise be around long enough for Prometheus to scrape their
+metrics. We use it as a workaround to bridge Metrics data with
+Prometheus/Grafana.
 ## Services
-<!-- TODO: open ports, daemons, cron jobs -->
+Prometheus is made of multiple components:
+ - Prometheus: a daemon with an HTTP API that scrapes exporters and
+   targets for metrics, evaluates alerting rules and sends alerts to
+   the Alertmanager
+ - Alertmanager: another daemon with HTTP APIs that receives alerts
+   from one or more Prometheus daemons, gossips with other
+   Alertmanagers to deduplicate alerts, and send notifications to
+   receivers
+ - Exporters: HTTP endpoints that expose Prometheus metrics, scraped
+   by Prometheus
+ - Node exporter: a specific exporter to expose system-level metrics
+   like memory, CPU, disk usage and so on
+ - Text file collector: a directory read by the node exporter where
+   other tools can drop metrics
+So almost everything happens over HTTP or HTTPS.
+Many services expose their metrics by running cron jobs or systemd
+timers that write to the node exporter text file collector.
 ### Monitored services
 Those are the actual services monitored by Prometheus.
-### Internal server (`prometheus1`)
+#### Internal server (`prometheus1`)
 The "internal" server scrapes all hosts managed by Puppet for
 TPA. Puppet installs a [`node_exporter`][] on *all* servers, which
@@ -2387,7 +2406,7 @@ authentication only to keep bots away.
 [`node_exporter`]: https://github.com/prometheus/node_exporter
-### External server (`prometheus2`)
+#### External server (`prometheus2`)
 The "external" server, on the other hand, is more restrictive and does
 not allow public access. This is out of concern that specific metrics
@@ -2420,7 +2439,7 @@ July 2019 following [#31159][].
 [this ticket]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31159
 [#31159]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31159
-### Other possible services to monitor
+#### Other possible services to monitor
 Many more exporters could be configured. A non-exhaustive list was
 built in [ticket #30028][] around launch time. Here we
@@ -2504,7 +2523,7 @@ for the full deployment plan.
 No major issue resolved so far is worth mentioning here.
-## Maintainer, users, and upstream
+## Maintainers
 The Prometheus services have been setup and are managed by anarcat
 inside TPA. The internal Prometheus server is mostly used by TPA staff