The [Alertmanager][] is a separate program that receives notifications
The [Alertmanager][] is a separate program that receives notifications
...
@@ -2365,15 +2296,103 @@ notification in a particularly flappy alert][].
...
@@ -2365,15 +2296,103 @@ notification in a particularly flappy alert][].
[in `dispatch.go`, line 460, function `aggrGroup.run()`]:https://github.com/prometheus/alertmanager/blob/e9904f93a7efa063bac628ed0b74184acf1c7401/dispatch/dispatch.go#L460
[in `dispatch.go`, line 460, function `aggrGroup.run()`]:https://github.com/prometheus/alertmanager/blob/e9904f93a7efa063bac628ed0b74184acf1c7401/dispatch/dispatch.go#L460
[mysterious failure to send notification in a particularly flappy alert]:https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/issues/18
[mysterious failure to send notification in a particularly flappy alert]:https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/issues/18
### Alert routing details
Once Prometheus has created an alert, it sends it to one or more instances of
Alertmanager. This one in turn is responsible for routing the alert to the right
communication channel.
That is, if Alertmanager is correctly configured, that is if it's
configured in `prometheus.yml`, the `alerting` section, see
[Installation][] section.
Alert routes are set as a hierarchical tree in which the first route that
matches gets to handle the alert. The first-matching route may decide to ask
Alertmanager to continue processing with other routes so that the same alert can
match multiple routes. This is how TPA receives emails for critical alerts and
also IRC notifications for both warning and critical.
Each route needs to have one or more receivers set.
Receivers are and routes are defined in Hiera in `hiera/common/prometheus.yaml`.
#### Receivers
Receivers are set in the key `prometheus::alertmanager::receivers` and look like
this:
- name: 'TPA-email'
email_configs:
- to: 'recipient@example.com'
require_tls: false
text: '{{ template "email.custom.txt" . }}'
headers:
subject: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " -- " }}'
Here we've configured an email recipient. Alertmanager can send alerts with a
bunch of other communications channels. For example to send IRC notifications,
we have a daemon binding to `localhost` on the Prometheus server waiting for
web hook calls, and the corresponding receiver has a section `webhook_configs`
instead of `email_configs`.
#### Routes
Alert routes are set in the key `prometheus::alertmanager::route` in Hiera. The
default route, the one set at the top level of that key, uses the receiver
`fallback` and some default options for other routes.
The default route _should not be explicitly used_ by alerts. We always want to
explicitly match on a set of labels to send alerts to the correct destination.
Thus, the default recipient uses a different message template that explicitly
says there is a configuration error. This way we can more easily catch what's
been wrongly configured.
The default route has a key `routes`. This is where additional routes are set.
A route needs to set a recipient and then can match on certain label values,
using the `matchers` list. Here's an example for the TPA IRC route:
- receiver: 'irc-tor-admin'
matchers:
- 'team = "TPA"'
- 'severity =~ "critical|warning"'
### Pushgateway
The [Pushgateway][] is a separate server from the main Prometheus
server that is designed to "hold" onto metrics for ephemeral jobs that
would otherwise be around long enough for Prometheus to scrape their
metrics. We use it as a workaround to bridge Metrics data with
Prometheus/Grafana.
## Services
## Services
<!-- TODO: open ports, daemons, cron jobs -->
Prometheus is made of multiple components:
- Prometheus: a daemon with an HTTP API that scrapes exporters and
targets for metrics, evaluates alerting rules and sends alerts to
the Alertmanager
- Alertmanager: another daemon with HTTP APIs that receives alerts
from one or more Prometheus daemons, gossips with other
Alertmanagers to deduplicate alerts, and send notifications to
receivers
- Exporters: HTTP endpoints that expose Prometheus metrics, scraped
by Prometheus
- Node exporter: a specific exporter to expose system-level metrics
like memory, CPU, disk usage and so on
- Text file collector: a directory read by the node exporter where
other tools can drop metrics
So almost everything happens over HTTP or HTTPS.
Many services expose their metrics by running cron jobs or systemd
timers that write to the node exporter text file collector.
### Monitored services
### Monitored services
Those are the actual services monitored by Prometheus.
Those are the actual services monitored by Prometheus.
### Internal server (`prometheus1`)
#### Internal server (`prometheus1`)
The "internal" server scrapes all hosts managed by Puppet for
The "internal" server scrapes all hosts managed by Puppet for
TPA. Puppet installs a [`node_exporter`][] on *all* servers, which
TPA. Puppet installs a [`node_exporter`][] on *all* servers, which
...
@@ -2387,7 +2406,7 @@ authentication only to keep bots away.
...
@@ -2387,7 +2406,7 @@ authentication only to keep bots away.