prom: reorder docs (#41655) authored by anarcat's avatar anarcat
...@@ -1317,6 +1317,26 @@ IRC relay: ...@@ -1317,6 +1317,26 @@ IRC relay:
[default route errors]: #default-route-errors [default route errors]: #default-route-errors
## Debugging the blackbox exporter
The [upstream documentation][] has some details that can help. We also
have examples [above][] for how to configure it in our setup.
One thing that's nice to know in addition to how it's configured is how you can
debug it. You can query the exporter from `localhost` in order to get more
information. If you are using this method for debugging, you'll most probably
want to include debugging output. For example, to run an ICMP test on host
`pauli.torproject.org`:
curl http://localhost:9115/probe?target=pauli.torproject.org&module=icmp&debug=true
Note that the above trick can be used for _any_ target, not just for ones
currently configured in the blackbox exporter. So you can also use this to test
things before creating the final configuration for the target.
[upstream documentation]: https://github.com/prometheus/blackbox_exporter
[above]: #adding-alert-rules
## Advanced metrics ingestion ## Advanced metrics ingestion
This section documents more advanced metrics injection topics that we This section documents more advanced metrics injection topics that we
...@@ -2010,9 +2030,9 @@ See also [Adding metrics to applications][], above. ...@@ -2010,9 +2030,9 @@ See also [Adding metrics to applications][], above.
## Upgrades ## Upgrades
<!-- TODO: how upgrades are performed. preferably automated through Debian --> Upgrades are automatically managed by official Debian packages
<!-- packages, otherwise document how upgrades are performed. see also --> everywhere, except Grafana that's managed by upstream packages and
<!-- the Testing section below --> Karma that's managed through a container, still automated.
## SLA ## SLA
...@@ -2046,95 +2066,6 @@ Nagios deployment. ...@@ -2046,95 +2066,6 @@ Nagios deployment.
It does not show that Prometheus can federate to multiple instances It does not show that Prometheus can federate to multiple instances
and the Alertmanager can be configured with High availability. and the Alertmanager can be configured with High availability.
### Alert routing details
Once Prometheus has created an alert, it sends it to one or more instances of
Alertmanager. This one in turn is responsible for routing the alert to the right
communication channel.
That is, if Alertmanager is correctly configured, that is if it's
configured in `prometheus.yml`, the `alerting` section, see
[Installation][] section.
Alert routes are set as a hierarchical tree in which the first route that
matches gets to handle the alert. The first-matching route may decide to ask
Alertmanager to continue processing with other routes so that the same alert can
match multiple routes. This is how TPA receives emails for critical alerts and
also IRC notifications for both warning and critical.
Each route needs to have one or more receivers set.
Receivers are and routes are defined in Hiera in `hiera/common/prometheus.yaml`.
#### Receivers
Receivers are set in the key `prometheus::alertmanager::receivers` and look like
this:
- name: 'TPA-email'
email_configs:
- to: 'recipient@example.com'
require_tls: false
text: '{{ template "email.custom.txt" . }}'
headers:
subject: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " -- " }}'
Here we've configured an email recipient. Alertmanager can send alerts with a
bunch of other communications channels. For example to send IRC notifications,
we have a daemon binding to `localhost` on the Prometheus server waiting for
web hook calls, and the corresponding receiver has a section `webhook_configs`
instead of `email_configs`.
#### Routes
Alert routes are set in the key `prometheus::alertmanager::route` in Hiera. The
default route, the one set at the top level of that key, uses the receiver
`fallback` and some default options for other routes.
The default route _should not be explicitly used_ by alerts. We always want to
explicitly match on a set of labels to send alerts to the correct destination.
Thus, the default recipient uses a different message template that explicitly
says there is a configuration error. This way we can more easily catch what's
been wrongly configured.
The default route has a key `routes`. This is where additional routes are set.
A route needs to set a recipient and then can match on certain label values,
using the `matchers` list. Here's an example for the TPA IRC route:
- receiver: 'irc-tor-admin'
matchers:
- 'team = "TPA"'
- 'severity =~ "critical|warning"'
### Pushgateway
The [Pushgateway][] is a separate server from the main Prometheus
server that is designed to "hold" onto metrics for ephemeral jobs that
would otherwise be around long enough for Prometheus to scrape their
metrics. We use it as a workaround to bridge Metrics data with
Prometheus/Grafana.
### Debugging the blackbox exporter
The [upstream documentation][] has some details that can help. We also
have examples [above][] for how to configure it in our setup.
One thing that's nice to know in addition to how it's configured is how you can
debug it. You can query the exporter from `localhost` in order to get more
information. If you are using this method for debugging, you'll most probably
want to include debugging output. For example, to run an ICMP test on host
`pauli.torproject.org`:
curl http://localhost:9115/probe?target=pauli.torproject.org&module=icmp&debug=true
Note that the above trick can be used for _any_ target, not just for ones
currently configured in the blackbox exporter. So you can also use this to test
things before creating the final configuration for the target.
[upstream documentation]: https://github.com/prometheus/blackbox_exporter
[above]: #adding-alert-rules
### Alertmanager ### Alertmanager
The [Alertmanager][] is a separate program that receives notifications The [Alertmanager][] is a separate program that receives notifications
...@@ -2365,15 +2296,103 @@ notification in a particularly flappy alert][]. ...@@ -2365,15 +2296,103 @@ notification in a particularly flappy alert][].
[in `dispatch.go`, line 460, function `aggrGroup.run()`]: https://github.com/prometheus/alertmanager/blob/e9904f93a7efa063bac628ed0b74184acf1c7401/dispatch/dispatch.go#L460 [in `dispatch.go`, line 460, function `aggrGroup.run()`]: https://github.com/prometheus/alertmanager/blob/e9904f93a7efa063bac628ed0b74184acf1c7401/dispatch/dispatch.go#L460
[mysterious failure to send notification in a particularly flappy alert]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/issues/18 [mysterious failure to send notification in a particularly flappy alert]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/issues/18
### Alert routing details
Once Prometheus has created an alert, it sends it to one or more instances of
Alertmanager. This one in turn is responsible for routing the alert to the right
communication channel.
That is, if Alertmanager is correctly configured, that is if it's
configured in `prometheus.yml`, the `alerting` section, see
[Installation][] section.
Alert routes are set as a hierarchical tree in which the first route that
matches gets to handle the alert. The first-matching route may decide to ask
Alertmanager to continue processing with other routes so that the same alert can
match multiple routes. This is how TPA receives emails for critical alerts and
also IRC notifications for both warning and critical.
Each route needs to have one or more receivers set.
Receivers are and routes are defined in Hiera in `hiera/common/prometheus.yaml`.
#### Receivers
Receivers are set in the key `prometheus::alertmanager::receivers` and look like
this:
- name: 'TPA-email'
email_configs:
- to: 'recipient@example.com'
require_tls: false
text: '{{ template "email.custom.txt" . }}'
headers:
subject: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .GroupLabels.SortedPairs.Values | join " -- " }}'
Here we've configured an email recipient. Alertmanager can send alerts with a
bunch of other communications channels. For example to send IRC notifications,
we have a daemon binding to `localhost` on the Prometheus server waiting for
web hook calls, and the corresponding receiver has a section `webhook_configs`
instead of `email_configs`.
#### Routes
Alert routes are set in the key `prometheus::alertmanager::route` in Hiera. The
default route, the one set at the top level of that key, uses the receiver
`fallback` and some default options for other routes.
The default route _should not be explicitly used_ by alerts. We always want to
explicitly match on a set of labels to send alerts to the correct destination.
Thus, the default recipient uses a different message template that explicitly
says there is a configuration error. This way we can more easily catch what's
been wrongly configured.
The default route has a key `routes`. This is where additional routes are set.
A route needs to set a recipient and then can match on certain label values,
using the `matchers` list. Here's an example for the TPA IRC route:
- receiver: 'irc-tor-admin'
matchers:
- 'team = "TPA"'
- 'severity =~ "critical|warning"'
### Pushgateway
The [Pushgateway][] is a separate server from the main Prometheus
server that is designed to "hold" onto metrics for ephemeral jobs that
would otherwise be around long enough for Prometheus to scrape their
metrics. We use it as a workaround to bridge Metrics data with
Prometheus/Grafana.
## Services ## Services
<!-- TODO: open ports, daemons, cron jobs --> Prometheus is made of multiple components:
- Prometheus: a daemon with an HTTP API that scrapes exporters and
targets for metrics, evaluates alerting rules and sends alerts to
the Alertmanager
- Alertmanager: another daemon with HTTP APIs that receives alerts
from one or more Prometheus daemons, gossips with other
Alertmanagers to deduplicate alerts, and send notifications to
receivers
- Exporters: HTTP endpoints that expose Prometheus metrics, scraped
by Prometheus
- Node exporter: a specific exporter to expose system-level metrics
like memory, CPU, disk usage and so on
- Text file collector: a directory read by the node exporter where
other tools can drop metrics
So almost everything happens over HTTP or HTTPS.
Many services expose their metrics by running cron jobs or systemd
timers that write to the node exporter text file collector.
### Monitored services ### Monitored services
Those are the actual services monitored by Prometheus. Those are the actual services monitored by Prometheus.
### Internal server (`prometheus1`) #### Internal server (`prometheus1`)
The "internal" server scrapes all hosts managed by Puppet for The "internal" server scrapes all hosts managed by Puppet for
TPA. Puppet installs a [`node_exporter`][] on *all* servers, which TPA. Puppet installs a [`node_exporter`][] on *all* servers, which
...@@ -2387,7 +2406,7 @@ authentication only to keep bots away. ...@@ -2387,7 +2406,7 @@ authentication only to keep bots away.
[`node_exporter`]: https://github.com/prometheus/node_exporter [`node_exporter`]: https://github.com/prometheus/node_exporter
### External server (`prometheus2`) #### External server (`prometheus2`)
The "external" server, on the other hand, is more restrictive and does The "external" server, on the other hand, is more restrictive and does
not allow public access. This is out of concern that specific metrics not allow public access. This is out of concern that specific metrics
...@@ -2420,7 +2439,7 @@ July 2019 following [#31159][]. ...@@ -2420,7 +2439,7 @@ July 2019 following [#31159][].
[this ticket]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31159 [this ticket]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31159
[#31159]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31159 [#31159]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/31159
### Other possible services to monitor #### Other possible services to monitor
Many more exporters could be configured. A non-exhaustive list was Many more exporters could be configured. A non-exhaustive list was
built in [ticket #30028][] around launch time. Here we built in [ticket #30028][] around launch time. Here we
...@@ -2504,7 +2523,7 @@ for the full deployment plan. ...@@ -2504,7 +2523,7 @@ for the full deployment plan.
No major issue resolved so far is worth mentioning here. No major issue resolved so far is worth mentioning here.
## Maintainer, users, and upstream ## Maintainers
The Prometheus services have been setup and are managed by anarcat The Prometheus services have been setup and are managed by anarcat
inside TPA. The internal Prometheus server is mostly used by TPA staff inside TPA. The internal Prometheus server is mostly used by TPA staff
... ...
......