spell-check prom docs with harper authored by anarcat's avatar anarcat
Did this distractedly while idling in a meeting. Filed a bunch of
issues upstream too:

https://github.com/elijah-potter/harper/issues/196
https://github.com/elijah-potter/harper/issues/195
https://github.com/elijah-potter/harper/issues/194
...@@ -201,7 +201,7 @@ To add a scrape job in a puppet profile, you can use the ...@@ -201,7 +201,7 @@ To add a scrape job in a puppet profile, you can use the
`prometheus::scrape_job` defined type, or one of the defined types which are `prometheus::scrape_job` defined type, or one of the defined types which are
convenience wrappers around that. convenience wrappers around that.
Here is, for example, how the gitlab runners are scraped: Here is, for example, how the GitLab runners are scraped:
``` ```
# tell Prometheus to scrape the exporter # tell Prometheus to scrape the exporter
...@@ -255,9 +255,9 @@ In another example, to configure the ssh scrape jobs (in ...@@ -255,9 +255,9 @@ In another example, to configure the ssh scrape jobs (in
}, },
} }
But because this is a blackbox exporter, the `scrape_configs` But because this is a `blackbox_exporter`, the `scrape_configs`
configuration is more involved, as it needs to define the configuration is more involved, as it needs to define the
`relabel_configs` element that make the blackbox exporter work: `relabel_configs` element that make the `blackbox_exporter` work:
- job_name: 'blackbox_ssh_banner' - job_name: 'blackbox_ssh_banner'
metrics_path: '/probe' metrics_path: '/probe'
...@@ -274,7 +274,7 @@ configuration is more involved, as it needs to define the ...@@ -274,7 +274,7 @@ configuration is more involved, as it needs to define the
- target_label: '__address__' - target_label: '__address__'
replacement: 'localhost:9115' replacement: 'localhost:9115'
Scrape jobs for non-TPA services are defined in hiera under keys named Scrape jobs for non-TPA services are defined in Hiera under keys named
`scrape_configs` in `hiera/common/prometheus.yaml`. Here's one example of such a `scrape_configs` in `hiera/common/prometheus.yaml`. Here's one example of such a
scrape job definition: scrape job definition:
...@@ -323,7 +323,7 @@ configure a service, you *may* define extra jobs in the ...@@ -323,7 +323,7 @@ configure a service, you *may* define extra jobs in the
`profile::prometheus::server::internal` Puppet class. `profile::prometheus::server::internal` Puppet class.
For example, because the GitLab setup is fully managed by Puppet For example, because the GitLab setup is fully managed by Puppet
(e.g. [tpo/tpa/gitlab#20][], but other similar issues remain), we (e.g. [`gitlab#20`][], but other similar issues remain), we
cannot use this automatic setup, so manual scrape targets are defined cannot use this automatic setup, so manual scrape targets are defined
like this: like this:
...@@ -361,7 +361,7 @@ then we open the port to the Prometheus server on the exporter, with: ...@@ -361,7 +361,7 @@ then we open the port to the Prometheus server on the exporter, with:
Those rules are declared on the server, in `prometheus::prometheus::server::internal`. Those rules are declared on the server, in `prometheus::prometheus::server::internal`.
[tpo/tpa/gitlab#20]: https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/20 [`gitlab#20`]: https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/20
## Writing an alert ## Writing an alert
...@@ -374,9 +374,9 @@ discussion on that. ...@@ -374,9 +374,9 @@ discussion on that.
An [alerting rule][] is a simple YAML file that consists mainly of: An [alerting rule][] is a simple YAML file that consists mainly of:
- a name (say `JobDown`) - A name (say `JobDown`).
- a Prometheus query, or "expression" (say `up != 1`) - A Prometheus query, or "expression" (say `up != 1`).
- extra labels and annotations - Extra labels and annotations.
### Expressions ### Expressions
...@@ -415,7 +415,7 @@ flapping and temporary conditions. Rule of thumbs: ...@@ -415,7 +415,7 @@ flapping and temporary conditions. Rule of thumbs:
more than 24h), `RAIDDegraded` (failed disk won't come back on its more than 24h), `RAIDDegraded` (failed disk won't come back on its
own in 15m) own in 15m)
- `15m`: availability checks, designed to ignore transient errors. - `15m`: availability checks, designed to ignore transient errors.
examples: `JobDown`, `DiskFull` Examples: `JobDown`, `DiskFull`
- `1h`: consistency checks, things an operator might have deployed - `1h`: consistency checks, things an operator might have deployed
incorrectly but could recover on its own. Examples: incorrectly but could recover on its own. Examples:
`OutdatedLibraries`, as `needrestart` might recover at the end of `OutdatedLibraries`, as `needrestart` might recover at the end of
...@@ -509,8 +509,8 @@ configuration, or alerting rule: ...@@ -509,8 +509,8 @@ configuration, or alerting rule:
> fixing, but not immediately, no user-visible impact; example: > fixing, but not immediately, no user-visible impact; example:
> server needs to be rebooted > server needs to be rebooted
> * `critical`: serious condition with disruptive user-visible impact > * `critical`: serious condition with disruptive user-visible impact
> which requires prompt response; example: donation site gives a 500 > which requires prompt response; example: donation site returns 500
> error > errors
### Annotations ### Annotations
...@@ -529,17 +529,17 @@ with the alert. ...@@ -529,17 +529,17 @@ with the alert.
The playbook *must* include those things: The playbook *must* include those things:
1. the actual code name of the alert (e.g. `JobDown` or 1. The actual code name of the alert (e.g. `JobDown` or
`DiskWillFillSoon`) `DiskWillFillSoon`).
2. an example of the alert output (e.g. `Exporter job gitlab_runner 2. An example of the alert output (e.g. `Exporter job gitlab_runner
on tb-build-02.torproject.org:9252 is down`) on tb-build-02.torproject.org:9252 is down`).
3. why this alert triggered, what is its impact 3. Why this alert triggered, what is its impact.
4. optionally, how to reproduce the issue 4. Optionally, how to reproduce the issue.
5. how to fix it 5. How to fix it.
How to reproduce the issue is optional, but important. Think of How to reproduce the issue is optional, but important. Think of
yourself in the future, tired and panicking because things are yourself in the future, tired and panicking because things are
...@@ -562,8 +562,8 @@ fixed. ...@@ -562,8 +562,8 @@ fixed.
If the playbook becomes too complicated, consider making a [Fabric][] If the playbook becomes too complicated, consider making a [Fabric][]
script out of it. script out of it.
A good example of a proper playbook is the [Textfile collector errors A good example of a proper playbook is the [text file collector errors
playbook here][]. It has all of the above points, including actual playbook here][]. It has all the above points, including actual
fixes for different actual scenarios. fixes for different actual scenarios.
Here's a template to get started: Here's a template to get started:
...@@ -590,7 +590,7 @@ document here how you fix this next time. ...@@ -590,7 +590,7 @@ document here how you fix this next time.
``` ```
[Fabric]: howto/fabric [Fabric]: howto/fabric
[Textfile collector errors playbook here]: #textfile-collector-errors [text file collector errors playbook here]: #textfile-collector-errors
### Alerting rule template ### Alerting rule template
...@@ -628,8 +628,8 @@ groups: ...@@ -628,8 +628,8 @@ groups:
rules: rules:
``` ```
... as that structure just serves to declare the rest of the alerts in That structure just serves to declare the rest of the alerts in the
the file. However, consider that "rules within a group are run file. However, consider that "rules within a group are run
sequentially at a regular interval, with the same evaluation time" sequentially at a regular interval, with the same evaluation time"
(see the [recording rules documentation][]). So avoid putting *all* (see the [recording rules documentation][]). So avoid putting *all*
alerts inside the same file. In TPA, we group alerts by exporter, so alerts inside the same file. In TPA, we group alerts by exporter, so
...@@ -680,7 +680,7 @@ predict if a disk will fill in less than 24h: ...@@ -680,7 +680,7 @@ predict if a disk will fill in less than 24h:
) )
The core of the logic is the magic `predict_linear` function, but also The core of the logic is the magic `predict_linear` function, but also
note how it also restricts its checks to filesystems with only 20% note how it also restricts its checks to file systems with only 20%
space left, to avoid warning about normal write spikes. space left, to avoid warning about normal write spikes.
[metrics in your application]: #adding-metrics-to-applications [metrics in your application]: #adding-metrics-to-applications
...@@ -759,7 +759,7 @@ Those are visible in the [main Grafana dashboard][]. ...@@ -759,7 +759,7 @@ Those are visible in the [main Grafana dashboard][].
sort_desc(sum(up{job=~\"$job\"}) by (job) sort_desc(sum(up{job=~\"$job\"}) by (job)
[Number of CPU cores, memory size, filesystem and LVM sizes][]: [Number of CPU cores, memory size, file system and LVM sizes][]:
count(node_cpu_seconds_total{classes=~\"$class\",mode=\"system\"}) count(node_cpu_seconds_total{classes=~\"$class\",mode=\"system\"})
sum(node_memory_MemTotal_bytes{classes=~\"$class\"}) by (alias) sum(node_memory_MemTotal_bytes{classes=~\"$class\"}) by (alias)
...@@ -775,7 +775,7 @@ See also the [CPU][], [memory][], and [disk][] dashboards. ...@@ -775,7 +775,7 @@ See also the [CPU][], [memory][], and [disk][] dashboards.
[Number of machines]: https://prometheus.torproject.org/graph?g0.expr=count(up{job%3D"node"}) [Number of machines]: https://prometheus.torproject.org/graph?g0.expr=count(up{job%3D"node"})
[Number of machine per OS version]: https://prometheus.torproject.org/graph?g0.expr=count(node_os_info)+by+(version_id,+version_codename) [Number of machine per OS version]: https://prometheus.torproject.org/graph?g0.expr=count(node_os_info)+by+(version_id,+version_codename)
[Number of machines per exporters, or technically, number of machines per job]: https://prometheus.torproject.org/graph?g0.expr=sort_desc(sum(up{job%3D~\"$job\"})+by+(job) [Number of machines per exporters, or technically, number of machines per job]: https://prometheus.torproject.org/graph?g0.expr=sort_desc(sum(up{job%3D~\"$job\"})+by+(job)
[Number of CPU cores, memory size, filesystem and LVM sizes]: https://prometheus.torproject.org/graph?g0.expr=count(node_cpu_seconds_total{classes%3D~\"$class\",mode%3D\"system\"}) [Number of CPU cores, memory size, file system and LVM sizes]: https://prometheus.torproject.org/graph?g0.expr=count(node_cpu_seconds_total{classes%3D~\"$class\",mode%3D\"system\"})
[Uptime, in days]: https://prometheus.torproject.org/graph?g0.expr=round((time()+-+node_boot_time_seconds)+/+(24*60*60)) [Uptime, in days]: https://prometheus.torproject.org/graph?g0.expr=round((time()+-+node_boot_time_seconds)+/+(24*60*60))
[main Grafana dashboard]: https://grafana.torproject.org/d/wUmZB05Zk/tpo-overview [main Grafana dashboard]: https://grafana.torproject.org/d/wUmZB05Zk/tpo-overview
[CPU]: https://grafana.torproject.org/d/gex9eLcWz/cpu-usage [CPU]: https://grafana.torproject.org/d/gex9eLcWz/cpu-usage
...@@ -882,17 +882,17 @@ dashboards][] section for details. ...@@ -882,17 +882,17 @@ dashboards][] section for details.
[exposed]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41733 [exposed]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41733
[Alerting dashboards]: #alerting-dashboards [Alerting dashboards]: #alerting-dashboards
### Managing alerts with amtool ### Managing alerts with `amtool`
Since the Alertmanager web UI is not available in Debian, you need to Since the Alertmanager web UI is not available in Debian, you need to
use the [amtool][] command. A few useful commands: use the [`amtool`][] command. A few useful commands:
* `amtool alert`: show firing alerts * `amtool alert`: show firing alerts
* `amtool silence add --duration=1h --author=anarcat * `amtool silence add --duration=1h --author=anarcat
--comment="working on it" ALERTNAME`: silence alert ALERTNAME for --comment="working on it" ALERTNAME`: silence alert `ALERTNAME` for
an hour, with some comments an hour, with some comments
[amtool]: https://manpages.debian.org/amtool.1 [`amtool`]: https://manpages.debian.org/amtool.1
### Checking alert history ### Checking alert history
...@@ -1009,7 +1009,7 @@ defined series), a specific query will generate a specific alert with a given ...@@ -1009,7 +1009,7 @@ defined series), a specific query will generate a specific alert with a given
set of labels and annotations. set of labels and annotations.
Those labels can then be fed into `amtool` to test routing. For Those labels can then be fed into `amtool` to test routing. For
example, the above alert can be tested against the alertmanager example, the above alert can be tested against the Alertmanager
configuration with: configuration with:
amtool config routes test alertname="NeedsReboot" instance="akka.0x90.dk:9100" job="relay" severity="warning" team="network" amtool config routes test alertname="NeedsReboot" instance="akka.0x90.dk:9100" job="relay" severity="warning" team="network"
...@@ -1035,8 +1035,8 @@ happens if the `team` label is missing or incorrect, to confirm ...@@ -1035,8 +1035,8 @@ happens if the `team` label is missing or incorrect, to confirm
The above, for example, confirms that `networking` is not the correct The above, for example, confirms that `networking` is not the correct
team name (it should be `network`). team name (it should be `network`).
Note that you can also deliver an alert to a webhook receiver Note that you can also deliver an alert to a web hook receiver
syntetically. For example, this will deliver and empty message to the synthetically. For example, this will deliver and empty message to the
IRC relay: IRC relay:
curl --header "Content-Type: application/json" --request POST --data "{}" http://localhost:8098 curl --header "Content-Type: application/json" --request POST --data "{}" http://localhost:8098
...@@ -1050,14 +1050,14 @@ IRC relay: ...@@ -1050,14 +1050,14 @@ IRC relay:
This section documents more advanced metrics injection topics that we This section documents more advanced metrics injection topics that we
rarely need or use. rarely need or use.
### Backfilling ### Back-filling
Starting from Prometheus 2.24, Prometheus [now Starting from Prometheus 2.24, Prometheus [now
supports][] [backfilling][]. This is untested, but [this guide][] supports][] [back-filling][]. This is untested, but [this guide][]
might provide a good tutorial. might provide a good tutorial.
[now supports]: https://github.com/prometheus/prometheus/issues/535 [now supports]: https://github.com/prometheus/prometheus/issues/535
[backfilling]: https://prometheus.io/docs/prometheus/latest/storage/#backfilling-from-openmetrics-format [back-filling]: https://prometheus.io/docs/prometheus/latest/storage/#backfilling-from-openmetrics-format
[this guide]: https://tlvince.com/prometheus-backfilling [this guide]: https://tlvince.com/prometheus-backfilling
### Push metrics to the Pushgateway ### Push metrics to the Pushgateway
...@@ -1068,7 +1068,7 @@ see the [article about pushing metrics][] before going down this ...@@ -1068,7 +1068,7 @@ see the [article about pushing metrics][] before going down this
route. route.
The Pushgateway is fairly particular: it listens on port 9091 and gets The Pushgateway is fairly particular: it listens on port 9091 and gets
data through a fairly simple [curl-friendly commandline][] [API][]. We data through a fairly simple [curl-friendly command line][] [API][]. We
have found that, once installed, this command just "does the right have found that, once installed, this command just "does the right
thing", more or less: thing", more or less:
...@@ -1087,7 +1087,7 @@ Note that it's [not possible to push timestamps][] into the ...@@ -1087,7 +1087,7 @@ Note that it's [not possible to push timestamps][] into the
Pushgateway, so it's not useful to ingest past historical data. Pushgateway, so it's not useful to ingest past historical data.
[article about pushing metrics]: https://prometheus.io/docs/practices/pushing/ [article about pushing metrics]: https://prometheus.io/docs/practices/pushing/
[curl-friendly commandline]: https://github.com/prometheus/pushgateway#command-line [curl-friendly command line]: https://github.com/prometheus/pushgateway#command-line
[API]: https://github.com/prometheus/pushgateway#api [API]: https://github.com/prometheus/pushgateway#api
[not possible to push timestamps]: https://github.com/prometheus/pushgateway#about-timestamps [not possible to push timestamps]: https://github.com/prometheus/pushgateway#about-timestamps
...@@ -1187,7 +1187,7 @@ like this every second: ...@@ -1187,7 +1187,7 @@ like this every second:
Nov 01 19:43:03 hetzner-nbg1-02 prometheus[49182]: level=info ts=2022-11-01T19:43:03.788Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=30182 maxSegment=30196 Nov 01 19:43:03 hetzner-nbg1-02 prometheus[49182]: level=info ts=2022-11-01T19:43:03.788Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=30182 maxSegment=30196
... it's somewhat normal. At the time of writing, Prometheus2 takes It's somewhat normal. At the time of writing, Prometheus2 takes
over a minute to start because of this problem. When it's done, it over a minute to start because of this problem. When it's done, it
will show the timing information, which is currently: will show the timing information, which is currently:
...@@ -1212,7 +1212,7 @@ the metrics it collects, and allow you to view the pending metrics ...@@ -1212,7 +1212,7 @@ the metrics it collects, and allow you to view the pending metrics
before they get scraped by Prometheus, which may be useful to before they get scraped by Prometheus, which may be useful to
troubleshoot issues with the gateway. troubleshoot issues with the gateway.
To pull metrics by hand, you can pull directly from the pushgateway: To pull metrics by hand, you can pull directly from the Pushgateway:
curl localhost:9091/metrics curl localhost:9091/metrics
...@@ -1223,7 +1223,7 @@ If you get this error while pulling metrics from the exporter: ...@@ -1223,7 +1223,7 @@ If you get this error while pulling metrics from the exporter:
collected metric "some_metric" { label:<name:"instance" value:"" > label:<name:"job" value:"some_job" > label:<name:"tag" value:"val1" > counter:<value:1 > } was collected before with the same name and label values collected metric "some_metric" { label:<name:"instance" value:"" > label:<name:"job" value:"some_job" > label:<name:"tag" value:"val1" > counter:<value:1 > } was collected before with the same name and label values
It's because similar metrics were sent twice into the gateway, which It's because similar metrics were sent twice into the gateway, which
corrupts the state of the pushgateway, a [known problems][] in corrupts the state of the Pushgateway, a [known problems][] in
earlier versions and [fixed in 0.10][] (Debian bullseye and later). A earlier versions and [fixed in 0.10][] (Debian bullseye and later). A
workaround is simply to restart the Pushgateway (and clear the workaround is simply to restart the Pushgateway (and clear the
storage, if persistence is enabled, see the `--persistence.file` storage, if persistence is enabled, see the `--persistence.file`
...@@ -1234,7 +1234,7 @@ flag). ...@@ -1234,7 +1234,7 @@ flag).
### Running out of disk space ### Running out of disk space
In [tpo/tpa/team#41070][], we encountered a situation where disk In [#41070][], we encountered a situation where disk
usage on the main Prometheus server was growing linearly even if the usage on the main Prometheus server was growing linearly even if the
number of targets didn't change. This is a typical problem in time number of targets didn't change. This is a typical problem in time
series like this where the "cardinality" of metrics grows without series like this where the "cardinality" of metrics grows without
...@@ -1242,7 +1242,7 @@ bound, consuming more and more disk space as time goes by. ...@@ -1242,7 +1242,7 @@ bound, consuming more and more disk space as time goes by.
The first step is to confirm the diagnosis by looking at the [Grafana The first step is to confirm the diagnosis by looking at the [Grafana
graph showing Prometheus disk usage][] over time. This should show a graph showing Prometheus disk usage][] over time. This should show a
"sawtooth" pattern where compactions happen regularly (about once "[sawtooth wave][]" pattern where compactions happen regularly (about once
every three weeks), but without growing much over longer periods of every three weeks), but without growing much over longer periods of
time. In the above ticket, the usage was growing despite time. In the above ticket, the usage was growing despite
compactions. There are also shorter-term (~4h) and smaller compactions compactions. There are also shorter-term (~4h) and smaller compactions
...@@ -1269,12 +1269,13 @@ long-term storage][] which suggests tweaking the ...@@ -1269,12 +1269,13 @@ long-term storage][] which suggests tweaking the
[This guide from Alexandre Vazquez][] also had some useful queries and [This guide from Alexandre Vazquez][] also had some useful queries and
tips we didn't fully investigate. tips we didn't fully investigate.
[tpo/tpa/team#41070]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41070 [#41070]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41070
[Grafana graph showing Prometheus disk usage]: https://grafana.torproject.org/d/000000012/prometheus-2-0-stats?orgId=1&refresh=1m&viewPanel=40&from=now-1y&to=now [Grafana graph showing Prometheus disk usage]: https://grafana.torproject.org/d/000000012/prometheus-2-0-stats?orgId=1&refresh=1m&viewPanel=40&from=now-1y&to=now
[disk usage graphic]: https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&var-class=All&var-instance=hetzner-nbg1-01.torproject.org&from=now-3d&to=now&viewPanel=2 [disk usage graphic]: https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&var-class=All&var-instance=hetzner-nbg1-01.torproject.org&from=now-3d&to=now&viewPanel=2
[upstream Storage documentation]: https://prometheus.io/docs/prometheus/1.8/storage/ [upstream Storage documentation]: https://prometheus.io/docs/prometheus/1.8/storage/
[advice on long-term storage]: https://prometheus.io/docs/prometheus/1.8/storage/#settings-for-very-long-retention-time [advice on long-term storage]: https://prometheus.io/docs/prometheus/1.8/storage/#settings-for-very-long-retention-time
[This guide from Alexandre Vazquez]: https://alexandre-vazquez.com/how-it-optimize-the-disk-usage-in-the-prometheus-database/ [This guide from Alexandre Vazquez]: https://alexandre-vazquez.com/how-it-optimize-the-disk-usage-in-the-prometheus-database/
[sawtooth wave]: https://en.wikipedia.org/wiki/Sawtooth_wave
### Default route errors ### Default route errors
...@@ -1336,9 +1337,9 @@ host are managed by the anti-censorship team service admins. If the ...@@ -1336,9 +1337,9 @@ host are managed by the anti-censorship team service admins. If the
host was *not* managed by TPA or this was a notification about a host was *not* managed by TPA or this was a notification about a
*service* operated by the team, then a ticket should be filed there. *service* operated by the team, then a ticket should be filed there.
In this case, [tpo/tpa/team#41667][] was filed. In this case, [#41667][] was filed.
[tpo/tpa/team#41667]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41667 [#41667]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41667
#### Fixing routing #### Fixing routing
...@@ -1348,7 +1349,7 @@ if the alert is still firing. In this case, we see this: ...@@ -1348,7 +1349,7 @@ if the alert is still firing. In this case, we see this:
| Labels | State | Active Since | Value | | Labels | State | Active Since | Value |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------|----------------------------------------|-------| |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------|----------------------------------------|-------|
| `alertname="JobDown"` `alias="rdsys-test-01.torproject.org"` `classes="role::rdsys::backend"` `instance="rdsys-test-01.torproject.org:3903"` `job="mtail"` `severity="warning"` | firing | 2024-07-03 13:51:17.36676096 +0000 UTC | 0 | | `alertname="JobDown"` `alias="rdsys-test-01.torproject.org"` `classes="role::rdsys::backend"` `instance="rdsys-test-01.torproject.org:3903"` `job="mtail"` `severity="warning"` | Firing | 2024-07-03 13:51:17.36676096 +0000 UTC | 0 |
In this case, we can see there's no `team` label on that metric, which In this case, we can see there's no `team` label on that metric, which
is the root cause. is the root cause.
...@@ -1379,7 +1380,7 @@ and the following rule: ...@@ -1379,7 +1380,7 @@ and the following rule:
The query, in this case, is therefore `up < 1`. But since the alert The query, in this case, is therefore `up < 1`. But since the alert
has resolved, we can't actually do the exact same query and expect to has resolved, we can't actually do the exact same query and expect to
find the same host, we need instead to broaden the query without the find the same host, we need instead to broaden the query without the
conditional (so just `up`) *and* add the right labels, in this case conditional (so just `up`) *and* add the right labels. In this case
this should do the trick: this should do the trick:
up{instance="rdsys-test-01.torproject.org:3903",job="mtail"} up{instance="rdsys-test-01.torproject.org:3903",job="mtail"}
...@@ -1485,10 +1486,10 @@ no value was provided for a metric, like this: ...@@ -1485,10 +1486,10 @@ no value was provided for a metric, like this:
# TYPE civicrm_torcrm_resque_processor_status_up gauge # TYPE civicrm_torcrm_resque_processor_status_up gauge
civicrm_torcrm_resque_processor_status_up civicrm_torcrm_resque_processor_status_up
See [tpo/web/civicrm#149][] for further details on this See [`web/civicrm#149`][] for further details on this
outage. outage.
[tpo/web/civicrm#149]: https://gitlab.torproject.org/tpo/web/civicrm/-/issues/149 [`web/civicrm#149`]: https://gitlab.torproject.org/tpo/web/civicrm/-/issues/149
#### Forbidden errors #### Forbidden errors
...@@ -1496,15 +1497,15 @@ Another example might be: ...@@ -1496,15 +1497,15 @@ Another example might be:
server returned HTTP status 403 Forbidden server returned HTTP status 403 Forbidden
... in which case there's a permission issue on the exporter In which case there's a permission issue on the exporter endpoint. Try
endpoint. Try to reproduce the issue by pulling the endpoint directly, to reproduce the issue by pulling the endpoint directly, on the
on the Prometheus server, with, for example: Prometheus server, with, for example:
curl -sSL https://donate.torproject.org:443/metrics curl -sSL https://donate.torproject.org:443/metrics
... or whatever URL is visible in the targets listing above. This Or whatever URL is visible in the targets listing above. This could be
could be a web server configuration or lack of matching credentials in a web server configuration or lack of matching credentials in the
the exporter configuration. Look in `tor-puppet.git`, the exporter configuration. Look in `tor-puppet.git`, the
`profile::prometheus::server::internal::collect_scrape` in `profile::prometheus::server::internal::collect_scrape` in
`hiera/common/prometheus.yaml`, where credentials should be defined `hiera/common/prometheus.yaml`, where credentials should be defined
(although they should actually be stored in Trocla). (although they should actually be stored in Trocla).
...@@ -1516,20 +1517,20 @@ test.example.com` (`ApacheScrapingFailed`), Apache is up, but the ...@@ -1516,20 +1517,20 @@ test.example.com` (`ApacheScrapingFailed`), Apache is up, but the
[Apache exporter][] cannot pull its metrics from there. [Apache exporter][] cannot pull its metrics from there.
That means the exporter cannot pull the URL That means the exporter cannot pull the URL
`http://localhost/server-status/?auto`. To reproduce, pull the URL `http://localhost/server-status/?auto`. To reproduce, pull the URL
with curl from the affected server, for example: with curl from the affected server, for example:
root@test.example.com:~# curl http://localhost/server-status/?auto root@test.example.com:~# curl http://localhost/server-status/?auto
This is a typical configuration error in Apache where the This is a typical configuration error in Apache where the
`/server-status` host is not available to the exporter because the `/server-status` host is not available to the exporter because the
"default vhost" was disabled (`apache2::default_vhost` in "default virtual host" was disabled (`apache2::default_vhost` in
Hiera). Hiera).
There is normally a workaround for this in the There is normally a workaround for this in the
`profile::prometheus::apache_exporter` class, which configures a `profile::prometheus::apache_exporter` class, which configures a
`localhost` vhost to answer properly on this address. Verify that it's `localhost` virtual host to answer properly on this address. Verify that it's
present, consider using `apache2ctl -S` to see the vhost present, consider using `apache2ctl -S` to see the virtual host
configuration. configuration.
See also the [Apache web server diagnostics][] in the incident See also the [Apache web server diagnostics][] in the incident
...@@ -1538,17 +1539,17 @@ response docs for broader issues with web servers. ...@@ -1538,17 +1539,17 @@ response docs for broader issues with web servers.
[Apache exporter]: https://github.com/Lusitaniae/apache_exporter/ [Apache exporter]: https://github.com/Lusitaniae/apache_exporter/
[Apache web server diagnostics]: #apache-web-server-diagnostics [Apache web server diagnostics]: #apache-web-server-diagnostics
### Textfile collector errors ### Text file collector errors
The `NodeTextfileCollectorErrors` looks like this: The `NodeTextfileCollectorErrors` looks like this:
Node exporter textfile collector errors on test.torproject.org Node exporter textfile collector errors on test.torproject.org
It means that the [textfile collector][] is having trouble parsing one It means that the [text file collector][] is having trouble parsing one
or many of the files in its `--collector.textfile.directory` (defaults or many of the files in its `--collector.textfile.directory` (defaults
to `/var/lib/prometheus/node-exporter`). to `/var/lib/prometheus/node-exporter`).
[textfile collector]: https://github.com/prometheus/node_exporter#textfile-collector [text file collector]: https://github.com/prometheus/node_exporter#textfile-collector
The error should be visible in the node exporter logs, run the The error should be visible in the node exporter logs, run the
following command to see it: following command to see it:
...@@ -1564,7 +1565,7 @@ might be different. ...@@ -1564,7 +1565,7 @@ might be different.
Sep 24 20:56:53 bungei prometheus-node-exporter[1387]: ts=2024-09-24T20:56:53.280Z caller=textfile.go:227 level=error collector=textfile msg="failed to collect textfile data" file=tpa_backuppg.prom err="failed to open textfile data file \"/var/lib/prometheus/node-exporter/tpa_backuppg.prom\": open /var/lib/prometheus/node-exporter/tpa_backuppg.prom: permission denied" Sep 24 20:56:53 bungei prometheus-node-exporter[1387]: ts=2024-09-24T20:56:53.280Z caller=textfile.go:227 level=error collector=textfile msg="failed to collect textfile data" file=tpa_backuppg.prom err="failed to open textfile data file \"/var/lib/prometheus/node-exporter/tpa_backuppg.prom\": open /var/lib/prometheus/node-exporter/tpa_backuppg.prom: permission denied"
``` ```
In this case, the file was created as a tempfile and moved into place In this case, the file was created as a temporary file and moved into place
without fixing the permission. The fix was to simply create the file without fixing the permission. The fix was to simply create the file
without the `tempfile` Python library, with a `.tmp` suffix, and just without the `tempfile` Python library, with a `.tmp` suffix, and just
move it into place. move it into place.
...@@ -1575,7 +1576,7 @@ move it into place. ...@@ -1575,7 +1576,7 @@ move it into place.
Sep 24 21:14:41 perdulce prometheus-node-exporter[429]: ts=2024-09-24T21:14:41.783Z caller=textfile.go:227 level=error collector=textfile msg="failed to collect textfile data" file=scheduled_shutdown_metric.prom err="failed to parse textfile data from \"/var/lib/prometheus/node-exporter/scheduled_shutdown_metric.prom\": text format parsing error in line 3: expected '\"' at start of label value, found 'r'" Sep 24 21:14:41 perdulce prometheus-node-exporter[429]: ts=2024-09-24T21:14:41.783Z caller=textfile.go:227 level=error collector=textfile msg="failed to collect textfile data" file=scheduled_shutdown_metric.prom err="failed to parse textfile data from \"/var/lib/prometheus/node-exporter/scheduled_shutdown_metric.prom\": text format parsing error in line 3: expected '\"' at start of label value, found 'r'"
``` ```
This was an experimental metric designed in [tpo/tpa/team#41734][] to This was an experimental metric designed in [#41734][] to
keep track of scheduled reboot times, but it was formatted keep track of scheduled reboot times, but it was formatted
incorrectly. The entire file content was: incorrectly. The entire file content was:
...@@ -1596,12 +1597,12 @@ node_shutdown_scheduled_timestamp_seconds{kind="reboot"} 1725545703.588789 ...@@ -1596,12 +1597,12 @@ node_shutdown_scheduled_timestamp_seconds{kind="reboot"} 1725545703.588789
But the file was simply removed in this case. But the file was simply removed in this case.
[tpo/tpa/team#41734]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41734 [#41734]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41734
## Disaster recovery ## Disaster recovery
If a Prometheus/Grafana is destroyed, it should be completely If a Prometheus/Grafana is destroyed, it should be completely
rebuildable from Puppet. Non-configuration data should be restored re-buildable from Puppet. Non-configuration data should be restored
from backup, with `/var/lib/prometheus/` being sufficient to from backup, with `/var/lib/prometheus/` being sufficient to
reconstruct history. If even backups are destroyed, history will be reconstruct history. If even backups are destroyed, history will be
lost, but the server should still recover and start tracking new lost, but the server should still recover and start tracking new
...@@ -1693,8 +1694,8 @@ A real-life (simplified) example: ...@@ -1693,8 +1694,8 @@ A real-life (simplified) example:
node_filesystem_avail_bytes{alias="alberti.torproject.org",device="/dev/sda1",fstype="ext4",mountpoint="/"} 16160059392 node_filesystem_avail_bytes{alias="alberti.torproject.org",device="/dev/sda1",fstype="ext4",mountpoint="/"} 16160059392
The above says that the node alberti has the device `/dev/sda` mounted The above says that the node `alberti` has the device `/dev/sda` mounted
on `/`, formatted as an `ext4` filesystem which has 16160059392 bytes on `/`, formatted as an `ext4` file system which has 16160059392 bytes
(~16GB) free. (~16GB) free.
[OpenMetrics]: https://openmetrics.io/ [OpenMetrics]: https://openmetrics.io/
...@@ -1711,21 +1712,21 @@ exporter", with the following steps: ...@@ -1711,21 +1712,21 @@ exporter", with the following steps:
apt install -t stretch-backports prometheus-node-exporter apt install -t stretch-backports prometheus-node-exporter
... assuming that backports is already configured. if it isn't, such a line in `/etc/apt/sources.list.d/backports.debian.org.list` should suffice: Assuming that backports is already configured. If it isn't, such a
line in `/etc/apt/sources.list.d/backports.debian.org.list` should
suffice, followed by an `apt update`:
deb https://deb.debian.org/debian/ stretch-backports main contrib non-free deb https://deb.debian.org/debian/ stretch-backports main contrib non-free
... followed by an `apt update`, naturally.
The firewall on the machine needs to allow traffic on the exporter The firewall on the machine needs to allow traffic on the exporter
port from the server `prometheus2.torproject.org`. Then [open a port from the server `prometheus2.torproject.org`. Then [open a
ticket][new-ticket] for TPA to configure the target. Make sure to ticket][new-ticket] for TPA to configure the target. Make sure to
mention: mention:
* the hostname for the exporter * The host name for the exporter
* the port of the exporter (varies according to the exporter, 9100 * The port of the exporter (varies according to the exporter, 9100
for the node exporter) for the node exporter)
* how often to scrape the target, if non-default (default: 15s) * How often to scrape the target, if non-default (default: 15 seconds)
Then TPA needs to hook those as part of a new node `job` in the Then TPA needs to hook those as part of a new node `job` in the
`scrape_configs`, in `prometheus.yml`, from Puppet, in `scrape_configs`, in `prometheus.yml`, from Puppet, in
...@@ -1739,7 +1740,7 @@ See also [Adding metrics to applications][], above. ...@@ -1739,7 +1740,7 @@ See also [Adding metrics to applications][], above.
Those are the actual services monitored by Prometheus. Those are the actual services monitored by Prometheus.
### Internal server (prometheus1) ### Internal server (`prometheus1`)
The "internal" server scrapes all hosts managed by Puppet for The "internal" server scrapes all hosts managed by Puppet for
TPA. Puppet installs a [`node_exporter`][] on *all* servers, which TPA. Puppet installs a [`node_exporter`][] on *all* servers, which
...@@ -1753,7 +1754,7 @@ authentication only to keep bots away. ...@@ -1753,7 +1754,7 @@ authentication only to keep bots away.
[`node_exporter`]: https://github.com/prometheus/node_exporter [`node_exporter`]: https://github.com/prometheus/node_exporter
### External server (prometheus2) ### External server (`prometheus2`)
The "external" server, on the other hand, is more restrictive and does The "external" server, on the other hand, is more restrictive and does
not allow public access. This is out of concern that specific metrics not allow public access. This is out of concern that specific metrics
...@@ -1764,10 +1765,10 @@ manually configured by TPA. ...@@ -1764,10 +1765,10 @@ manually configured by TPA.
Those are the services currently monitored by the external server: Those are the services currently monitored by the external server:
* [bridgestrap][] * [`bridgestrap`][]
* [rdsys][] * [`rdsys`][]
* OnionPerf external nodes' `node_exporter`s * OnionPerf external nodes' `node_exporter`
* connectivity test on (some?) bridges (using the * Connectivity test on (some?) bridges (using the
[`blackbox_exporter`][]) [`blackbox_exporter`][])
Note that this list might become out of sync with the actual Note that this list might become out of sync with the actual
...@@ -1778,8 +1779,8 @@ This separate server was actually provisioned for the anti-censorship ...@@ -1778,8 +1779,8 @@ This separate server was actually provisioned for the anti-censorship
team (see [this comment for background][]). The server was setup in team (see [this comment for background][]). The server was setup in
July 2019 following [#31159][]. July 2019 following [#31159][].
[bridgestrap]: https://bridges.torproject.org/bridgestrap-metrics [`bridgestrap`]: https://bridges.torproject.org/bridgestrap-metrics
[rdsys]: https://bridges.torproject.org/rdsys-backend-metrics [`rdsys`]: https://bridges.torproject.org/rdsys-backend-metrics
[`blackbox_exporter`]: https://github.com/prometheus/blackbox_exporter/ [`blackbox_exporter`]: https://github.com/prometheus/blackbox_exporter/
[Puppet]: howto/puppet [Puppet]: howto/puppet
[this comment for background]: https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/29863#note_2593114 [this comment for background]: https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/29863#note_2593114
...@@ -1788,22 +1789,22 @@ July 2019 following [#31159][]. ...@@ -1788,22 +1789,22 @@ July 2019 following [#31159][].
### Other possible services to monitor ### Other possible services to monitor
Many more exporters could be configured. A non-exaustive list was Many more exporters could be configured. A non-exhaustive list was
built in [ticket tpo/tpa/team#30028][] around launch time. Here we built in [ticket #30028][] around launch time. Here we
can document more such exporters we find along the way: can document more such exporters we find along the way:
* [Prometheus Onion Service Exporter][] - "Export the status and * [Prometheus Onion Service Exporter][] - "Export the status and
latency of an onion service" latency of an onion service"
* [hsprober][] - similar, but also with histogram buckets, multiple * [`hsprober`][] - similar, but also with histogram buckets, multiple
attempts, warm-up and error counts attempts, warm-up and error counts
* [haproxy_exporter][] * [`haproxy_exporter`][]
There's also a [list of third-party exporters][] in the Prometheus documentation. There's also a [list of third-party exporters][] in the Prometheus documentation.
[ticket tpo/tpa/team#30028]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/30028 [ticket #30028]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/30028
[Prometheus Onion Service Exporter]: https://github.com/systemli/prometheus-onion-service-exporter/ [Prometheus Onion Service Exporter]: https://github.com/systemli/prometheus-onion-service-exporter/
[hsprober]: https://git.autistici.org/ale/hsprober [`hsprober`]: https://git.autistici.org/ale/hsprober
[haproxy_exporter]: https://github.com/prometheus/haproxy_exporter [`haproxy_exporter`]: https://github.com/prometheus/haproxy_exporter
[list of third-party exporters]: https://prometheus.io/docs/instrumenting/exporters/ [list of third-party exporters]: https://prometheus.io/docs/instrumenting/exporters/
## SLA ## SLA
...@@ -1856,7 +1857,7 @@ also IRC notifications for both warning and critical. ...@@ -1856,7 +1857,7 @@ also IRC notifications for both warning and critical.
Each route needs to have one or more receivers set. Each route needs to have one or more receivers set.
Receivers are and routes are defined in hiera in `hiera/common/prometheus.yaml`. Receivers are and routes are defined in Hiera in `hiera/common/prometheus.yaml`.
#### Receivers #### Receivers
...@@ -1879,7 +1880,7 @@ instead of `email_configs`. ...@@ -1879,7 +1880,7 @@ instead of `email_configs`.
#### Routes #### Routes
Alert routes are set in the key `prometheus::alertmanager::route` in hiera. The Alert routes are set in the key `prometheus::alertmanager::route` in Hiera. The
default route, the one set at the top level of that key, uses the receiver default route, the one set at the top level of that key, uses the receiver
`fallback` and some default options for other routes. `fallback` and some default options for other routes.
...@@ -1907,30 +1908,30 @@ would otherwise be around long enough for Prometheus to scrape their ...@@ -1907,30 +1908,30 @@ would otherwise be around long enough for Prometheus to scrape their
metrics. We use it as a workaround to bridge Metrics data with metrics. We use it as a workaround to bridge Metrics data with
Prometheus/Grafana. Prometheus/Grafana.
## Blackbox exporter ## `blackbox_exporter`
Most exporters are pretty straightforward: a service binds to a port and exposes Most exporters are pretty straightforward: a service binds to a port and exposes
metrics through HTTP requests on that port, generally on the `/metrics` URL. metrics through HTTP requests on that port, generally on the `/metrics` URL.
The blackbox exporter, however, is a little bit more contrived. The exporter can The `blackbox_exporter`, however, is a little bit more contrived. The exporter can
be configured to run a bunch of different tests (e.g. tcp connections, http be configured to run a bunch of different tests (e.g. TCP connections, HTTP
requests, ICMP ping, etc) for a list of targets of its own. So the prometheus requests, ICMP ping, etc) for a list of targets of its own. So the Prometheus
server has one target, the host with the port for the blackbox exporter, but server has one target, the host with the port for the `blackbox_exporter`, but
that exporter in turn is set to check other hosts. that exporter in turn is set to check other hosts.
The [upstream documentation][] has some details that can help. We also The [upstream documentation][] has some details that can help. We also
have examples [above][] for how to configure it in our setup. have examples [above][] for how to configure it in our setup.
One thing that's nice to know in addition to how it's configured is how you can One thing that's nice to know in addition to how it's configured is how you can
debug it. You can query the exporter from localhost in order to get more debug it. You can query the exporter from `localhost` in order to get more
information. If you are using this method for debugging, you'll most probably information. If you are using this method for debugging, you'll most probably
want to include debugging output. For example, to run an ICMP test on host want to include debugging output. For example, to run an ICMP test on host
pauli.torproject.org: `pauli.torproject.org`:
curl http://localhost:9115/probe?target=pauli.torproject.org&module=icmp&debug=true curl http://localhost:9115/probe?target=pauli.torproject.org&module=icmp&debug=true
Note that the above trick can be used for _any_ target, not just for ones Note that the above trick can be used for _any_ target, not just for ones
currently configured in the blackbox exporter. So you can also use this to test currently configured in the `blackbox_exporter`. So you can also use this to test
things before creating the final configuration for the target. things before creating the final configuration for the target.
[upstream documentation]: https://github.com/prometheus/blackbox_exporter [upstream documentation]: https://github.com/prometheus/blackbox_exporter
...@@ -1962,16 +1963,16 @@ builtin support for: ...@@ -1962,16 +1963,16 @@ builtin support for:
* [Opsgenie][] (now Atlassian) * [Opsgenie][] (now Atlassian)
* Wechat * Wechat
There's also a [generic webhook receiver][] which is typically used There's also a [generic web hook receiver][] which is typically used
to send notifications. Many other endpoints are implemented through to send notifications. Many other endpoints are implemented through
that webhook, for example: that web hook, for example:
* [Cachet][] * [Cachet][]
* [Dingtalk][] * [Dingtalk][]
* [Discord][] * [Discord][]
* [Google Chat][] * [Google Chat][]
* [IRC][] * [IRC][]
* Matrix: [matrix-alertmanager][] (JS) or [knopfler][] (Python), see * Matrix: [`matrix-alertmanager`][] (JavaScript) or [knopfler][] (Python), see
also [#40216][] also [#40216][]
* [Mattermost][] * [Mattermost][]
* [Microsoft teams][] * [Microsoft teams][]
...@@ -1982,13 +1983,13 @@ that webhook, for example: ...@@ -1982,13 +1983,13 @@ that webhook, for example:
* [Signal][] (or [Signald][]) * [Signal][] (or [Signald][])
* [Splunk][] * [Splunk][]
* [SNMP][] * [SNMP][]
* Telegram: [nopp/alertmanager-webhook-telegram-python][] or [metalmatze/alertmanager-bot][] * Telegram: [`nopp/alertmanager-webhook-telegram-python`][] or [`metalmatze/alertmanager-bot`][]
* [Twilio][] * [Twilio][]
* [Wechat][] * [Wechat][]
* Zabbix: [alertmanager-zabbix-webhook][] or [zabbix-alertmanager][] * Zabbix: [`alertmanager-zabbix-webhook`][] or [`zabbix-alertmanager`][]
And that is only what was available at the time of writing, the And that is only what was available at the time of writing, the
[alertmanager-webhook][] and [alertmanager tags][] GitHub might have more. [`alertmanager-webhook`][] and [`alertmanager` tags][] GitHub might have more.
The Alertmanager has its own web interface to see and silence alerts, The Alertmanager has its own web interface to see and silence alerts,
but there are also alternatives like [Karma][] (previously but there are also alternatives like [Karma][] (previously
...@@ -2012,14 +2013,14 @@ again. The [kthxbye bot][] works around that issue. ...@@ -2012,14 +2013,14 @@ again. The [kthxbye bot][] works around that issue.
[Victorops]: https://victorops.com [Victorops]: https://victorops.com
[Pagerduty]: https://pagerduty.com/ [Pagerduty]: https://pagerduty.com/
[Opsgenie]: https://opsgenie.com [Opsgenie]: https://opsgenie.com
[generic webhook receiver]: https://prometheus.io/docs/alerting/latest/configuration/#webhook_config [generic web hook receiver]: https://prometheus.io/docs/alerting/latest/configuration/#webhook_config
[Cachet]: https://github.com/oxyno-zeta/prometheus-cachethq [Cachet]: https://github.com/oxyno-zeta/prometheus-cachethq
[Dingtalk]: https://github.com/timonwong/prometheus-webhook-dingtalk [Dingtalk]: https://github.com/timonwong/prometheus-webhook-dingtalk
[Discord]: https://github.com/rogerrum/alertmanager-discord [Discord]: https://github.com/rogerrum/alertmanager-discord
[Google Chat]: https://github.com/mr-karan/calert [Google Chat]: https://github.com/mr-karan/calert
[IRC]: https://github.com/crisidev/alertmanager_irc [IRC]: https://github.com/crisidev/alertmanager_irc
[#40216]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40216 [#40216]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40216
[matrix-alertmanager]: https://github.com/jaywink/matrix-alertmanager [`matrix-alertmanager`]: https://github.com/jaywink/matrix-alertmanager
[knopfler]: https://github.com/sinnwerkstatt/knopfler [knopfler]: https://github.com/sinnwerkstatt/knopfler
[Mattermost]: https://github.com/cpanato/mattermost-plugin-alertmanager [Mattermost]: https://github.com/cpanato/mattermost-plugin-alertmanager
[Microsoft teams]: https://github.com/prometheus-msteams/prometheus-msteams [Microsoft teams]: https://github.com/prometheus-msteams/prometheus-msteams
...@@ -2030,14 +2031,14 @@ again. The [kthxbye bot][] works around that issue. ...@@ -2030,14 +2031,14 @@ again. The [kthxbye bot][] works around that issue.
[Signald]: https://github.com/dgl/alertmanager-webhook-signald [Signald]: https://github.com/dgl/alertmanager-webhook-signald
[Splunk]: https://github.com/sylr/alertmanager-splunkbot [Splunk]: https://github.com/sylr/alertmanager-splunkbot
[SNMP]: https://github.com/maxwo/snmp_notifier [SNMP]: https://github.com/maxwo/snmp_notifier
[nopp/alertmanager-webhook-telegram-python]: https://github.com/nopp/alertmanager-webhook-telegram-python [`nopp/alertmanager-webhook-telegram-python`]: https://github.com/nopp/alertmanager-webhook-telegram-python
[metalmatze/alertmanager-bot]: https://github.com/metalmatze/alertmanager-bot [`metalmatze/alertmanager-bot`]: https://github.com/metalmatze/alertmanager-bot
[Twilio]: https://github.com/Swatto/promtotwilio [Twilio]: https://github.com/Swatto/promtotwilio
[Wechat]: https://github.com/daozzg/work_wechat_robot [Wechat]: https://github.com/daozzg/work_wechat_robot
[alertmanager-zabbix-webhook]: https://github.com/gmauleon/alertmanager-zabbix-webhook [`alertmanager-zabbix-webhook`]: https://github.com/gmauleon/alertmanager-zabbix-webhook
[zabbix-alertmanager]: https://github.com/devopyio/zabbix-alertmanager [`zabbix-alertmanager`]: https://github.com/devopyio/zabbix-alertmanager
[alertmanager-webhook]: https://github.com/topics/alertmanager-webhook [`alertmanager-webhook`]: https://github.com/topics/alertmanager-webhook
[alertmanager tags]: https://github.com/topics/alertmanager [`alertmanager` tags]: https://github.com/topics/alertmanager
[Karma]: https://karma-dashboard.io/ [Karma]: https://karma-dashboard.io/
[unsee]: https://github.com/cloudflare/unsee [unsee]: https://github.com/cloudflare/unsee
[Elm compiler]: https://github.com/elm/compiler [Elm compiler]: https://github.com/elm/compiler
...@@ -2098,7 +2099,7 @@ route's `group_by` setting, and then Alertmanager will evaluate the ...@@ -2098,7 +2099,7 @@ route's `group_by` setting, and then Alertmanager will evaluate the
timers set on the particular route that was matched. An alert group is timers set on the particular route that was matched. An alert group is
created when an alert is received and no other alerts already match created when an alert is received and no other alerts already match
the same values for the `group_by` criteria. An alert group is removed the same values for the `group_by` criteria. An alert group is removed
when all alerts in a group are in state `inactive` (e.g. resolved). when all alerts in a group are in state `inactive` (e.g. resolved).
Fourth, there's the `group_wait` setting (defaults to 5 seconds, can Fourth, there's the `group_wait` setting (defaults to 5 seconds, can
be [customized by route][]). This will keep Alertmanager from be [customized by route][]). This will keep Alertmanager from
...@@ -2120,10 +2121,10 @@ relay that alert to the Alertmanager, and another timer comes in. ...@@ -2120,10 +2121,10 @@ relay that alert to the Alertmanager, and another timer comes in.
Fifth, before relaying that new alert that's already part of a firing Fifth, before relaying that new alert that's already part of a firing
group, Alertmanager will wait `group_interval` (defaults to 5m) before group, Alertmanager will wait `group_interval` (defaults to 5m) before
resending a notification to a group. re-sending a notification to a group.
When Alertmanager first creates an alert group, a thread is started When Alertmanager first creates an alert group, a thread is started
for that group and the _route_'s `group_interval` acts like a time for that group and the *route's* `group_interval` acts like a time
ticker. Notifications are only sent when the `group_interval` period ticker. Notifications are only sent when the `group_interval` period
repeats. repeats.
...@@ -2180,23 +2181,23 @@ There is no issue tracker specifically for this project, [File][new-ticket] or ...@@ -2180,23 +2181,23 @@ There is no issue tracker specifically for this project, [File][new-ticket] or
Those are major issues that are worth knowing about Prometheus in Those are major issues that are worth knowing about Prometheus in
general, and our setup in particular: general, and our setup in particular:
- bind mounts generate duplicate metrics, upstream issue: [Way to - Bind mounts generate duplicate metrics, upstream issue: [Way to
distinguish bind mounted path ?][], possible workaround: manually distinguish bind mounted path?][], possible workaround: manually
specify known bind mount points specify known bind mount points
(e.g. `node_filesystem_avail_bytes{instance=~"$instance:.*",fstype!='tmpfs',fstype!='shm',mountpoint!~"/home|/var/lib/postgresql"}`), (e.g. `node_filesystem_avail_bytes{instance=~"$instance:.*",fstype!='tmpfs',fstype!='shm',mountpoint!~"/home|/var/lib/postgresql"}`),
but that can hide actual, real mountpoints, possible fix: the but that can hide actual, real mount points, possible fix: the
`node_filesystem_mount_info` metric, [added in PR 2970 from `node_filesystem_mount_info` metric, [added in PR 2970 from
2024-07-14][], unreleased as of 2024-08-28 2024-07-14][], unreleased as of 2024-08-28
- high cardinality metrics from exporters we do not control can fill - High cardinality metrics from exporters we do not control can fill
the disk the disk
- no long-term metrics storage, issue: [multi-year metrics storage][] - No long-term metrics storage, issue: [multi-year metrics storage][]
- the web UI is really limited, and is actually deprecated, with the - The web user interface is really limited, and is actually deprecated, with the
new [React-based one not (yet?) packaged][] new [React-based one not (yet?) packaged][]
In general, the service is still being launched, see [TPA-RFC-33][] In general, the service is still being launched, see [TPA-RFC-33][]
for the full deployment plan. for the full deployment plan.
[Way to distinguish bind mounted path ?]: https://github.com/prometheus/node_exporter/issues/600 [Way to distinguish bind mounted path?]: https://github.com/prometheus/node_exporter/issues/600
[added in PR 2970 from 2024-07-14]: https://github.com/prometheus/node_exporter/pull/2970 [added in PR 2970 from 2024-07-14]: https://github.com/prometheus/node_exporter/pull/2970
[multi-year metrics storage]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40330 [multi-year metrics storage]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40330
[React-based one not (yet?) packaged]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41790 [React-based one not (yet?) packaged]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41790
...@@ -2225,7 +2226,7 @@ but it was [salvaged][] by the [Prometheus community][]. ...@@ -2225,7 +2226,7 @@ but it was [salvaged][] by the [Prometheus community][].
Another important layer is the large amount of Puppet code that is Another important layer is the large amount of Puppet code that is
used to deploy Prometheus and its components. This is all part of a used to deploy Prometheus and its components. This is all part of a
big Puppet module, [`puppet-prometheus`][], managed by the [voxpupuli big Puppet module, [`puppet-prometheus`][], managed by the [Voxpupuli
collective][]. Our integration with the module is not yet complete: collective][]. Our integration with the module is not yet complete:
we have a lot of glue code on top of it to correctly make it work with we have a lot of glue code on top of it to correctly make it work with
Debian packages. A lot of work has been done to complete that work by Debian packages. A lot of work has been done to complete that work by
...@@ -2237,13 +2238,13 @@ details. ...@@ -2237,13 +2238,13 @@ details.
[bind_exporter]: https://github.com/digitalocean/bind_exporter/ [bind_exporter]: https://github.com/digitalocean/bind_exporter/
[salvaged]: https://github.com/prometheus-community/bind_exporter/issues/55 [salvaged]: https://github.com/prometheus-community/bind_exporter/issues/55
[Prometheus community]: https://github.com/prometheus-community/community/issues/15 [Prometheus community]: https://github.com/prometheus-community/community/issues/15
[voxpupuli collective]: https://github.com/voxpupuli [Voxpupuli collective]: https://github.com/voxpupuli
[upstream issue 32]: https://github.com/voxpupuli/puppet-prometheus/issues/32 [upstream issue 32]: https://github.com/voxpupuli/puppet-prometheus/issues/32
## Monitoring and testing ## Monitoring and testing
Prometheus doesn't have specific tests, but there *is* a test suite in Prometheus doesn't have specific tests, but there *is* a test suite in
the upstream prometheus Puppet module. the upstream Prometheus Puppet module.
The server is monitored for basic system-level metrics by Nagios. It The server is monitored for basic system-level metrics by Nagios. It
also monitors itself for system-level metrics but also also monitors itself for system-level metrics but also
...@@ -2279,13 +2280,13 @@ require little backups. The metrics themselves are kept in ...@@ -2279,13 +2280,13 @@ require little backups. The metrics themselves are kept in
WAL (write-ahead log) files are ignored by the backups, which can lead WAL (write-ahead log) files are ignored by the backups, which can lead
to an extra 2-3 hours of data loss since the last backup in the case to an extra 2-3 hours of data loss since the last backup in the case
of a total failure, see [tpo/tpa/team#41627][] for the of a total failure, see [#41627][] for the
discussion. This should eventually be mitigated by a high availability discussion. This should eventually be mitigated by a high availability
setup ([tpo/tpa/team#41643][]). setup ([#41643][]).
[backup procedures]: service/backup [backup procedures]: service/backup
[tpo/tpa/team#41627]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41627 [#41627]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41627
[tpo/tpa/team#41643]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41643 [#41643]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41643
## Other documentation ## Other documentation
...@@ -2313,7 +2314,7 @@ traces of Munin were removed in early April 2019 ([ticket 29682][]). ...@@ -2313,7 +2314,7 @@ traces of Munin were removed in early April 2019 ([ticket 29682][]).
Resource requirements were researched in [ticket 29388][] and it was Resource requirements were researched in [ticket 29388][] and it was
originally planned to retain 15 days of metrics. This was expanded to originally planned to retain 15 days of metrics. This was expanded to
one year in November 2019 ([ticket 31244][]) with the hope this could one year in November 2019 ([ticket 31244][]) with the hope this could
eventually be expanded further with a downsampling server in the eventually be expanded further with a down-sampling server in the
future. future.
[ticket 31244]: https://bugs.torproject.org/31244 [ticket 31244]: https://bugs.torproject.org/31244
...@@ -2334,7 +2335,7 @@ metrics are just that: metrics, without thresholds... This makes it ...@@ -2334,7 +2335,7 @@ metrics are just that: metrics, without thresholds... This makes it
more difficult to replace Nagios because a ton of alerts need to be more difficult to replace Nagios because a ton of alerts need to be
rewritten to replace the existing ones. A lot of reports and rewritten to replace the existing ones. A lot of reports and
functionality built-in to Nagios, like availability reports, functionality built-in to Nagios, like availability reports,
acknowledgements and other reports, would need to be reimplemented as acknowledgments and other reports, would need to be re-implemented as
well. well.
## Goals ## Goals
...@@ -2362,12 +2363,12 @@ really just second-guessing... ...@@ -2362,12 +2363,12 @@ really just second-guessing...
## Approvals required ## Approvals required
Primary Prometheus server was decided [in the Brussels 2019 Primary Prometheus server was decided [in the Brussels 2019
devmeeting][], before anarcat joined the team ([ticket developer meeting][], before anarcat joined the team ([ticket
29389][]). Secondary Prometheus server was approved in 29389][]). Secondary Prometheus server was approved in
[meeting/2019-04-08][]. Storage expansion was approved in [meeting/2019-04-08][]. Storage expansion was approved in
[meeting/2019-11-25][]. [meeting/2019-11-25][].
[in the Brussels 2019 devmeeting]: https://gitlab.torproject.org/legacy/trac/-/wikis/org/meetings/2019BrusselsAdminTeamMinutes#Trendingmonitoring [in the Brussels 2019 developer meeting]: https://gitlab.torproject.org/legacy/trac/-/wikis/org/meetings/2019BrusselsAdminTeamMinutes#Trendingmonitoring
[ticket 29389]: https://bugs.torproject.org/29389 [ticket 29389]: https://bugs.torproject.org/29389
[meeting/2019-04-08]: meeting/2019-04-08 [meeting/2019-04-08]: meeting/2019-04-08
[meeting/2019-11-25]: meeting/2019-11-25 [meeting/2019-11-25]: meeting/2019-11-25
...@@ -2378,7 +2379,7 @@ Prometheus was chosen, see also [Grafana][]. ...@@ -2378,7 +2379,7 @@ Prometheus was chosen, see also [Grafana][].
## Cost ## Cost
N/A. N/A
## Alternatives considered ## Alternatives considered
...@@ -2389,7 +2390,7 @@ from Prometheus, but ultimately decided against it in [TPA-RFC-33][]. ...@@ -2389,7 +2390,7 @@ from Prometheus, but ultimately decided against it in [TPA-RFC-33][].
Alerting rules are currently stored in an external Alerting rules are currently stored in an external
[`prometheus-alerts.git` repository][] that holds not only TPA's [`prometheus-alerts.git` repository][] that holds not only TPA's
alerts, but also those of other teams. So the rules alerts, but also those of other teams. So the rules
are _not_ directly managed by puppet -- although puppet will ensure are _not_ directly managed by puppet -- although puppet will ensure
that the repository is checked out with the most recent commit on the that the repository is checked out with the most recent commit on the
Prometheus servers. Prometheus servers.
...@@ -2432,7 +2433,7 @@ Basically, Prometheus is similar to Munin in many ways: ...@@ -2432,7 +2433,7 @@ Basically, Prometheus is similar to Munin in many ways:
like Munin like Munin
* The agent running on the nodes is called `prometheus-node-exporter` * The agent running on the nodes is called `prometheus-node-exporter`
instead of `munin-node`. it scrapes only a set of built-in instead of `munin-node`. It scrapes only a set of built-in
parameters like CPU, disk space and so on, different exporters are parameters like CPU, disk space and so on, different exporters are
necessary for different applications (like necessary for different applications (like
`prometheus-apache-exporter`) and any application can easily `prometheus-apache-exporter`) and any application can easily
...@@ -2440,18 +2441,18 @@ Basically, Prometheus is similar to Munin in many ways: ...@@ -2440,18 +2441,18 @@ Basically, Prometheus is similar to Munin in many ways:
`/metrics` endpoint `/metrics` endpoint
* Like Munin, the node exporter doesn't have any form of * Like Munin, the node exporter doesn't have any form of
authentication built-in. we rely on IP-level firewalls to avoid authentication built-in. We rely on IP-level firewalls to avoid
leakage leakage
* The central server is simply called `prometheus` and runs as a * The central server is simply called `prometheus` and runs as a
daemon that wakes up on its own, instead of `munin-update` which is daemon that wakes up on its own, instead of `munin-update` which is
called from `munin-cron` and before that `cron` called from `munin-cron` and before that `cron`
* graphics are generated on the fly through the crude Prometheus web * Graphics are generated on the fly through the crude Prometheus web
interface or by frontends like Grafana, instead of being constantly interface or by frontends like Grafana, instead of being constantly
regenerated by `munin-graph` regenerated by `munin-graph`
* samples are stored in a custom "time series database" (TSDB) in * Samples are stored in a custom "time series database" (TSDB) in
Prometheus instead of the (ad-hoc) RRD standard Prometheus instead of the (ad-hoc) RRD standard
* Prometheus performs *no* down-sampling like RRD and Prom relies on * Prometheus performs *no* down-sampling like RRD and Prom relies on
... ...
......