spell-check prom docs with harper authored by anarcat's avatar anarcat
Did this distractedly while idling in a meeting. Filed a bunch of
issues upstream too:

https://github.com/elijah-potter/harper/issues/196
https://github.com/elijah-potter/harper/issues/195
https://github.com/elijah-potter/harper/issues/194
......@@ -201,7 +201,7 @@ To add a scrape job in a puppet profile, you can use the
`prometheus::scrape_job` defined type, or one of the defined types which are
convenience wrappers around that.
Here is, for example, how the gitlab runners are scraped:
Here is, for example, how the GitLab runners are scraped:
```
# tell Prometheus to scrape the exporter
......@@ -255,9 +255,9 @@ In another example, to configure the ssh scrape jobs (in
},
}
But because this is a blackbox exporter, the `scrape_configs`
But because this is a `blackbox_exporter`, the `scrape_configs`
configuration is more involved, as it needs to define the
`relabel_configs` element that make the blackbox exporter work:
`relabel_configs` element that make the `blackbox_exporter` work:
- job_name: 'blackbox_ssh_banner'
metrics_path: '/probe'
......@@ -274,7 +274,7 @@ configuration is more involved, as it needs to define the
- target_label: '__address__'
replacement: 'localhost:9115'
Scrape jobs for non-TPA services are defined in hiera under keys named
Scrape jobs for non-TPA services are defined in Hiera under keys named
`scrape_configs` in `hiera/common/prometheus.yaml`. Here's one example of such a
scrape job definition:
......@@ -323,7 +323,7 @@ configure a service, you *may* define extra jobs in the
`profile::prometheus::server::internal` Puppet class.
For example, because the GitLab setup is fully managed by Puppet
(e.g. [tpo/tpa/gitlab#20][], but other similar issues remain), we
(e.g. [`gitlab#20`][], but other similar issues remain), we
cannot use this automatic setup, so manual scrape targets are defined
like this:
......@@ -361,7 +361,7 @@ then we open the port to the Prometheus server on the exporter, with:
Those rules are declared on the server, in `prometheus::prometheus::server::internal`.
[tpo/tpa/gitlab#20]: https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/20
[`gitlab#20`]: https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/20
## Writing an alert
......@@ -374,9 +374,9 @@ discussion on that.
An [alerting rule][] is a simple YAML file that consists mainly of:
- a name (say `JobDown`)
- a Prometheus query, or "expression" (say `up != 1`)
- extra labels and annotations
- A name (say `JobDown`).
- A Prometheus query, or "expression" (say `up != 1`).
- Extra labels and annotations.
### Expressions
......@@ -415,7 +415,7 @@ flapping and temporary conditions. Rule of thumbs:
more than 24h), `RAIDDegraded` (failed disk won't come back on its
own in 15m)
- `15m`: availability checks, designed to ignore transient errors.
examples: `JobDown`, `DiskFull`
Examples: `JobDown`, `DiskFull`
- `1h`: consistency checks, things an operator might have deployed
incorrectly but could recover on its own. Examples:
`OutdatedLibraries`, as `needrestart` might recover at the end of
......@@ -509,8 +509,8 @@ configuration, or alerting rule:
> fixing, but not immediately, no user-visible impact; example:
> server needs to be rebooted
> * `critical`: serious condition with disruptive user-visible impact
> which requires prompt response; example: donation site gives a 500
> error
> which requires prompt response; example: donation site returns 500
> errors
### Annotations
......@@ -529,17 +529,17 @@ with the alert.
The playbook *must* include those things:
1. the actual code name of the alert (e.g. `JobDown` or
`DiskWillFillSoon`)
1. The actual code name of the alert (e.g. `JobDown` or
`DiskWillFillSoon`).
2. an example of the alert output (e.g. `Exporter job gitlab_runner
on tb-build-02.torproject.org:9252 is down`)
2. An example of the alert output (e.g. `Exporter job gitlab_runner
on tb-build-02.torproject.org:9252 is down`).
3. why this alert triggered, what is its impact
3. Why this alert triggered, what is its impact.
4. optionally, how to reproduce the issue
4. Optionally, how to reproduce the issue.
5. how to fix it
5. How to fix it.
How to reproduce the issue is optional, but important. Think of
yourself in the future, tired and panicking because things are
......@@ -562,8 +562,8 @@ fixed.
If the playbook becomes too complicated, consider making a [Fabric][]
script out of it.
A good example of a proper playbook is the [Textfile collector errors
playbook here][]. It has all of the above points, including actual
A good example of a proper playbook is the [text file collector errors
playbook here][]. It has all the above points, including actual
fixes for different actual scenarios.
Here's a template to get started:
......@@ -590,7 +590,7 @@ document here how you fix this next time.
```
[Fabric]: howto/fabric
[Textfile collector errors playbook here]: #textfile-collector-errors
[text file collector errors playbook here]: #textfile-collector-errors
### Alerting rule template
......@@ -628,8 +628,8 @@ groups:
rules:
```
... as that structure just serves to declare the rest of the alerts in
the file. However, consider that "rules within a group are run
That structure just serves to declare the rest of the alerts in the
file. However, consider that "rules within a group are run
sequentially at a regular interval, with the same evaluation time"
(see the [recording rules documentation][]). So avoid putting *all*
alerts inside the same file. In TPA, we group alerts by exporter, so
......@@ -882,17 +882,17 @@ dashboards][] section for details.
[exposed]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41733
[Alerting dashboards]: #alerting-dashboards
### Managing alerts with amtool
### Managing alerts with `amtool`
Since the Alertmanager web UI is not available in Debian, you need to
use the [amtool][] command. A few useful commands:
use the [`amtool`][] command. A few useful commands:
* `amtool alert`: show firing alerts
* `amtool silence add --duration=1h --author=anarcat
--comment="working on it" ALERTNAME`: silence alert ALERTNAME for
--comment="working on it" ALERTNAME`: silence alert `ALERTNAME` for
an hour, with some comments
[amtool]: https://manpages.debian.org/amtool.1
[`amtool`]: https://manpages.debian.org/amtool.1
### Checking alert history
......@@ -1009,7 +1009,7 @@ defined series), a specific query will generate a specific alert with a given
set of labels and annotations.
Those labels can then be fed into `amtool` to test routing. For
example, the above alert can be tested against the alertmanager
example, the above alert can be tested against the Alertmanager
configuration with:
amtool config routes test alertname="NeedsReboot" instance="akka.0x90.dk:9100" job="relay" severity="warning" team="network"
......@@ -1036,7 +1036,7 @@ The above, for example, confirms that `networking` is not the correct
team name (it should be `network`).
Note that you can also deliver an alert to a web hook receiver
syntetically. For example, this will deliver and empty message to the
synthetically. For example, this will deliver and empty message to the
IRC relay:
curl --header "Content-Type: application/json" --request POST --data "{}" http://localhost:8098
......@@ -1050,14 +1050,14 @@ IRC relay:
This section documents more advanced metrics injection topics that we
rarely need or use.
### Backfilling
### Back-filling
Starting from Prometheus 2.24, Prometheus [now
supports][] [backfilling][]. This is untested, but [this guide][]
supports][] [back-filling][]. This is untested, but [this guide][]
might provide a good tutorial.
[now supports]: https://github.com/prometheus/prometheus/issues/535
[backfilling]: https://prometheus.io/docs/prometheus/latest/storage/#backfilling-from-openmetrics-format
[back-filling]: https://prometheus.io/docs/prometheus/latest/storage/#backfilling-from-openmetrics-format
[this guide]: https://tlvince.com/prometheus-backfilling
### Push metrics to the Pushgateway
......@@ -1187,7 +1187,7 @@ like this every second:
Nov 01 19:43:03 hetzner-nbg1-02 prometheus[49182]: level=info ts=2022-11-01T19:43:03.788Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=30182 maxSegment=30196
... it's somewhat normal. At the time of writing, Prometheus2 takes
It's somewhat normal. At the time of writing, Prometheus2 takes
over a minute to start because of this problem. When it's done, it
will show the timing information, which is currently:
......@@ -1212,7 +1212,7 @@ the metrics it collects, and allow you to view the pending metrics
before they get scraped by Prometheus, which may be useful to
troubleshoot issues with the gateway.
To pull metrics by hand, you can pull directly from the pushgateway:
To pull metrics by hand, you can pull directly from the Pushgateway:
curl localhost:9091/metrics
......@@ -1223,7 +1223,7 @@ If you get this error while pulling metrics from the exporter:
collected metric "some_metric" { label:<name:"instance" value:"" > label:<name:"job" value:"some_job" > label:<name:"tag" value:"val1" > counter:<value:1 > } was collected before with the same name and label values
It's because similar metrics were sent twice into the gateway, which
corrupts the state of the pushgateway, a [known problems][] in
corrupts the state of the Pushgateway, a [known problems][] in
earlier versions and [fixed in 0.10][] (Debian bullseye and later). A
workaround is simply to restart the Pushgateway (and clear the
storage, if persistence is enabled, see the `--persistence.file`
......@@ -1234,7 +1234,7 @@ flag).
### Running out of disk space
In [tpo/tpa/team#41070][], we encountered a situation where disk
In [#41070][], we encountered a situation where disk
usage on the main Prometheus server was growing linearly even if the
number of targets didn't change. This is a typical problem in time
series like this where the "cardinality" of metrics grows without
......@@ -1242,7 +1242,7 @@ bound, consuming more and more disk space as time goes by.
The first step is to confirm the diagnosis by looking at the [Grafana
graph showing Prometheus disk usage][] over time. This should show a
"sawtooth" pattern where compactions happen regularly (about once
"[sawtooth wave][]" pattern where compactions happen regularly (about once
every three weeks), but without growing much over longer periods of
time. In the above ticket, the usage was growing despite
compactions. There are also shorter-term (~4h) and smaller compactions
......@@ -1269,12 +1269,13 @@ long-term storage][] which suggests tweaking the
[This guide from Alexandre Vazquez][] also had some useful queries and
tips we didn't fully investigate.
[tpo/tpa/team#41070]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41070
[#41070]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41070
[Grafana graph showing Prometheus disk usage]: https://grafana.torproject.org/d/000000012/prometheus-2-0-stats?orgId=1&refresh=1m&viewPanel=40&from=now-1y&to=now
[disk usage graphic]: https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&var-class=All&var-instance=hetzner-nbg1-01.torproject.org&from=now-3d&to=now&viewPanel=2
[upstream Storage documentation]: https://prometheus.io/docs/prometheus/1.8/storage/
[advice on long-term storage]: https://prometheus.io/docs/prometheus/1.8/storage/#settings-for-very-long-retention-time
[This guide from Alexandre Vazquez]: https://alexandre-vazquez.com/how-it-optimize-the-disk-usage-in-the-prometheus-database/
[sawtooth wave]: https://en.wikipedia.org/wiki/Sawtooth_wave
### Default route errors
......@@ -1336,9 +1337,9 @@ host are managed by the anti-censorship team service admins. If the
host was *not* managed by TPA or this was a notification about a
*service* operated by the team, then a ticket should be filed there.
In this case, [tpo/tpa/team#41667][] was filed.
In this case, [#41667][] was filed.
[tpo/tpa/team#41667]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41667
[#41667]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41667
#### Fixing routing
......@@ -1348,7 +1349,7 @@ if the alert is still firing. In this case, we see this:
| Labels | State | Active Since | Value |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------|----------------------------------------|-------|
| `alertname="JobDown"` `alias="rdsys-test-01.torproject.org"` `classes="role::rdsys::backend"` `instance="rdsys-test-01.torproject.org:3903"` `job="mtail"` `severity="warning"` | firing | 2024-07-03 13:51:17.36676096 +0000 UTC | 0 |
| `alertname="JobDown"` `alias="rdsys-test-01.torproject.org"` `classes="role::rdsys::backend"` `instance="rdsys-test-01.torproject.org:3903"` `job="mtail"` `severity="warning"` | Firing | 2024-07-03 13:51:17.36676096 +0000 UTC | 0 |
In this case, we can see there's no `team` label on that metric, which
is the root cause.
......@@ -1379,7 +1380,7 @@ and the following rule:
The query, in this case, is therefore `up < 1`. But since the alert
has resolved, we can't actually do the exact same query and expect to
find the same host, we need instead to broaden the query without the
conditional (so just `up`) *and* add the right labels, in this case
conditional (so just `up`) *and* add the right labels. In this case
this should do the trick:
up{instance="rdsys-test-01.torproject.org:3903",job="mtail"}
......@@ -1485,10 +1486,10 @@ no value was provided for a metric, like this:
# TYPE civicrm_torcrm_resque_processor_status_up gauge
civicrm_torcrm_resque_processor_status_up
See [tpo/web/civicrm#149][] for further details on this
See [`web/civicrm#149`][] for further details on this
outage.
[tpo/web/civicrm#149]: https://gitlab.torproject.org/tpo/web/civicrm/-/issues/149
[`web/civicrm#149`]: https://gitlab.torproject.org/tpo/web/civicrm/-/issues/149
#### Forbidden errors
......@@ -1496,15 +1497,15 @@ Another example might be:
server returned HTTP status 403 Forbidden
... in which case there's a permission issue on the exporter
endpoint. Try to reproduce the issue by pulling the endpoint directly,
on the Prometheus server, with, for example:
In which case there's a permission issue on the exporter endpoint. Try
to reproduce the issue by pulling the endpoint directly, on the
Prometheus server, with, for example:
curl -sSL https://donate.torproject.org:443/metrics
... or whatever URL is visible in the targets listing above. This
could be a web server configuration or lack of matching credentials in
the exporter configuration. Look in `tor-puppet.git`, the
Or whatever URL is visible in the targets listing above. This could be
a web server configuration or lack of matching credentials in the
exporter configuration. Look in `tor-puppet.git`, the
`profile::prometheus::server::internal::collect_scrape` in
`hiera/common/prometheus.yaml`, where credentials should be defined
(although they should actually be stored in Trocla).
......@@ -1523,13 +1524,13 @@ with curl from the affected server, for example:
This is a typical configuration error in Apache where the
`/server-status` host is not available to the exporter because the
"default vhost" was disabled (`apache2::default_vhost` in
"default virtual host" was disabled (`apache2::default_vhost` in
Hiera).
There is normally a workaround for this in the
`profile::prometheus::apache_exporter` class, which configures a
`localhost` vhost to answer properly on this address. Verify that it's
present, consider using `apache2ctl -S` to see the vhost
`localhost` virtual host to answer properly on this address. Verify that it's
present, consider using `apache2ctl -S` to see the virtual host
configuration.
See also the [Apache web server diagnostics][] in the incident
......@@ -1564,7 +1565,7 @@ might be different.
Sep 24 20:56:53 bungei prometheus-node-exporter[1387]: ts=2024-09-24T20:56:53.280Z caller=textfile.go:227 level=error collector=textfile msg="failed to collect textfile data" file=tpa_backuppg.prom err="failed to open textfile data file \"/var/lib/prometheus/node-exporter/tpa_backuppg.prom\": open /var/lib/prometheus/node-exporter/tpa_backuppg.prom: permission denied"
```
In this case, the file was created as a tempfile and moved into place
In this case, the file was created as a temporary file and moved into place
without fixing the permission. The fix was to simply create the file
without the `tempfile` Python library, with a `.tmp` suffix, and just
move it into place.
......@@ -1575,7 +1576,7 @@ move it into place.
Sep 24 21:14:41 perdulce prometheus-node-exporter[429]: ts=2024-09-24T21:14:41.783Z caller=textfile.go:227 level=error collector=textfile msg="failed to collect textfile data" file=scheduled_shutdown_metric.prom err="failed to parse textfile data from \"/var/lib/prometheus/node-exporter/scheduled_shutdown_metric.prom\": text format parsing error in line 3: expected '\"' at start of label value, found 'r'"
```
This was an experimental metric designed in [tpo/tpa/team#41734][] to
This was an experimental metric designed in [#41734][] to
keep track of scheduled reboot times, but it was formatted
incorrectly. The entire file content was:
......@@ -1596,12 +1597,12 @@ node_shutdown_scheduled_timestamp_seconds{kind="reboot"} 1725545703.588789
But the file was simply removed in this case.
[tpo/tpa/team#41734]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41734
[#41734]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41734
## Disaster recovery
If a Prometheus/Grafana is destroyed, it should be completely
rebuildable from Puppet. Non-configuration data should be restored
re-buildable from Puppet. Non-configuration data should be restored
from backup, with `/var/lib/prometheus/` being sufficient to
reconstruct history. If even backups are destroyed, history will be
lost, but the server should still recover and start tracking new
......@@ -1693,7 +1694,7 @@ A real-life (simplified) example:
node_filesystem_avail_bytes{alias="alberti.torproject.org",device="/dev/sda1",fstype="ext4",mountpoint="/"} 16160059392
The above says that the node alberti has the device `/dev/sda` mounted
The above says that the node `alberti` has the device `/dev/sda` mounted
on `/`, formatted as an `ext4` file system which has 16160059392 bytes
(~16GB) free.
......@@ -1711,21 +1712,21 @@ exporter", with the following steps:
apt install -t stretch-backports prometheus-node-exporter
... assuming that backports is already configured. if it isn't, such a line in `/etc/apt/sources.list.d/backports.debian.org.list` should suffice:
Assuming that backports is already configured. If it isn't, such a
line in `/etc/apt/sources.list.d/backports.debian.org.list` should
suffice, followed by an `apt update`:
deb https://deb.debian.org/debian/ stretch-backports main contrib non-free
... followed by an `apt update`, naturally.
The firewall on the machine needs to allow traffic on the exporter
port from the server `prometheus2.torproject.org`. Then [open a
ticket][new-ticket] for TPA to configure the target. Make sure to
mention:
* the hostname for the exporter
* the port of the exporter (varies according to the exporter, 9100
* The host name for the exporter
* The port of the exporter (varies according to the exporter, 9100
for the node exporter)
* how often to scrape the target, if non-default (default: 15s)
* How often to scrape the target, if non-default (default: 15 seconds)
Then TPA needs to hook those as part of a new node `job` in the
`scrape_configs`, in `prometheus.yml`, from Puppet, in
......@@ -1739,7 +1740,7 @@ See also [Adding metrics to applications][], above.
Those are the actual services monitored by Prometheus.
### Internal server (prometheus1)
### Internal server (`prometheus1`)
The "internal" server scrapes all hosts managed by Puppet for
TPA. Puppet installs a [`node_exporter`][] on *all* servers, which
......@@ -1753,7 +1754,7 @@ authentication only to keep bots away.
[`node_exporter`]: https://github.com/prometheus/node_exporter
### External server (prometheus2)
### External server (`prometheus2`)
The "external" server, on the other hand, is more restrictive and does
not allow public access. This is out of concern that specific metrics
......@@ -1764,10 +1765,10 @@ manually configured by TPA.
Those are the services currently monitored by the external server:
* [bridgestrap][]
* [rdsys][]
* OnionPerf external nodes' `node_exporter`s
* connectivity test on (some?) bridges (using the
* [`bridgestrap`][]
* [`rdsys`][]
* OnionPerf external nodes' `node_exporter`
* Connectivity test on (some?) bridges (using the
[`blackbox_exporter`][])
Note that this list might become out of sync with the actual
......@@ -1778,8 +1779,8 @@ This separate server was actually provisioned for the anti-censorship
team (see [this comment for background][]). The server was setup in
July 2019 following [#31159][].
[bridgestrap]: https://bridges.torproject.org/bridgestrap-metrics
[rdsys]: https://bridges.torproject.org/rdsys-backend-metrics
[`bridgestrap`]: https://bridges.torproject.org/bridgestrap-metrics
[`rdsys`]: https://bridges.torproject.org/rdsys-backend-metrics
[`blackbox_exporter`]: https://github.com/prometheus/blackbox_exporter/
[Puppet]: howto/puppet
[this comment for background]: https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/29863#note_2593114
......@@ -1788,22 +1789,22 @@ July 2019 following [#31159][].
### Other possible services to monitor
Many more exporters could be configured. A non-exaustive list was
built in [ticket tpo/tpa/team#30028][] around launch time. Here we
Many more exporters could be configured. A non-exhaustive list was
built in [ticket #30028][] around launch time. Here we
can document more such exporters we find along the way:
* [Prometheus Onion Service Exporter][] - "Export the status and
latency of an onion service"
* [hsprober][] - similar, but also with histogram buckets, multiple
* [`hsprober`][] - similar, but also with histogram buckets, multiple
attempts, warm-up and error counts
* [haproxy_exporter][]
* [`haproxy_exporter`][]
There's also a [list of third-party exporters][] in the Prometheus documentation.
[ticket tpo/tpa/team#30028]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/30028
[ticket #30028]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/30028
[Prometheus Onion Service Exporter]: https://github.com/systemli/prometheus-onion-service-exporter/
[hsprober]: https://git.autistici.org/ale/hsprober
[haproxy_exporter]: https://github.com/prometheus/haproxy_exporter
[`hsprober`]: https://git.autistici.org/ale/hsprober
[`haproxy_exporter`]: https://github.com/prometheus/haproxy_exporter
[list of third-party exporters]: https://prometheus.io/docs/instrumenting/exporters/
## SLA
......@@ -1856,7 +1857,7 @@ also IRC notifications for both warning and critical.
Each route needs to have one or more receivers set.
Receivers are and routes are defined in hiera in `hiera/common/prometheus.yaml`.
Receivers are and routes are defined in Hiera in `hiera/common/prometheus.yaml`.
#### Receivers
......@@ -1879,7 +1880,7 @@ instead of `email_configs`.
#### Routes
Alert routes are set in the key `prometheus::alertmanager::route` in hiera. The
Alert routes are set in the key `prometheus::alertmanager::route` in Hiera. The
default route, the one set at the top level of that key, uses the receiver
`fallback` and some default options for other routes.
......@@ -1907,30 +1908,30 @@ would otherwise be around long enough for Prometheus to scrape their
metrics. We use it as a workaround to bridge Metrics data with
Prometheus/Grafana.
## Blackbox exporter
## `blackbox_exporter`
Most exporters are pretty straightforward: a service binds to a port and exposes
metrics through HTTP requests on that port, generally on the `/metrics` URL.
The blackbox exporter, however, is a little bit more contrived. The exporter can
be configured to run a bunch of different tests (e.g. tcp connections, http
requests, ICMP ping, etc) for a list of targets of its own. So the prometheus
server has one target, the host with the port for the blackbox exporter, but
The `blackbox_exporter`, however, is a little bit more contrived. The exporter can
be configured to run a bunch of different tests (e.g. TCP connections, HTTP
requests, ICMP ping, etc) for a list of targets of its own. So the Prometheus
server has one target, the host with the port for the `blackbox_exporter`, but
that exporter in turn is set to check other hosts.
The [upstream documentation][] has some details that can help. We also
have examples [above][] for how to configure it in our setup.
One thing that's nice to know in addition to how it's configured is how you can
debug it. You can query the exporter from localhost in order to get more
debug it. You can query the exporter from `localhost` in order to get more
information. If you are using this method for debugging, you'll most probably
want to include debugging output. For example, to run an ICMP test on host
pauli.torproject.org:
`pauli.torproject.org`:
curl http://localhost:9115/probe?target=pauli.torproject.org&module=icmp&debug=true
Note that the above trick can be used for _any_ target, not just for ones
currently configured in the blackbox exporter. So you can also use this to test
currently configured in the `blackbox_exporter`. So you can also use this to test
things before creating the final configuration for the target.
[upstream documentation]: https://github.com/prometheus/blackbox_exporter
......@@ -1971,7 +1972,7 @@ that webhook, for example:
* [Discord][]
* [Google Chat][]
* [IRC][]
* Matrix: [matrix-alertmanager][] (JS) or [knopfler][] (Python), see
* Matrix: [`matrix-alertmanager`][] (JavaScript) or [knopfler][] (Python), see
also [#40216][]
* [Mattermost][]
* [Microsoft teams][]
......@@ -1982,13 +1983,13 @@ that webhook, for example:
* [Signal][] (or [Signald][])
* [Splunk][]
* [SNMP][]
* Telegram: [nopp/alertmanager-webhook-telegram-python][] or [metalmatze/alertmanager-bot][]
* Telegram: [`nopp/alertmanager-webhook-telegram-python`][] or [`metalmatze/alertmanager-bot`][]
* [Twilio][]
* [Wechat][]
* Zabbix: [alertmanager-zabbix-webhook][] or [zabbix-alertmanager][]
* Zabbix: [`alertmanager-zabbix-webhook`][] or [`zabbix-alertmanager`][]
And that is only what was available at the time of writing, the
[alertmanager-webhook][] and [alertmanager tags][] GitHub might have more.
[`alertmanager-webhook`][] and [`alertmanager` tags][] GitHub might have more.
The Alertmanager has its own web interface to see and silence alerts,
but there are also alternatives like [Karma][] (previously
......@@ -2019,7 +2020,7 @@ again. The [kthxbye bot][] works around that issue.
[Google Chat]: https://github.com/mr-karan/calert
[IRC]: https://github.com/crisidev/alertmanager_irc
[#40216]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40216
[matrix-alertmanager]: https://github.com/jaywink/matrix-alertmanager
[`matrix-alertmanager`]: https://github.com/jaywink/matrix-alertmanager
[knopfler]: https://github.com/sinnwerkstatt/knopfler
[Mattermost]: https://github.com/cpanato/mattermost-plugin-alertmanager
[Microsoft teams]: https://github.com/prometheus-msteams/prometheus-msteams
......@@ -2030,14 +2031,14 @@ again. The [kthxbye bot][] works around that issue.
[Signald]: https://github.com/dgl/alertmanager-webhook-signald
[Splunk]: https://github.com/sylr/alertmanager-splunkbot
[SNMP]: https://github.com/maxwo/snmp_notifier
[nopp/alertmanager-webhook-telegram-python]: https://github.com/nopp/alertmanager-webhook-telegram-python
[metalmatze/alertmanager-bot]: https://github.com/metalmatze/alertmanager-bot
[`nopp/alertmanager-webhook-telegram-python`]: https://github.com/nopp/alertmanager-webhook-telegram-python
[`metalmatze/alertmanager-bot`]: https://github.com/metalmatze/alertmanager-bot
[Twilio]: https://github.com/Swatto/promtotwilio
[Wechat]: https://github.com/daozzg/work_wechat_robot
[alertmanager-zabbix-webhook]: https://github.com/gmauleon/alertmanager-zabbix-webhook
[zabbix-alertmanager]: https://github.com/devopyio/zabbix-alertmanager
[alertmanager-webhook]: https://github.com/topics/alertmanager-webhook
[alertmanager tags]: https://github.com/topics/alertmanager
[`alertmanager-zabbix-webhook`]: https://github.com/gmauleon/alertmanager-zabbix-webhook
[`zabbix-alertmanager`]: https://github.com/devopyio/zabbix-alertmanager
[`alertmanager-webhook`]: https://github.com/topics/alertmanager-webhook
[`alertmanager` tags]: https://github.com/topics/alertmanager
[Karma]: https://karma-dashboard.io/
[unsee]: https://github.com/cloudflare/unsee
[Elm compiler]: https://github.com/elm/compiler
......@@ -2120,10 +2121,10 @@ relay that alert to the Alertmanager, and another timer comes in.
Fifth, before relaying that new alert that's already part of a firing
group, Alertmanager will wait `group_interval` (defaults to 5m) before
resending a notification to a group.
re-sending a notification to a group.
When Alertmanager first creates an alert group, a thread is started
for that group and the _route_'s `group_interval` acts like a time
for that group and the *route's* `group_interval` acts like a time
ticker. Notifications are only sent when the `group_interval` period
repeats.
......@@ -2180,17 +2181,17 @@ There is no issue tracker specifically for this project, [File][new-ticket] or
Those are major issues that are worth knowing about Prometheus in
general, and our setup in particular:
- bind mounts generate duplicate metrics, upstream issue: [Way to
- Bind mounts generate duplicate metrics, upstream issue: [Way to
distinguish bind mounted path?][], possible workaround: manually
specify known bind mount points
(e.g. `node_filesystem_avail_bytes{instance=~"$instance:.*",fstype!='tmpfs',fstype!='shm',mountpoint!~"/home|/var/lib/postgresql"}`),
but that can hide actual, real mount points, possible fix: the
`node_filesystem_mount_info` metric, [added in PR 2970 from
2024-07-14][], unreleased as of 2024-08-28
- high cardinality metrics from exporters we do not control can fill
- High cardinality metrics from exporters we do not control can fill
the disk
- no long-term metrics storage, issue: [multi-year metrics storage][]
- the web UI is really limited, and is actually deprecated, with the
- No long-term metrics storage, issue: [multi-year metrics storage][]
- The web user interface is really limited, and is actually deprecated, with the
new [React-based one not (yet?) packaged][]
In general, the service is still being launched, see [TPA-RFC-33][]
......@@ -2225,7 +2226,7 @@ but it was [salvaged][] by the [Prometheus community][].
Another important layer is the large amount of Puppet code that is
used to deploy Prometheus and its components. This is all part of a
big Puppet module, [`puppet-prometheus`][], managed by the [voxpupuli
big Puppet module, [`puppet-prometheus`][], managed by the [Voxpupuli
collective][]. Our integration with the module is not yet complete:
we have a lot of glue code on top of it to correctly make it work with
Debian packages. A lot of work has been done to complete that work by
......@@ -2237,13 +2238,13 @@ details.
[bind_exporter]: https://github.com/digitalocean/bind_exporter/
[salvaged]: https://github.com/prometheus-community/bind_exporter/issues/55
[Prometheus community]: https://github.com/prometheus-community/community/issues/15
[voxpupuli collective]: https://github.com/voxpupuli
[Voxpupuli collective]: https://github.com/voxpupuli
[upstream issue 32]: https://github.com/voxpupuli/puppet-prometheus/issues/32
## Monitoring and testing
Prometheus doesn't have specific tests, but there *is* a test suite in
the upstream prometheus Puppet module.
the upstream Prometheus Puppet module.
The server is monitored for basic system-level metrics by Nagios. It
also monitors itself for system-level metrics but also
......@@ -2279,13 +2280,13 @@ require little backups. The metrics themselves are kept in
WAL (write-ahead log) files are ignored by the backups, which can lead
to an extra 2-3 hours of data loss since the last backup in the case
of a total failure, see [tpo/tpa/team#41627][] for the
of a total failure, see [#41627][] for the
discussion. This should eventually be mitigated by a high availability
setup ([tpo/tpa/team#41643][]).
setup ([#41643][]).
[backup procedures]: service/backup
[tpo/tpa/team#41627]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41627
[tpo/tpa/team#41643]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41643
[#41627]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41627
[#41643]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41643
## Other documentation
......@@ -2313,7 +2314,7 @@ traces of Munin were removed in early April 2019 ([ticket 29682][]).
Resource requirements were researched in [ticket 29388][] and it was
originally planned to retain 15 days of metrics. This was expanded to
one year in November 2019 ([ticket 31244][]) with the hope this could
eventually be expanded further with a downsampling server in the
eventually be expanded further with a down-sampling server in the
future.
[ticket 31244]: https://bugs.torproject.org/31244
......@@ -2334,7 +2335,7 @@ metrics are just that: metrics, without thresholds... This makes it
more difficult to replace Nagios because a ton of alerts need to be
rewritten to replace the existing ones. A lot of reports and
functionality built-in to Nagios, like availability reports,
acknowledgements and other reports, would need to be reimplemented as
acknowledgments and other reports, would need to be re-implemented as
well.
## Goals
......@@ -2362,12 +2363,12 @@ really just second-guessing...
## Approvals required
Primary Prometheus server was decided [in the Brussels 2019
devmeeting][], before anarcat joined the team ([ticket
developer meeting][], before anarcat joined the team ([ticket
29389][]). Secondary Prometheus server was approved in
[meeting/2019-04-08][]. Storage expansion was approved in
[meeting/2019-11-25][].
[in the Brussels 2019 devmeeting]: https://gitlab.torproject.org/legacy/trac/-/wikis/org/meetings/2019BrusselsAdminTeamMinutes#Trendingmonitoring
[in the Brussels 2019 developer meeting]: https://gitlab.torproject.org/legacy/trac/-/wikis/org/meetings/2019BrusselsAdminTeamMinutes#Trendingmonitoring
[ticket 29389]: https://bugs.torproject.org/29389
[meeting/2019-04-08]: meeting/2019-04-08
[meeting/2019-11-25]: meeting/2019-11-25
......@@ -2378,7 +2379,7 @@ Prometheus was chosen, see also [Grafana][].
## Cost
N/A.
N/A
## Alternatives considered
......@@ -2432,7 +2433,7 @@ Basically, Prometheus is similar to Munin in many ways:
like Munin
* The agent running on the nodes is called `prometheus-node-exporter`
instead of `munin-node`. it scrapes only a set of built-in
instead of `munin-node`. It scrapes only a set of built-in
parameters like CPU, disk space and so on, different exporters are
necessary for different applications (like
`prometheus-apache-exporter`) and any application can easily
......@@ -2440,18 +2441,18 @@ Basically, Prometheus is similar to Munin in many ways:
`/metrics` endpoint
* Like Munin, the node exporter doesn't have any form of
authentication built-in. we rely on IP-level firewalls to avoid
authentication built-in. We rely on IP-level firewalls to avoid
leakage
* The central server is simply called `prometheus` and runs as a
daemon that wakes up on its own, instead of `munin-update` which is
called from `munin-cron` and before that `cron`
* graphics are generated on the fly through the crude Prometheus web
* Graphics are generated on the fly through the crude Prometheus web
interface or by frontends like Grafana, instead of being constantly
regenerated by `munin-graph`
* samples are stored in a custom "time series database" (TSDB) in
* Samples are stored in a custom "time series database" (TSDB) in
Prometheus instead of the (ad-hoc) RRD standard
* Prometheus performs *no* down-sampling like RRD and Prom relies on
......
......