Changes

Did this distractedly while idling in a meeting. Filed a bunch of issues upstream too: https://github.com/elijah-potter/harper/issues/196 https://github.com/elijah-potter/harper/issues/195 https://github.com/elijah-potter/harper/issues/194
anarcat · 84d70d28
--- a/service/prometheus.md
+++ b/service/prometheus.md
@@ -201,7 +201,7 @@ To add a scrape job in a puppet profile, you can use the
 `prometheus::scrape_job` defined type, or one of the defined types which are
 convenience wrappers around that.

-Here is, for example, how the gitlab runners are scraped:
+Here is, for example, how the GitLab runners are scraped:

 ```
 # tell Prometheus to scrape the exporter
@@ -255,9 +255,9 @@ In another example, to configure the ssh scrape jobs (in
      },
    }

-But because this is a blackbox exporter, the `scrape_configs`
+But because this is a `blackbox_exporter`, the `scrape_configs`
 configuration is more involved, as it needs to define the
-`relabel_configs` element that make the blackbox exporter work:
+`relabel_configs` element that make the `blackbox_exporter` work:

    - job_name: 'blackbox_ssh_banner'
      metrics_path: '/probe'
@@ -274,7 +274,7 @@ configuration is more involved, as it needs to define the
        - target_label: '__address__'
          replacement: 'localhost:9115'

-Scrape jobs for non-TPA services are defined in hiera under keys named
+Scrape jobs for non-TPA services are defined in Hiera under keys named
 `scrape_configs` in `hiera/common/prometheus.yaml`. Here's one example of such a
 scrape job definition:

@@ -323,7 +323,7 @@ configure a service, you *may* define extra jobs in the
 `profile::prometheus::server::internal` Puppet class.

 For example, because the GitLab setup is fully managed by Puppet
-(e.g. [tpo/tpa/gitlab#20][], but other similar issues remain), we
+(e.g. [`gitlab#20`][], but other similar issues remain), we
 cannot use this automatic setup, so manual scrape targets are defined
 like this:

@@ -361,7 +361,7 @@ then we open the port to the Prometheus server on the exporter, with:

 Those rules are declared on the server, in `prometheus::prometheus::server::internal`.

-[tpo/tpa/gitlab#20]: https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/20
+[`gitlab#20`]: https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/20

 ## Writing an alert

@@ -374,9 +374,9 @@ discussion on that.

 An [alerting rule][] is a simple YAML file that consists mainly of:

- a name (say `JobDown`)
- a Prometheus query, or "expression" (say `up != 1`)
- extra labels and annotations
+- A name (say `JobDown`).
+- A Prometheus query, or "expression" (say `up != 1`).
+- Extra labels and annotations.

 ### Expressions

@@ -415,7 +415,7 @@ flapping and temporary conditions. Rule of thumbs:
  more than 24h), `RAIDDegraded` (failed disk won't come back on its
  own in 15m)
 - `15m`: availability checks, designed to ignore transient errors.
-  examples: `JobDown`, `DiskFull`
+  Examples: `JobDown`, `DiskFull`
 - `1h`: consistency checks, things an operator might have deployed
  incorrectly but could recover on its own. Examples:
  `OutdatedLibraries`, as `needrestart` might recover at the end of
@@ -509,8 +509,8 @@ configuration, or alerting rule:
 >    fixing, but not immediately, no user-visible impact; example:
 >    server needs to be rebooted
 >  * `critical`: serious condition with disruptive user-visible impact
->    which requires prompt response; example: donation site gives a 500
->    error
+>    which requires prompt response; example: donation site returns 500
+>    errors

 ### Annotations

@@ -529,17 +529,17 @@ with the alert.

 The playbook *must* include those things:

- 1. the actual code name of the alert (e.g. `JobDown` or
-    `DiskWillFillSoon`)
+ 1. The actual code name of the alert (e.g. `JobDown` or
+    `DiskWillFillSoon`).

- 2. an example of the alert output (e.g. `Exporter job gitlab_runner
-    on tb-build-02.torproject.org:9252 is down`)
+ 2. An example of the alert output (e.g. `Exporter job gitlab_runner
+    on tb-build-02.torproject.org:9252 is down`).

- 3. why this alert triggered, what is its impact
+ 3. Why this alert triggered, what is its impact.

- 4. optionally, how to reproduce the issue
+ 4. Optionally, how to reproduce the issue.

- 5. how to fix it
+ 5. How to fix it.

 How to reproduce the issue is optional, but important. Think of
 yourself in the future, tired and panicking because things are
@@ -562,8 +562,8 @@ fixed.
 If the playbook becomes too complicated, consider making a [Fabric][]
 script out of it.

-A good example of a proper playbook is the [Textfile collector errors
-playbook here][]. It has all of the above points, including actual
+A good example of a proper playbook is the [text file collector errors
+playbook here][]. It has all the above points, including actual
 fixes for different actual scenarios.

 Here's a template to get started:
@@ -590,7 +590,7 @@ document here how you fix this next time.
 ```

 [Fabric]: howto/fabric
-[Textfile collector errors playbook here]: #textfile-collector-errors
+[text file collector errors playbook here]: #textfile-collector-errors

 ### Alerting rule template

@@ -628,8 +628,8 @@ groups:
  rules:
 ```

-... as that structure just serves to declare the rest of the alerts in
-the file. However, consider that "rules within a group are run
+That structure just serves to declare the rest of the alerts in the
+file. However, consider that "rules within a group are run
 sequentially at a regular interval, with the same evaluation time"
 (see the [recording rules documentation][]). So avoid putting *all*
 alerts inside the same file. In TPA, we group alerts by exporter, so
@@ -882,17 +882,17 @@ dashboards][] section for details.
 [exposed]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41733
 [Alerting dashboards]: #alerting-dashboards

-### Managing alerts with amtool
+### Managing alerts with `amtool`

 Since the Alertmanager web UI is not available in Debian, you need to
-use the [amtool][] command. A few useful commands:
+use the [`amtool`][] command. A few useful commands:

 * `amtool alert`: show firing alerts
 * `amtool silence add --duration=1h --author=anarcat
-   --comment="working on it" ALERTNAME`: silence alert ALERTNAME for
+   --comment="working on it" ALERTNAME`: silence alert `ALERTNAME` for
   an hour, with some comments

-[amtool]: https://manpages.debian.org/amtool.1
+[`amtool`]: https://manpages.debian.org/amtool.1

 ### Checking alert history

@@ -1009,7 +1009,7 @@ defined series), a specific query will generate a specific alert with a given
 set of labels and annotations.

 Those labels can then be fed into `amtool` to test routing. For
-example, the above alert can be tested against the alertmanager
+example, the above alert can be tested against the Alertmanager
 configuration with:

    amtool config routes test alertname="NeedsReboot" instance="akka.0x90.dk:9100" job="relay" severity="warning" team="network"
@@ -1036,7 +1036,7 @@ The above, for example, confirms that `networking` is not the correct
 team name (it should be `network`).

 Note that you can also deliver an alert to a web hook receiver
-syntetically. For example, this will deliver and empty message to the
+synthetically. For example, this will deliver and empty message to the
 IRC relay:

    curl --header "Content-Type: application/json" --request POST --data "{}" http://localhost:8098
@@ -1050,14 +1050,14 @@ IRC relay:
 This section documents more advanced metrics injection topics that we
 rarely need or use.

-### Backfilling
+### Back-filling

 Starting from Prometheus 2.24, Prometheus [now
-supports][] [backfilling][]. This is untested, but [this guide][]
+supports][] [back-filling][]. This is untested, but [this guide][]
 might provide a good tutorial.

 [now supports]: https://github.com/prometheus/prometheus/issues/535
-[backfilling]: https://prometheus.io/docs/prometheus/latest/storage/#backfilling-from-openmetrics-format
+[back-filling]: https://prometheus.io/docs/prometheus/latest/storage/#backfilling-from-openmetrics-format
 [this guide]: https://tlvince.com/prometheus-backfilling

 ### Push metrics to the Pushgateway
@@ -1187,7 +1187,7 @@ like this every second:

    Nov 01 19:43:03 hetzner-nbg1-02 prometheus[49182]: level=info ts=2022-11-01T19:43:03.788Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=30182 maxSegment=30196

-... it's somewhat normal. At the time of writing, Prometheus2 takes
+It's somewhat normal. At the time of writing, Prometheus2 takes
 over a minute to start because of this problem. When it's done, it
 will show the timing information, which is currently:

@@ -1212,7 +1212,7 @@ the metrics it collects, and allow you to view the pending metrics
 before they get scraped by Prometheus, which may be useful to
 troubleshoot issues with the gateway.

-To pull metrics by hand, you can pull directly from the pushgateway:
+To pull metrics by hand, you can pull directly from the Pushgateway:

    curl localhost:9091/metrics

@@ -1223,7 +1223,7 @@ If you get this error while pulling metrics from the exporter:
    collected metric "some_metric" { label:<name:"instance" value:"" > label:<name:"job" value:"some_job" > label:<name:"tag" value:"val1" > counter:<value:1 > } was collected before with the same name and label values

 It's because similar metrics were sent twice into the gateway, which
-corrupts the state of the pushgateway, a [known problems][] in
+corrupts the state of the Pushgateway, a [known problems][] in
 earlier versions and [fixed in 0.10][] (Debian bullseye and later). A
 workaround is simply to restart the Pushgateway (and clear the
 storage, if persistence is enabled, see the `--persistence.file`
@@ -1234,7 +1234,7 @@ flag).

 ### Running out of disk space

-In [tpo/tpa/team#41070][], we encountered a situation where disk
+In [#41070][], we encountered a situation where disk
 usage on the main Prometheus server was growing linearly even if the
 number of targets didn't change. This is a typical problem in time
 series like this where the "cardinality" of metrics grows without
@@ -1242,7 +1242,7 @@ bound, consuming more and more disk space as time goes by.

 The first step is to confirm the diagnosis by looking at the [Grafana
 graph showing Prometheus disk usage][] over time. This should show a
-"sawtooth" pattern where compactions happen regularly (about once
+"[sawtooth wave][]" pattern where compactions happen regularly (about once
 every three weeks), but without growing much over longer periods of
 time. In the above ticket, the usage was growing despite
 compactions. There are also shorter-term (~4h) and smaller compactions
@@ -1269,12 +1269,13 @@ long-term storage][] which suggests tweaking the
 [This guide from Alexandre Vazquez][] also had some useful queries and
 tips we didn't fully investigate.

-[tpo/tpa/team#41070]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41070
+[#41070]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41070
 [Grafana graph showing Prometheus disk usage]: https://grafana.torproject.org/d/000000012/prometheus-2-0-stats?orgId=1&refresh=1m&viewPanel=40&from=now-1y&to=now
 [disk usage graphic]: https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&var-class=All&var-instance=hetzner-nbg1-01.torproject.org&from=now-3d&to=now&viewPanel=2
 [upstream Storage documentation]: https://prometheus.io/docs/prometheus/1.8/storage/
 [advice on long-term storage]: https://prometheus.io/docs/prometheus/1.8/storage/#settings-for-very-long-retention-time
 [This guide from Alexandre Vazquez]: https://alexandre-vazquez.com/how-it-optimize-the-disk-usage-in-the-prometheus-database/
+[sawtooth wave]: https://en.wikipedia.org/wiki/Sawtooth_wave

 ### Default route errors

@@ -1336,9 +1337,9 @@ host are managed by the anti-censorship team service admins. If the
 host was *not* managed by TPA or this was a notification about a
 *service* operated by the team, then a ticket should be filed there.

-In this case, [tpo/tpa/team#41667][] was filed.
+In this case, [#41667][] was filed.

-[tpo/tpa/team#41667]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41667
+[#41667]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41667

 #### Fixing routing

@@ -1348,7 +1349,7 @@ if the alert is still firing. In this case, we see this:

 | Labels                                                                                                                                                                          | State  | Active Since                           | Value |
 |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------|----------------------------------------|-------|
-| `alertname="JobDown"` `alias="rdsys-test-01.torproject.org"` `classes="role::rdsys::backend"` `instance="rdsys-test-01.torproject.org:3903"` `job="mtail"` `severity="warning"` | firing | 2024-07-03 13:51:17.36676096 +0000 UTC | 0     |
+| `alertname="JobDown"` `alias="rdsys-test-01.torproject.org"` `classes="role::rdsys::backend"` `instance="rdsys-test-01.torproject.org:3903"` `job="mtail"` `severity="warning"` | Firing | 2024-07-03 13:51:17.36676096 +0000 UTC | 0     |

 In this case, we can see there's no `team` label on that metric, which
 is the root cause.
@@ -1379,7 +1380,7 @@ and the following rule:
 The query, in this case, is therefore `up < 1`. But since the alert
 has resolved, we can't actually do the exact same query and expect to
 find the same host, we need instead to broaden the query without the
-conditional (so just `up`) *and* add the right labels, in this case
+conditional (so just `up`) *and* add the right labels. In this case
 this should do the trick:

    up{instance="rdsys-test-01.torproject.org:3903",job="mtail"}
@@ -1485,10 +1486,10 @@ no value was provided for a metric, like this:
    # TYPE civicrm_torcrm_resque_processor_status_up gauge
    civicrm_torcrm_resque_processor_status_up

-See [tpo/web/civicrm#149][] for further details on this
+See [`web/civicrm#149`][] for further details on this
 outage.

-[tpo/web/civicrm#149]: https://gitlab.torproject.org/tpo/web/civicrm/-/issues/149
+[`web/civicrm#149`]: https://gitlab.torproject.org/tpo/web/civicrm/-/issues/149

 #### Forbidden errors

@@ -1496,15 +1497,15 @@ Another example might be:

    server returned HTTP status 403 Forbidden

-... in which case there's a permission issue on the exporter
-endpoint. Try to reproduce the issue by pulling the endpoint directly,
-on the Prometheus server, with, for example:
+In which case there's a permission issue on the exporter endpoint. Try
+to reproduce the issue by pulling the endpoint directly, on the
+Prometheus server, with, for example:

    curl -sSL https://donate.torproject.org:443/metrics

-... or whatever URL is visible in the targets listing above. This
-could be a web server configuration or lack of matching credentials in
-the exporter configuration. Look in `tor-puppet.git`, the
+Or whatever URL is visible in the targets listing above. This could be
+a web server configuration or lack of matching credentials in the
+exporter configuration. Look in `tor-puppet.git`, the
 `profile::prometheus::server::internal::collect_scrape` in
 `hiera/common/prometheus.yaml`, where credentials should be defined
 (although they should actually be stored in Trocla).
@@ -1523,13 +1524,13 @@ with curl from the affected server, for example:

 This is a typical configuration error in Apache where the
 `/server-status` host is not available to the exporter because the
-"default vhost" was disabled (`apache2::default_vhost` in
+"default virtual host" was disabled (`apache2::default_vhost` in
 Hiera).

 There is normally a workaround for this in the
 `profile::prometheus::apache_exporter` class, which configures a
-`localhost` vhost to answer properly on this address. Verify that it's
-present, consider using `apache2ctl -S` to see the vhost
+`localhost` virtual host to answer properly on this address. Verify that it's
+present, consider using `apache2ctl -S` to see the virtual host
 configuration.

 See also the [Apache web server diagnostics][] in the incident
@@ -1564,7 +1565,7 @@ might be different.
 Sep 24 20:56:53 bungei prometheus-node-exporter[1387]: ts=2024-09-24T20:56:53.280Z caller=textfile.go:227 level=error collector=textfile msg="failed to collect textfile data" file=tpa_backuppg.prom err="failed to open textfile data file \"/var/lib/prometheus/node-exporter/tpa_backuppg.prom\": open /var/lib/prometheus/node-exporter/tpa_backuppg.prom: permission denied"
 ```

-In this case, the file was created as a tempfile and moved into place
+In this case, the file was created as a temporary file and moved into place
 without fixing the permission. The fix was to simply create the file
 without the `tempfile` Python library, with a `.tmp` suffix, and just
 move it into place.
@@ -1575,7 +1576,7 @@ move it into place.
 Sep 24 21:14:41 perdulce prometheus-node-exporter[429]: ts=2024-09-24T21:14:41.783Z caller=textfile.go:227 level=error collector=textfile msg="failed to collect textfile data" file=scheduled_shutdown_metric.prom err="failed to parse textfile data from \"/var/lib/prometheus/node-exporter/scheduled_shutdown_metric.prom\": text format parsing error in line 3: expected '\"' at start of label value, found 'r'"
 ```

-This was an experimental metric designed in [tpo/tpa/team#41734][] to
+This was an experimental metric designed in [#41734][] to
 keep track of scheduled reboot times, but it was formatted
 incorrectly. The entire file content was:

@@ -1596,12 +1597,12 @@ node_shutdown_scheduled_timestamp_seconds{kind="reboot"} 1725545703.588789

 But the file was simply removed in this case.

-[tpo/tpa/team#41734]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41734
+[#41734]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41734

 ## Disaster recovery

 If a Prometheus/Grafana is destroyed, it should be completely
-rebuildable from Puppet. Non-configuration data should be restored
+re-buildable from Puppet. Non-configuration data should be restored
 from backup, with `/var/lib/prometheus/` being sufficient to
 reconstruct history. If even backups are destroyed, history will be
 lost, but the server should still recover and start tracking new
@@ -1693,7 +1694,7 @@ A real-life (simplified) example:

    node_filesystem_avail_bytes{alias="alberti.torproject.org",device="/dev/sda1",fstype="ext4",mountpoint="/"} 16160059392

-The above says that the node alberti has the device `/dev/sda` mounted
+The above says that the node `alberti` has the device `/dev/sda` mounted
 on `/`, formatted as an `ext4` file system which has 16160059392 bytes
 (~16GB) free.

@@ -1711,21 +1712,21 @@ exporter", with the following steps:

        apt install -t stretch-backports prometheus-node-exporter

-   ... assuming that backports is already configured. if it isn't, such a line in `/etc/apt/sources.list.d/backports.debian.org.list` should suffice:
+   Assuming that backports is already configured. If it isn't, such a
+   line in `/etc/apt/sources.list.d/backports.debian.org.list` should
+   suffice, followed by an `apt update`:

        deb	https://deb.debian.org/debian/	stretch-backports	main contrib non-free

-   ... followed by an `apt update`, naturally.
-
 The firewall on the machine needs to allow traffic on the exporter
 port from the server `prometheus2.torproject.org`. Then [open a
 ticket][new-ticket] for TPA to configure the target. Make sure to
 mention:

- * the hostname for the exporter
- * the port of the exporter (varies according to the exporter, 9100
+ * The host name for the exporter
+ * The port of the exporter (varies according to the exporter, 9100
   for the node exporter)
- * how often to scrape the target, if non-default (default: 15s)
+ * How often to scrape the target, if non-default (default: 15 seconds)

 Then TPA needs to hook those as part of a new node `job` in the
 `scrape_configs`, in `prometheus.yml`, from Puppet, in
@@ -1739,7 +1740,7 @@ See also [Adding metrics to applications][], above.

 Those are the actual services monitored by Prometheus.

-### Internal server (prometheus1)
+### Internal server (`prometheus1`)

 The "internal" server scrapes all hosts managed by Puppet for
 TPA. Puppet installs a [`node_exporter`][] on *all* servers, which
@@ -1753,7 +1754,7 @@ authentication only to keep bots away.

 [`node_exporter`]: https://github.com/prometheus/node_exporter

-### External server (prometheus2)
+### External server (`prometheus2`)

 The "external" server, on the other hand, is more restrictive and does
 not allow public access. This is out of concern that specific metrics
@@ -1764,10 +1765,10 @@ manually configured by TPA.

 Those are the services currently monitored by the external server:

- * [bridgestrap][]
- * [rdsys][]
- * OnionPerf external nodes' `node_exporter`s
- * connectivity test on (some?) bridges (using the
+ * [`bridgestrap`][]
+ * [`rdsys`][]
+ * OnionPerf external nodes' `node_exporter`
+ * Connectivity test on (some?) bridges (using the
   [`blackbox_exporter`][])

 Note that this list might become out of sync with the actual
@@ -1778,8 +1779,8 @@ This separate server was actually provisioned for the anti-censorship
 team (see [this comment for background][]). The server was setup in
 July 2019 following [#31159][].

-[bridgestrap]: https://bridges.torproject.org/bridgestrap-metrics
-[rdsys]: https://bridges.torproject.org/rdsys-backend-metrics
+[`bridgestrap`]: https://bridges.torproject.org/bridgestrap-metrics
+[`rdsys`]: https://bridges.torproject.org/rdsys-backend-metrics
 [`blackbox_exporter`]: https://github.com/prometheus/blackbox_exporter/
 [Puppet]: howto/puppet
 [this comment for background]: https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/29863#note_2593114
@@ -1788,22 +1789,22 @@ July 2019 following [#31159][].

 ### Other possible services to monitor

-Many more exporters could be configured. A non-exaustive list was
-built in [ticket tpo/tpa/team#30028][] around launch time. Here we
+Many more exporters could be configured. A non-exhaustive list was
+built in [ticket #30028][] around launch time. Here we
 can document more such exporters we find along the way:

 * [Prometheus Onion Service Exporter][] - "Export the status and
   latency of an onion service"
- * [hsprober][] - similar, but also with histogram buckets, multiple
+ * [`hsprober`][] - similar, but also with histogram buckets, multiple
   attempts, warm-up and error counts
- * [haproxy_exporter][]
+ * [`haproxy_exporter`][]

 There's also a [list of third-party exporters][] in the Prometheus documentation.

-[ticket tpo/tpa/team#30028]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/30028
+[ticket #30028]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/30028
 [Prometheus Onion Service Exporter]: https://github.com/systemli/prometheus-onion-service-exporter/
-[hsprober]: https://git.autistici.org/ale/hsprober
-[haproxy_exporter]: https://github.com/prometheus/haproxy_exporter
+[`hsprober`]: https://git.autistici.org/ale/hsprober
+[`haproxy_exporter`]: https://github.com/prometheus/haproxy_exporter
 [list of third-party exporters]: https://prometheus.io/docs/instrumenting/exporters/

 ## SLA
@@ -1856,7 +1857,7 @@ also IRC notifications for both warning and critical.

 Each route needs to have one or more receivers set.

-Receivers are and routes are defined in hiera in `hiera/common/prometheus.yaml`.
+Receivers are and routes are defined in Hiera in `hiera/common/prometheus.yaml`.

 #### Receivers

@@ -1879,7 +1880,7 @@ instead of `email_configs`.

 #### Routes

-Alert routes are set in the key `prometheus::alertmanager::route` in hiera. The
+Alert routes are set in the key `prometheus::alertmanager::route` in Hiera. The
 default route, the one set at the top level of that key, uses the receiver
 `fallback` and some default options for other routes.

@@ -1907,30 +1908,30 @@ would otherwise be around long enough for Prometheus to scrape their
 metrics. We use it as a workaround to bridge Metrics data with
 Prometheus/Grafana.

-## Blackbox exporter
+## `blackbox_exporter`

 Most exporters are pretty straightforward: a service binds to a port and exposes
 metrics through HTTP requests on that port, generally on the `/metrics` URL.

-The blackbox exporter, however, is a little bit more contrived. The exporter can
-be configured to run a bunch of different tests (e.g. tcp connections, http
-requests, ICMP ping, etc) for a list of targets of its own. So the prometheus
-server has one target, the host with the port for the blackbox exporter, but
+The `blackbox_exporter`, however, is a little bit more contrived. The exporter can
+be configured to run a bunch of different tests (e.g. TCP connections, HTTP
+requests, ICMP ping, etc) for a list of targets of its own. So the Prometheus
+server has one target, the host with the port for the `blackbox_exporter`, but
 that exporter in turn is set to check other hosts.

 The [upstream documentation][] has some details that can help. We also
 have examples [above][] for how to configure it in our setup.

 One thing that's nice to know in addition to how it's configured is how you can
-debug it. You can query the exporter from localhost in order to get more
+debug it. You can query the exporter from `localhost` in order to get more
 information. If you are using this method for debugging, you'll most probably
 want to include debugging output. For example, to run an ICMP test on host
-pauli.torproject.org:
+`pauli.torproject.org`:

    curl http://localhost:9115/probe?target=pauli.torproject.org&module=icmp&debug=true

 Note that the above trick can be used for _any_ target, not just for ones
-currently configured in the blackbox exporter. So you can also use this to test
+currently configured in the `blackbox_exporter`. So you can also use this to test
 things before creating the final configuration for the target.

 [upstream documentation]: https://github.com/prometheus/blackbox_exporter
@@ -1971,7 +1972,7 @@ that webhook, for example:
 * [Discord][]
 * [Google Chat][]
 * [IRC][]
- * Matrix: [matrix-alertmanager][] (JS) or [knopfler][] (Python), see
+ * Matrix: [`matrix-alertmanager`][] (JavaScript) or [knopfler][] (Python), see
   also [#40216][]
 * [Mattermost][]
 * [Microsoft teams][]
@@ -1982,13 +1983,13 @@ that webhook, for example:
 * [Signal][] (or [Signald][])
 * [Splunk][]
 * [SNMP][]
- * Telegram: [nopp/alertmanager-webhook-telegram-python][] or [metalmatze/alertmanager-bot][]
+ * Telegram: [`nopp/alertmanager-webhook-telegram-python`][] or [`metalmatze/alertmanager-bot`][]
 * [Twilio][]
 * [Wechat][]
- * Zabbix: [alertmanager-zabbix-webhook][] or [zabbix-alertmanager][]
+ * Zabbix: [`alertmanager-zabbix-webhook`][] or [`zabbix-alertmanager`][]

 And that is only what was available at the time of writing, the
-[alertmanager-webhook][] and [alertmanager tags][] GitHub might have more.
+[`alertmanager-webhook`][] and [`alertmanager` tags][] GitHub might have more.

 The Alertmanager has its own web interface to see and silence alerts,
 but there are also alternatives like [Karma][] (previously
@@ -2019,7 +2020,7 @@ again. The [kthxbye bot][] works around that issue.
 [Google Chat]: https://github.com/mr-karan/calert
 [IRC]: https://github.com/crisidev/alertmanager_irc
 [#40216]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40216
-[matrix-alertmanager]: https://github.com/jaywink/matrix-alertmanager
+[`matrix-alertmanager`]: https://github.com/jaywink/matrix-alertmanager
 [knopfler]: https://github.com/sinnwerkstatt/knopfler
 [Mattermost]: https://github.com/cpanato/mattermost-plugin-alertmanager
 [Microsoft teams]: https://github.com/prometheus-msteams/prometheus-msteams
@@ -2030,14 +2031,14 @@ again. The [kthxbye bot][] works around that issue.
 [Signald]: https://github.com/dgl/alertmanager-webhook-signald
 [Splunk]: https://github.com/sylr/alertmanager-splunkbot
 [SNMP]: https://github.com/maxwo/snmp_notifier
-[nopp/alertmanager-webhook-telegram-python]: https://github.com/nopp/alertmanager-webhook-telegram-python
-[metalmatze/alertmanager-bot]: https://github.com/metalmatze/alertmanager-bot
+[`nopp/alertmanager-webhook-telegram-python`]: https://github.com/nopp/alertmanager-webhook-telegram-python
+[`metalmatze/alertmanager-bot`]: https://github.com/metalmatze/alertmanager-bot
 [Twilio]: https://github.com/Swatto/promtotwilio
 [Wechat]: https://github.com/daozzg/work_wechat_robot
-[alertmanager-zabbix-webhook]: https://github.com/gmauleon/alertmanager-zabbix-webhook
-[zabbix-alertmanager]: https://github.com/devopyio/zabbix-alertmanager
-[alertmanager-webhook]: https://github.com/topics/alertmanager-webhook
-[alertmanager tags]: https://github.com/topics/alertmanager
+[`alertmanager-zabbix-webhook`]: https://github.com/gmauleon/alertmanager-zabbix-webhook
+[`zabbix-alertmanager`]: https://github.com/devopyio/zabbix-alertmanager
+[`alertmanager-webhook`]: https://github.com/topics/alertmanager-webhook
+[`alertmanager` tags]: https://github.com/topics/alertmanager
 [Karma]: https://karma-dashboard.io/
 [unsee]: https://github.com/cloudflare/unsee
 [Elm compiler]: https://github.com/elm/compiler
@@ -2120,10 +2121,10 @@ relay that alert to the Alertmanager, and another timer comes in.

 Fifth, before relaying that new alert that's already part of a firing
 group, Alertmanager will wait `group_interval` (defaults to 5m) before
-resending a notification to a group.
+re-sending a notification to a group.

 When Alertmanager first creates an alert group, a thread is started
-for that group and the _route_'s `group_interval` acts like a time
+for that group and the *route's* `group_interval` acts like a time
 ticker. Notifications are only sent when the `group_interval` period
 repeats.

@@ -2180,17 +2181,17 @@ There is no issue tracker specifically for this project, [File][new-ticket] or
 Those are major issues that are worth knowing about Prometheus in
 general, and our setup in particular:

- - bind mounts generate duplicate metrics, upstream issue: [Way to
+ - Bind mounts generate duplicate metrics, upstream issue: [Way to
   distinguish bind mounted path?][], possible workaround: manually
   specify known bind mount points
   (e.g. `node_filesystem_avail_bytes{instance=~"$instance:.*",fstype!='tmpfs',fstype!='shm',mountpoint!~"/home|/var/lib/postgresql"}`),
   but that can hide actual, real mount points, possible fix: the
   `node_filesystem_mount_info` metric, [added in PR 2970 from
   2024-07-14][], unreleased as of 2024-08-28
- - high cardinality metrics from exporters we do not control can fill
+ - High cardinality metrics from exporters we do not control can fill
   the disk
- - no long-term metrics storage, issue: [multi-year metrics storage][]
- - the web UI is really limited, and is actually deprecated, with the
+ - No long-term metrics storage, issue: [multi-year metrics storage][]
+ - The web user interface is really limited, and is actually deprecated, with the
   new [React-based one not (yet?) packaged][]

 In general, the service is still being launched, see [TPA-RFC-33][]
@@ -2225,7 +2226,7 @@ but it was [salvaged][] by the [Prometheus community][].

 Another important layer is the large amount of Puppet code that is
 used to deploy Prometheus and its components. This is all part of a
-big Puppet module, [`puppet-prometheus`][], managed by the [voxpupuli
+big Puppet module, [`puppet-prometheus`][], managed by the [Voxpupuli
 collective][]. Our integration with the module is not yet complete:
 we have a lot of glue code on top of it to correctly make it work with
 Debian packages. A lot of work has been done to complete that work by
@@ -2237,13 +2238,13 @@ details.
 [bind_exporter]: https://github.com/digitalocean/bind_exporter/
 [salvaged]: https://github.com/prometheus-community/bind_exporter/issues/55
 [Prometheus community]: https://github.com/prometheus-community/community/issues/15
-[voxpupuli collective]: https://github.com/voxpupuli
+[Voxpupuli collective]: https://github.com/voxpupuli
 [upstream issue 32]: https://github.com/voxpupuli/puppet-prometheus/issues/32

 ## Monitoring and testing

 Prometheus doesn't have specific tests, but there *is* a test suite in
-the upstream prometheus Puppet module.
+the upstream Prometheus Puppet module.

 The server is monitored for basic system-level metrics by Nagios. It
 also monitors itself for system-level metrics but also
@@ -2279,13 +2280,13 @@ require little backups. The metrics themselves are kept in

 WAL (write-ahead log) files are ignored by the backups, which can lead
 to an extra 2-3 hours of data loss since the last backup in the case
-of a total failure, see [tpo/tpa/team#41627][] for the
+of a total failure, see [#41627][] for the
 discussion. This should eventually be mitigated by a high availability
-setup ([tpo/tpa/team#41643][]).
+setup ([#41643][]).

 [backup procedures]: service/backup
-[tpo/tpa/team#41627]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41627
-[tpo/tpa/team#41643]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41643
+[#41627]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41627
+[#41643]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41643

 ## Other documentation

@@ -2313,7 +2314,7 @@ traces of Munin were removed in early April 2019 ([ticket 29682][]).
 Resource requirements were researched in [ticket 29388][] and it was
 originally planned to retain 15 days of metrics. This was expanded to
 one year in November 2019 ([ticket 31244][]) with the hope this could
-eventually be expanded further with a downsampling server in the
+eventually be expanded further with a down-sampling server in the
 future.

 [ticket 31244]: https://bugs.torproject.org/31244
@@ -2334,7 +2335,7 @@ metrics are just that: metrics, without thresholds... This makes it
 more difficult to replace Nagios because a ton of alerts need to be
 rewritten to replace the existing ones. A lot of reports and
 functionality built-in to Nagios, like availability reports,
-acknowledgements and other reports, would need to be reimplemented as
+acknowledgments and other reports, would need to be re-implemented as
 well.

 ## Goals
@@ -2362,12 +2363,12 @@ really just second-guessing...
 ## Approvals required

 Primary Prometheus server was decided [in the Brussels 2019
-devmeeting][], before anarcat joined the team ([ticket
+developer meeting][], before anarcat joined the team ([ticket
 29389][]). Secondary Prometheus server was approved in
 [meeting/2019-04-08][]. Storage expansion was approved in
 [meeting/2019-11-25][].

- [in the Brussels 2019 devmeeting]: https://gitlab.torproject.org/legacy/trac/-/wikis/org/meetings/2019BrusselsAdminTeamMinutes#Trendingmonitoring
+ [in the Brussels 2019 developer meeting]: https://gitlab.torproject.org/legacy/trac/-/wikis/org/meetings/2019BrusselsAdminTeamMinutes#Trendingmonitoring
 [ticket 29389]: https://bugs.torproject.org/29389
 [meeting/2019-04-08]: meeting/2019-04-08
 [meeting/2019-11-25]: meeting/2019-11-25
@@ -2378,7 +2379,7 @@ Prometheus was chosen, see also [Grafana][].

 ## Cost

-N/A.
+N/A

 ## Alternatives considered

@@ -2432,7 +2433,7 @@ Basically, Prometheus is similar to Munin in many ways:
   like Munin

 * The agent running on the nodes is called `prometheus-node-exporter`
-   instead of `munin-node`. it scrapes only a set of built-in
+   instead of `munin-node`. It scrapes only a set of built-in
   parameters like CPU, disk space and so on, different exporters are
   necessary for different applications (like
   `prometheus-apache-exporter`) and any application can easily
@@ -2440,18 +2441,18 @@ Basically, Prometheus is similar to Munin in many ways:
   `/metrics` endpoint

 * Like Munin, the node exporter doesn't have any form of
-   authentication built-in. we rely on IP-level firewalls to avoid
+   authentication built-in. We rely on IP-level firewalls to avoid
   leakage

 * The central server is simply called `prometheus` and runs as a
   daemon that wakes up on its own, instead of `munin-update` which is
   called from `munin-cron` and before that `cron`

- * graphics are generated on the fly through the crude Prometheus web
+ * Graphics are generated on the fly through the crude Prometheus web
   interface or by frontends like Grafana, instead of being constantly
   regenerated by `munin-graph`

- * samples are stored in a custom "time series database" (TSDB) in
+ * Samples are stored in a custom "time series database" (TSDB) in
   Prometheus instead of the (ad-hoc) RRD standard
   
 * Prometheus performs *no* down-sampling like RRD and Prom relies on