Changes

Did this distractedly while idling in a meeting. Filed a bunch of issues upstream too: https://github.com/elijah-potter/harper/issues/196 https://github.com/elijah-potter/harper/issues/195 https://github.com/elijah-potter/harper/issues/194
anarcat · 84d70d28
--- a/service/prometheus.md
+++ b/service/prometheus.md
@@ -201,7 +201,7 @@ To add a scrape job in a puppet profile, you can use the
 `prometheus::scrape_job` defined type, or one of the defined types which are
 convenience wrappers around that.
-Here is, for example, how the gitlab runners are scraped:
+Here is, for example, how the GitLab runners are scraped:
 ```
 # tell Prometheus to scrape the exporter
@@ -255,9 +255,9 @@ In another example, to configure the ssh scrape jobs (in
      },
    }
-But because this is a blackbox exporter, the `scrape_configs`
+But because this is a `blackbox_exporter`, the `scrape_configs`
 configuration is more involved, as it needs to define the
-`relabel_configs` element that make the blackbox exporter work:
+`relabel_configs` element that make the `blackbox_exporter` work:
    - job_name: 'blackbox_ssh_banner'
      metrics_path: '/probe'
@@ -274,7 +274,7 @@ configuration is more involved, as it needs to define the
        - target_label: '__address__'
          replacement: 'localhost:9115'
-Scrape jobs for non-TPA services are defined in hiera under keys named
+Scrape jobs for non-TPA services are defined in Hiera under keys named
 `scrape_configs` in `hiera/common/prometheus.yaml`. Here's one example of such a
 scrape job definition:
@@ -323,7 +323,7 @@ configure a service, you *may* define extra jobs in the
 `profile::prometheus::server::internal` Puppet class.
 For example, because the GitLab setup is fully managed by Puppet
-(e.g. [tpo/tpa/gitlab#20][], but other similar issues remain), we
+(e.g. [`gitlab#20`][], but other similar issues remain), we
 cannot use this automatic setup, so manual scrape targets are defined
 like this:
@@ -361,7 +361,7 @@ then we open the port to the Prometheus server on the exporter, with:
 Those rules are declared on the server, in `prometheus::prometheus::server::internal`.
-[tpo/tpa/gitlab#20]: https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/20
+[`gitlab#20`]: https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/20
 ## Writing an alert
@@ -374,9 +374,9 @@ discussion on that.
 An [alerting rule][] is a simple YAML file that consists mainly of:
- a name (say `JobDown`)
+- A name (say `JobDown`).
- a Prometheus query, or "expression" (say `up != 1`)
+- A Prometheus query, or "expression" (say `up != 1`).
- extra labels and annotations
+- Extra labels and annotations.
 ### Expressions
@@ -415,7 +415,7 @@ flapping and temporary conditions. Rule of thumbs:
  more than 24h), `RAIDDegraded` (failed disk won't come back on its
  own in 15m)
 - `15m`: availability checks, designed to ignore transient errors.
-  examples: `JobDown`, `DiskFull`
+  Examples: `JobDown`, `DiskFull`
 - `1h`: consistency checks, things an operator might have deployed
  incorrectly but could recover on its own. Examples:
  `OutdatedLibraries`, as `needrestart` might recover at the end of
@@ -509,8 +509,8 @@ configuration, or alerting rule:
 >    fixing, but not immediately, no user-visible impact; example:
 >    server needs to be rebooted
 >  * `critical`: serious condition with disruptive user-visible impact
->    which requires prompt response; example: donation site gives a 500
+>    which requires prompt response; example: donation site returns 500
->    error
+>    errors
 ### Annotations
@@ -529,17 +529,17 @@ with the alert.
 The playbook *must* include those things:
- 1. the actual code name of the alert (e.g. `JobDown` or
+ 1. The actual code name of the alert (e.g. `JobDown` or
-    `DiskWillFillSoon`)
+    `DiskWillFillSoon`).
- 2. an example of the alert output (e.g. `Exporter job gitlab_runner
+ 2. An example of the alert output (e.g. `Exporter job gitlab_runner
-    on tb-build-02.torproject.org:9252 is down`)
+    on tb-build-02.torproject.org:9252 is down`).
- 3. why this alert triggered, what is its impact
+ 3. Why this alert triggered, what is its impact.
- 4. optionally, how to reproduce the issue
+ 4. Optionally, how to reproduce the issue.
- 5. how to fix it
+ 5. How to fix it.
 How to reproduce the issue is optional, but important. Think of
 yourself in the future, tired and panicking because things are
@@ -562,8 +562,8 @@ fixed.
 If the playbook becomes too complicated, consider making a [Fabric][]
 script out of it.
-A good example of a proper playbook is the [Textfile collector errors
+A good example of a proper playbook is the [text file collector errors
-playbook here][]. It has all of the above points, including actual
+playbook here][]. It has all the above points, including actual
 fixes for different actual scenarios.
 Here's a template to get started:
@@ -590,7 +590,7 @@ document here how you fix this next time.
 ```
 [Fabric]: howto/fabric
-[Textfile collector errors playbook here]: #textfile-collector-errors
+[text file collector errors playbook here]: #textfile-collector-errors
 ### Alerting rule template
@@ -628,8 +628,8 @@ groups:
  rules:
 ```
-... as that structure just serves to declare the rest of the alerts in
+That structure just serves to declare the rest of the alerts in the
-the file. However, consider that "rules within a group are run
+file. However, consider that "rules within a group are run
 sequentially at a regular interval, with the same evaluation time"
 (see the [recording rules documentation][]). So avoid putting *all*
 alerts inside the same file. In TPA, we group alerts by exporter, so
@@ -680,7 +680,7 @@ predict if a disk will fill in less than 24h:
      )
 The core of the logic is the magic `predict_linear` function, but also
-note how it also restricts its checks to filesystems with only 20%
+note how it also restricts its checks to file systems with only 20%
 space left, to avoid warning about normal write spikes.
 [metrics in your application]: #adding-metrics-to-applications
@@ -759,7 +759,7 @@ Those are visible in the [main Grafana dashboard][].
    sort_desc(sum(up{job=~\"$job\"}) by (job)
-[Number of CPU cores, memory size, filesystem and LVM sizes][]:
+[Number of CPU cores, memory size, file system and LVM sizes][]:
    count(node_cpu_seconds_total{classes=~\"$class\",mode=\"system\"})
    sum(node_memory_MemTotal_bytes{classes=~\"$class\"}) by (alias)
@@ -775,7 +775,7 @@ See also the [CPU][], [memory][], and [disk][] dashboards.
 [Number of machines]: https://prometheus.torproject.org/graph?g0.expr=count(up{job%3D"node"})
 [Number of machine per OS version]: https://prometheus.torproject.org/graph?g0.expr=count(node_os_info)+by+(version_id,+version_codename)
 [Number of machines per exporters, or technically, number of machines per job]: https://prometheus.torproject.org/graph?g0.expr=sort_desc(sum(up{job%3D~\"$job\"})+by+(job)
-[Number of CPU cores, memory size, filesystem and LVM sizes]: https://prometheus.torproject.org/graph?g0.expr=count(node_cpu_seconds_total{classes%3D~\"$class\",mode%3D\"system\"})
+[Number of CPU cores, memory size, file system and LVM sizes]: https://prometheus.torproject.org/graph?g0.expr=count(node_cpu_seconds_total{classes%3D~\"$class\",mode%3D\"system\"})
 [Uptime, in days]: https://prometheus.torproject.org/graph?g0.expr=round((time()+-+node_boot_time_seconds)+/+(24*60*60))
 [main Grafana dashboard]: https://grafana.torproject.org/d/wUmZB05Zk/tpo-overview
 [CPU]: https://grafana.torproject.org/d/gex9eLcWz/cpu-usage
@@ -882,17 +882,17 @@ dashboards][] section for details.
 [exposed]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41733
 [Alerting dashboards]: #alerting-dashboards
-### Managing alerts with amtool
+### Managing alerts with `amtool`
 Since the Alertmanager web UI is not available in Debian, you need to
-use the [amtool][] command. A few useful commands:
+use the [`amtool`][] command. A few useful commands:
 * `amtool alert`: show firing alerts
 * `amtool silence add --duration=1h --author=anarcat
-   --comment="working on it" ALERTNAME`: silence alert ALERTNAME for
+   --comment="working on it" ALERTNAME`: silence alert `ALERTNAME` for
   an hour, with some comments
-[amtool]: https://manpages.debian.org/amtool.1
+[`amtool`]: https://manpages.debian.org/amtool.1
 ### Checking alert history
@@ -1009,7 +1009,7 @@ defined series), a specific query will generate a specific alert with a given
 set of labels and annotations.
 Those labels can then be fed into `amtool` to test routing. For
-example, the above alert can be tested against the alertmanager
+example, the above alert can be tested against the Alertmanager
 configuration with:
    amtool config routes test alertname="NeedsReboot" instance="akka.0x90.dk:9100" job="relay" severity="warning" team="network"
@@ -1035,8 +1035,8 @@ happens if the `team` label is missing or incorrect, to confirm
 The above, for example, confirms that `networking` is not the correct
 team name (it should be `network`).
-Note that you can also deliver an alert to a webhook receiver
+Note that you can also deliver an alert to a web hook receiver
-syntetically. For example, this will deliver and empty message to the
+synthetically. For example, this will deliver and empty message to the
 IRC relay:
    curl --header "Content-Type: application/json" --request POST --data "{}" http://localhost:8098
@@ -1050,14 +1050,14 @@ IRC relay:
 This section documents more advanced metrics injection topics that we
 rarely need or use.
-### Backfilling
+### Back-filling
 Starting from Prometheus 2.24, Prometheus [now
-supports][] [backfilling][]. This is untested, but [this guide][]
+supports][] [back-filling][]. This is untested, but [this guide][]
 might provide a good tutorial.
 [now supports]: https://github.com/prometheus/prometheus/issues/535
-[backfilling]: https://prometheus.io/docs/prometheus/latest/storage/#backfilling-from-openmetrics-format
+[back-filling]: https://prometheus.io/docs/prometheus/latest/storage/#backfilling-from-openmetrics-format
 [this guide]: https://tlvince.com/prometheus-backfilling
 ### Push metrics to the Pushgateway
@@ -1068,7 +1068,7 @@ see the [article about pushing metrics][] before going down this
 route.
 The Pushgateway is fairly particular: it listens on port 9091 and gets
-data through a fairly simple [curl-friendly commandline][] [API][]. We
+data through a fairly simple [curl-friendly command line][] [API][]. We
 have found that, once installed, this command just "does the right
 thing", more or less:
@@ -1087,7 +1087,7 @@ Note that it's [not possible to push timestamps][] into the
 Pushgateway, so it's not useful to ingest past historical data.
 [article about pushing metrics]: https://prometheus.io/docs/practices/pushing/
-[curl-friendly commandline]: https://github.com/prometheus/pushgateway#command-line
+[curl-friendly command line]: https://github.com/prometheus/pushgateway#command-line
 [API]: https://github.com/prometheus/pushgateway#api
 [not possible to push timestamps]: https://github.com/prometheus/pushgateway#about-timestamps
@@ -1187,7 +1187,7 @@ like this every second:
    Nov 01 19:43:03 hetzner-nbg1-02 prometheus[49182]: level=info ts=2022-11-01T19:43:03.788Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=30182 maxSegment=30196
-... it's somewhat normal. At the time of writing, Prometheus2 takes
+It's somewhat normal. At the time of writing, Prometheus2 takes
 over a minute to start because of this problem. When it's done, it
 will show the timing information, which is currently:
@@ -1212,7 +1212,7 @@ the metrics it collects, and allow you to view the pending metrics
 before they get scraped by Prometheus, which may be useful to
 troubleshoot issues with the gateway.
-To pull metrics by hand, you can pull directly from the pushgateway:
+To pull metrics by hand, you can pull directly from the Pushgateway:
    curl localhost:9091/metrics
@@ -1223,7 +1223,7 @@ If you get this error while pulling metrics from the exporter:
    collected metric "some_metric" { label:<name:"instance" value:"" > label:<name:"job" value:"some_job" > label:<name:"tag" value:"val1" > counter:<value:1 > } was collected before with the same name and label values
 It's because similar metrics were sent twice into the gateway, which
-corrupts the state of the pushgateway, a [known problems][] in
+corrupts the state of the Pushgateway, a [known problems][] in
 earlier versions and [fixed in 0.10][] (Debian bullseye and later). A
 workaround is simply to restart the Pushgateway (and clear the
 storage, if persistence is enabled, see the `--persistence.file`
@@ -1234,7 +1234,7 @@ flag).
 ### Running out of disk space
-In [tpo/tpa/team#41070][], we encountered a situation where disk
+In [#41070][], we encountered a situation where disk
 usage on the main Prometheus server was growing linearly even if the
 number of targets didn't change. This is a typical problem in time
 series like this where the "cardinality" of metrics grows without
@@ -1242,7 +1242,7 @@ bound, consuming more and more disk space as time goes by.
 The first step is to confirm the diagnosis by looking at the [Grafana
 graph showing Prometheus disk usage][] over time. This should show a
-"sawtooth" pattern where compactions happen regularly (about once
+"[sawtooth wave][]" pattern where compactions happen regularly (about once
 every three weeks), but without growing much over longer periods of
 time. In the above ticket, the usage was growing despite
 compactions. There are also shorter-term (~4h) and smaller compactions
@@ -1269,12 +1269,13 @@ long-term storage][] which suggests tweaking the
 [This guide from Alexandre Vazquez][] also had some useful queries and
 tips we didn't fully investigate.
-[tpo/tpa/team#41070]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41070
+[#41070]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41070
 [Grafana graph showing Prometheus disk usage]: https://grafana.torproject.org/d/000000012/prometheus-2-0-stats?orgId=1&refresh=1m&viewPanel=40&from=now-1y&to=now
 [disk usage graphic]: https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&var-class=All&var-instance=hetzner-nbg1-01.torproject.org&from=now-3d&to=now&viewPanel=2
 [upstream Storage documentation]: https://prometheus.io/docs/prometheus/1.8/storage/
 [advice on long-term storage]: https://prometheus.io/docs/prometheus/1.8/storage/#settings-for-very-long-retention-time
 [This guide from Alexandre Vazquez]: https://alexandre-vazquez.com/how-it-optimize-the-disk-usage-in-the-prometheus-database/
+[sawtooth wave]: https://en.wikipedia.org/wiki/Sawtooth_wave
 ### Default route errors
@@ -1336,9 +1337,9 @@ host are managed by the anti-censorship team service admins. If the
 host was *not* managed by TPA or this was a notification about a
 *service* operated by the team, then a ticket should be filed there.
-In this case, [tpo/tpa/team#41667][] was filed.
+In this case, [#41667][] was filed.
-[tpo/tpa/team#41667]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41667
+[#41667]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41667
 #### Fixing routing
@@ -1348,7 +1349,7 @@ if the alert is still firing. In this case, we see this:
 | Labels                                                                                                                                                                          | State  | Active Since                           | Value |
 |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------|----------------------------------------|-------|
-| `alertname="JobDown"` `alias="rdsys-test-01.torproject.org"` `classes="role::rdsys::backend"` `instance="rdsys-test-01.torproject.org:3903"` `job="mtail"` `severity="warning"` | firing | 2024-07-03 13:51:17.36676096 +0000 UTC | 0     |
+| `alertname="JobDown"` `alias="rdsys-test-01.torproject.org"` `classes="role::rdsys::backend"` `instance="rdsys-test-01.torproject.org:3903"` `job="mtail"` `severity="warning"` | Firing | 2024-07-03 13:51:17.36676096 +0000 UTC | 0     |
 In this case, we can see there's no `team` label on that metric, which
 is the root cause.
@@ -1379,7 +1380,7 @@ and the following rule:
 The query, in this case, is therefore `up < 1`. But since the alert
 has resolved, we can't actually do the exact same query and expect to
 find the same host, we need instead to broaden the query without the
-conditional (so just `up`) *and* add the right labels, in this case
+conditional (so just `up`) *and* add the right labels. In this case
 this should do the trick:
    up{instance="rdsys-test-01.torproject.org:3903",job="mtail"}
@@ -1485,10 +1486,10 @@ no value was provided for a metric, like this:
    # TYPE civicrm_torcrm_resque_processor_status_up gauge
    civicrm_torcrm_resque_processor_status_up
-See [tpo/web/civicrm#149][] for further details on this
+See [`web/civicrm#149`][] for further details on this
 outage.
-[tpo/web/civicrm#149]: https://gitlab.torproject.org/tpo/web/civicrm/-/issues/149
+[`web/civicrm#149`]: https://gitlab.torproject.org/tpo/web/civicrm/-/issues/149
 #### Forbidden errors
@@ -1496,15 +1497,15 @@ Another example might be:
    server returned HTTP status 403 Forbidden
-... in which case there's a permission issue on the exporter
+In which case there's a permission issue on the exporter endpoint. Try
-endpoint. Try to reproduce the issue by pulling the endpoint directly,
+to reproduce the issue by pulling the endpoint directly, on the
-on the Prometheus server, with, for example:
+Prometheus server, with, for example:
    curl -sSL https://donate.torproject.org:443/metrics
-... or whatever URL is visible in the targets listing above. This
+Or whatever URL is visible in the targets listing above. This could be
-could be a web server configuration or lack of matching credentials in
+a web server configuration or lack of matching credentials in the
-the exporter configuration. Look in `tor-puppet.git`, the
+exporter configuration. Look in `tor-puppet.git`, the
 `profile::prometheus::server::internal::collect_scrape` in
 `hiera/common/prometheus.yaml`, where credentials should be defined
 (although they should actually be stored in Trocla).
@@ -1516,20 +1517,20 @@ test.example.com` (`ApacheScrapingFailed`), Apache is up, but the
 [Apache exporter][] cannot pull its metrics from there.
 That means the exporter cannot pull the URL
-`http://localhost/server-status/?auto`.  To reproduce, pull the URL
+`http://localhost/server-status/?auto`. To reproduce, pull the URL
 with curl from the affected server, for example:
    root@test.example.com:~# curl http://localhost/server-status/?auto
 This is a typical configuration error in Apache where the
 `/server-status` host is not available to the exporter because the
-"default vhost" was disabled (`apache2::default_vhost` in
+"default virtual host" was disabled (`apache2::default_vhost` in
 Hiera).
 There is normally a workaround for this in the
 `profile::prometheus::apache_exporter` class, which configures a
-`localhost` vhost to answer properly on this address. Verify that it's
+`localhost` virtual host to answer properly on this address. Verify that it's
-present, consider using `apache2ctl -S` to see the vhost
+present, consider using `apache2ctl -S` to see the virtual host
 configuration.
 See also the [Apache web server diagnostics][] in the incident
@@ -1538,17 +1539,17 @@ response docs for broader issues with web servers.
 [Apache exporter]: https://github.com/Lusitaniae/apache_exporter/
 [Apache web server diagnostics]: #apache-web-server-diagnostics
-### Textfile collector errors
+### Text file collector errors
 The `NodeTextfileCollectorErrors` looks like this:
    Node exporter textfile collector errors on test.torproject.org
-It means that the [textfile collector][] is having trouble parsing one
+It means that the [text file collector][] is having trouble parsing one
 or many of the files in its `--collector.textfile.directory` (defaults
 to `/var/lib/prometheus/node-exporter`).
-[textfile collector]: https://github.com/prometheus/node_exporter#textfile-collector
+[text file collector]: https://github.com/prometheus/node_exporter#textfile-collector
 The error should be visible in the node exporter logs, run the
 following command to see it:
@@ -1564,7 +1565,7 @@ might be different.
 Sep 24 20:56:53 bungei prometheus-node-exporter[1387]: ts=2024-09-24T20:56:53.280Z caller=textfile.go:227 level=error collector=textfile msg="failed to collect textfile data" file=tpa_backuppg.prom err="failed to open textfile data file \"/var/lib/prometheus/node-exporter/tpa_backuppg.prom\": open /var/lib/prometheus/node-exporter/tpa_backuppg.prom: permission denied"
 ```
-In this case, the file was created as a tempfile and moved into place
+In this case, the file was created as a temporary file and moved into place
 without fixing the permission. The fix was to simply create the file
 without the `tempfile` Python library, with a `.tmp` suffix, and just
 move it into place.
@@ -1575,7 +1576,7 @@ move it into place.
 Sep 24 21:14:41 perdulce prometheus-node-exporter[429]: ts=2024-09-24T21:14:41.783Z caller=textfile.go:227 level=error collector=textfile msg="failed to collect textfile data" file=scheduled_shutdown_metric.prom err="failed to parse textfile data from \"/var/lib/prometheus/node-exporter/scheduled_shutdown_metric.prom\": text format parsing error in line 3: expected '\"' at start of label value, found 'r'"
 ```
-This was an experimental metric designed in [tpo/tpa/team#41734][] to
+This was an experimental metric designed in [#41734][] to
 keep track of scheduled reboot times, but it was formatted
 incorrectly. The entire file content was:
@@ -1596,12 +1597,12 @@ node_shutdown_scheduled_timestamp_seconds{kind="reboot"} 1725545703.588789
 But the file was simply removed in this case.
-[tpo/tpa/team#41734]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41734
+[#41734]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41734
 ## Disaster recovery
 If a Prometheus/Grafana is destroyed, it should be completely
-rebuildable from Puppet. Non-configuration data should be restored
+re-buildable from Puppet. Non-configuration data should be restored
 from backup, with `/var/lib/prometheus/` being sufficient to
 reconstruct history. If even backups are destroyed, history will be
 lost, but the server should still recover and start tracking new
@@ -1693,8 +1694,8 @@ A real-life (simplified) example:
    node_filesystem_avail_bytes{alias="alberti.torproject.org",device="/dev/sda1",fstype="ext4",mountpoint="/"} 16160059392
-The above says that the node alberti has the device `/dev/sda` mounted
+The above says that the node `alberti` has the device `/dev/sda` mounted
-on `/`, formatted as an `ext4` filesystem which has 16160059392 bytes
+on `/`, formatted as an `ext4` file system which has 16160059392 bytes
 (~16GB) free.
 [OpenMetrics]: https://openmetrics.io/
@@ -1711,21 +1712,21 @@ exporter", with the following steps:
        apt install -t stretch-backports prometheus-node-exporter
-   ... assuming that backports is already configured. if it isn't, such a line in `/etc/apt/sources.list.d/backports.debian.org.list` should suffice:
+   Assuming that backports is already configured. If it isn't, such a
+   line in `/etc/apt/sources.list.d/backports.debian.org.list` should
+   suffice, followed by an `apt update`:
        deb	https://deb.debian.org/debian/	stretch-backports	main contrib non-free
-   ... followed by an `apt update`, naturally.
 The firewall on the machine needs to allow traffic on the exporter
 port from the server `prometheus2.torproject.org`. Then [open a
 ticket][new-ticket] for TPA to configure the target. Make sure to
 mention:
- * the hostname for the exporter
+ * The host name for the exporter
- * the port of the exporter (varies according to the exporter, 9100
+ * The port of the exporter (varies according to the exporter, 9100
   for the node exporter)
- * how often to scrape the target, if non-default (default: 15s)
+ * How often to scrape the target, if non-default (default: 15 seconds)
 Then TPA needs to hook those as part of a new node `job` in the
 `scrape_configs`, in `prometheus.yml`, from Puppet, in
@@ -1739,7 +1740,7 @@ See also [Adding metrics to applications][], above.
 Those are the actual services monitored by Prometheus.
-### Internal server (prometheus1)
+### Internal server (`prometheus1`)
 The "internal" server scrapes all hosts managed by Puppet for
 TPA. Puppet installs a [`node_exporter`][] on *all* servers, which
@@ -1753,7 +1754,7 @@ authentication only to keep bots away.
 [`node_exporter`]: https://github.com/prometheus/node_exporter
-### External server (prometheus2)
+### External server (`prometheus2`)
 The "external" server, on the other hand, is more restrictive and does
 not allow public access. This is out of concern that specific metrics
@@ -1764,10 +1765,10 @@ manually configured by TPA.
 Those are the services currently monitored by the external server:
- * [bridgestrap][]
+ * [`bridgestrap`][]
- * [rdsys][]
+ * [`rdsys`][]
- * OnionPerf external nodes' `node_exporter`s
+ * OnionPerf external nodes' `node_exporter`
- * connectivity test on (some?) bridges (using the
+ * Connectivity test on (some?) bridges (using the
   [`blackbox_exporter`][])
 Note that this list might become out of sync with the actual
@@ -1778,8 +1779,8 @@ This separate server was actually provisioned for the anti-censorship
 team (see [this comment for background][]). The server was setup in
 July 2019 following [#31159][].
-[bridgestrap]: https://bridges.torproject.org/bridgestrap-metrics
+[`bridgestrap`]: https://bridges.torproject.org/bridgestrap-metrics
-[rdsys]: https://bridges.torproject.org/rdsys-backend-metrics
+[`rdsys`]: https://bridges.torproject.org/rdsys-backend-metrics
 [`blackbox_exporter`]: https://github.com/prometheus/blackbox_exporter/
 [Puppet]: howto/puppet
 [this comment for background]: https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/29863#note_2593114
@@ -1788,22 +1789,22 @@ July 2019 following [#31159][].
 ### Other possible services to monitor
-Many more exporters could be configured. A non-exaustive list was
+Many more exporters could be configured. A non-exhaustive list was
-built in [ticket tpo/tpa/team#30028][] around launch time. Here we
+built in [ticket #30028][] around launch time. Here we
 can document more such exporters we find along the way:
 * [Prometheus Onion Service Exporter][] - "Export the status and
   latency of an onion service"
- * [hsprober][] - similar, but also with histogram buckets, multiple
+ * [`hsprober`][] - similar, but also with histogram buckets, multiple
   attempts, warm-up and error counts
- * [haproxy_exporter][]
+ * [`haproxy_exporter`][]
 There's also a [list of third-party exporters][] in the Prometheus documentation.
-[ticket tpo/tpa/team#30028]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/30028
+[ticket #30028]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/30028
 [Prometheus Onion Service Exporter]: https://github.com/systemli/prometheus-onion-service-exporter/
-[hsprober]: https://git.autistici.org/ale/hsprober
+[`hsprober`]: https://git.autistici.org/ale/hsprober
-[haproxy_exporter]: https://github.com/prometheus/haproxy_exporter
+[`haproxy_exporter`]: https://github.com/prometheus/haproxy_exporter
 [list of third-party exporters]: https://prometheus.io/docs/instrumenting/exporters/
 ## SLA
@@ -1856,7 +1857,7 @@ also IRC notifications for both warning and critical.
 Each route needs to have one or more receivers set.
-Receivers are and routes are defined in hiera in `hiera/common/prometheus.yaml`.
+Receivers are and routes are defined in Hiera in `hiera/common/prometheus.yaml`.
 #### Receivers
@@ -1879,7 +1880,7 @@ instead of `email_configs`.
 #### Routes
-Alert routes are set in the key `prometheus::alertmanager::route` in hiera. The
+Alert routes are set in the key `prometheus::alertmanager::route` in Hiera. The
 default route, the one set at the top level of that key, uses the receiver
 `fallback` and some default options for other routes.
@@ -1907,30 +1908,30 @@ would otherwise be around long enough for Prometheus to scrape their
 metrics. We use it as a workaround to bridge Metrics data with
 Prometheus/Grafana.
-## Blackbox exporter
+## `blackbox_exporter`
 Most exporters are pretty straightforward: a service binds to a port and exposes
 metrics through HTTP requests on that port, generally on the `/metrics` URL.
-The blackbox exporter, however, is a little bit more contrived. The exporter can
+The `blackbox_exporter`, however, is a little bit more contrived. The exporter can
-be configured to run a bunch of different tests (e.g. tcp connections, http
+be configured to run a bunch of different tests (e.g. TCP connections, HTTP
-requests, ICMP ping, etc) for a list of targets of its own. So the prometheus
+requests, ICMP ping, etc) for a list of targets of its own. So the Prometheus
-server has one target, the host with the port for the blackbox exporter, but
+server has one target, the host with the port for the `blackbox_exporter`, but
 that exporter in turn is set to check other hosts.
 The [upstream documentation][] has some details that can help. We also
 have examples [above][] for how to configure it in our setup.
 One thing that's nice to know in addition to how it's configured is how you can
-debug it. You can query the exporter from localhost in order to get more
+debug it. You can query the exporter from `localhost` in order to get more
 information. If you are using this method for debugging, you'll most probably
 want to include debugging output. For example, to run an ICMP test on host
-pauli.torproject.org:
+`pauli.torproject.org`:
    curl http://localhost:9115/probe?target=pauli.torproject.org&module=icmp&debug=true
 Note that the above trick can be used for _any_ target, not just for ones
-currently configured in the blackbox exporter. So you can also use this to test
+currently configured in the `blackbox_exporter`. So you can also use this to test
 things before creating the final configuration for the target.
 [upstream documentation]: https://github.com/prometheus/blackbox_exporter
@@ -1962,16 +1963,16 @@ builtin support for:
 * [Opsgenie][] (now Atlassian)
 * Wechat
-There's also a [generic webhook receiver][] which is typically used
+There's also a [generic web hook receiver][] which is typically used
 to send notifications. Many other endpoints are implemented through
-that webhook, for example:
+that web hook, for example:
 * [Cachet][]
 * [Dingtalk][]
 * [Discord][]
 * [Google Chat][]
 * [IRC][]
- * Matrix: [matrix-alertmanager][] (JS) or [knopfler][] (Python), see
+ * Matrix: [`matrix-alertmanager`][] (JavaScript) or [knopfler][] (Python), see
   also [#40216][]
 * [Mattermost][]
 * [Microsoft teams][]
@@ -1982,13 +1983,13 @@ that webhook, for example:
 * [Signal][] (or [Signald][])
 * [Splunk][]
 * [SNMP][]
- * Telegram: [nopp/alertmanager-webhook-telegram-python][] or [metalmatze/alertmanager-bot][]
+ * Telegram: [`nopp/alertmanager-webhook-telegram-python`][] or [`metalmatze/alertmanager-bot`][]
 * [Twilio][]
 * [Wechat][]
- * Zabbix: [alertmanager-zabbix-webhook][] or [zabbix-alertmanager][]
+ * Zabbix: [`alertmanager-zabbix-webhook`][] or [`zabbix-alertmanager`][]
 And that is only what was available at the time of writing, the
-[alertmanager-webhook][] and [alertmanager tags][] GitHub might have more.
+[`alertmanager-webhook`][] and [`alertmanager` tags][] GitHub might have more.
 The Alertmanager has its own web interface to see and silence alerts,
 but there are also alternatives like [Karma][] (previously
@@ -2012,14 +2013,14 @@ again. The [kthxbye bot][] works around that issue.
 [Victorops]: https://victorops.com
 [Pagerduty]: https://pagerduty.com/
 [Opsgenie]: https://opsgenie.com
-[generic webhook receiver]: https://prometheus.io/docs/alerting/latest/configuration/#webhook_config
+[generic web hook receiver]: https://prometheus.io/docs/alerting/latest/configuration/#webhook_config
 [Cachet]: https://github.com/oxyno-zeta/prometheus-cachethq
 [Dingtalk]: https://github.com/timonwong/prometheus-webhook-dingtalk
 [Discord]: https://github.com/rogerrum/alertmanager-discord
 [Google Chat]: https://github.com/mr-karan/calert
 [IRC]: https://github.com/crisidev/alertmanager_irc
 [#40216]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40216
-[matrix-alertmanager]: https://github.com/jaywink/matrix-alertmanager
+[`matrix-alertmanager`]: https://github.com/jaywink/matrix-alertmanager
 [knopfler]: https://github.com/sinnwerkstatt/knopfler
 [Mattermost]: https://github.com/cpanato/mattermost-plugin-alertmanager
 [Microsoft teams]: https://github.com/prometheus-msteams/prometheus-msteams
@@ -2030,14 +2031,14 @@ again. The [kthxbye bot][] works around that issue.
 [Signald]: https://github.com/dgl/alertmanager-webhook-signald
 [Splunk]: https://github.com/sylr/alertmanager-splunkbot
 [SNMP]: https://github.com/maxwo/snmp_notifier
-[nopp/alertmanager-webhook-telegram-python]: https://github.com/nopp/alertmanager-webhook-telegram-python
+[`nopp/alertmanager-webhook-telegram-python`]: https://github.com/nopp/alertmanager-webhook-telegram-python
-[metalmatze/alertmanager-bot]: https://github.com/metalmatze/alertmanager-bot
+[`metalmatze/alertmanager-bot`]: https://github.com/metalmatze/alertmanager-bot
 [Twilio]: https://github.com/Swatto/promtotwilio
 [Wechat]: https://github.com/daozzg/work_wechat_robot
-[alertmanager-zabbix-webhook]: https://github.com/gmauleon/alertmanager-zabbix-webhook
+[`alertmanager-zabbix-webhook`]: https://github.com/gmauleon/alertmanager-zabbix-webhook
-[zabbix-alertmanager]: https://github.com/devopyio/zabbix-alertmanager
+[`zabbix-alertmanager`]: https://github.com/devopyio/zabbix-alertmanager
-[alertmanager-webhook]: https://github.com/topics/alertmanager-webhook
+[`alertmanager-webhook`]: https://github.com/topics/alertmanager-webhook
-[alertmanager tags]: https://github.com/topics/alertmanager
+[`alertmanager` tags]: https://github.com/topics/alertmanager
 [Karma]: https://karma-dashboard.io/
 [unsee]: https://github.com/cloudflare/unsee
 [Elm compiler]: https://github.com/elm/compiler
@@ -2098,7 +2099,7 @@ route's `group_by` setting, and then Alertmanager will evaluate the
 timers set on the particular route that was matched. An alert group is
 created when an alert is received and no other alerts already match
 the same values for the `group_by` criteria. An alert group is removed
-when all alerts in a group are in state `inactive` (e.g.  resolved).
+when all alerts in a group are in state `inactive` (e.g. resolved).
 Fourth, there's the `group_wait` setting (defaults to 5 seconds, can
 be [customized by route][]). This will keep Alertmanager from
@@ -2120,10 +2121,10 @@ relay that alert to the Alertmanager, and another timer comes in.
 Fifth, before relaying that new alert that's already part of a firing
 group, Alertmanager will wait `group_interval` (defaults to 5m) before
-resending a notification to a group.
+re-sending a notification to a group.
 When Alertmanager first creates an alert group, a thread is started
-for that group and the _route_'s `group_interval` acts like a time
+for that group and the *route's* `group_interval` acts like a time
 ticker. Notifications are only sent when the `group_interval` period
 repeats.
@@ -2180,23 +2181,23 @@ There is no issue tracker specifically for this project, [File][new-ticket] or
 Those are major issues that are worth knowing about Prometheus in
 general, and our setup in particular:
- - bind mounts generate duplicate metrics, upstream issue: [Way to
+ - Bind mounts generate duplicate metrics, upstream issue: [Way to
-   distinguish bind mounted path ?][], possible workaround: manually
+   distinguish bind mounted path?][], possible workaround: manually
   specify known bind mount points
   (e.g. `node_filesystem_avail_bytes{instance=~"$instance:.*",fstype!='tmpfs',fstype!='shm',mountpoint!~"/home|/var/lib/postgresql"}`),
-   but that can hide actual, real mountpoints, possible fix: the
+   but that can hide actual, real mount points, possible fix: the
   `node_filesystem_mount_info` metric, [added in PR 2970 from
   2024-07-14][], unreleased as of 2024-08-28
- - high cardinality metrics from exporters we do not control can fill
+ - High cardinality metrics from exporters we do not control can fill
   the disk
- - no long-term metrics storage, issue: [multi-year metrics storage][]
+ - No long-term metrics storage, issue: [multi-year metrics storage][]
- - the web UI is really limited, and is actually deprecated, with the
+ - The web user interface is really limited, and is actually deprecated, with the
   new [React-based one not (yet?) packaged][]
 In general, the service is still being launched, see [TPA-RFC-33][]
 for the full deployment plan.
-[Way to distinguish bind mounted path ?]: https://github.com/prometheus/node_exporter/issues/600
+[Way to distinguish bind mounted path?]: https://github.com/prometheus/node_exporter/issues/600
 [added in PR 2970 from 2024-07-14]: https://github.com/prometheus/node_exporter/pull/2970
 [multi-year metrics storage]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40330
 [React-based one not (yet?) packaged]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41790
@@ -2225,7 +2226,7 @@ but it was [salvaged][] by the [Prometheus community][].
 Another important layer is the large amount of Puppet code that is
 used to deploy Prometheus and its components. This is all part of a
-big Puppet module, [`puppet-prometheus`][], managed by the [voxpupuli
+big Puppet module, [`puppet-prometheus`][], managed by the [Voxpupuli
 collective][]. Our integration with the module is not yet complete:
 we have a lot of glue code on top of it to correctly make it work with
 Debian packages. A lot of work has been done to complete that work by
@@ -2237,13 +2238,13 @@ details.
 [bind_exporter]: https://github.com/digitalocean/bind_exporter/
 [salvaged]: https://github.com/prometheus-community/bind_exporter/issues/55
 [Prometheus community]: https://github.com/prometheus-community/community/issues/15
-[voxpupuli collective]: https://github.com/voxpupuli
+[Voxpupuli collective]: https://github.com/voxpupuli
 [upstream issue 32]: https://github.com/voxpupuli/puppet-prometheus/issues/32
 ## Monitoring and testing
 Prometheus doesn't have specific tests, but there *is* a test suite in
-the upstream prometheus Puppet module.
+the upstream Prometheus Puppet module.
 The server is monitored for basic system-level metrics by Nagios. It
 also monitors itself for system-level metrics but also
@@ -2279,13 +2280,13 @@ require little backups. The metrics themselves are kept in
 WAL (write-ahead log) files are ignored by the backups, which can lead
 to an extra 2-3 hours of data loss since the last backup in the case
-of a total failure, see [tpo/tpa/team#41627][] for the
+of a total failure, see [#41627][] for the
 discussion. This should eventually be mitigated by a high availability
-setup ([tpo/tpa/team#41643][]).
+setup ([#41643][]).
 [backup procedures]: service/backup
-[tpo/tpa/team#41627]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41627
+[#41627]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41627
-[tpo/tpa/team#41643]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41643
+[#41643]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41643
 ## Other documentation
@@ -2313,7 +2314,7 @@ traces of Munin were removed in early April 2019 ([ticket 29682][]).
 Resource requirements were researched in [ticket 29388][] and it was
 originally planned to retain 15 days of metrics. This was expanded to
 one year in November 2019 ([ticket 31244][]) with the hope this could
-eventually be expanded further with a downsampling server in the
+eventually be expanded further with a down-sampling server in the
 future.
 [ticket 31244]: https://bugs.torproject.org/31244
@@ -2334,7 +2335,7 @@ metrics are just that: metrics, without thresholds... This makes it
 more difficult to replace Nagios because a ton of alerts need to be
 rewritten to replace the existing ones. A lot of reports and
 functionality built-in to Nagios, like availability reports,
-acknowledgements and other reports, would need to be reimplemented as
+acknowledgments and other reports, would need to be re-implemented as
 well.
 ## Goals
@@ -2362,12 +2363,12 @@ really just second-guessing...
 ## Approvals required
 Primary Prometheus server was decided [in the Brussels 2019
-devmeeting][], before anarcat joined the team ([ticket
+developer meeting][], before anarcat joined the team ([ticket
 29389][]). Secondary Prometheus server was approved in
 [meeting/2019-04-08][]. Storage expansion was approved in
 [meeting/2019-11-25][].
- [in the Brussels 2019 devmeeting]: https://gitlab.torproject.org/legacy/trac/-/wikis/org/meetings/2019BrusselsAdminTeamMinutes#Trendingmonitoring
+ [in the Brussels 2019 developer meeting]: https://gitlab.torproject.org/legacy/trac/-/wikis/org/meetings/2019BrusselsAdminTeamMinutes#Trendingmonitoring
 [ticket 29389]: https://bugs.torproject.org/29389
 [meeting/2019-04-08]: meeting/2019-04-08
 [meeting/2019-11-25]: meeting/2019-11-25
@@ -2378,7 +2379,7 @@ Prometheus was chosen, see also [Grafana][].
 ## Cost
-N/A.
+N/A
 ## Alternatives considered
@@ -2389,7 +2390,7 @@ from Prometheus, but ultimately decided against it in [TPA-RFC-33][].
 Alerting rules are currently stored in an external
 [`prometheus-alerts.git` repository][] that holds not only TPA's
-alerts, but also those of other teams.  So the rules
+alerts, but also those of other teams. So the rules
 are _not_ directly managed by puppet -- although puppet will ensure
 that the repository is checked out with the most recent commit on the
 Prometheus servers.
@@ -2432,7 +2433,7 @@ Basically, Prometheus is similar to Munin in many ways:
   like Munin
 * The agent running on the nodes is called `prometheus-node-exporter`
-   instead of `munin-node`. it scrapes only a set of built-in
+   instead of `munin-node`. It scrapes only a set of built-in
   parameters like CPU, disk space and so on, different exporters are
   necessary for different applications (like
   `prometheus-apache-exporter`) and any application can easily
@@ -2440,18 +2441,18 @@ Basically, Prometheus is similar to Munin in many ways:
   `/metrics` endpoint
 * Like Munin, the node exporter doesn't have any form of
-   authentication built-in. we rely on IP-level firewalls to avoid
+   authentication built-in. We rely on IP-level firewalls to avoid
   leakage
 * The central server is simply called `prometheus` and runs as a
   daemon that wakes up on its own, instead of `munin-update` which is
   called from `munin-cron` and before that `cron`
- * graphics are generated on the fly through the crude Prometheus web
+ * Graphics are generated on the fly through the crude Prometheus web
   interface or by frontends like Grafana, instead of being constantly
   regenerated by `munin-graph`
- * samples are stored in a custom "time series database" (TSDB) in
+ * Samples are stored in a custom "time series database" (TSDB) in
   Prometheus instead of the (ad-hoc) RRD standard
 * Prometheus performs *no* down-sampling like RRD and Prom relies on