Changes

anarcat · 4a6f1b9a
--- a/service/prometheus.md
+++ b/service/prometheus.md
@@ -363,11 +363,164 @@ Those rules are declared on the server, in `prometheus::prometheus::server::inte

 [tpo/tpa/gitlab#20]: https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/20

-## Writing an alerting rule
+## Writing an alert

-TODO
+Now that you have [metrics in your application][] and those are
+[scraped by Prometheus][], you are likely going to want alert on some
+of those metrics. Be careful writing alerts that are not too noisy,
+and alert on user-visible symptoms, not on underlying technical issues
+you *think* might affect users, see our [Alerting philosophy][] for a
+discussion on that.

-## Writing a playbook
+An [alerting rule][] is a simple YAML file that consists mainly of:
+
+- a name (say `JobDown`)
+- a Prometheus query, or "expression" (say `up != 1`)
+- extra labels and annotations
+
+### Expressions
+
+The most important part of the alert is the `expr` field, which is a
+Prometheus query that should evaluate to "true" (non-zero) for the
+alert to fire.
+
+Here is, for example, the first alert in the [`rules.d/tpa_node.rules`
+file](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/blob/21d67a21ce9926b2eeef0e14b04bb317fb5c94c0/rules.d/tpa_node.rules):
+
+```
+  - alert: JobDown
+    expr: up < 1
+    for: 15m
+    labels:
+      severity: warning
+    annotations:
+      summary: 'Exporter job {{ $labels.job }} on {{ $labels.instance }} is down'
+      description: 'Exporter job {{ $labels.job }} on {{ $labels.instance }} has been unreachable for more than 15 minutes.'
+      playbook: "https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/prometheus/#exporter-job-down-warnings"
+```
+
+In the above, Prometheus will generate an alert if the metric `up` is
+not equal to 1 for more than 15 minutes, hence `up < 1`.
+
+### Duration
+
+The `for` field means the alert is not immediately passed down to the
+Alertmanager until that time has passed. It is useful to avoid
+flapping and temporary conditions. Rule of thumbs:
+
+- `0s`: checks that already have a built-in time threshold in its
+  expression (see below), or critical condition requiring immediate
+  action, immediate notification (default). Examples:
+  `AptUpdateLagging` (checks for `apt update` not running for
+  more than 24h), `RAIDDegraded` (failed disk won't come back on its
+  own in 15m)
+- `15m`: availability checks, designed to ignore transient errors.
+  examples: `JobDown`, `DiskFull`
+- `1h`: consistency checks, things an operator might have deployed
+  incorrectly but could recover on its own. Examples:
+  `OutdatedLibraries`, as `needrestart` might recover at the end of
+  the upgrade job, which could take more than 15m
+- `1d`: daily consistency check. Examples: `PackagesPendingTooLong`
+  (upgrades are supposed to run daily)
+
+### Grouping
+
+At this point, what it effectively does is generate a message that it
+passes along to the Alertmanager with the annotations, the labels
+defined in the alerting rule (`severity="warning"`). It also passes
+along all other labels that might be attached to the `up` metric*,
+which is important, as the query can modify which labels are
+visible. For example, the `up` metric typically looks like this:
+
+```
+up{alias="test-01.torproject.org",classes="role::ldapdb",instance="test-01.torproject.org:9100",job="node",team="TPA"} 1
+```
+
+Also note that this single expression *will* generate multiple alerts
+for multiple matches. For example, if two hosts are down, the metric
+would look like this:
+
+```
+up{alias="test-01.torproject.org",classes="role::ldapdb",instance="test-01.torproject.org:9100",job="node",team="TPA"} 0
+up{alias="test-02.torproject.org",classes="role::ldapdb",instance="test-02.torproject.org:9100",job="node",team="TPA"} 0
+```
+
+This will generate *two* alerts. This matters, because it can create a
+lot of noise and confusion on the other end. A good way to deal with
+this is to use [aggregation operators][]. For example, here is the
+DRBD alerting rule, which often fires for multiple disks at once
+because we're mass-migrating instances in Ganeti:
+
+```
+- alert: DRBDDegraded
+    expr: count(node_drbd_disk_state_is_up_to_date != 1) by (job, instance, alias, team)
+    for: 1h
+    labels:
+      severity: warning
+    annotations:
+      summary: "DRBD has {{ $value }} out of date disks on {{ $labels.alias }}"
+      description: "Found {{ $value }} disks that are out of date on {{ $labels.alias }}."
+      playbook: "https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/drbd#resyncing-disks"
+```
+
+The expression, here, is:
+
+```
+sum(node_drbd_disk_state_is_up_to_date != 1) by (job, instance, alias, team)
+```
+
+This matters because otherwise this would create a *lot* of alerts,
+one per disk! For example, on `fsn-node-01`, there are *52* drives:
+
+    count(node_drbd_disk_state_is_up_to_date{alias=~"fsn-node-01.*"}) == 52
+
+So we use the `count()` function to count the number of drives per
+machine. Technically, we count `by (job, instance, alias, team)`, but
+typically, the 4 metrics will be the same for each alert. We still
+have to specify all of those because otherwise they get redacted by
+the aggregation function.
+
+Note that the Alertmanager does its own grouping as well, see the
+`group_by` setting.
+
+### Labels
+
+As mentioned above, labels typically come from the metrics used in the
+alerting rule itself. It's the job of the exporter and the Prometheus
+configuration to attach most necessary labels to the metrics for the
+Alertmanager to function properly. We expect the following labels to
+be produce by either the exporter, the Prometheus scrape
+configuration, or alerting rule:
+
+- `job`: name of the job (e.g. `JobDown`)
+- `instance`: host name and port of affected device, including URL for
+  some `blackbox` probes (e.g. `test-01.torproject.org:9100`,
+  `https://www.torproject.org`)
+- `alias`: similar to instance, without the port number
+  (e.g. `test-01.torproject.org`, `https://www.torproject.org`)
+- `team`: which group to contact for this alert, which affects how
+  alerts get routed
+- `severity`: `warning` or `critical`, also affects routing, use
+  `warning` unless the alert is absolutely `critical`.
+
+[TPA-RFC-33][] defines the [alert levels][] as:
+
+> * `warning` (new): non-urgent condition, requiring investigation and
+>    fixing, but not immediately, no user-visible impact; example:
+>    server needs to be rebooted
+>  * `critical`: serious condition with disruptive user-visible impact
+>    which requires prompt response; example: donation site gives a 500
+>    error
+
+### Annotations
+
+Annotations are another field that's part of the alert generated by
+Prometheus. Those are use to generate messages for the users,
+depending on the Alertmanager routing. The `summary` field ends up in
+the `Subject` field of outgoing email, and the `description` is the
+email body, for example.
+
+### Writing a playbook

 Every alert in Prometheus *must* have a playbook annotation. This is
 (if done well), a URL pointing at a service page like this one,
@@ -418,7 +571,7 @@ Here's a template to get started:
 ```
 ### Foo errors

-The `FooAlert` looks like this:
+The `FooDegraded` looks like this:

    Service Foo has too many errors on test.torproject.org

@@ -439,24 +592,50 @@ document here how you fix this next time.
 [Fabric]: howto/fabric
 [Textfile collector errors playbook here]: #textfile-collector-errors

-## Adding alerting rules to Prometheus
+### Alerting rule template

-Adding alerts is mainly an alerting rule definition that matches on a
-PromQL expression, defined in a Git repository.
+Here is an alert template that has most fields you should be using in
+your alerts.

-But it already assumes some metrics are available and scraped by
-Prometheus. For this, ensure you have followed the tutorials [Adding
-metrics to applications][] and [Adding scrape targets][].
+```
+- alert: FooDegraded
+    expr: sum(foo_error_count) by (job, instance, alias, team)
+    for: 1h
+    labels:
+      severity: warning
+    annotations:
+      summary: "Service Foo has too many errors on {{ $labels.alias }}"
+      description: "Found {{ $value }} errors in service Foo on {{ $labels.alias }}."
+      playbook: "https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/foo#too-many-errors"
+```

-[Adding scrape targets]: #adding-scrape-targets
+### Adding alerting rules to Prometheus

-The Prometheus servers regularly pull the [`prometheus-alerts.git`
-repository][] for alerting rule and target definitions. Alert rules
-can be added through the repository by adding a file in the `rules.d`
-directory, see [`rules.d`][] directory for more documentation on that.
+Now that you have an alert, you need to deploy it. The Prometheus
+servers regularly pull the [`prometheus-alerts.git` repository][] for
+alerting rule and target definitions. Alert rules can be added through
+the repository by adding a file in the `rules.d` directory, see
+[`rules.d`][] directory for more documentation on that.

 [`rules.d`]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/tree/main/rules.d

+Note the top of `.rules` file, for example in the above
+`tpa_node.rules` sample we didn't include:
+
+```
+groups:
+- name: tpa_node
+  rules:
+```
+
+... as that structure just serves to declare the rest of the alerts in
+the file. However, consider that "rules within a group are run
+sequentially at a regular interval, with the same evaluation time"
+(see the [recording rules documentation][]). So avoid putting *all*
+alerts inside the same file. In TPA, we group alerts by exporter, so
+we have (above) `tpa_node` for alerts pertaining to the
+[`node_exporter`][], for example.
+
 After being merged, the changes should propagate within [4 to 6
 hours][]. Prometheus does _not_ automatically reload those rules by
 itself, but Puppet should handle reloading the service as a
@@ -467,6 +646,52 @@ reloading the Prometheus server with:
    git -C /etc/prometheus-alerts/ pull
    systemctl reload prometheus

+### Other expression examples
+
+The `AptUpdateLagging` alert is a good example of an expression with a
+built-in threshold:
+
+    (time() - apt_package_cache_timestamp_seconds)/(60*60) > 24
+
+What this does is calculate the age of the package cache (given by the
+`apt_package_cache_timestamp_seconds` metric) by substracting it to
+the current time. It gives us a number of second, which we convert to
+hours (`/3600`) and then check against our threshold (`> 24`). This
+gives us a value (in this case, in hours), we can reuse in our
+annotation. In general, the formula looks like:
+
+    (time() - metric_seconds)/$tick > $threshold
+
+Where threshold is the order of magnitude (minutes, hours, days, etc)
+similar to the threshold. Note the priority of operators here requires
+putting the `60*60` tick in parenthesis.
+
+The `DiskWillFillSoon` alert does a [linear regression][] to try to
+predict if a disk will fill in less than 24h:
+
+      (node_filesystem_readonly != 1)
+      and (
+        node_filesystem_avail_bytes
+        / node_filesystem_size_bytes < 0.2
+      )
+      and (
+        predict_linear(node_filesystem_avail_bytes[6h], 24*60*60)
+        < 0
+      )
+
+The core of the logic is the magic `predict_linear` function, but also
+note how it also restricts its checks to filesystems with only 20%
+space left, to avoid warning about normal write spikes.
+
+[metrics in your application]: #adding-metrics-to-applications
+[scraped by Prometheus]: #adding-scrape-targets
+[Alerting philosophy]: #alerting-philosophy
+[alerting rule]: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
+[recording rules documentation]: https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#recording-rules
+[aggregation operators]: https://prometheus.io/docs/prometheus/latest/querying/operators/#aggregation-operators
+[alert levels]: policy/tpa-rfc-33-monitoring#alert-levels
+[linear regression]: https://en.wikipedia.org/wiki/Linear_regression
+
 # How-to

 ## Queries cheat sheet
@@ -1773,6 +1998,8 @@ compiler][] which is [not in Debian][]. It can be built by hand
 using the `debian/generate-ui.sh` script, but only in newer, post
 buster versions. Another alternative to consider is [Crochet][].

+### Alerting philosophy
+
 In general, when working on alerting, keeping [the "My Philosophy on
 Alerting" paper from a Google engineer][] (now the [Monitoring
 distributed systems][] chapter of the [Site Reliability