Changes
Page history
write an alert writing tutorial (
#41655
)
authored
Oct 01, 2024
by
anarcat
Show whitespace changes
Inline
Side-by-side
service/prometheus.md
View page @
4a6f1b9a
...
...
@@ -363,11 +363,164 @@ Those rules are declared on the server, in `prometheus::prometheus::server::inte
[
tpo/tpa/gitlab#20
]:
https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/20
## Writing an alert
ing rule
## Writing an alert
TODO
Now that you have
[
metrics in your application
][]
and those are
[
scraped by Prometheus
][]
, you are likely going to want alert on some
of those metrics. Be careful writing alerts that are not too noisy,
and alert on user-visible symptoms, not on underlying technical issues
you
*think*
might affect users, see our
[
Alerting philosophy
][]
for a
discussion on that.
## Writing a playbook
An
[
alerting rule
][]
is a simple YAML file that consists mainly of:
-
a name (say
`JobDown`
)
-
a Prometheus query, or "expression" (say
`up != 1`
)
-
extra labels and annotations
### Expressions
The most important part of the alert is the
`expr`
field, which is a
Prometheus query that should evaluate to "true" (non-zero) for the
alert to fire.
Here is, for example, the first alert in the
[
`rules.d/tpa_node.rules`
file
](
https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/blob/21d67a21ce9926b2eeef0e14b04bb317fb5c94c0/rules.d/tpa_node.rules
)
:
```
- alert: JobDown
expr: up < 1
for: 15m
labels:
severity: warning
annotations:
summary: 'Exporter job {{ $labels.job }} on {{ $labels.instance }} is down'
description: 'Exporter job {{ $labels.job }} on {{ $labels.instance }} has been unreachable for more than 15 minutes.'
playbook: "https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/prometheus/#exporter-job-down-warnings"
```
In the above, Prometheus will generate an alert if the metric
`up`
is
not equal to 1 for more than 15 minutes, hence
`up < 1`
.
### Duration
The
`for`
field means the alert is not immediately passed down to the
Alertmanager until that time has passed. It is useful to avoid
flapping and temporary conditions. Rule of thumbs:
-
`0s`
: checks that already have a built-in time threshold in its
expression (see below), or critical condition requiring immediate
action, immediate notification (default). Examples:
`AptUpdateLagging`
(checks for
`apt update`
not running for
more than 24h),
`RAIDDegraded`
(failed disk won't come back on its
own in 15m)
-
`15m`
: availability checks, designed to ignore transient errors.
examples:
`JobDown`
,
`DiskFull`
-
`1h`
: consistency checks, things an operator might have deployed
incorrectly but could recover on its own. Examples:
`OutdatedLibraries`
, as
`needrestart`
might recover at the end of
the upgrade job, which could take more than 15m
-
`1d`
: daily consistency check. Examples:
`PackagesPendingTooLong`
(upgrades are supposed to run daily)
### Grouping
At this point, what it effectively does is generate a message that it
passes along to the Alertmanager with the annotations, the labels
defined in the alerting rule (
`severity="warning"`
). It also passes
along all other labels that might be attached to the
`up`
metric
*
,
which is important, as the query can modify which labels are
visible. For example, the
`up`
metric typically looks like this:
```
up{alias="test-01.torproject.org",classes="role::ldapdb",instance="test-01.torproject.org:9100",job="node",team="TPA"} 1
```
Also note that this single expression
*will*
generate multiple alerts
for multiple matches. For example, if two hosts are down, the metric
would look like this:
```
up{alias="test-01.torproject.org",classes="role::ldapdb",instance="test-01.torproject.org:9100",job="node",team="TPA"} 0
up{alias="test-02.torproject.org",classes="role::ldapdb",instance="test-02.torproject.org:9100",job="node",team="TPA"} 0
```
This will generate
*two*
alerts. This matters, because it can create a
lot of noise and confusion on the other end. A good way to deal with
this is to use
[
aggregation operators
][]
. For example, here is the
DRBD alerting rule, which often fires for multiple disks at once
because we're mass-migrating instances in Ganeti:
```
- alert: DRBDDegraded
expr: count(node_drbd_disk_state_is_up_to_date != 1) by (job, instance, alias, team)
for: 1h
labels:
severity: warning
annotations:
summary: "DRBD has {{ $value }} out of date disks on {{ $labels.alias }}"
description: "Found {{ $value }} disks that are out of date on {{ $labels.alias }}."
playbook: "https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/drbd#resyncing-disks"
```
The expression, here, is:
```
sum(node_drbd_disk_state_is_up_to_date != 1) by (job, instance, alias, team)
```
This matters because otherwise this would create a
*lot*
of alerts,
one per disk! For example, on
`fsn-node-01`
, there are
*52*
drives:
count(node_drbd_disk_state_is_up_to_date{alias=~"fsn-node-01.*"}) == 52
So we use the
`count()`
function to count the number of drives per
machine. Technically, we count
`by (job, instance, alias, team)`
, but
typically, the 4 metrics will be the same for each alert. We still
have to specify all of those because otherwise they get redacted by
the aggregation function.
Note that the Alertmanager does its own grouping as well, see the
`group_by`
setting.
### Labels
As mentioned above, labels typically come from the metrics used in the
alerting rule itself. It's the job of the exporter and the Prometheus
configuration to attach most necessary labels to the metrics for the
Alertmanager to function properly. We expect the following labels to
be produce by either the exporter, the Prometheus scrape
configuration, or alerting rule:
-
`job`
: name of the job (e.g.
`JobDown`
)
-
`instance`
: host name and port of affected device, including URL for
some
`blackbox`
probes (e.g.
`test-01.torproject.org:9100`
,
`https://www.torproject.org`
)
-
`alias`
: similar to instance, without the port number
(e.g.
`test-01.torproject.org`
,
`https://www.torproject.org`
)
-
`team`
: which group to contact for this alert, which affects how
alerts get routed
-
`severity`
:
`warning`
or
`critical`
, also affects routing, use
`warning`
unless the alert is absolutely
`critical`
.
[
TPA-RFC-33
][]
defines the
[
alert levels
][]
as:
> * `warning` (new): non-urgent condition, requiring investigation and
> fixing, but not immediately, no user-visible impact; example:
> server needs to be rebooted
> * `critical`: serious condition with disruptive user-visible impact
> which requires prompt response; example: donation site gives a 500
> error
### Annotations
Annotations are another field that's part of the alert generated by
Prometheus. Those are use to generate messages for the users,
depending on the Alertmanager routing. The
`summary`
field ends up in
the
`Subject`
field of outgoing email, and the
`description`
is the
email body, for example.
### Writing a playbook
Every alert in Prometheus
*must*
have a playbook annotation. This is
(if done well), a URL pointing at a service page like this one,
...
...
@@ -418,7 +571,7 @@ Here's a template to get started:
```
### Foo errors
The `Foo
Alert
` looks like this:
The `Foo
Degraded
` looks like this:
Service Foo has too many errors on test.torproject.org
...
...
@@ -439,24 +592,50 @@ document here how you fix this next time.
[
Fabric
]:
howto/fabric
[
Textfile collector errors playbook here
]:
#textfile-collector-errors
## A
dding a
lerting rule
s
t
o Prometheus
##
#
Alerting rule t
emplate
Adding alerts is mainly an alerting rule definition that matches on a
PromQL expression, defined in a Git repository
.
Here is an alert template that has most fields you should be using in
your alerts
.
But it already assumes some metrics are available and scraped by
Prometheus. For this, ensure you have followed the tutorials
[
Adding
metrics to applications
][]
and
[
Adding scrape targets
][]
.
```
- alert: FooDegraded
expr: sum(foo_error_count) by (job, instance, alias, team)
for: 1h
labels:
severity: warning
annotations:
summary: "Service Foo has too many errors on {{ $labels.alias }}"
description: "Found {{ $value }} errors in service Foo on {{ $labels.alias }}."
playbook: "https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/foo#too-many-errors"
```
[
Adding
scrape targets
]:
#adding-scrape-target
s
###
Adding
alerting rules to Prometheu
s
The Prometheus servers regularly pull the
[
`prometheus-alerts.git`
repository
][]
for alerting rule and target definitions. Alert rules
can be added through the repository by adding a file in the
`rules.d`
directory, see
[
`rules.d`
][]
directory for more documentation on that.
Now that you have an alert, you need to deploy it. The Prometheus
servers regularly pull the
[
`prometheus-alerts.git` repository
][]
for
alerting rule and target definitions. Alert rules can be added through
the repository by adding a file in the
`rules.d`
directory, see
[
`rules.d`
][]
directory for more documentation on that.
[
`rules.d`
]:
https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/tree/main/rules.d
Note the top of
`.rules`
file, for example in the above
`tpa_node.rules`
sample we didn't include:
```
groups:
- name: tpa_node
rules:
```
... as that structure just serves to declare the rest of the alerts in
the file. However, consider that "rules within a group are run
sequentially at a regular interval, with the same evaluation time"
(see the
[
recording rules documentation
][]
). So avoid putting
*all*
alerts inside the same file. In TPA, we group alerts by exporter, so
we have (above)
`tpa_node`
for alerts pertaining to the
[
`node_exporter`
][]
, for example.
After being merged, the changes should propagate within
[
4 to 6
hours
][]
. Prometheus does _not_ automatically reload those rules by
itself, but Puppet should handle reloading the service as a
...
...
@@ -467,6 +646,52 @@ reloading the Prometheus server with:
git -C /etc/prometheus-alerts/ pull
systemctl reload prometheus
### Other expression examples
The
`AptUpdateLagging`
alert is a good example of an expression with a
built-in threshold:
(time() - apt_package_cache_timestamp_seconds)/(60*60) > 24
What this does is calculate the age of the package cache (given by the
`apt_package_cache_timestamp_seconds`
metric) by substracting it to
the current time. It gives us a number of second, which we convert to
hours (
`/3600`
) and then check against our threshold (
`> 24`
). This
gives us a value (in this case, in hours), we can reuse in our
annotation. In general, the formula looks like:
(time() - metric_seconds)/$tick > $threshold
Where threshold is the order of magnitude (minutes, hours, days, etc)
similar to the threshold. Note the priority of operators here requires
putting the
`60*60`
tick in parenthesis.
The
`DiskWillFillSoon`
alert does a
[
linear regression
][]
to try to
predict if a disk will fill in less than 24h:
(node_filesystem_readonly != 1)
and (
node_filesystem_avail_bytes
/ node_filesystem_size_bytes < 0.2
)
and (
predict_linear(node_filesystem_avail_bytes[6h], 24*60*60)
< 0
)
The core of the logic is the magic
`predict_linear`
function, but also
note how it also restricts its checks to filesystems with only 20%
space left, to avoid warning about normal write spikes.
[
metrics in your application
]:
#adding-metrics-to-applications
[
scraped by Prometheus
]:
#adding-scrape-targets
[
Alerting philosophy
]:
#alerting-philosophy
[
alerting rule
]:
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
[
recording rules documentation
]:
https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#recording-rules
[
aggregation operators
]:
https://prometheus.io/docs/prometheus/latest/querying/operators/#aggregation-operators
[
alert levels
]:
policy/tpa-rfc-33-monitoring#alert-levels
[
linear regression
]:
https://en.wikipedia.org/wiki/Linear_regression
# How-to
## Queries cheat sheet
...
...
@@ -1773,6 +1998,8 @@ compiler][] which is [not in Debian][]. It can be built by hand
using the
`debian/generate-ui.sh`
script, but only in newer, post
buster versions. Another alternative to consider is
[
Crochet
][]
.
### Alerting philosophy
In general, when working on alerting, keeping
[
the "My Philosophy on
Alerting" paper from a Google engineer
][]
(now the
[
Monitoring
distributed systems
][]
chapter of the
[
Site Reliability
...
...
...
...