Changes
Page history
move alerting tutorials up in the tutorials section (
#41655
)
authored
Oct 01, 2024
by
anarcat
Show whitespace changes
Inline
Side-by-side
service/prometheus.md
View page @
7cbaa745
...
@@ -363,6 +363,110 @@ Those rules are declared on the server, in `prometheus::prometheus::server::inte
...
@@ -363,6 +363,110 @@ Those rules are declared on the server, in `prometheus::prometheus::server::inte
[
tpo/tpa/gitlab#20
]:
https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/20
[
tpo/tpa/gitlab#20
]:
https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/20
## Writing an alerting rule
TODO
## Writing a playbook
Every alert in Prometheus
*must*
have a playbook annotation. This is
(if done well), a URL pointing at a service page like this one,
typically in the
`Pager playbook`
section, that explains how to deal
with the alert.
The playbook
*must*
include those things:
1.
the actual code name of the alert (e.g.
`JobDown`
or
`DiskWillFillSoon`
)
2.
an example of the alert output (e.g.
`Exporter job gitlab_runner
on tb-build-02.torproject.org:9252 is down`
)
3.
why this alert triggered, what is its impact
4.
optionally, how to reproduce the issue
5.
how to fix it
How to reproduce the issue is optional, but important. Think of
yourself in the future, tired and panicking because things are
broken:
-
Where do you think the error will be visible?
-
Can we
`curl`
something to see it happening?
-
Is there a dashboard where you can see trends?
-
Is there a specific Prometheus query to run live?
-
Which log file can we inspect?
-
Which systemd service is running it?
The "how to fix it" can be a simple one line, or it can go into a
multiple case example of scenarios that were found in the wild. It's
the hard part: sometimes, when you make an alert, you don't actually
*know*
how to handle the situation. If so, explicitly state that
problem in the playbook, and say you're sorry, and that it should be
fixed.
If the playbook becomes too complicated, consider making a
[
Fabric
][]
script out of it.
A good example of a proper playbook is the
[
Textfile collector errors
playbook here
][]
. It has all of the above points, including actual
fixes for different actual scenarios.
Here's a template to get started:
```
### Foo errors
The `FooAlert` looks like this:
Service Foo has too many errors on test.torproject.org
It means that the service Foo is having some kind of trouble. [Explain
why this happened, and what the impact is, what means for which
users. Are we losing money, data, exposing users, etc.]
[Optional] You can tell this is a real issue by going to place X and
trying Y.
[Ideal] To fix this issue, [inverse the polarity of the shift inverter
in service Foo].
[Optional] We do not yet exactly know how to fix issue, sorry. Please
document here how you fix this next time.
```
[
Fabric
]:
howto/fabric
[
Textfile collector errors playbook here
]:
#textfile-collector-errors
## Adding alerting rules to Prometheus
Adding alerts is mainly an alerting rule definition that matches on a
PromQL expression, defined in a Git repository.
But it already assumes some metrics are available and scraped by
Prometheus. For this, ensure you have followed the tutorials
[
Adding
metrics to applications
][]
and
[
Adding scrape targets
][]
.
[
Adding scrape targets
]:
#adding-scrape-targets
The Prometheus servers regularly pull the
[
`prometheus-alerts.git`
repository
][]
for alerting rule and target definitions. Alert rules
can be added through the repository by adding a file in the
`rules.d`
directory, see
[
`rules.d`
][]
directory for more documentation on that.
[
`rules.d`
]:
https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/tree/main/rules.d
After being merged, the changes should propagate within
[
4 to 6
hours
][]
. Prometheus does _not_ automatically reload those rules by
itself, but Puppet should handle reloading the service as a
consequence of the file changes. TPA members can accelerate this by
running Puppet on the Prometheus servers, or pulling the code and
reloading the Prometheus server with:
git -C /etc/prometheus-alerts/ pull
systemctl reload prometheus
# How-to
# How-to
## Queries cheat sheet
## Queries cheat sheet
...
@@ -472,7 +576,7 @@ the Prometheus server.
...
@@ -472,7 +576,7 @@ the Prometheus server.
[
Prometheus JSON API
]:
https://prometheus.io/docs/prometheus/latest/querying/api/
[
Prometheus JSON API
]:
https://prometheus.io/docs/prometheus/latest/querying/api/
[
Cumin
]:
howto/cumin
[
Cumin
]:
howto/cumin
## Alerting
## Alert
debugg
ing
We are now using Prometheus for alerting for TPA services. Here's a basic
We are now using Prometheus for alerting for TPA services. Here's a basic
overview of how things interact around alerting:
overview of how things interact around alerting:
...
@@ -514,110 +618,6 @@ TPA-RFC-33 proposal][].
...
@@ -514,110 +618,6 @@ TPA-RFC-33 proposal][].
[
alerting system
]:
https://grafana.torproject.org/alerting/
[
alerting system
]:
https://grafana.torproject.org/alerting/
[
Grafana for alerting section of the TPA-RFC-33 proposal
]:
policy/tpa-rfc-33-monitoring#grafana-for-alerting
[
Grafana for alerting section of the TPA-RFC-33 proposal
]:
policy/tpa-rfc-33-monitoring#grafana-for-alerting
### Writing alerting rules
TODO
### Writing a playbook
Every alert in Prometheus
*must*
have a playbook annotation. This is
(if done well), a URL pointing at a service page like this one,
typically in the
`Pager playbook`
section, that explains how to deal
with the alert.
The playbook
*must*
include those things:
1.
the actual code name of the alert (e.g.
`JobDown`
or
`DiskWillFillSoon`
)
2.
an example of the alert output (e.g.
`Exporter job gitlab_runner
on tb-build-02.torproject.org:9252 is down`
)
3.
why this alert triggered, what is its impact
4.
optionally, how to reproduce the issue
5.
how to fix it
How to reproduce the issue is optional, but important. Think of
yourself in the future, tired and panicking because things are
broken:
-
Where do you think the error will be visible?
-
Can we
`curl`
something to see it happening?
-
Is there a dashboard where you can see trends?
-
Is there a specific Prometheus query to run live?
-
Which log file can we inspect?
-
Which systemd service is running it?
The "how to fix it" can be a simple one line, or it can go into a
multiple case example of scenarios that were found in the wild. It's
the hard part: sometimes, when you make an alert, you don't actually
*know*
how to handle the situation. If so, explicitly state that
problem in the playbook, and say you're sorry, and that it should be
fixed.
If the playbook becomes too complicated, consider making a
[
Fabric
][]
script out of it.
A good example of a proper playbook is the
[
Textfile collector errors
playbook here
][]
. It has all of the above points, including actual
fixes for different actual scenarios.
Here's a template to get started:
```
### Foo errors
The `FooAlert` looks like this:
Service Foo has too many errors on test.torproject.org
It means that the service Foo is having some kind of trouble. [Explain
why this happened, and what the impact is, what means for which
users. Are we losing money, data, exposing users, etc.]
[Optional] You can tell this is a real issue by going to place X and
trying Y.
[Ideal] To fix this issue, [inverse the polarity of the shift inverter
in service Foo].
[Optional] We do not yet exactly know how to fix issue, sorry. Please
document here how you fix this next time.
```
[
Fabric
]:
howto/fabric
[
Textfile collector errors playbook here
]:
#textfile-collector-errors
### Adding alerting rules
Adding alerts is mainly an alerting rule definition that matches on a
PromQL expression, defined in a Git repository.
But it already assumes some metrics are available and scraped by
Prometheus. For this, ensure you have followed the tutorials
[
Adding
metrics to applications
][]
and
[
Adding scrape targets
][]
.
[
Adding scrape targets
]:
#adding-scrape-targets
The Prometheus servers regularly pull the
[
`prometheus-alerts.git`
repository
][]
for alerting rule and target definitions. Alert rules
can be added through the repository by adding a file in the
`rules.d`
directory, see
[
`rules.d`
][]
directory for more documentation on that.
[
`rules.d`
]:
https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/tree/main/rules.d
After being merged, the changes should propagate within
[
4 to 6
hours
][]
. Prometheus does _not_ automatically reload those rules by
itself, but Puppet should handle reloading the service as a
consequence of the file changes. TPA members can accelerate this by
running Puppet on the Prometheus servers, or pulling the code and
reloading the Prometheus server with:
git -C /etc/prometheus-alerts/ pull
systemctl reload prometheus
### Diagnosing alerting failures
### Diagnosing alerting failures
Normally, alerts should fire on the Prometheus server and be sent out
Normally, alerts should fire on the Prometheus server and be sent out
...
...
...
...