Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
Wiki Replica
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package Registry
Container Registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Service Desk
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Cecylia Bocovich
Wiki Replica
Commits
e4a68d22
Unverified
Commit
e4a68d22
authored
3 years ago
by
anarcat
Browse files
Options
Downloads
Patches
Plain Diff
add more info about alerts
parent
5a4be1f7
No related branches found
Branches containing commit
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
howto/prometheus.md
+142
-2
142 additions, 2 deletions
howto/prometheus.md
with
142 additions
and
2 deletions
howto/prometheus.md
+
142
−
2
View file @
e4a68d22
...
...
@@ -89,6 +89,135 @@ TODO: talk about `scrape_jobs` for in-puppet configurations.
TODO: show how to hook a custom scrape job, and on where server to put
it.
## Web dashboard usage
The main web dashboard for the internal Prometheus server should be
accessible at
<https://prometheus.torproject.org>
using the
well-known, public username.
The dashboard for the external Prometheus server, however, is not
publicly available. To bypass it, use the following commandline to
forward ports over SSH:
ssh -L 9090:localhost:9090 -L 9091:localhost:9091 -L 9093:localhost:9093 prometheus2.torproject.org
The above will also forward the management interfaces of the
Alertmanager (port 9093) and Pushgateway (9091).
## Alerting
We currently do not do alerting for TPA services with Prometheus. We
do, however, have the Alertmanager setup to do alerting for other
teams on the secondary Prometheus server (
`prometheus2`
). This
documentation details how that works, but could also eventually cover
the main server, if it eventually replaces
[
Nagios
](
howto/nagios
)
for
alerting (
[
ticket 29864
][]
).
In general, the upstream documentation for alerting starts from
[
the
Alerting Overview
](
https://prometheus.io/docs/alerting/latest/overview/
)
but I have found it to be lacking at times. I
have instead been following
[
this tutorial
](
https://ashish.one/blogs/setup-alertmanager/
)
which was quite
helpful.
### Adding alerts
The Alertmanager is currently managed through Puppet, in
`profile::prometheus::server::external`
. An alerting rule is defined
like:
{
'name' => 'bridgestrap',
'rules' => [
'alert' => 'Bridges down',
'expr' => 'bridgestrap_fraction_functional < 0.50',
'for' => '5m',
'labels' =>
{
'severity' => 'critical',
'team' => 'anti-censorship',
},
'annotations' =>
{
'title' => 'Bridges down',
'description' => 'Too many bridges down',
# use humanizePercentage when upgrading to prom > 2.11
'summary' => 'Number of functional bridges is `{{$value}}%`',
'host' => '{{$labels.instance}}',
},
],
},
The key part of the alert is the
`expr`
setting which is a PromQL
expression that, when evaluated to "true" for more than
`5m`
(the
`for`
settings), will fire an error at the Alertmanager. Also note
the
`team`
label which will route the message to the right team. Those
routes are defined later, in the
`routes`
and
`receivers`
settings.
Note that those might move to separate files and/or Hiera later on.
### Adding alert recipients
To add a new recipient for alerts, look for the
`receivers`
setting
and add something like this:
receivers => [
{
'name' => 'anti-censorship team',
'email_configs' => [
'to' => 'anti-censorship-alerts@lists.torproject.org',
# see above
'require_tls' => false,
],
},
# [...]
Then alerts can be routed to that receiver by adding a "route" in the
`routes`
setting. For example, this will route alerts with the
`team:
anti-censorship`
label:
routes => [
{
'receiver' => 'anti-censorship team',
'match' => {
'team' => 'anti-censorship',
},
},
],
### Testing alerts
Normally, alerts should fire on the Prometheus server and be sent out
to the Alertmanager server, if the latter is correctly configured
(ie. if it's configured in
`prometheus.yml`
, the
`alerting`
section,
see
[
Installation
](
#installation
)
below).
If you're not sure alerts are working, head to the web dashboard (see
[
the access instructions
](
#web-dashboard-usage
)
) and look at the
`/alerts`
, and
`/rules`
pages. For example, if you're
using port forwarding:
*
<http://localhost:9090/alerts>
- should show the configure alerts,
and if they are firing
*
<http://localhost:9090/rules>
- should show the configured rules,
and whether they match
Typically, the
<http://localhost:9093>
URL should also be useful to
manage the Alertmanager, but in practice the Debian package does not
ship the web interface, so its interest is limited in that regard. See
the
`amtool`
section below for more information.
Note that the
`/targets`
URL is also useful to diagnose problems with
exporters, in general.
### Managing alerts with amtool
Since the Alertmanager web UI is not available in Debian, you need to
use the
[
amtool
](
https://manpages.debian.org/amtool.1
)
command. A few useful commands:
*
`amtool alert`
: show firing alerts
*
`amtool silence add --duration=1h --author=anarcat
--comment="working on it" ALERTNAME`
: silence alert ALERTNAME for
an hour, with some comments
## Pager playbook
TBD.
...
...
@@ -101,6 +230,8 @@ dashboard is not available, how to bypass authentication restrictions
on said dashboard, talk about the Alertmanager (lack of?) UI, the
Pushgateway UI, how to access them,
`amtool`
, rules debugging...
TODO: talk about
`/targets`
.
## Disaster recovery
If a Prometheus/Grafana is destroyed, it should be compltely
...
...
@@ -257,13 +388,22 @@ changed.
The
[
Alertmanager
][]
is configured on the external Prometheus server
for the metrics and anti-censorship teams to monitor the health of the
network. It may eventually also be used to replace or enhance
[
Nagios
](
howto/nagios
)
(
[ticket
29864]
(https://gitlab.torproject.org/tpo/tpa/team/-/issues/29864)
).
[
Nagios
](
howto/nagios
)
(
[ticket
29864]
[]
).
It is installed through Puppet, in
`profile::prometheus::server::external`
, but could be moved to its own
profile if it is deployed on more than one server.
TODO: document how to add stuff to the Alertmanager.
Note that Alertmanager only dispatches alerts, which are actually
generated on the Prometheus server side of things. Make sure the
following block exists in the
`prometheus.yml`
file:
alerting:
alert_relabel_configs: []
alertmanagers:
- static_configs:
- targets:
- localhost:9093
### Manual node configuration
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment