Skip to content
Snippets Groups Projects
Verified Commit e8741f85 authored by lelutin's avatar lelutin
Browse files

Reorganize and rephrase rules + scrape jobs/targets

Currently rules are *not* defined in puppet. However, scrape jobs and
targets should be for all TPA-related services.
parent c0dda00c
No related branches found
No related tags found
No related merge requests found
......@@ -249,55 +249,106 @@ Each alert should have a URL to a "runbook" in its annotations, typically a link
to this very wiki, in the "Pager playbook" section, which shows how to handle
any particular outage. If it's not present, it's a bug and can be filed as such.
### Adding alerts in Puppet
The Alertmanager can be (but currently isn't, on the external server)
managed through Puppet, in `profile::prometheus::server::external`.
An alerting rule, in Puppet, is defined like:
{
'name' => 'bridgestrap',
'rules' => [
'alert' => 'Bridges down',
'expr' => 'bridgestrap_fraction_functional < 0.50',
'for' => '5m',
'labels' =>
{
'severity' => 'critical',
'team' => 'anti-censorship',
},
'annotations' =>
{
'title' => 'Bridges down',
'description' => 'Too many bridges down',
# use humanizePercentage when upgrading to prom > 2.11
'summary' => 'Number of functional bridges is `{{$value}}%`',
'host' => '{{$labels.instance}}',
},
],
},
Note that we might want to move those to Hiera so that we could use
YAML code directly, which would better match the syntax of the actual
alerting rules.
### Adding alerts through Git, on the external server
The external server pulls pulls a [git repository](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/) for alerting and
targets regularly. Alerts can be added through that repository by
adding a file in the `rules.d` directory, see [rules.d](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/tree/main/rules.d) directory
for more documentation on that.
### Configuring alert rules in Prometheus
Adding alerts is done in two parts:
* the alert rule definition that matches on a PromQL expression
* This is currently not done in puppet, but in a git repository
* a scrape job that ensure we obtain the metrics we need
* this also defines the scrape job's target(s)
* some scrape jobs will obtain a list of targets from a file. This is mostly
done for non-TPA services
* for any TPA-maintained service, it is best to use profiles to export scrape
jobs.
#### Scrape job + target
TPA-managed services should define their scrape jobs, and thus targets, via
puppet profiles.
To add a scrape job in a puppet profile, you can use the
`prometheus::scrape_job` defined type, or one of the defined types which are
convenience wrappers around that. For example, to configure the ssh scrape jobs,
in `modules/profile/manifests/ssh.pp`, the scrape job is created with:
@@prometheus::scrape_job { "blackbox_ssh_banner_${facts['networking']['fqdn']}":
job_name => 'blackbox_ssh_banner',
targets => [ "${facts['networking']['fqdn']}:22" ],
labels => {
'alias' => $facts['networking']['fqdn'],
'team' => 'TPA',
},
}
Scrape jobs for non-TPA services are defined in hiera under keys named
`scrape_configs` in `hiera/common/prometheus.yaml`. Here's one example of such a
scrape job definition:
profile::prometheus::server::external::scrape_configs:
# generic blackbox exporters from any team
- job_name: blackbox
metrics_path: "/probe"
params:
module:
- http_2xx
file_sd_configs:
- files:
- "/etc/prometheus-alerts/targets.d/blackbox_*.yaml"
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: localhost:9115
Some scrape jobs can be simpler and not require the relabeling part. In the
above case, the relabeling is done since the exporter runs on the prometheus
server itself instead of the actual target.
Targets for scrape jobs defined in hiera are however not managed by puppet. They
are defined through files in the [prometheus-alerts][prometheus-alerts]
repository. See the section below for more details on how things are maintained
there. In the above example, we can see that targets are obtained via files on
disk. The repository [prometheus-alerts][prometheus-alerts] is cloned in
`/etc/prometheus-alerts` on the prometheus servers.
[prometheus-alerts]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts
Note: we currently have a handful of blackbox_exporter-related targets for TPA
services, namely for the http* checks. We intend to move those into puppet
profiles whenever possible.
#### Alert rules and targets through Git
Both prometheus servers pull the [prometheus-alerts][prometheus-alerts]
repository for alert rule and target definitions, regularly.
All alerting rules are currently defined in
[prometheus-alerts][prometheus-alerts] and so they are _not_ directly managed by
puppet -- although puppet will ensure that the repository is checked out with the
most recent commit on the prometheus servers. Rule definitions should appear
only once and we already had the above-mentioned repository that could be used
to configure alerting rules.
Alert rules can be added through the repository by adding a file in the
`rules.d` directory, see
[rules.d](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/tree/main/rules.d)
directory for more documentation on that.
TPA-managed services should define their scrape job through puppet, see the
section above. Targets that are not managed by TPA are defined in
[prometheus-alerts][prometheus-alerts] under `targets.d/$exporter_*.yaml`
After being merged, the changes should propagate within [4 to 6
hours](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/puppet/#cron-and-scheduling). Prometheus does *not* automatically reload those rules when
they are deployed, but Puppet should reload the service after
deploying the rules.
Note: that wasn't tested. It's possible this doesn't work, see the
`vcsrepo` resource in `tor-puppet.git`,
`modules/profile/manifests/prometheus/server/server.pp` for the
`notify` directive.
hours](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/puppet/#cron-and-scheduling).
Prometheus does _not_ automatically reload those rules by itself, but Puppet
should handle reloading the service as a consequence of the file changes. TPA
members can accelerate this by running puppet on the prometheus servers.
Note that all scrape jobs for non-TPA services are managed by TPA through
puppet. See the section above for details.
### Adding alert recipients
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment