Reorganize and rephrase rules + scrape jobs/targets

Currently rules are *not* defined in puppet. However, scrape jobs and targets should be for all TPA-related services.

Reorganize and rephrase rules + scrape jobs/targets
e8741f85 · lelutin · c0dda00c · e8741f85
Verified Commit e8741f85 authored 6 months ago by lelutin
--- a/howto/prometheus.md
+++ b/howto/prometheus.md
@@ -249,55 +249,106 @@ Each alert should have a URL to a "runbook" in its annotations, typically a link
 to this very wiki, in the "Pager playbook" section, which shows how to handle
 any particular outage. If it's not present, it's a bug and can be filed as such.

-### Adding alerts in Puppet
-
-The Alertmanager can be (but currently isn't, on the external server)
-managed through Puppet, in `profile::prometheus::server::external`.
-
-An alerting rule, in Puppet, is defined like:
-
-        {
-          'name' => 'bridgestrap',
-          'rules' => [
-            'alert' => 'Bridges down',
-            'expr'  => 'bridgestrap_fraction_functional < 0.50',
-            'for'   => '5m',
-            'labels'       =>
-            {
-              'severity' => 'critical',
-              'team'     => 'anti-censorship',
-            },
-            'annotations'  =>
-            {
-              'title' => 'Bridges down',
-              'description' => 'Too many bridges down',
-              # use humanizePercentage when upgrading to prom > 2.11
-              'summary' => 'Number of functional bridges is `{{$value}}%`',
-              'host' => '{{$labels.instance}}',
-            },
-          ],
-        },
-
-Note that we might want to move those to Hiera so that we could use
-YAML code directly, which would better match the syntax of the actual
-alerting rules.
-
-### Adding alerts through Git, on the external server
-
-The external server pulls pulls a [git repository](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/) for alerting and
-targets regularly. Alerts can be added through that repository by
-adding a file in the `rules.d` directory, see [rules.d](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/tree/main/rules.d) directory
-for more documentation on that.
+### Configuring alert rules in Prometheus
+
+Adding alerts is done in two parts:
+
+* the alert rule definition that matches on a PromQL expression
+  * This is currently not done in puppet, but in a git repository
+* a scrape job that ensure we obtain the metrics we need
+  * this also defines the scrape job's target(s)
+  * some scrape jobs will obtain a list of targets from a file. This is mostly
+    done for non-TPA services
+  * for any TPA-maintained service, it is best to use profiles to export scrape
+    jobs.
+
+#### Scrape job + target
+
+TPA-managed services should define their scrape jobs, and thus targets, via
+puppet profiles.
+
+To add a scrape job in a puppet profile, you can use the
+`prometheus::scrape_job` defined type, or one of the defined types which are
+convenience wrappers around that. For example, to configure the ssh scrape jobs,
+in `modules/profile/manifests/ssh.pp`, the scrape job is created with:
+
+    @@prometheus::scrape_job { "blackbox_ssh_banner_${facts['networking']['fqdn']}":
+      job_name => 'blackbox_ssh_banner',
+      targets  => [ "${facts['networking']['fqdn']}:22" ],
+      labels   => {
+        'alias' => $facts['networking']['fqdn'],
+        'team'  => 'TPA',
+      },
+    }
+
+Scrape jobs for non-TPA services are defined in hiera under keys named
+`scrape_configs` in `hiera/common/prometheus.yaml`. Here's one example of such a
+scrape job definition:
+
+    profile::prometheus::server::external::scrape_configs:
+    # generic blackbox exporters from any team
+    - job_name: blackbox
+      metrics_path: "/probe"
+      params:
+        module:
+        - http_2xx
+      file_sd_configs:
+      - files:
+        - "/etc/prometheus-alerts/targets.d/blackbox_*.yaml"
+      relabel_configs:
+      - source_labels: [__address__]
+        target_label: __param_target
+      - source_labels: [__param_target]
+        target_label: instance
+      - target_label: __address__
+        replacement: localhost:9115
+
+Some scrape jobs can be simpler and not require the relabeling part. In the
+above case, the relabeling is done since the exporter runs on the prometheus
+server itself instead of the actual target.
+
+Targets for scrape jobs defined in hiera are however not managed by puppet. They
+are defined through files in the [prometheus-alerts][prometheus-alerts]
+repository. See the section below for more details on how things are maintained
+there. In the above example, we can see that targets are obtained via files on
+disk. The repository [prometheus-alerts][prometheus-alerts] is cloned in
+`/etc/prometheus-alerts` on the prometheus servers.
+
+[prometheus-alerts]: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts
+
+Note: we currently have a handful of blackbox_exporter-related targets for TPA
+services, namely for the http* checks. We intend to move those into puppet
+profiles whenever possible.
+
+#### Alert rules and targets through Git
+
+Both prometheus servers pull the [prometheus-alerts][prometheus-alerts]
+repository for alert rule and target definitions, regularly.
+
+All alerting rules are currently defined in
+[prometheus-alerts][prometheus-alerts] and so they are _not_ directly managed by
+puppet -- although puppet will ensure that the repository is checked out with the
+most recent commit on the prometheus servers. Rule definitions should appear
+only once and we already had the above-mentioned repository that could be used
+to configure alerting rules.
+
+Alert rules can be added through the repository by adding a file in the
+`rules.d` directory, see
+[rules.d](https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/tree/main/rules.d)
+directory for more documentation on that.
+
+TPA-managed services should define their scrape job through puppet, see the
+section above. Targets that are not managed by TPA are defined in
+[prometheus-alerts][prometheus-alerts] under `targets.d/$exporter_*.yaml`

 After being merged, the changes should propagate within [4 to 6
-hours](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/puppet/#cron-and-scheduling). Prometheus does *not* automatically reload those rules when
-they are deployed, but Puppet should reload the service after
-deploying the rules.
-
-Note: that wasn't tested. It's possible this doesn't work, see the
-`vcsrepo` resource in `tor-puppet.git`,
-`modules/profile/manifests/prometheus/server/server.pp` for the
-`notify` directive.
+hours](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/puppet/#cron-and-scheduling).
+Prometheus does _not_ automatically reload those rules by itself, but Puppet
+should handle reloading the service as a consequence of the file changes. TPA
+members can accelerate this by running puppet on the prometheus servers.
+
+Note that all scrape jobs for non-TPA services are managed by TPA through
+puppet. See the section above for details.

 ### Adding alert recipients