document what we learned about metric relabeling (#41642) authored by anarcat's avatar anarcat
......@@ -1463,6 +1463,93 @@ IRC relay:
[default route errors]: #default-route-errors
## Metric relabeling
The [blackbox target documentation](#adding-a-blackbox-target) uses a technique called
"relabeling" to have the blackbox exporter actually provide useful
labels. This is done with the [`relabel_configs`](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#relabel_configs) configuration,
which changes labels before the scrape is performed, so that the
blackbox exporter is scraped instead of the configured target, and
that the configured target is passed to the exporter.
There are other uses for this. In the `bacula` job, for example, we
relabel the `alias` label so that it points at the host being backed
up instead of the host where backups are stored:
```yaml
- job_name: 'bacula'
metric_relabel_configs:
# the alias label is what's displayed in IRC summary lines. we want to
# know which backup jobs failed alerts, not which backup host contains the
# failed jobs.
- source_labels:
- 'alias'
target_label: 'backup_host'
- source_labels:
- 'bacula_job'
target_label: 'alias'
```
The above takes the `alias` label (e.g. `bungei.torproject.org`) and
copies it to a new label, `backup_host`. It then takes the
`bacula_job` label and uses *that* as an `alias` label. This has the
effect of turning a metric like this:
```
bacula_job_last_execution_end_time{alias="bacula-director-01.torproject.org",bacula_job="alberti.torproject.org",instance="bacula-director-01.torproject.org:9133",job="bacula",team="TPA"}
```
into that:
```
bacula_job_last_execution_end_time{alias="alberti.torproject.org",backup_host="bacula-director-01.torproject.org",bacula_job="alberti.torproject.org",instance="bacula-director-01.torproject.org:9133",job="bacula",team="TPA"}
```
This configuration is different from the blackbox exporter because it
operates *after* the scrape, and therefore affects labels coming out
of the exporter (which plain `relabel_configs` *can't* do).
This can be really tricky to get right. The equivalent change, for the
Puppet reporter, initially caused problems because it dropped the
`alias` label on *all* `node` metrics. This was the incorrect
configuration:
```yaml
- job_name: 'node'
metric_relabel_configs:
- source_labels: ['host']
target_label: 'alias'
action: 'replace'
- regex: '^host$'
action: 'labeldrop'
```
That destroyed the `alias` label because the first block matches even
if the host was empty. The fix was to match *something* (anything!) in
the `host` label, making sure it was present, by changing the `regex`
field:
```yaml
- job_name: 'node'
metric_relabel_configs:
- source_labels: ['host']
target_label: 'alias'
action: 'replace'
regex: '.+'
- regex: '^host$'
action: 'labeldrop'
```
Those configurations were done to make it possible to inhibit alerts
based on common labels. Before those changes, the `alias` field (for
example) was not common between (say) the Puppet metrics and the
normal `node` exporter, which made it impossible to (say) avoid
sending alerts about a catalog being stale in Puppet because a host is
down. See [tpo/tpa/team#41642](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41642) for a full discussion on this.
The site [`relabeler.promlabs.com`](https://relabeler.promlabs.com/) can be extremely useful to
iterate more quickly over those configurations.
## Debugging the blackbox exporter
The [upstream documentation][] has some details that can help. We also
......
......