Normalize meaning of metrics labels in prometheus

We need to normalize the semantics of some of the labels that we attach to metrics. What we need to do:

  • document the agreed upon semantics for labels (wiki-replica!61 (merged))
  • if at all possible, modify puppet resources to use the suggested semantics
    • on prometheus1
    • on prometheus2 -- coordinate with alert changes to avoid disruptions on alerts to other teams
  • modify scrape jobs configured in hiera to change the concerned labels, in particular stop using "host" for the puppet exporter
  • change alerts so that they match the new labels (including removing the now deprecated host and node labels)
  • change grafana dashboards to use the new labels instead (perhaps delegate to a separate issue)

This is what @anarcat suggested in #41642 (closed) :

We really need to formalize what all those labels mean, and come up with a global meaning for everything.

Right now, i think we have this:

Label syntax normal example blackbox example note
instance host:port idle-fsn-01.torproject.org:9100 http://idle-fsn-01.torproject.org?
alias host idle-fsn-01.torproject.org http://idle-fsn-01.torproject.org
host host idle-fsn-01.torproject.org N/A used in some Grafana dashboard variables and puppet exporter
node host or host:port idle-fsn-01.torproject.org or ...:9100 N/A used in some Grafana dashboard variables
backup_host host bacula-director-01.torproject.org", N/A used in bacula exporter

I would propose we do that instead:

Label syntax normal example blackbox example note
instance host:port idle-fsn-01.torproject.org:9100 idle-fsn-01.torproject.org:80
alias host idle-fsn-01.torproject.org idle-fsn-01.torproject.org
host N/A N/A N/A deprecated
node N/A N/A N/A deprecated
target full URL N/A? http://idle-fsn-01.torproject.org/ new, generated at relabel_configs stage
exporter_instance host:port bacula-director-01.torproject.org:9133 localhost:9115 new, generated at relabel_configs stage

That is:

  • remove the "scheme" (e.g. "HTTP") part from the URL passed to blackbox because, anyways, it doesn't work: if you tell the blackbox exporter to scrape http://example.com/ with an HTTPS probe, it will just fail
  • concretely, remove the URL from instance and alias
  • add the port number to the instance (probably in relabel_configs as well)
  • retire the host label, which is confusing because it's similar to alias, but not quite the same?
  • similarly deprecate the node variable (which is not an actual prometheus label i've seen anywhere, but that is used in some grafana dashboards)
  • uniformely have alias refer to the host's FQDN, regardless of where the metric comes from (e.g. even if it's backups scraped from bacula or puppet jobs scraped from pauli, we're talking about, say, idle-fsn-01.torproject.org here)
  • add a target label that has the actual, nice URL that we would expect for (say) http probes
  • add a exporter_instance label that's similar to what we use instance for presently, but that has the address of the exporter generating the metric, if any (currently, this is __address__ in relabel_configs, and i don't think it's accessible out there

We could also add exporter_alias as well if we need to match on that.

Note that the bacula exporter was previously marked (incorrectly) as using backup_host to point at the instance being backed up (e.g. idle-fsn-01), but it's actually setup "correctly" in the sense that backup_host is always set to bacula-director-01, and alias points at the instance (idle-fsn-01).

Edited by lelutin