Normalize meaning of metrics labels in prometheus
We need to normalize the semantics of some of the labels that we attach to metrics. What we need to do:
-
document the agreed upon semantics for labels (wiki-replica!61 (merged)) - if at all possible, modify puppet resources to use the suggested semantics
-
on prometheus1 -
on prometheus2 -- coordinate with alert changes to avoid disruptions on alerts to other teams
-
-
modify scrape jobs configured in hiera to change the concerned labels, in particular stop using "host" for the puppet exporter -
change alerts so that they match the new labels (including removing the now deprecated hostandnodelabels) -
change grafana dashboards to use the new labels instead (perhaps delegate to a separate issue)
This is what @anarcat suggested in #41642 (closed) :
We really need to formalize what all those labels mean, and come up with a global meaning for everything.
Right now, i think we have this:
Label syntax normal example blackbox example note instancehost:portidle-fsn-01.torproject.org:9100http://idle-fsn-01.torproject.org?aliashostidle-fsn-01.torproject.orghttp://idle-fsn-01.torproject.orghosthostidle-fsn-01.torproject.orgN/A used in some Grafana dashboard variables and puppet exporter nodehostorhost:portidle-fsn-01.torproject.orgor...:9100N/A used in some Grafana dashboard variables backup_hosthostbacula-director-01.torproject.org",N/A used in bacula exporter I would propose we do that instead:
Label syntax normal example blackbox example note instancehost:portidle-fsn-01.torproject.org:9100idle-fsn-01.torproject.org:80aliashostidle-fsn-01.torproject.orgidle-fsn-01.torproject.orghostN/A N/A N/A deprecated nodeN/A N/A N/A deprecated targetfull URL N/A? http://idle-fsn-01.torproject.org/new, generated at relabel_configsstageexporter_instancehost:portbacula-director-01.torproject.org:9133localhost:9115new, generated at relabel_configsstageThat is:
- remove the "scheme" (e.g. "HTTP") part from the URL passed to blackbox because, anyways, it doesn't work: if you tell the blackbox exporter to scrape http://example.com/ with an HTTPS probe, it will just fail
- concretely, remove the URL from
instanceandalias- add the port number to the
instance(probably in relabel_configs as well)- retire the
hostlabel, which is confusing because it's similar to alias, but not quite the same?- similarly deprecate the
nodevariable (which is not an actual prometheus label i've seen anywhere, but that is used in some grafana dashboards)- uniformely have
aliasrefer to the host's FQDN, regardless of where the metric comes from (e.g. even if it's backups scraped from bacula or puppet jobs scraped from pauli, we're talking about, say, idle-fsn-01.torproject.org here)- add a
targetlabel that has the actual, nice URL that we would expect for (say) http probes- add a
exporter_instancelabel that's similar to what we useinstancefor presently, but that has the address of the exporter generating the metric, if any (currently, this is__address__in relabel_configs, and i don't think it's accessible out thereWe could also add
exporter_aliasas well if we need to match on that.Note that the bacula exporter was previously marked (incorrectly) as using
backup_hostto point at the instance being backed up (e.g.idle-fsn-01), but it's actually setup "correctly" in the sense thatbackup_hostis always set tobacula-director-01, andaliaspoints at the instance (idle-fsn-01).
Edited by lelutin