fix pint warnings on prod server
I have enabled new checks in CI here, and it seems a bunch of rules are failing some checks. This job has an example:
https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/jobs/191508
... which is copied below:
rules.d/metrics.rules:5: metric "metrics_log_warnings" with label {alias="CollecTor"} is only sometimes present on prometheus "prod" at https://ci:[MASKED]@prometheus2.torproject.org with average life span of 14h47m (promql/series)
5 | expr: metrics_log_warnings{alias="CollecTor", level="ERROR"} > 1
rules.d/metrics.rules:5: prometheus "prod" at https://ci:[MASKED]@prometheus2.torproject.org has "metrics_log_warnings" metric with "level" label but there are no series matching {level="ERROR"} in the last 1w (promql/series)
5 | expr: metrics_log_warnings{alias="CollecTor", level="ERROR"} > 1
rules.d/metrics.rules:16: metric "metrics_log_warnings" with label {alias="OnionooService"} is only sometimes present on prometheus "prod" at https://ci:[MASKED]@prometheus2.torproject.org with average life span of 1d13h12m30s (promql/series)
16 | expr: metrics_log_warnings{alias="OnionooService", level="ERROR"} > 1
rules.d/metrics.rules:16: prometheus "prod" at https://ci:[MASKED]@prometheus2.torproject.org has "metrics_log_warnings" metric with "level" label but there are no series matching {level="ERROR"} in the last 1w (promql/series)
16 | expr: metrics_log_warnings{alias="OnionooService", level="ERROR"} > 1
rules.d/metrics.rules:27: prometheus "prod" at https://ci:[MASKED]@prometheus2.torproject.org has "exits_list_last_updated_in_minutes" metric with "status" label but there are no series matching {status="UNSTABLE"} in the last 1w (promql/series)
27 | expr: exits_list_last_updated_in_minutes{status="UNSTABLE"} > 0
rules.d/metrics.rules:60: prometheus "prod" at https://ci:[MASKED]@prometheus2.torproject.org has "onionperf_analysis_logs_staleness" metric with "status" label but there are no series matching {status="UNSTABLE"} in the last 1w (promql/series)
60 | expr: onionperf_analysis_logs_staleness{status="UNSTABLE"} > 0
rules.d/rdsys.rules:5: metric "gettor_request_total" is only sometimes present on prometheus "prod" at https://ci:[MASKED]@prometheus2.torproject.org with average life span of 1d3h37m30s in the last 1w (promql/series)
5 | expr: sum(increase(gettor_request_total[1h])) == 0
basically, it looks like metrics like (say) metrics_log_warnings
don't explicitly emit the requested label (presumably when it's zero. i haven't checked, but i bet that, for example, that metrics looks like this in the exporter output:
metrics_log_warnings{alias="CollecTor", level="WARNING"} 10
metrics_log_warnings{alias="CollecTor", level="INFO"} 100
and that, since there's no ERROR
, no line is emitted. for some reason, this makes pint unhappy... i'm not exactly sure why, but i suspect it's to catch problems with incorrect labels in rules or metrics names. for example, if you'd write an alerting rule with metrics_log_warnings{alias="CollecTor", level="ERREUR"}
because, I don't know, you're french or something, pint would catch that and tell you that label doesn't exist. (it won't tell you the right one is ERROR
of course, can't have nice things.)
so i think it makes sense to have that check, and i'd encourage you (i think both @hiro and @meskio have things to fix here) should look into this and improve the exporters so that those labels are correctly emitted even if zero.
for now, the job is marked as allow_failure
, but i'd like to enforce that one eventually, so we'd need to have that check go green first.
assigning to @hiro first, since there's more metrics than anti-censorship stuff in there.