add label to tor_bug_reached metric and rename metric to avoid grafana warnings
Summary
We would like to be able to ignore specific high-frequent tor bug events in our prometheus alerting. Currently this is not possible. To make this possible we would need additional information in the tor_bug_reached metric as a label.
this metric got implemented in #40839 (closed)
What is the expected behavior?
given this example:
zgrep 'tor_bug_occurred_()' syslog.*|cut -d" " -f4-|sort|uniq -c
1112 tor_bug_occurred_(): Bug: ../src/core/or/conflux.c:567: conflux_pick_first_leg: Non-fatal assertion !(smartlist_len(cfx->legs) <= 0) failed.
4 tor_bug_occurred_(): Bug: ../src/core/or/relay.c:2338: connection_edge_package_raw_inbuf: Non-fatal assertion !(conn->base_.marked_for_close) failed.
this could look like this:
tor_bug_reached_count{function="conflux_pick_first_leg"} 1112
tor_bug_reached_count{function="connection_edge_package_raw_inbuf"} 4
In addition it would be best to rename this metric to tor_bug_reached_count
otherwise grafana displays a warning sign saying:
PromQL info: metric might not be a counter, name does not end in _total/_sum/_count
Renaming this metric seems fine since there is no tor release with this metric yet.
assumption about cardinality
It is not best practice to add high cardinality information to labels.
https://prometheus.io/docs/practices/naming/#labels
CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values.
but we expect that the number of distinct bug events will never be >100 in any given month, so this should be fine.