Some LD_BUG logs don't come with any warn/err logs, leading to silent metrics port bursts of 2000+ bug counts
I've been working with @toralf who reported surprising counts on the METRICS_NAME(bug_reached_count)
metrics port counter:
<toralf> and I observed 2000 incremnents within 3 min in the past [...]
They were a mystery because the warn-level logs remained empty.
I looked into it some more, and it looks like this counter increments every time there is a log entry of domain LD_BUG.
For the most part we only log with LD_BUG at severity warn or err, but there are some exceptions, e.g.
- conflux_log_set() uses LD_BUG but here gets told to use LOG_INFO by default:
log_fn(LOG_PROTOCOL_WARN, LD_CIRC,
"Conflux set has too many legs to link. "
"Rejecting this circuit.");
conflux_log_set(LOG_PROTOCOL_WARN, unlinked->cfx, unlinked->is_client);
-
In or/circuitstats.c we have some lines of the form
log_debug(LD_BUG,
-
On the client side,
if (endreason != END_STREAM_REASON_RESOLVEFAILED) {
log_info(LD_BUG,
"No origin circuit for successful SOCKS stream %"PRIu64
So... there appear to be quite a few edge cases like this.
I guess the hope was that we have an invariant where you only use LD_BUG when you are also (loudly) logging details about a real bug?
The robust simple fix would be to no longer increment the counter for LD_BUG.
The more pervasive and more brittle fix would be to audit all uses of LD_BUG and make sure we keep to our invariant.
Calling @dgoulet's name since maybe he can help us with what the original intent was for this metrics port counter.