Directory authorities might store wrong descriptor in relay list
This morning we had a relay operator wondering why their healthy guard relay with a consensus weight of 30000 suddenly dropped to a consensus weight of 20. In particular, as the sibling relay on the same machine was still behaving fine.
It turned out the relay, nognu
, got just two measurements and made it barely in the consensus, so the consensus weight = 20 fallback kicked in. Now, why exactly did it not make into, e.g. moria1
's vote then? For some reason moria1
did not get the latest two descriptors directly but from different directory authorities:
published 2022-10-11 17:06:38
@downloaded-at 2022-10-11 17:50:44
@source "154.35.175.225"
published 2022-10-11 18:36:51
@downloaded-at 2022-10-11 18:50:03
@source "199.58.81.140"
That's not too bad. However, moria1
thought none of those was the latest descriptor, rather it was the one it fetched from dizum
:
published 2022-10-10 23:06:35
@downloaded-at 2022-10-11 18:50:05
@source "45.66.33.45"
It seems it arrived two seconds later after moria1
got the latest descriptor suddenly making this old descriptor the latest. And given that it already expired more than ROUTER_MAX_AGE_TO_PUBLISH
ago that descriptor got discarded and now moria1
(and it seems a bunch of other directory authorities) think that relay does not exist, i.e. they don't include it in their vote.
@arma did a bunch of debugging here. So, I'll take the liberty to just paste the IRC debug logs into this ticket, so I don't lose important details:
09:52 <+arma1> and then issue two is: there seems to be a bug where tor dir auths
store the wrong new descriptor in the relay list
09:53 <+arma1> they should take the new descriptor and compare published-by and
take the newest
09:53 <+arma1> though now that i think about it, i think there is some logic to
try to consense upon the most popular one
09:53 <+arma1> so if a relay publishes a new one every minute and there are
thousands of descriptors for it, the dir auths don't all vote
about a different one
09:57 <+arma1> yes, here is that logic: see dirserv_add_descriptor() in
feature/dirauth/process_descs.c,
09:57 <+arma1> /* Check whether this descriptor is semantically identical to
the last one
09:57 <+arma1> * from this server. (We do this here and not in
router_add_to_routerlist
09:57 <+arma1> * because we want to be able to accept the newest router
descriptor that
09:57 <+arma1> * another authority has, so we all converge on the same one.) */
09:59 <+arma1> we hit a race where in one voting period, we got two votes about
nognu each naming a different descriptor
10:02 <+arma1> Oct 11 14:50:03.773 [notice] longclaw posted a vote to me from
199.58.81.140.
10:02 <+arma1> Oct 11 14:50:07.099 [notice] dizum posted a vote to me from
45.66.33.45.
10:04 <+arma1> i wonder if the downloaded-at's are utc or local time
10:04 <+arma1> looks like they are utc
10:10 <+arma1> how did i receive the nognu descriptor from dizum at 18:50:03 if i
received dizum's vote at 14:50:07.099? that is weird.
10:10 <+arma1> erm, i mean 18:50:05. how did i receive it from dizum at 18:50:05
if i got the vote 2 seconds after that.
10:14 <+arma1> ok, i think i know where the bug happened,
10:14 <+arma1> in router_add_to_routerlist(),
10:14 <+arma1> if (old_router) {
10:14 <+arma1> if (!in_consensus && (router->cache_info.published_on <=
10:14 <+arma1> old_router->cache_info.published_on)) {
10:14 <+arma1> } else {
10:14 <+arma1> /* Same key, and either new, or listed in the consensus. */
10:14 <+arma1> log_debug(LD_DIR, "Replacing entry for router %s",
10:14 <+arma1> router_describe(router));
10:15 <+arma1> i bet that earlier nognu descriptor was the one listed in the
consensus at 18:00 utc on that day
10:15 <+arma1> so when i heard it from dizum, i fetched a copy, and put it as my
primary descriptor for nognu because that's what other people were
voting about at the time
10:20 <+arma1> https://gitlab.torproject.org/tpo/core/tor/-/issues/543 is the
original bug
10:20 -zwiebelbot:#tor-relays- tor:tpo/core/tor#543: 0.2.0.9-alpha servers don't
update enough dir info - https://bugs.torproject.org/tpo/core/tor/543
10:21 <+arma1> as added in commit acaa9a7f696
10:24 <+arma1> we do have code to rescue old descriptors if we see them listed in
the consensus,
10:24 <+arma1> log_info(LD_DIR, "%d router descriptors listed in consensus
are "
10:24 <+arma1> "currently in old_routers; making them current.",
10:24 <+arma1> smartlist_len(no_longer_old));
10:24 <+arma1> but we seem to have wrapped that code in if
(!authdir_mode_v3(options)
10:24 <+arma1> i.e. everybody does the rescuing except v3 dir auths
Tagging @nickm as I heard he might be our best bet here. :)