Tor relays publish a new descriptor but authorities drop it because they think it's only cosmetically different, and then the relay waits 18 more hours to publish, thus falling out of the consensus
We have a design flaw, or at least an impedance mismatch, in our descriptor publishing algorithm.
Relays publish a new descriptor when they think something has sufficiently changed (e.g. bandwidth, IP address, exit policy, etc) or when 18 hours have passed.
Directory authorities accept the new descriptor when they think it has sufficiently changed. If they think it hasn't, they quietly drop it:
log_info(LD_DIRSERV, "Not replacing descriptor from %s (source: %s); " "differences are cosmetic.", router_describe(ri), source);
The trouble comes when things get out of sync: the relay thinks it published recently so it is still early in its 18 hour timer, but the authorities discarded that descriptor. Then when the "current" descriptor becomes 24 hours old, it gets discarded, and the relay falls out of the consensus.
I don't have stats on how frequently this out-of-sync actually happens, but it's enough to have tickets filed about it (#23638 (moved)) and it's enough to have confused/sad posts from relay operators about it every month: https://lists.torproject.org/pipermail/tor-dev/2018-March/013030.html https://lists.torproject.org/pipermail/tor-relays/2018-March/014764.html
We deployed a bandaid in 0.2.3.4-alpha (commit 1f4b694, #3327 (moved)), that makes relays look in the consensus and publish a new descriptor more aggressively if they find they're not listed. That hack is apparently needed quite often: in #21642 (moved) I said "So 426 of our ~7300 relays stayed in the consensus in the last 12.5 hours because of this hack."
But I think we haven't actually explored whether the bandaid helps all of the relays stay in the consensus all of the time, or if there are still "holes" in it that mean some relays fall out sometimes. The reports above make me think that yes there are still holes.
Potential ways forward:
Match up the descriptor upload timings, as seen by a dir auth, with the appearance of relays in the consensus. See how many of the relays publishing for reason "version listed in consensus is quite old" are missing any hours in the consensus.
If there are some that fall out of the consensus entirely, think about ways to make the republish more aggressive and earlier, or if it is already more aggressive and earlier, figure out why it isn't sticking.
Think about ways to make our relay-side decisions about "is it different enough" synchronize better with our dirauth-side decisions. Now that we're doing hourly consensus documents, can the dir auths be more lenient of similar-ish descriptors, because there's only one "winner" of a descriptor each hour? This poor synchronization is part of why we couldn't implement proposal 275 when we wanted to.