Mystery bug causes relays to attempt many many descriptor publishes, with no X-Desc-Gen-Reason header
Sometimes relays get into a state where they try to publish a new descriptor every second, and this state lasts for hours or days.
I instrumented moria1 to keep track of the X-Desc-Gen-Reason headers in each descriptor upload attempt:
diff --git a/src/or/directory.c b/src/or/directory.c
index c419b61..4570b20 100644
--- a/src/or/directory.c
+++ b/src/or/directory.c
@@ -5102,6 +5102,15 @@ directory_handle_command_post,(dir_connection_t *conn, const char *headers,
const char *msg = "[None]";
uint8_t purpose = authdir_mode_bridge(options) ?
ROUTER_PURPOSE_BRIDGE : ROUTER_PURPOSE_GENERAL;
+
+ {
+ char *genreason = http_get_header(headers, "X-Desc-Gen-Reason: ");
+ log_info(LD_DIRSERV,
+ "New descriptor post, because: %s",
+ genreason ? genreason : "not specified");
+ tor_free(genreason);
+ }
+
was_router_added_t r = dirserv_add_multiple_descriptors(body, purpose,
conn->base_.address, &msg);
tor_assert(msg);
And then I counted up the number of upload attempts of each type over the last two-ish weeks:
$ grep "New descriptor post, because:" moria1-info|cut -d: -f5-|sort|uniq -c
20685 bandwidth has changed
1 Chosen Or/DirPort changed
53191 config change
601 configured managed proxies
41663 DirPort found reachable
96 dns resolvers back
191 IP address changed
131619 not listed in consensus
1647625 not specified
31440 ORPort found reachable
8518 rotated onion key
28 set onion key
120487 time for new descriptor
156 Tor just started
25362 version listed in consensus is quite old
Now, the weird "not listed in consensus" and "version listed in consensus is quite old" ones are #25685 (moved).
But there are a huge number that are simply lacking this header. These tend to come from a relatively small number of relays that are just bombing me with publish attempts.
In fact, out of those 1647625 attempts that didn't provide a reason, nearly all of them got discarded:
$ grep -A1 "New descriptor post, because: not specified" moria1-info|grep "Not replacing descriptor"|wc -l
1645051
We should try to track down what bug on the relay side causes republishes without including a reason header. Maybe we do this by examining the code and looking for mistakes where a republish can happen without also setting the reason header?
See #3942 (moved) for a time long ago that we had problems with listing our reason. And see #21642 (moved) for where a lot of this analysis started.