Mystery bug causes relays to attempt many many descriptor publishes, with no X-Desc-Gen-Reason header

Sometimes relays get into a state where they try to publish a new descriptor every second, and this state lasts for hours or days.

I instrumented moria1 to keep track of the X-Desc-Gen-Reason headers in each descriptor upload attempt:

diff --git a/src/or/directory.c b/src/or/directory.c
index c419b61..4570b20 100644
--- a/src/or/directory.c
+++ b/src/or/directory.c
@@ -5102,6 +5102,15 @@ directory_handle_command_post,(dir_connection_t *conn, const char *headers,
     const char *msg = "[None]";
     uint8_t purpose = authdir_mode_bridge(options) ?
                       ROUTER_PURPOSE_BRIDGE : ROUTER_PURPOSE_GENERAL;
+
+    {
+      char *genreason = http_get_header(headers, "X-Desc-Gen-Reason: ");
+      log_info(LD_DIRSERV,
+               "New descriptor post, because: %s",
+               genreason ? genreason : "not specified");
+      tor_free(genreason);
+    }
+
     was_router_added_t r = dirserv_add_multiple_descriptors(body, purpose,
                                              conn->base_.address, &msg);
     tor_assert(msg);

And then I counted up the number of upload attempts of each type over the last two-ish weeks:

$ grep "New descriptor post, because:" moria1-info|cut -d: -f5-|sort|uniq -c
  20685  bandwidth has changed
      1  Chosen Or/DirPort changed
  53191  config change
    601  configured managed proxies
  41663  DirPort found reachable
     96  dns resolvers back
    191  IP address changed
 131619  not listed in consensus
1647625  not specified
  31440  ORPort found reachable
   8518  rotated onion key
     28  set onion key
 120487  time for new descriptor
    156  Tor just started
  25362  version listed in consensus is quite old

Now, the weird "not listed in consensus" and "version listed in consensus is quite old" ones are #25685 (moved).

But there are a huge number that are simply lacking this header. These tend to come from a relatively small number of relays that are just bombing me with publish attempts.

In fact, out of those 1647625 attempts that didn't provide a reason, nearly all of them got discarded:

$ grep -A1 "New descriptor post, because: not specified" moria1-info|grep "Not replacing descriptor"|wc -l
1645051

We should try to track down what bug on the relay side causes republishes without including a reason header. Maybe we do this by examining the code and looking for mistakes where a republish can happen without also setting the reason header?

See #3942 (moved) for a time long ago that we had problems with listing our reason. And see #21642 (moved) for where a lot of this analysis started.