prop224: Disconnects on long-lasting HS connections (possibly because of mds)

Hello,

so I've been noticing the following pattern in my prop224 IRC client:

Everything goes well for like a day or so, IRC connection stays open and everything. Then at some point after like 30+ hours, there is a DC on my IRC connection for a few secs, and then a reconnect.

That could be caused by any of the 6 rend circuit hops going down, but I've seen it happen 4-5 times now and I'm starting to suspect that's not the case.

So this happened yesterday night as well. Here are some logs from before/during/after the reconnect happened (at 01:13):

Sep 16 00:50:55.000 [info] handle_response_fetch_consensus(): Received consensus directory (body size 2082670) from server '137.205.124.35:1720'
Sep 16 00:50:55.000 [info] I learned some more directory information, but not enough to build a circuit: We're missing descriptors for some of our primary entry guards
Sep 16 00:50:55.000 [info] handle_response_fetch_consensus(): Successfully loaded consensus.
Sep 16 00:50:55.000 [info] connection_free_(): Freeing linked Directory connection [client finished] with 0 bytes on inbuf, 0 on outbuf.
Sep 16 00:51:53.000 [info] routerlist_remove_old_routers(): We have 0 live routers and 0 old router descriptors.
Sep 16 00:51:57.000 [info] circuit_mark_for_close_(): Circuit 0 marked for close at src/or/command.c:599 (orig reason: 521, new reason: 0)
Sep 16 00:52:53.000 [info] routerlist_remove_old_routers(): We have 0 live routers and 0 old router descriptors.
...
...
Sep 16 01:13:30.000 [info] connection_edge_reached_eof(): conn (fd 18) reached eof. Closing.
Sep 16 01:13:31.000 [info] connection_handle_listener_read(): New SOCKS connection opened from 127.0.0.1.
Sep 16 01:13:31.000 [info] connection_ap_handle_onion(): Got a hidden service request for ID 'd6sfftbz6pkwfwwl'
Sep 16 01:13:31.000 [info] rep_hist_note_used_internal(): New port prediction added. Will continue predictive circ building for 2492 more seconds.
Sep 16 01:13:31.000 [info] connection_ap_handle_onion(): Descriptor is here. Great.
Sep 16 01:13:31.000 [info] connection_edge_process_inbuf(): data from edge while in 'waiting for circuit' state. Leaving it on buffer.
...
Sep 16 01:13:31.000 [notice] Application request when we haven't used client functionality lately. Optimistically trying directory fetches again.
...
Sep 16 01:13:33.000 [notice] We now have enough directory information to build circuits.
...
Sep 16 01:13:34.000 [notice] Tor has successfully opened a circuit. Looks like client functionality is working.
Sep 16 01:13:34.000 [notice] Received an RENDEZVOUS_ESTABLISHED. This circuit is now ready for rendezvous.
Sep 16 01:13:34.000 [notice] Re-extending circ 3572311836, this time to $AB0CC19517476E610B86FF796B703643BFE21E89~$AB0CC19517476E610B at 5.51.106.108.
Sep 16 01:13:34.000 [notice] Re-extending circ 3572311836, this time to $DB4080D411B0E62788EE38164FFDC275F20159BB~$DB4080D411B0E62788 at 163.172.152.231.
Sep 16 01:13:36.000 [notice] Got RENDEZVOUS2 cell from hidden service on circuit 2915300815.
Sep 16 06:04:51.000 [notice] Heartbeat: Tor's uptime is 1 day 17:59 hours, with 2 circuits open. I've sent 4.01 MB and received 21.77 MB.
Sep 16 06:04:51.000 [notice] Average packaged cell fullness: 23.706%. TLS write overhead: 5%

So it seems like after many many hours of operation we somehow ended up with no mds for our primary guards.

I've noticed this the other day in a different way as well. Tor was running for days, and I would get this message every once in a while:

 I learned some more directory information, but not enough to build a circuit: We need more microdescriptors: we have 5822/6822...

but everytime the number of mds would decrease more and more. Then at some point it reached below minimum dir info, Tor stalled and then it started fetching mds again. It's like something expires mds while in memory, but there is no logic to fetch them back in before we reach the minimum point.