Mystery: why do some busy relays consistently not answer my create cells?
I adapted my bermuda exit scanner into an 'extendprobe' tool, which builds two-hop circuits from my Tor client, to every test relay, to a destination relay that I run. The goal is to identify which relays are bad at finishing these circuits to my relay, and try to diagnose and fix them. About 100-200 relays in the network consistently have problems extending:
https://lists.torproject.org/pipermail/network-health/2021-March/000668.html
I mailed the ones I could, and some of them were out of file descriptors, or censoring outbound ports that included my relay's ORPort, or other as-yet-unclear reasons. These are the four cases I experienced where I get a destroy cell back from the middle relay:
650 CIRC 45556 FAILED $6EE7AEAFC24542BF4DB6E724E0B49D85DF9516FA~LANETRelay PURPOSE=GENERAL TIME_CREATED=2021-04-01T08:13:41.165376 REASON=DESTROYED REMOTE_REASON=RESOURCELIMIT
650 CIRC 58 FAILED $B891CB6370CF7C51C6FB24D80947AFB7ED463D00~niftygrolantor PURPOSE=GENERAL TIME_CREATED=2021-03-30T04:18:40.899307 REASON=DESTROYED REMOTE_REASON=FINISHED
650 CIRC 57871 FAILED $42A8C3AECAFD03F0242D09FA2C022AED849BF933~router PURPOSE=GENERAL TIME_CREATED=2021-04-02T01:03:31.822972 REASON=DESTROYED REMOTE_REASON=CHANNEL_CLOSED
650 CIRC 120 FAILED $45B8EB2373436BB8B714FD7A64F29FD85C71D308~darknetdotservices PURPOSE=GENERAL TIME_CREATED=2021-03-30T04:19:42.911016 REASON=DESTROYED REMOTE_REASON=CONNECTFAILED
But many of them, especially really fast ones, consistently fail not with a destroy cell, but because my client times out and gives up waiting for the circuit:
650 CIRC 57764 FAILED $98F793C7320CE3C15A45353AFCC165747A40366D~F3Netze PURPOSE=GENERAL TIME_CREATED=2021-04-02T00:32:28.661684 REASON=TIMEOUT
Now, obviously these busy relays are busy -- that is, they're handling a lot of traffic, and so they must be able to make (or receive) some connections in order to be getting that traffic. But how come they won't make my connections?
One hypothesis is that they are good at sending cells (eventually), but not at establishing new ORConns (in time), and so when they're more idle, they make connections, and enough circuits stay established that those connections endure long-term. I specifically set up this first experiment to (likely) require both a new ORConn and a circuit handshake.
But there are many other possible explanations. Maybe we have a mis-design in our scheduler that prioritizes existing conns rather than handshaking ones, so a busy relay just never gets around to finishing the conns? Maybe it does eventually finish but my client has given up well before then? (Do I send a destroy when I give up, meaning there are no pending circuits, meaning it then closes that conn it took so long to open?) Maybe there is a difference between making time for outgoing ORConns vs making time for incoming ORConns?
So that's mystery one: figure out the mechanism for why some big relays consistently end up timing out my extend request.
And then, whatever the actual mechanism is, mystery two is: what is this doing to the circuits that exist on the network as a whole? If I'm timing out because there's no existing conn between those relays, and also they won't make one, does that mean there are edges in the network graph that are essentially never used, and the phenomenon gets worse as relays get busy, maybe even in a way that could be externally induced? Tor clients will work around these problems, hiding the extent of the issue from us, because they make their circuits preemptively. That's great for robustness but poor for finding problems and also potentially poor for anonymity.