Skip to content

Internet connection issues cause arti to get into a state where all connections time out

I think that the following issues are all the same problem:

(@opara seems to have a similar idea in #2071 (closed))

If we agree that these are likely to all be the same issue, I'd like to close all those in favor of this issue, just to have a single place to discuss this.

I do not have a consistent repro for this, although when you get into this state, copying the ~/.local/share/arti/ directory can allow one to toggle back and forth between a working and non-working state. However, sometimes the non-working state will magically start working again. In doing this, I have found that my current non-working state can be solved by deleting the state/circuit_timeouts.json file. Here is the non-working and working (blank) file:

Non-working `state/circuit_timeouts.json`
{
  "version": 1,
  "histogram": [
    [
      355,
      4
    ],
    [
      365,
      6
    ],
    [
      375,
      4
    ],
    [
      385,
      6
    ],
    [
      395,
      9
    ],
    [
      405,
      9
    ],
    [
      415,
      17
    ],
    [
      425,
      6
    ],
    [
      435,
      17
    ],
    [
      445,
      7
    ],
    [
      455,
      7
    ],
    [
      465,
      9
    ],
    [
      475,
      8
    ],
    [
      485,
      8
    ],
    [
      495,
      13
    ],
    [
      505,
      5
    ],
    [
      515,
      12
    ],
    [
      525,
      9
    ],
    [
      535,
      7
    ],
    [
      545,
      17
    ],
    [
      555,
      10
    ],
    [
      565,
      35
    ],
    [
      575,
      26
    ],
    [
      585,
      16
    ],
    [
      595,
      18
    ],
    [
      605,
      20
    ],
    [
      615,
      24
    ],
    [
      625,
      28
    ],
    [
      635,
      25
    ],
    [
      645,
      30
    ],
    [
      655,
      37
    ],
    [
      665,
      28
    ],
    [
      675,
      36
    ],
    [
      685,
      32
    ],
    [
      695,
      38
    ],
    [
      705,
      33
    ],
    [
      715,
      32
    ],
    [
      725,
      33
    ],
    [
      735,
      38
    ],
    [
      745,
      34
    ],
    [
      755,
      32
    ],
    [
      765,
      18
    ],
    [
      775,
      19
    ],
    [
      785,
      21
    ],
    [
      795,
      36
    ],
    [
      805,
      36
    ],
    [
      815,
      31
    ],
    [
      825,
      15
    ],
    [
      835,
      16
    ],
    [
      845,
      12
    ],
    [
      855,
      9
    ],
    [
      865,
      2
    ]
  ],
  "current_timeout": 748
}
Working (blank) `state/circuit_timeouts.json`
{
  "version": 1,
  "histogram": [],
  "current_timeout": 60000
}

To me, this implies that choosing a higher minimum timeout could help alleviate this issue.

My suspicion is that only certain kinds of network issues cause this (arti handles the network being completely down fine, I think the problem comes when you have 100% packet loss or similar kinds of situations, but my iptables skills were not up to the task of simulating that readily). If it is true that a 100% packet loss situation does cause this, it might be worth thinking about whether that's something we can detect and treat the same as the network being down. Not sure if that complexity is worth it or not, though.

There is also the matter of these kinds of log messages:

No usable guards. Rejected 60/60 as down, then 0/0 as pending, then 0/0 as unsuitable to purpose, then 0/0 with filter.

In #1861 (closed), @nickm writes:

So it seems the behavior here is that we have previously tried all our possible guards, marked them all as down, and decided not to try them any more because they are down.

IIRC (cc @arma @dgoulet) C tor has a behavior here where, in this situation, we optimistically retry our primarry guards, to see if they work. We should probably:

  • see what C tor does in this situation,
  • see whether it is reasonable, and come up with something better if it isn't,
  • make sure that the specs document it,
  • implement it.

IF this indeed is what's going on, retrying primary guards if we've rejected all guards seems like a reasonable solution. However, I'm not totally clear on how this code works, and thus it's unclear to me if this diagnosis is correct: I'm surprised if that's the case that resetting circuit_timeouts.json without touching guards.json can fix this. If the rejected status of guards is reset when the process restarts, that would make sense, if not, then this diagnosis seems incorrect, or at least can't be the complete picture.

I'll add this to the discussion for the next team meeting, it would be nice to get thoughts on possible solutions and ways forward here. Some possibilities:

  • Increase minimum timeout (maybe 2500ms?)
  • Detect situation where all guards are rejected
    • Reset timeout?
    • Retry primary guards?
  • Try to detect kinds of network failures that lead to this, and don't update timeouts in those situations.

cc @syphyr

Edited by wesleyac