We have someindicators about serious non-exit relay general overload going on. We "solved" the exit relay issues by just not using the DNS failure metric anymore (see: team#139 (closed) for some analysis of the problem). We might need to tune our metrics that get triggered by non-exits as well.
We should probably use our network-health relays in testing and figuring out what is going on.
s7r's post in particular sounded like a guard relay that was suddenly being hit by tons of circuit requests, or similar DoS attack.
Yeah. I think at this point it's not clear. Looking at the last graphs I uploaded, I'd say some spikes could get mapped to that kind of DoS but not all of them.
s7r's post in particular sounded like a guard relay that was suddenly being hit by tons of circuit requests, or similar DoS attack.
Yeah. I think at this point it's not clear. Looking at the last graphs I uploaded, I'd say some spikes could get mapped to that kind of DoS but not all of them.
For the timeframe in particular (around 12/20/2021 - 03/01/2022) we have the following spikes (taking the 72h overload-general timeframe of guard relays (the spikes are caused by non-exits as the overloaded exits stay around 100-200 over the whole time)):
I've checked our laundry list of things to look at for network performance issues (modulo the outlier analysis, which I've not done yet) but nothing jumps out here.
One thing we noticed while doing graphs for just taking overload into account that is not more than 18h in the past (18h being the max time between descriptor uploads) is that the amount of overload reduces drastically for the timeframe in question while other periods are not affected.
One theory is that those spurious overloads are caused by ntor onionskins being dropped (assuming that spurious TCP port exhaustions nor OOM invocations due to memory pressure are at play here). In fact, we have evidence that this got triggered on relays within the timeframe we are concerned with in this ticket by overloading relays just for some minutes every couple of days. We made that indicator more robust so that this kind of probing does not trigger overload in the future anymore. That change appeared in 0.4.7.5-alpha.
I picked that spike to look a bit closer as to whether there are some patterns taking the 18h overload numbers for that period, but those are not obvious. For one, there is no large operators that restarted their relays, e.g. to get the overload warning on relay-search away (we've been in contact with some folks doing that). Secondly, there does not seem to be a single AS or so affected.
Looking over the relays that have a stable overload between 01/28 and 02/03 there are between ca. 40 and 70 Guard-only ones (out of 130 for 01/28, 140 for 02/01, 302 for 02/02, 256 for 02/03). So, the remaining ones are pretty much fluctuating. I suspect they are still due to relay probing even though for those relays it might not be as spurious as in the 72h window.
I picked that spike to look a bit closer as to whether there are some patterns taking the 18h overload numbers for that period, but those are not obvious. For one, there is no large operators that restarted their relays, e.g. to get the overload warning on relay-search away (we've been in contact with some folks doing that). Secondly, there does not seem to be a single AS or so affected.
Another thing I looked at was whether there are some kind of bw spikes at affected relays. I looked at a couple but did not see any weird pattern for the period in question either.