Investigate non-exit general overload

assigned to @gk

changed milestone to %Network Health OKRs 2022 Q1-Q2 (Metrics excluded)

assigned to @gk

added S61-O4-Maybe - FINISHED Sponsor 61 - FINISHED labels

Do these correlate to onion service spikes in https://metrics.torproject.org/hidserv-rend-relayed-cells.html?

s7r's post in particular sounded like a guard relay that was suddenly being hit by tons of circuit requests, or similar DoS attack.

Do these correlate to onion service spikes in https://metrics.torproject.org/hidserv-rend-relayed-cells.html?

s7r's post in particular sounded like a guard relay that was suddenly being hit by tons of circuit requests, or similar DoS attack.

Yeah. I think at this point it's not clear. Looking at the last graphs I uploaded, I'd say some spikes could get mapped to that kind of DoS but not all of them.

mentioned in issue team#121 (closed)

mentioned in issue tpo/core/tor#40560 (closed)

added Roadmap::Future label and removed Backlog label

removed Roadmap::Future label

added Backlog label

changed the description

added Doing label and removed Backlog label

I might do this in #24 (comment 2789386) 4., not sure yet. But I should do this ticket while I am staring at overload data.

changed time estimate to 2h

Do these correlate to onion service spikes in https://metrics.torproject.org/hidserv-rend-relayed-cells.html?

s7r's post in particular sounded like a guard relay that was suddenly being hit by tons of circuit requests, or similar DoS attack.

Yeah. I think at this point it's not clear. Looking at the last graphs I uploaded, I'd say some spikes could get mapped to that kind of DoS but not all of them.

For the timeframe in particular (around 12/20/2021 - 03/01/2022) we have the following spikes (taking the 72h overload-general timeframe of guard relays (the spikes are caused by non-exits as the overloaded exits stay around 100-200 over the whole time)):

12/24 - 12/30 (477->624->668->708->632->528->393)
12/31 - 01/08 (483->581->633->578->528->487->457->397->339) (onion service!?)
01/12 - 01/16 (320->424->597->679->735) (onion service!?)
01/28 - 01/31 (658->578->430->142)
02/01 - 02/03 (249->644->768)
02/04 - 02/09 (732->663->617->580->477->268)
02/10 - 02/13 (342->506->605->636)
02/24 - 03/02 (740->672->621->545->432->251->130)

and onion services load graphs:. So, if at all only the second and the third spike might be related to increased onion service traffic.

I've checked our laundry list of things to look at for network performance issues (modulo the outlier analysis, which I've not done yet) but nothing jumps out here.

One thing we noticed while doing graphs for just taking overload into account that is not more than 18h in the past (18h being the max time between descriptor uploads) is that the amount of overload reduces drastically for the timeframe in question while other periods are not affected.

One theory is that those spurious overloads are caused by ntor onionskins being dropped (assuming that spurious TCP port exhaustions nor OOM invocations due to memory pressure are at play here). In fact, we have evidence that this got triggered on relays within the timeframe we are concerned with in this ticket by overloading relays just for some minutes every couple of days. We made that indicator more robust so that this kind of probing does not trigger overload in the future anymore. That change appeared in 0.4.7.5-alpha.

mentioned in issue sbws#40021 (moved)

> 01/28 - 01/31 (658->578->430->142)
> 02/01 - 02/03 (249->644->768)

I picked that spike to look a bit closer as to whether there are some patterns taking the 18h overload numbers for that period, but those are not obvious. For one, there is no large operators that restarted their relays, e.g. to get the overload warning on relay-search away (we've been in contact with some folks doing that). Secondly, there does not seem to be a single AS or so affected.

Looking over the relays that have a stable overload between 01/28 and 02/03 there are between ca. 40 and 70 Guard-only ones (out of 130 for 01/28, 140 for 02/01, 302 for 02/02, 256 for 02/03). So, the remaining ones are pretty much fluctuating. I suspect they are still due to relay probing even though for those relays it might not be as spurious as in the 72h window.

I think we are done here, closing.

closed

added 2h 15m of time spent

mentioned in issue #24 (closed)

I picked that spike to look a bit closer as to whether there are some patterns taking the 18h overload numbers for that period, but those are not obvious. For one, there is no large operators that restarted their relays, e.g. to get the overload warning on relay-search away (we've been in contact with some folks doing that). Secondly, there does not seem to be a single AS or so affected.

Another thing I looked at was whether there are some kind of bw spikes at affected relays. I looked at a couple but did not see any weird pattern for the period in question either.

mentioned in issue helper-scripts#17 (closed)

Investigate non-exit general overload

Designs

Child items ...

Activity