I have a relay running vanilla scheduler (disabled KIST everywhere when I noticed #29427 (moved) around the time it appeared) where the transfer rate maxes at 2 mbyte per second now--even when when traffic is light and right after a restart. Just noticed this; two years ago it could easily manage 5 mbyte/sec for a single connection even under moderately high load. Direct 'curl' downloads of same files on the system where the relay runs hit sustained 24 mbyte/sec.
Have logged bandwidth scanner votes for over two years and observe a sharp and permanent Torflow ratio drop in Decemmber 2018. The system runs in the central European Internet, has CDN grade connectivity.
Plan on investigating further by running old versions of tor in no-publish-descriptor mode and repeating the experiments, but it's an important issue and I opened the ticket after becoming convinced the problem is real rather than waiting. If it proves out, this has implications for the state of bandwidth measurements and load balancing in the network.
In #29427 (moved) KIST related transfer rate impairment shown in the graph is huge. I am doubtful regarding the cost vs benefit of the feature.
In #29427 (moved) KIST related transfer rate impairment shown in the graph is huge. I am doubtful regarding the cost vs benefit of the feature.
Just to be clear, that issue is related to client only. Relays do not suffer from the interval issue because they have thousands of open connections compared to a client with a couple at best.
So relays should really not be capped in theory compared to Vanilla scheduler... If there are evidence that KIST is performing poorly compared from Vanilla on the relay side, then that is a different story.
But without data, we can't really go on here?
As for the bandwidth scanner issue, that is very different problem and a big one that we are trying to address.
I've seen very rare occurrence of circuit cell queue hitting their maximum on all my relays (there is a log in the heartbeat).
Also, that cell limit is 6.25 times the actual enforced limits by the flow control protocol (SENDME) so I don't see how that can affect performance :S...
Ran several experiments and found that I was wrong about possible performance regression. Reproduced fast single-circuit transfer rates with 0.4.3 and that it may be a touch faster than older versions. Will close this ticket after allowing time for further comment.
Earlier was running a client from a system lacking hardware acceleration for AES encryption, forgetting the change some time back in one of my typical setups. What is interesting is cpu consumption was very low on the non-AES cpu during the flawed tests. I suspect the problem in that case is an accumulation of store-de/encrypt-and-forward latencies interacting badly with various other path components, producing terrible throughput.
This is notable if one examines a ranking of the fastest relays by SBWS absolute bandwidth, either mean or median, where fewer than five (literally five) relays at any time are rated faster than the 2Mbyte/sec maximum I observed in the problematic test. Might indicate typical case degradation due to accumulation of end-to-end latencies deserves a close look with an eye toward discovering and correcting a performance cliff. Ticket #29427 (moved) could be the culprit--I haven't yet completely eliminated KIST in tests due to recent changes preventing one-hop circuits. Seems to me reduction of default KISTSchedRunInterval to 2ms should be completed.
Trac: Summary: possible single circuit maximum transfer rate regression in 0.3.4 to possible single circuit maximum transfer rate regression in typical circuit path cases Version: Tor: 0.3.4.10 to Tor: 0.4.3.5