kist: Poor performance with a small amount of sockets
We just recently found that KIST is performing very poorly if tor has very little amount of sockets.
How KIST operates
KIST is scheduled if some cells are put on a circuit queue. A scheduler run might not handle all cells because it depends on the available space in the TCP buffer for the socket. What KIST does at the moment is reschedule itself in 10ms (static value).
The problem here is that if there are very few sockets (like most tor clients), then KIST will be able to handle one socket very fast, let say in 1ms, and then it will sleep for another 9ms until KIST is rescheduled.
That 9ms waiting time means that tor is not pushing bytes on the wire even though it could during that time. See the attached graph made by pastly, you can see how much KIST badly under performs with the current 10ms.
Consequences
(Might be more, don't treat this as an exhaustive list)
-
Clients are basically capped in bandwidth because they in general only talk to the Guard on a single socket.
-
A new relay joining the network won't have any connections so when the authority measures it, or our bw. scanners, they will only be able to measure a capped value compared to what the relay could actually do (if higher). This measurement will recover after a while once the relay starts seeing traffic and the number of sockets ramps up.
Solution
As you can see on the attached graph, bringing the scheduler interval time down to 2ms gives us better performance than Vanilla. That could be a short term solution.
A better solution, a bit more medium-term, would be to make that scheduling interval dynamic depending on how fast tor thinks the TCP buffer on the socket will get emptied. That depends on the connection throughput basically. For example, a 100mbit NIC towards a Guard might only push through 10mbit so we would need a way for tor to learn that per-connection which would allow KIST to estimate when it needs to be rescheduled for that connection.