We just recently found that KIST is performing very poorly if tor has very little amount of sockets.
How KIST operates
KIST is scheduled if some cells are put on a circuit queue. A scheduler run might not handle all cells because it depends on the available space in the TCP buffer for the socket. What KIST does at the moment is reschedule itself in 10ms (static value).
The problem here is that if there are very few sockets (like most tor clients), then KIST will be able to handle one socket very fast, let say in 1ms, and then it will sleep for another 9ms until KIST is rescheduled.
That 9ms waiting time means that tor is not pushing bytes on the wire even though it could during that time. See the attached graph made by pastly, you can see how much KIST badly under performs with the current 10ms.
Consequences
(Might be more, don't treat this as an exhaustive list)
Clients are basically capped in bandwidth because they in general only talk to the Guard on a single socket.
A new relay joining the network won't have any connections so when the authority measures it, or our bw. scanners, they will only be able to measure a capped value compared to what the relay could actually do (if higher). This measurement will recover after a while once the relay starts seeing traffic and the number of sockets ramps up.
Solution
As you can see on the attached graph, bringing the scheduler interval time down to 2ms gives us better performance than Vanilla. That could be a short term solution.
A better solution, a bit more medium-term, would be to make that scheduling interval dynamic depending on how fast tor thinks the TCP buffer on the socket will get emptied. That depends on the connection throughput basically. For example, a 100mbit NIC towards a Guard might only push through 10mbit so we would need a way for tor to learn that per-connection which would allow KIST to estimate when it needs to be rescheduled for that connection.
Edited
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
KIST tries not to overload the kernel when there are many sockets, and so it only runs every 10 msec. On high performance relays with lots of sockets, this is a good thing.
But on a relay with only 1 active socket, it is possible that you fill the kernel socket buffer, and the NIC only takes e.g. 2 msec to send it, but then KIST doesn't run again for another 8 msec. So the NIC is sitting there idle for those 8 msec.
This is only an issue when the sum of bytes in all kernel-level socket buffers is less than the number of bytes the NIC could send in 10 msec. This can happen when there are only a few sockets, or on a really really fast NIC.
= What to do:
== Clients:
KIST was designed for relays. Clients don't need to prioritize traffic the same way relays do, so they don't really need KIST. Clients can simply run the vanilla scheduler so that they read/write ASAP (rather than deferring I/O like KIST does). Or clients can run KIST with a 1 msec scheduling frequency.
== Relays:
For relays, we could guess how long it would take the kernel to send out all of the notsent bytes sitting in kernel buffers plus all outbuf bytes sitting in Tor outbufs. If the time we guess is less than 10 msec, then we could run KIST sooner. This guess would probably involve knowing or estimating the NIC speed.
Slightly more formally, if we know we can write b bytes per KIST scheduling run, i.e. b bytes per m milliseconds, but we only actually wrote w bytes where w < b, then we know that we need to call back sooner than m milliseconds. We can dynamically compute the time it will take us to write the w bytes as:
new_callback_time = m * (w/b)
Then we check back again in new_callback_time milliseconds (the time when the kernel will be empty) instead of m milliseconds.
Then also, KIST will never let itself write more than b bytes across all sockets, because it knows that its network card can't possibly write more than b bytes.
== Issues:
I don't know how each relay can reliably compute the value of b. Maybe we start with the "observed bandwidth" as an estimate? But then we need to allow b to grow in case the relay suddenly got faster, or for new relays?
KIST was designed for relays. Clients don't need to prioritize traffic the same way relays do, so they don't really need KIST. Clients can simply run the vanilla scheduler so that they read/write ASAP (rather than deferring I/O like KIST does). Or clients can run KIST with a 1 msec scheduling frequency.
Fortunately, right now, it is easy for Tor to know if it is running has a relay or not. Easy solution is to adjust the KISTSchedRunInterval to 2msec (initial testing at 1msec is locking tor apparently, need to be investigated) for clients only.
What I worry here is for onion service. They can have a lot of circuits to many rendezvous points so there is a clear requirement for circuit priority and not loading the Guard link which KIST would basically help. But then, we don't have a way to measure the NIC used throughput for clients/HS :S ...
I don't know how each relay can reliably compute the value of b. Maybe we start with the "observed bandwidth" as an estimate? But then we need to allow b to grow in case the relay suddenly got faster, or for new relays?
For relays, I think the observed bandwidth from the consensus could be a good start until we have a reliable way for Tor to measure its throughput regularly.
The following graph I have more confidence in as a useful piece of evidence for you.
This is a tiny 10 relay network run on localhost on my desktop Debian 9 computer. There is one Tor client with one curl downloading a single large file of all zeros from nginx also running on localhost. The client builds normal three hop circuits to this webserver, always choosing the target relay as the first hop. All relays and the client have the same scheduler, and in the case of KIST, the same run interval too. Everyone is running Debian's Tor 0.3.5.7 unmodified.
Here is the configuration of a relay in the network:
$ cat torrc-commonShutdownWaitLength 2ExitRelay 1IPv6Exit 1ExitPolicy accept *:*CookieAuthentication 1ContactInfo pastly@torproject.orgLogTimeGranularity 1SafeLogging 0DirAuthority auth1 orport=10102 no-v2 v3ident=13572CEF296468E344506CAE402BDE55A28C21CD 127.100.1.1:10103 04C4B152E7EE3960B947BDE96823728132BE2A06DirAuthority auth2 orport=10106 no-v2 v3ident=47188F93370723370B6C1F441C9131F68F65F54C 127.100.1.1:10107 A182371ABFBDE825B359AD005EEA795F27F91C81DirAuthority auth3 orport=10110 no-v2 v3ident=CA8134FE7E018D48C4821E3C3233DE5A6C68C823 127.100.1.1:10111 71A9A9E880118B4BCA5B5A4303BF8C0534F92D2FTestingTorNetwork 1# change between kist and vanilla here# change KISTSchedRunInterval with consensus# param and waiting for it to be disseminated# to allSchedulers Vanilla$ cat relay1/torrc %include torrc-commonDataDirectory relay1PidFile relay1/tor.pid#Log notice file relay1/notice.logAddress 127.100.1.1SocksPort 127.100.1.1:10112ControlPort 127.100.1.1:10113ControlSocket /redacted/path/to/relay1/control_socketORPort 127.100.1.1:10114DirPort 127.100.1.1:10115Nickname relay1CacheDirectory /tmp/relay1
Here is the configuration of the client in this network:
$ cat torrc-common DirAuthority auth1 orport=10102 no-v2 v3ident=13572CEF296468E344506CAE402BDE55A28C21CD 127.100.1.1:10103 04C4B152E7EE3960B947BDE96823728132BE2A06DirAuthority auth2 orport=10106 no-v2 v3ident=47188F93370723370B6C1F441C9131F68F65F54C 127.100.1.1:10107 A182371ABFBDE825B359AD005EEA795F27F91C81DirAuthority auth3 orport=10110 no-v2 v3ident=CA8134FE7E018D48C4821E3C3233DE5A6C68C823 127.100.1.1:10111 71A9A9E880118B4BCA5B5A4303BF8C0534F92D2FTestingTorNetwork 1NumCPUs 1LogTimeGranularity 1SafeLogging 0ShutdownWaitLength 2CookieAuthentication 1# change between kist and vanilla here# change KISTSchedRunInterval with consensus# param and waiting for it to be disseminated# to allSchedulers Vanilla$ cat client10301/torrc %include torrc-commonDataDirectory client10301PidFile client10301/tor.pid#Log notice file client10301/notice.logSocksPort 127.0.0.1:10301ControlSocket /redacted/path/to/client10301/control_socketCacheDirectory /tmp/client10301EntryNodes relay1
Please use the following graph for insight instead of the previously shared perf-10ms.png. The following graph is way closer to the real world (unmodified Tor binary, 3 hop circuits, etc.)
We do not. The choice for the interval is based on torrc and consensus so it shouldn't matter on which platform?
My concern is that different platforms sometimes handle small timers very differently.
True. For multiple platforms, I have to say no :S.
Timers at the msec level I would assume would be fine on most of our targeted platforms. Also, this specific interval is only used on Linux and BSD. Windows uses Vanilla scheduler which doesn't have this problem.
Should there be separate consensus parameters for client and server?
Yes this is a good idea actually. But this means 043 considering past our feature freeze?
Setting myself as reviewer per discussion at meeting.
One more question: is this something we want to think about potentially backporting? If not, should it wait for 043 when we can treat it as a feature and add a new consensus parameter?
Setting myself as reviewer per discussion at meeting.
One more question: is this something we want to think about potentially backporting? If not, should it wait for 043 when we can treat it as a feature and add a new consensus parameter?
Discussion with nickm on IRC. There is a good argument on preventing partitioning client/HS into two buckets of "performance".
So because of this, we'll defer this to 043, add two consensus parameters (client and relay sched interval) and backport it back to 035. We'll have a good chunk of the 043 cycle to make sure it works properly.
Trac: Status: needs_review to needs_revision Keywords: N/Adeleted, 041-backport, 035-backport, 040-backport, 042-backport added Milestone: Tor: 0.4.2.x-final to Tor: 0.4.3.x-final
Quite unclear when we can do this. This would just be a bandaid on a much broader problem which is the overall "grace period" that KIST has. There are ways we can get rid of this and should probably spent time on that work than adding more bandaid on KIST...
With the consensus parameter change, can we close this ticket and make another one for making the scheduler interval not a fixed amount of time? Or should we leave this opened? It seems unlikely we will move on changing the scheduler until we have congestion control and the network utilization is high enough for EWMA to kick in. That can be discovered in future KIST+EWMA experiments.
(Time spent here is just mine. dgoulet and others probably spent more time)