kist: Poor performance with a small amount of sockets

changed milestone to %Tor: 0.4.4.x-final in legacy/trac

Trac:

Trac:
Cc: N/A to pastly

== More intuition about the problem:

KIST tries not to overload the kernel when there are many sockets, and so it only runs every 10 msec. On high performance relays with lots of sockets, this is a good thing.

But on a relay with only 1 active socket, it is possible that you fill the kernel socket buffer, and the NIC only takes e.g. 2 msec to send it, but then KIST doesn't run again for another 8 msec. So the NIC is sitting there idle for those 8 msec.

This is only an issue when the sum of bytes in all kernel-level socket buffers is less than the number of bytes the NIC could send in 10 msec. This can happen when there are only a few sockets, or on a really really fast NIC.

= What to do:

== Clients:

KIST was designed for relays. Clients don't need to prioritize traffic the same way relays do, so they don't really need KIST. Clients can simply run the vanilla scheduler so that they read/write ASAP (rather than deferring I/O like KIST does). Or clients can run KIST with a 1 msec scheduling frequency.

== Relays:

For relays, we could guess how long it would take the kernel to send out all of the notsent bytes sitting in kernel buffers plus all outbuf bytes sitting in Tor outbufs. If the time we guess is less than 10 msec, then we could run KIST sooner. This guess would probably involve knowing or estimating the NIC speed.

Slightly more formally, if we know we can write b bytes per KIST scheduling run, i.e. b bytes per m milliseconds, but we only actually wrote w bytes where w < b, then we know that we need to call back sooner than m milliseconds. We can dynamically compute the time it will take us to write the w bytes as:

new_callback_time = m * (w/b)

Then we check back again in new_callback_time milliseconds (the time when the kernel will be empty) instead of m milliseconds.

Then also, KIST will never let itself write more than b bytes across all sockets, because it knows that its network card can't possibly write more than b bytes.

== Issues:

I don't know how each relay can reliably compute the value of b. Maybe we start with the "observed bandwidth" as an estimate? But then we need to allow b to grow in case the relay suddenly got faster, or for new relays?

Trac:
Cc: pastly to pastly, robgjansen

Replying to robgjansen:

KIST was designed for relays. Clients don't need to prioritize traffic the same way relays do, so they don't really need KIST. Clients can simply run the vanilla scheduler so that they read/write ASAP (rather than deferring I/O like KIST does). Or clients can run KIST with a 1 msec scheduling frequency.

Fortunately, right now, it is easy for Tor to know if it is running has a relay or not. Easy solution is to adjust the KISTSchedRunInterval to 2msec (initial testing at 1msec is locking tor apparently, need to be investigated) for clients only.

What I worry here is for onion service. They can have a lot of circuits to many rendezvous points so there is a clear requirement for circuit priority and not loading the Guard link which KIST would basically help. But then, we don't have a way to measure the NIC used throughput for clients/HS :S ...

I don't know how each relay can reliably compute the value of b. Maybe we start with the "observed bandwidth" as an estimate? But then we need to allow b to grow in case the relay suddenly got faster, or for new relays?

For relays, I think the observed bandwidth from the consensus could be a good start until we have a reliable way for Tor to measure its throughput regularly.

The following graph I have more confidence in as a useful piece of evidence for you.

This is a tiny 10 relay network run on localhost on my desktop Debian 9 computer. There is one Tor client with one curl downloading a single large file of all zeros from nginx also running on localhost. The client builds normal three hop circuits to this webserver, always choosing the target relay as the first hop. All relays and the client have the same scheduler, and in the case of KIST, the same run interval too. Everyone is running Debian's Tor 0.3.5.7 unmodified.

Here is the configuration of a relay in the network:

$ cat torrc-common

ShutdownWaitLength 2
ExitRelay 1
IPv6Exit 1
ExitPolicy accept *:*
CookieAuthentication 1
ContactInfo pastly@torproject.org
LogTimeGranularity 1
SafeLogging 0

DirAuthority auth1 orport=10102 no-v2 v3ident=13572CEF296468E344506CAE402BDE55A28C21CD 127.100.1.1:10103 04C4B152E7EE3960B947BDE96823728132BE2A06
DirAuthority auth2 orport=10106 no-v2 v3ident=47188F93370723370B6C1F441C9131F68F65F54C 127.100.1.1:10107 A182371ABFBDE825B359AD005EEA795F27F91C81
DirAuthority auth3 orport=10110 no-v2 v3ident=CA8134FE7E018D48C4821E3C3233DE5A6C68C823 127.100.1.1:10111 71A9A9E880118B4BCA5B5A4303BF8C0534F92D2F
TestingTorNetwork 1
# change between kist and vanilla here
# change KISTSchedRunInterval with consensus
#     param and waiting for it to be disseminated
#     to all
Schedulers Vanilla

$ cat relay1/torrc 

%include torrc-common
DataDirectory relay1
PidFile relay1/tor.pid
#Log notice file relay1/notice.log
Address 127.100.1.1
SocksPort 127.100.1.1:10112
ControlPort 127.100.1.1:10113
ControlSocket /redacted/path/to/relay1/control_socket
ORPort 127.100.1.1:10114
DirPort 127.100.1.1:10115
Nickname relay1
CacheDirectory /tmp/relay1

Here is the configuration of the client in this network:

$ cat torrc-common 

DirAuthority auth1 orport=10102 no-v2 v3ident=13572CEF296468E344506CAE402BDE55A28C21CD 127.100.1.1:10103 04C4B152E7EE3960B947BDE96823728132BE2A06
DirAuthority auth2 orport=10106 no-v2 v3ident=47188F93370723370B6C1F441C9131F68F65F54C 127.100.1.1:10107 A182371ABFBDE825B359AD005EEA795F27F91C81
DirAuthority auth3 orport=10110 no-v2 v3ident=CA8134FE7E018D48C4821E3C3233DE5A6C68C823 127.100.1.1:10111 71A9A9E880118B4BCA5B5A4303BF8C0534F92D2F

TestingTorNetwork 1
NumCPUs 1
LogTimeGranularity 1
SafeLogging 0
ShutdownWaitLength 2
CookieAuthentication 1
# change between kist and vanilla here
# change KISTSchedRunInterval with consensus
#     param and waiting for it to be disseminated
#     to all
Schedulers Vanilla

$ cat client10301/torrc 

%include torrc-common
DataDirectory client10301
PidFile client10301/tor.pid
#Log notice file client10301/notice.log
SocksPort 127.0.0.1:10301
ControlSocket /redacted/path/to/client10301/control_socket
CacheDirectory /tmp/client10301
EntryNodes relay1

Please use the following graph for insight instead of the previously shared perf-10ms.png. The following graph is way closer to the real world (unmodified Tor binary, 3 hop circuits, etc.)

Trac:

Trac:
Keywords: N/A deleted, regression? added

Marking these tickets as deferred from 041.

Trac:
Keywords: N/A deleted, 041-deferred-20190530 added

Trac:
Milestone: Tor: 0.4.1.x-final to Tor: 0.4.2.x-final

Trac:
Keywords: N/A deleted, 042-should added

Distributing 0.4.2 tickets between network team members.

Trac:
Status: new to assigned
Owner: N/A to dgoulet

PR: https://github.com/torproject/tor/pull/1387 Branch: ticket29427_042_01

Trac:
Status: assigned to needs_review
Keywords: N/A deleted, BugSmashFund added
Points: N/A to 0.1
Actualpoints: N/A to 0.1

Quick question: do we have a way to test that this performs as expected on different platforms?

Replying to nickm:

Quick question: do we have a way to test that this performs as expected on different platforms?

We do not. The choice for the interval is based on torrc and consensus so it shouldn't matter on which platform?

We do not. The choice for the interval is based on torrc and consensus so it shouldn't matter on which platform?

My concern is that different platforms sometimes handle small timers very differently.

Should there be separate consensus parameters for client and server?

Replying to nickm:

We do not. The choice for the interval is based on torrc and consensus so it shouldn't matter on which platform?

My concern is that different platforms sometimes handle small timers very differently.

True. For multiple platforms, I have to say no :S.

Timers at the msec level I would assume would be fine on most of our targeted platforms. Also, this specific interval is only used on Linux and BSD. Windows uses Vanilla scheduler which doesn't have this problem.

Should there be separate consensus parameters for client and server?

Yes this is a good idea actually. But this means 043 considering past our feature freeze?

Setting myself as reviewer per discussion at meeting.

One more question: is this something we want to think about potentially backporting? If not, should it wait for 043 when we can treat it as a feature and add a new consensus parameter?

Trac:
Reviewer: N/A to nickm

Replying to nickm:

Setting myself as reviewer per discussion at meeting.

One more question: is this something we want to think about potentially backporting? If not, should it wait for 043 when we can treat it as a feature and add a new consensus parameter?

Discussion with nickm on IRC. There is a good argument on preventing partitioning client/HS into two buckets of "performance".

So because of this, we'll defer this to 043, add two consensus parameters (client and relay sched interval) and backport it back to 035. We'll have a good chunk of the 043 cycle to make sure it works properly.

Trac:
Status: needs_review to needs_revision
Keywords: N/A deleted, 041-backport, 035-backport, 040-backport, 042-backport added
Milestone: Tor: 0.4.2.x-final to Tor: 0.4.3.x-final

Trac:
Keywords: 042-should deleted, N/A added

Trac:
Keywords: N/A deleted, network-team-roadmap-2019-Q1Q2 added

Trac:
Keywords: network-team-roadmap-2019-Q1Q2 deleted, network-team-roadmap-2020Q1 added

All 0.4.3.x tickets without 043-must, 043-should, or 043-can are about to be deferred.

Trac:
Keywords: N/A deleted, 043-deferred added

Trac:
Milestone: Tor: 0.4.3.x-final to Tor: 0.4.4.x-final

Add 044-must to all "regression" tickets in 0.4.4

Trac:
Keywords: N/A deleted, regression added

changed time estimate to 48m

added 48m of time spent

mentioned in issue legacy/trac#34171 (moved)

mentioned in issue legacy/trac#34303 (moved)

moved from legacy/trac#29427 (moved)

added Bug label and removed 1 deleted label

changed milestone to %Tor: 0.4.5.x-freeze

added Deferred Regression + 1 deleted label and removed 1 deleted label

changed the description

Quite unclear when we can do this. This would just be a bandaid on a much broader problem which is the overall "grace period" that KIST has. There are ways we can get rid of this and should probably spent time on that work than adding more bandaid on KIST...

For this reason, moving out of 045.

changed milestone to %Tor: unspecified

unassigned @dgoulet

added BugSmashFund label

removed milestone

removed 1 deleted label

So I just conducted an experiment with chutney after @mikeperry reported that on the live network he is capped at 3MB/sec.

Basic network. Vanilla client: 3MB/sec
Basic network. Client set with KISTSchedRunInterval 2 msec: ~3MB/sec
Entire network set with KISTSchedRunInterval 2 msec: ~5MB/sec
Entire network set with Schedulers Vanilla: avg = ~4MB/sec and burst = ~8MB/sec
Using @mikeperry branch that hard codes KIST to 2msec and uses very large circuit window: avg = ~6.37MiB/s and burst = ~14MB/sec

added S61-O1-Maybe - FINISHED label

added Sponsor 61 - FINISHED label

With the consensus parameter change, can we close this ticket and make another one for making the scheduler interval not a fixed amount of time? Or should we leave this opened? It seems unlikely we will move on changing the scheduler until we have congestion control and the network utilization is high enough for EWMA to kick in. That can be discovered in future KIST+EWMA experiments.

(Time spent here is just mine. dgoulet and others probably spent more time)

added 16h of time spent at 2020-11-05

changed milestone to %Sponsor 61 - Making the Tor network faster & more reliable for users in Internet-repressive places

@mikeperry please create a new ticket and reference this one so we have some context.

closed

mentioned in issue #40214

kist: Poor performance with a small amount of sockets

How KIST operates

Consequences

Solution

Child items ...

Activity