Design and implement optimizations for socket write limits
KIST has two components: global scheduling (legacy/trac#9262 (moved)) and socket write limits. This ticket is to track discussion about the design that should be implemented to realize socket write limits, and discussion about the implementation.
The goal of the write limit is to never send to the kernel what the kernel wouldn't send out to the network anyway due to throttling at the TCP layer. Rob's USENIX Security paper computed write limits for each socket as
sock_space = sock_buf_size - sock_buf_len tcp_space = (snd_cwnd - snd_unacked) * mss sock_write_limit = min(sock_space, tcp_space)
And then a global write limit across all sockets for each scheduling round is computed according to the upstream bandwidth of the relay and the configured write callback time interval. Writing in a given round ends when either the global limit is reached, or all of the socket limits are reached.
The TCP information can be collected with a getsockopt call, but doing this for every socket for every write round (callback interval) can get expensive. A kernel hacker, Patrick McHardy, suggested using the "netlink socket diag" interface (examples here and here) to collect information for multiple sockets all at once instead of a separate system call for each.
Note that the socket write limit need not actually be computed, because the kernel will return EAGAIN when the socket is full anyway. Along these lines, Bryan Ford suggested setting the socket buffer size based on the amount Tor thinks it should send plus a little extra (e.g., tcp_space*1.25), and then let the kernel push back automatically instead of trying to compute a new write limit for every socket for every write interval round. Then Tor can continue to try to write as much as it can and let the kernel push back when Tor should stop. In this case, we need to ensure TCP auto-tuning is disabled, as otherwise it may undo our settings by adjusting our socket buffer sizes underneath us.
I think we need two intervals: e.g., we want to try to write every 10 milliseconds, and then update snd_cwnd/write limits/socket buffer sizes every 100 milliseconds.