Increase kcp and smux window and buffer sizes

requested review from @dcf

It also occurs to me that the kcp.MaxWindowSize has to be at least as big as the MaxStreamBuffer size to notice any improvements, otherwise that will be the limiting factor on the amount of inflight data. Right now this is set to 64KB for both the client and the server.

I believe UDPSession.SetWindowSize is in units of packets, not bytes. The current values of 65535 should be plenty. Does it still work with the current values? The kcptun defaults are 128 and 512.

In the project I posted about (an AMP cache tunnel), I'm using SetWindowSize(1024, 1024). Actually, I have measurements testing that change separately from increasing the smux buffers:

direct	thru cache	condition (cumulative)
329.0	260.5	baseline
477.4	416.9	`SetWindowSize(32, 32)` → `(1024, 1024)`
1388.3	1405.8	`MaxStreamBuffer = 641024` → `110241024` `MaxReceiveBuffer = 410241024` → `161024*1024` (server only)

added 45m of time spent

Ah, I think you're right. Hmm, I wonder why I wasn't getting any noticeable difference in download speed by changing the MaxStreamBuffer then (because, if I understand correctly, changing it at the client should be enough to affect downloads as opposed to uploads). I'll take a closer look at this again.

In experimenting with increasing the buffer sizes in dnstt, I found that another parameter that matters for tuning is the turbotunnel queueSize, which is the size of internal buffering for QueuePacketConn, currently 32. If it's too small on the sending size, packets get dropped inside QueuePacketConn and have to be retransmitted later. You can check whether drops are happening with the following patch. For me, there were drops in sending on the server side.

diff --git a/common/turbotunnel/queuepacketconn.go b/common/turbotunnel/queuepacketconn.go
index 14a9833..ede3433 100644
--- a/common/turbotunnel/queuepacketconn.go
+++ b/common/turbotunnel/queuepacketconn.go
@@ -1,6 +1,7 @@
 package turbotunnel
 
 import (
+	"log"
 	"net"
 	"sync"
 	"sync/atomic"
@@ -57,6 +58,7 @@ func (c *QueuePacketConn) QueueIncoming(p []byte, addr net.Addr) {
 	select {
 	case c.recvQueue <- taggedPacket{buf, addr}:
 	default:
+		log.Printf("drop incoming %d", len(buf))
 		// Drop the incoming packet if the receive queue is full.
 	}
 }
@@ -99,6 +101,7 @@ func (c *QueuePacketConn) WriteTo(p []byte, addr net.Addr) (int, error) {
 		return len(buf), nil
 	default:
 		// Drop the outgoing packet if the send queue is full.
+		log.Printf("drop outgoing %d", len(buf))
 		return len(buf), nil
 	}
 }

In some quick tests, I started to get faster results in dnstt when I tried

queueSize = 2048 (was 32)
SetWindowSize(1024, 1024) (was 32)
MaxStreamBuffer = 4 * 1024 * 1024 (was 65536)

It was only a little slower with SetWindowSize(128, 128). I imagine there could be drawbacks to an overly large queueSize or SetWindowSize—when a packet needs to be retransmitted it goes to the end of the queue. For reference, the settings I was using when I first noticed the potential of increasing MaxStreamBuffer in the AMP cache tunnel were:

queueSize = 256
SetWindowSize(1024, 1024)
MaxStreamBuffer = 1 * 1024 * 1024

added 7 commits

1470b25b...e3d376ca - 6 commits from branch tpo/anti-censorship/pluggable-transports:main
b5b5334f - Increase smux and QueuePacketConn buffer sizes

Compare with previous version

I spent some time on Friday and yesterday trying out the new Shadow release to see if it would work with Go yet. Turns out there's still some work that needs to be done to support Go binaries, but I'm hopeful that this will be useful for doing more in-depth experiments with Snowflake.

In the nearer term, I don't see an issue with increasing the queueSize and MaxStreamBuffer as suggested here at the server and trying it out. I just pushed an update to the merge request.

approved this merge request

Okay, let's go ahead with this.

With a large queueSize, let's watch out for bufferbloat effects. What may happen is that when a connection is interrupted, the buffer fills up with packets that cannot be sent yet, possibly including retransmissions of packets already in the buffer. All these old stale packets will need to be flushed out of the buffer before anything new can be sent. I was thinking of instituting some kind of random early detection drops into turbotunnel.QueuePacketConn to mitigate these kinds of effects.

I just merged and deployed this. Thanks for pointing out the potential buffer bloat issue. If I understand this correctly, two ways of monitoring this would be:

in packet captures for clients that have interrupted connections, we'll see a lot of duplicate retransmissions following the outage and a longer than usual recovery time
increased memory usage at the server

These sound difficult to actively monitor, we might just need to run a few intentional experiments to determine whether this is an issue.

The random early detection is interesting, thanks for the link!