Use a sync.Pool to reduce allocation in QueuePacketConn
This is the continuation of a thread that started at #40260 (comment 2885524). #40260 (closed) undid an erroneous performance optimization; however that optimization was tried for a reason, and we talked about possible correct alternatives.
This merge request uses a sync.Pool
in QueuePacketConn
as a source of byte buffers that are stored in queues,
rather than allocating fresh buffers with make
and letting each one be garbage collected.
The most notable change relative to the patch proposed in
#40260 (comment 2885539)
is that buffers are automatically returned to the pool in more situations:
when a packet is silently dropped in QueueIncoming
and WriteTo
,
and with the packets whose contents are returned by ReadFrom
.
Besides that, 97c93001 is a fix for
benchmark interference from TestQueuePacketConnWriteToKCP
noted at #40260 (comment 2885832).
The test started a goroutine that was supposed to stop when the test function returned,
but a bug meant the goroutine kept running
and consumed resources during the benchmark.
6bae31f0 reduces memory allocation during
BenchmarkQueueIncoming
and BenchmarkWriteTo
,
which is no longer necessary since #40260 (closed) went back to permitting
the caller to reuse its passed-in buffer.
I went with the model of having a sync.Pool
be an instance variable of QueuePacketConn
.
That makes it easier to control the MTU as a parameter to NewQueuePacketConn
.
It does forego the possibility of some sharing of a single pool over multiple
QueuePacketConn
(in practice we do use multiple QueuePacketConn
, 4 currently).
My intuition says it does not matter much.
These are the QueuePacketConn
benchmarks at 97c93001
(i.e., without the sync.Pool
change but with the benchmark fixes):
snowflake/common/turbotunnel$ go test -bench 'Benchmark(QueueIncoming|WriteTo)' -benchmem -benchtime 10s
goos: linux
goarch: amd64
pkg: git.torproject.org/pluggable-transports/snowflake.git/v2/common/turbotunnel
cpu: Intel(R) Core(TM) i5 CPU 680 @ 3.60GHz
BenchmarkQueueIncoming-4 42610378 299.1 ns/op 514 B/op 1 allocs/op
BenchmarkWriteTo-4 22618404 463.1 ns/op 516 B/op 1 allocs/op
The results at c097d5f3
(i.e., with the sync.Pool
change):
snowflake/common/turbotunnel$ go test -bench 'Benchmark(QueueIncoming|WriteTo)' -benchmem -benchtime 10s
goos: linux
goarch: amd64
pkg: git.torproject.org/pluggable-transports/snowflake.git/v2/common/turbotunnel
cpu: Intel(R) Core(TM) i5 CPU 680 @ 3.60GHz
BenchmarkQueueIncoming-4 77492412 187.0 ns/op 28 B/op 1 allocs/op
BenchmarkWriteTo-4 29897580 362.6 ns/op 36 B/op 1 allocs/op