Fix goroutine leak in snowflake server

It's back up now, as of 2021/03/08 14:14:27 after a service tor restart

There aren't many clues in the logs. The last few messages before it stopped working were:

2021/03/08 13:14:51 acceptStreams: io: read/write on closed pipe
2021/03/08 13:14:52 error copying ORPort to WebSocket io: read/write on closed pipe
2021/03/08 13:14:52 error copying WebSocket to ORPort io: read/write on closed pipe
2021/03/08 13:14:52 acceptStreams: io: read/write on closed pipe
2021/03/08 13:15:03 error copying WebSocket to ORPort io: read/write on closed pipe
2021/03/08 13:15:03 error closing read after copying ORPort to WebSocket close tcp [scrubbed]->[scrubbed]: shutdown: transport endpoint is not connected
2021/03/08 13:15:03 acceptStreams: io: read/write on closed pipe

However, all of these messages occur frequently throughout. As an aside, I think I'm going to clean up our logging a bit because most of these messages are not helpful and might be burying something that is.

mentioned in issue #40035 (closed)

unassigned @cohosh

Unassigning myself for now. This is still a mystery, and we can update the ticket with more info if it happens again.

This just happened again and I checked /var/log/syslog to see:

May  3 02:35:19 snowflake kernel: [3292296.426221] proxy-go invoked oom-killer: gfp
_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
May  3 02:35:19 snowflake kernel: [3292296.426302] CPU: 1 PID: 24623 Comm: proxy-go
 Not tainted 5.4.22 #2

[snip]

May  3 02:35:19 snowflake kernel: [3292296.560904] oom-kill:constraint=CONSTRAINT_N
ONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/sys
tem-tor.slice/tor@default.service,task=snowflake-serve,pid=10523,uid=106
May  3 02:35:19 snowflake kernel: [3292296.565317] Out of memory: Killed process 10
523 (snowflake-serve) total-vm:2240608kB, anon-rss:1123648kB, file-rss:0kB, shmem-r
ss:0kB, UID:106 pgtables:2440kB oom_score_adj:0
May  3 02:35:19 snowflake kernel: [3292296.622117] oom_reaper: reaped process 10523
 (snowflake-serve), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
May  3 03:17:01 snowflake CRON[3865]: (root) CMD (   cd / && run-parts --report /et
c/cron.hourly)

so proxy-go invoked the OOM killer, but it could be either the server or the proxy-go instances filling up the memory. I'll take a look and see if there are any likely memory leaks today.

assigned to @cohosh

added Bug Doing Sponsor 28 - FINISHED labels

Yeah it looks like it was the server running out of memory:

May  3 02:35:19 snowflake kernel: [3292296.493764] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
May  3 02:35:19 snowflake kernel: [3292296.539248] [  10522]   106 10522   187354   112865  1376256        0             0 tor
May  3 02:35:19 snowflake kernel: [3292296.540654] [  10523]   106 10523   560152   280912  2498560        0             0 snowflake-serve
May  3 02:35:19 snowflake kernel: [3292296.543564] [  24617]   108 24617      621       22    40960        0             0 timeout
May  3 02:35:19 snowflake kernel: [3292296.545082] [  24618]   108 24618   176047     7005   221184        0             0 proxy-go
May  3 02:35:19 snowflake kernel: [3292296.546599] [  26965]   108 26965      621       21    40960        0             0 timeout
May  3 02:35:19 snowflake kernel: [3292296.548192] [  26966]   108 26966   176111     9604   258048        0             0 proxy-go
May  3 02:35:19 snowflake kernel: [3292296.549835] [  28010]   108 28010      621       21    45056        0             0 timeout
May  3 02:35:19 snowflake kernel: [3292296.551398] [  28011]   108 28011   193200    13828   315392        0             0 proxy-go

I used the https://golang.org/pkg/net/http/pprof/ and it looks like there are no heap leaks causing memory usage, but we are leaking goroutines. From what it looks like, it's only about 2 per incoming connection. These don't take up a lot of memory but they add up over time. Here's the profiling output after starting and then closing 24 snowflake clients in https://github.com/cohosh/snowbox:

$ go tool pprof -top http://localhost:6060/debug/pprof/goroutine
Fetching profile over HTTP from http://localhost:6060/debug/pprof/goroutine
Saved profile in /home/snowflake/pprof/pprof.server.goroutine.037.pb.gz
File: server
Type: goroutine
Time: May 3, 2021 at 6:52pm (UTC)
Showing nodes accounting for 42, 100% of 42 total
      flat  flat%   sum%        cum   cum%
        38 90.48% 90.48%         38 90.48%  runtime.gopark
         1  2.38% 92.86%          1  2.38%  net/http.(*connReader).backgroundRead
         1  2.38% 95.24%          1  2.38%  runtime.notetsleepg
         1  2.38% 97.62%          1  2.38%  runtime/pprof.writeRuntimeProfile
         1  2.38%   100%          1  2.38%  syscall.Syscall
         0     0%   100%          1  2.38%  git.torproject.org/pluggable-transports/snowflake.git/common/turbotunnel.(*QueuePacketConn).ReadFrom
         0     0%   100%          1  2.38%  git.torproject.org/pluggable-transports/snowflake.git/common/turbotunnel.NewClientMap.func1
         0     0%   100%          1  2.38%  github.com/xtaci/kcp-go/v5.(*Listener).AcceptKCP
         0     0%   100%          1  2.38%  github.com/xtaci/kcp-go/v5.(*Listener).defaultMonitor
         0     0%   100%          1  2.38%  github.com/xtaci/kcp-go/v5.(*Listener).monitor
         0     0%   100%          1  2.38%  github.com/xtaci/kcp-go/v5.(*TimedSched).prepend
         0     0%   100%          4  9.52%  github.com/xtaci/kcp-go/v5.(*TimedSched).sched
         0     0%   100%          2  4.76%  internal/poll.(*FD).Accept
         0     0%   100%          1  2.38%  internal/poll.(*FD).Read
         0     0%   100%          2  4.76%  internal/poll.(*pollDesc).wait
         0     0%   100%          2  4.76%  internal/poll.(*pollDesc).waitRead
         0     0%   100%          2  4.76%  internal/poll.runtime_pollWait
         0     0%   100%          1  2.38%  io.Copy
         0     0%   100%          1  2.38%  io.copyBuffer
         0     0%   100%          1  2.38%  io/ioutil.devNull.ReadFrom
         0     0%   100%          1  2.38%  main.acceptSessions
         0     0%   100%          1  2.38%  main.initServer.func1
         0     0%   100%          1  2.38%  main.main
         0     0%   100%          1  2.38%  main.main.func1
         0     0%   100%          1  2.38%  main.main.func3
         0     0%   100%          1  2.38%  main.startServer.func1
         0     0%   100%          1  2.38%  main.statsThread
         0     0%   100%         26 61.90%  main.turbotunnelMode.func2
         0     0%   100%          2  4.76%  net.(*TCPListener).Accept
         0     0%   100%          2  4.76%  net.(*TCPListener).accept
         0     0%   100%          2  4.76%  net.(*netFD).accept
         0     0%   100%          1  2.38%  net/http.(*ServeMux).ServeHTTP
         0     0%   100%          2  4.76%  net/http.(*Server).ListenAndServe
         0     0%   100%          2  4.76%  net/http.(*Server).Serve
         0     0%   100%          1  2.38%  net/http.(*conn).serve
         0     0%   100%          1  2.38%  net/http.HandlerFunc.ServeHTTP
         0     0%   100%          1  2.38%  net/http.ListenAndServe
         0     0%   100%          1  2.38%  net/http.serverHandler.ServeHTTP
         0     0%   100%          1  2.38%  net/http/pprof.Index
         0     0%   100%          1  2.38%  net/http/pprof.handler.ServeHTTP
         0     0%   100%          1  2.38%  os.(*File).Read
         0     0%   100%          1  2.38%  os.(*File).read
         0     0%   100%          1  2.38%  os/signal.loop
         0     0%   100%          1  2.38%  os/signal.signal_recv
         0     0%   100%          1  2.38%  runtime.chanrecv
         0     0%   100%          1  2.38%  runtime.chanrecv1
         0     0%   100%         26 61.90%  runtime.chansend
         0     0%   100%         26 61.90%  runtime.chansend1
         0     0%   100%         28 66.67%  runtime.goparkunlock
         0     0%   100%          1  2.38%  runtime.main
         0     0%   100%          2  4.76%  runtime.netpollblock
         0     0%   100%          8 19.05%  runtime.selectgo
         0     0%   100%          1  2.38%  runtime/pprof.(*Profile).WriteTo
         0     0%   100%          1  2.38%  runtime/pprof.writeGoroutine
         0     0%   100%          1  2.38%  syscall.Read
         0     0%   100%          1  2.38%  syscall.read
         0     0%   100%          1  2.38%  time.Sleep

So in particular, it looks like main.turbotunnelMode.func2 refers to this goroutine

From what I can tell, we are never closing the channels we create https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/blob/af6e2c30e1a6aacc6e7adf9a31df0a387891cc37/common/turbotunnel/clientmap.go#L98 to queue outgoing packets for clients. These should probably be closed when the client expires and is removed from the clientmap.

changed title from Snowflake bridge is down to Fix goroutine leak in snowflake server

Good analysis. It's clear that there's nothing to stop the goroutine that reads from pconn.OutgoingQueue().

I compared with the Turbo Tunnel example code, and there there is a done channel that the conn loop uses to signal the other loop to terminate. Looks like we need that here?

https://www.bamsoftware.com/papers/turbotunnel/example/#listenerpacketconn.go

	var wg sync.WaitGroup
	wg.Add(2)
	done := make(chan struct{})
	go func() {
		defer wg.Done()
		defer close(done) // Signal the write loop to finish.
		for {
			p, err := turbotunnel.ReadPacket(conn)
			if err != nil {
				return
			}
			c.QueuePacketConn.QueueIncoming(p, sessionID)
		}
	}()
	go func() {
		defer wg.Done()
		defer conn.Close() // Signal the read loop to finish.
		for {
			select {
			case <-done:
				return
			case p, ok := <-c.QueuePacketConn.OutgoingQueue(sessionID):
				if ok {
					err := turbotunnel.WritePacket(conn, p)
					if err != nil {
						return
					}
				}
			}
		}
	}()
	wg.Wait()
	return ni

Yeah good call. I was testing a patch last night and realized that because this function keeps getting called, the record is never allowed to expire.

This done channel looks great, I'll add that to the patch

mentioned in merge request !35 (merged)

I opened a merge request, but it'll conflict a little with !31 (closed). The leak is very slow so I don't mind waiting until that is merged to rebase this and merge it.

added 8h of time spent at 2021-05-04

added Needs Review label and removed Doing label

closed with merge request !35 (merged)

This was deployed at the snowflake bridge at 2021/05/12 14:13:56

mentioned in issue #40039 (closed)

mentioned in issue #40020 (closed)

mentioned in issue #40052 (closed)

Fix goroutine leak in snowflake server

Designs

Child items 0

Activity