There aren't many clues in the logs. The last few messages before it stopped working were:
2021/03/08 13:14:51 acceptStreams: io: read/write on closed pipe2021/03/08 13:14:52 error copying ORPort to WebSocket io: read/write on closed pipe2021/03/08 13:14:52 error copying WebSocket to ORPort io: read/write on closed pipe2021/03/08 13:14:52 acceptStreams: io: read/write on closed pipe2021/03/08 13:15:03 error copying WebSocket to ORPort io: read/write on closed pipe2021/03/08 13:15:03 error closing read after copying ORPort to WebSocket close tcp [scrubbed]->[scrubbed]: shutdown: transport endpoint is not connected2021/03/08 13:15:03 acceptStreams: io: read/write on closed pipe
However, all of these messages occur frequently throughout. As an aside, I think I'm going to clean up our logging a bit because most of these messages are not helpful and might be burying something that is.
This just happened again and I checked /var/log/syslog to see:
May 3 02:35:19 snowflake kernel: [3292296.426221] proxy-go invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0May 3 02:35:19 snowflake kernel: [3292296.426302] CPU: 1 PID: 24623 Comm: proxy-go Not tainted 5.4.22 #2[snip]May 3 02:35:19 snowflake kernel: [3292296.560904] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/system-tor.slice/tor@default.service,task=snowflake-serve,pid=10523,uid=106May 3 02:35:19 snowflake kernel: [3292296.565317] Out of memory: Killed process 10523 (snowflake-serve) total-vm:2240608kB, anon-rss:1123648kB, file-rss:0kB, shmem-rss:0kB, UID:106 pgtables:2440kB oom_score_adj:0May 3 02:35:19 snowflake kernel: [3292296.622117] oom_reaper: reaped process 10523 (snowflake-serve), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kBMay 3 03:17:01 snowflake CRON[3865]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
so proxy-go invoked the OOM killer, but it could be either the server or the proxy-go instances filling up the memory. I'll take a look and see if there are any likely memory leaks today.
I used the https://golang.org/pkg/net/http/pprof/ and it looks like there are no heap leaks causing memory usage, but we are leaking goroutines. From what it looks like, it's only about 2 per incoming connection. These don't take up a lot of memory but they add up over time. Here's the profiling output after starting and then closing 24 snowflake clients in https://github.com/cohosh/snowbox:
Good analysis. It's clear that there's nothing to stop the goroutine that reads from pconn.OutgoingQueue().
I compared with the Turbo Tunnel example code, and there there is a done channel that the conn loop uses to signal the other loop to terminate. Looks like we need that here?
varwgsync.WaitGroupwg.Add(2)done:=make(chanstruct{})gofunc(){deferwg.Done()deferclose(done)// Signal the write loop to finish.for{p,err:=turbotunnel.ReadPacket(conn)iferr!=nil{return}c.QueuePacketConn.QueueIncoming(p,sessionID)}}()gofunc(){deferwg.Done()deferconn.Close()// Signal the read loop to finish.for{select{case<-done:returncasep,ok:=<-c.QueuePacketConn.OutgoingQueue(sessionID):ifok{err:=turbotunnel.WritePacket(conn,p)iferr!=nil{return}}}}}()wg.Wait()returnni
I opened a merge request, but it'll conflict a little with !31 (closed). The leak is very slow so I don't mind waiting until that is merged to rebase this and merge it.