I have checked and out of 4GB of memory only 1GB is used when onbasca is not running (2.5GB are used in caches). I see onbasca running raised the consumption ~400MB. It is the biggest memory and CPU consumer in this VM after launching it, maybe because testing a lot of bridges, but still not hitting the limits of the server. After living it running for a bit the CPU consumption goes down and it keeps using 7% of memory.
If it gets killed because of running out of memory we should get an OOM error, isn't it? But maybe is the tor subprocess being killed and onbasca failing to recover from that?
Anyway, I left it running again, let's see how long it survives.
If it gets killed because of running out of memory we should get an OOM error, isn't it? But maybe is the tor subprocess being killed and onbasca failing to recover from that?
ah, right. The exception is raised by stem cause tor socket is closed. I've just remembered there should be tor logs at ~/.onbrisca/tor, so let's try to figure out from that what is making tor die?
The only way to recover from socket closed would be to try launch tor again, but it might die again...
Jul 24 10:54:16 bridge_heartbeat[815043]: <INFO> (MainThread) bridge_heartbeat.py:79 - log_status - Sleeping for 60.008979 secs.Jul 24 10:54:17 log[815043]: <INFO> (Tor listener) log.py:174 - log - Error while receiving a control message (SocketClosed): empty socket contentJul 24 10:55:17 bridge_scanner[815043]: <INFO> (MainThread) bridge_scanner.py:195 - scan - Finished a loop.Jul 24 10:55:17 bridge_scanner[815043]: <DEBUG> (MainThread) bridge_scanner.py:79 - scan_bridges - Starting new loop.[...]Jul 24 10:55:17 bridge_torcontrol[815043]: <DEBUG> (ThreadPoolExecutor-0_0) bridge_torcontrol.py:44 - set_bridgelines - tor has 20 bridgelines[..] File "/home/onbasca/onbasca/onbrisca/bridge_torcontrol.py", line 51, in set_bridgelines self.controller.set_conf("Bridge", new_bridgeline)
it looks like tor socket dies while onbrisca is sleeping before ending a loop. Or could be at other moment and it's only detected when trying to send a query to the control socket, like in the previous traceback?
From tor log, the only thing i can spot so far is that something sent SIGTERM to tor.
Jul 24 10:54:16.000 [notice] Catching signal TERM, exiting cleanly.
What in onbrisca could be sending to tor SIGTERM?
The last change we did before this started to happen was to solve an stem.InvalidRequest exception when setting bridges via control socket (#157), which only happened when receiving a non-valid bridgeline. So in principle it seems unrelated.
In the distant past, Tor's control port had issues where it would queue up too many bytes waiting to get written out to the controller application, and if the controller kept not reading, eventually it would face a choice of either (a) using an unbounded amount of memory, which is a DoS vector, or (b) closing the socket to defend itself.
I think we have improved things a few times over the years, so I don't imagine that's being the issue here. But just in case: how much queued data are we talking here, and are you reading the answer promptly before asking for more?
Before this issue, we were sending SETCONF Brige with 25 bridge lines, now we're sending 25 times SETCONF Bridge plus the one with 25 bridge lines.
In a quick search for SIGTERM in C-tor, i found lost_owning_controller activating the signal, but i don't know in which cases the controller would be lost.
Regarding reading the answer, that's being done by stem, i think at this line: https://gitlab.torproject.org/tpo/network-health/stem/-/blob/master/stem/socket.py#L601 is when it tries to read and gets the error we are having (see the line in the above traceback Jul 24 10:54:17 log[815043]: <INFO> (Tor listener) log.py:174 - log - Error while receiving a control message (SocketClosed): empty socket content ).
I can try later to send less queries again or wait between them to see whether that stops this issue.