onbrisca dies with after getting/setting configurations via control port

added Next Q3 labels

assigned to @juga

I just tried getting/setting many times the Bridge option in a separated scripts and no exception was raised. Excerpt of the code:

    for i in range(1, 100):
        controller.get_conf("Bridge", multiple=True)
    for i in range(1, 100):
        for bl in [bl1, bl2]:
            controller.set_conf("Bridge", bl)
    controller.set_conf("Bridge",[bl1, bl2] * 100)

So it's possible that memory in the vm gets full, network being busy with something else or some other bug..

I have checked and out of 4GB of memory only 1GB is used when onbasca is not running (2.5GB are used in caches). I see onbasca running raised the consumption ~400MB. It is the biggest memory and CPU consumer in this VM after launching it, maybe because testing a lot of bridges, but still not hitting the limits of the server. After living it running for a bit the CPU consumption goes down and it keeps using 7% of memory.

If it gets killed because of running out of memory we should get an OOM error, isn't it? But maybe is the tor subprocess being killed and onbasca failing to recover from that?

Anyway, I left it running again, let's see how long it survives.

If it gets killed because of running out of memory we should get an OOM error, isn't it? But maybe is the tor subprocess being killed and onbasca failing to recover from that?

ah, right. The exception is raised by stem cause tor socket is closed. I've just remembered there should be tor logs at ~/.onbrisca/tor, so let's try to figure out from that what is making tor die?

The only way to recover from socket closed would be to try launch tor again, but it might die again...

This is the traceback I see in the server:

Jul 20 15:10:05 polyanthum python[874]: Jul 20 15:10:05 bridge_scanner[874]: <INFO> (MainThread) bridge_scanner.py:195 - scan - Finished a loop.
Jul 20 15:10:05 polyanthum python[874]: Jul 20 15:10:05 log[874]: <DEBUG> (MainThread) log.py:174 - log - GETCONF testingtornetwork (failed: )
Jul 20 15:10:05 polyanthum python[874]: Jul 20 15:10:05 bridge_scanner[874]: <ERROR> (MainThread) bridge_scanner.py:206 - run -
Jul 20 15:10:05 polyanthum python[874]: Traceback (most recent call last):
Jul 20 15:10:05 polyanthum python[874]:   File "/home/onbasca/onbasca/onbrisca/models/bridge_scanner.py", line 200, in run
Jul 20 15:10:05 polyanthum python[874]:     self.scan()
Jul 20 15:10:05 polyanthum python[874]:   File "/home/onbasca/onbasca/onbrisca/models/bridge_scanner.py", line 185, in scan
Jul 20 15:10:05 polyanthum python[874]:     self.tor_control.controller.get_conf("TestingTorNetwork")
Jul 20 15:10:05 polyanthum python[874]:   File "/usr/lib/python3/dist-packages/stem/control.py", line 2263, in get_conf
Jul 20 15:10:05 polyanthum python[874]:     entries = self.get_conf_map(param, default, multiple)
Jul 20 15:10:05 polyanthum python[874]:   File "/usr/lib/python3/dist-packages/stem/control.py", line 2347, in get_conf_map
Jul 20 15:10:05 polyanthum python[874]:     response = self.msg('GETCONF %s' % ' '.join(lookup_params))
Jul 20 15:10:05 polyanthum python[874]:   File "/usr/lib/python3/dist-packages/stem/control.py", line 662, in msg
Jul 20 15:10:05 polyanthum python[874]:     self._socket.send(message)
Jul 20 15:10:05 polyanthum python[874]:   File "/usr/lib/python3/dist-packages/stem/socket.py", line 460, in send
Jul 20 15:10:05 polyanthum python[874]:     self._send(message, lambda s, sf, msg: send_message(sf, msg))
Jul 20 15:10:05 polyanthum python[874]:   File "/usr/lib/python3/dist-packages/stem/socket.py", line 243, in _send
Jul 20 15:10:05 polyanthum python[874]:     raise stem.SocketClosed()
Jul 20 15:10:05 polyanthum python[874]: stem.SocketClosed
Jul 20 15:10:05 polyanthum python[874]: Jul 20 15:10:05 base_events[874]: <ERROR> (MainThread) base_events.py:1738 - default_exception_handler - Unclosed client session
Jul 20 15:10:05 polyanthum python[874]: client_session: <aiohttp.client.ClientSession object at 0x7fc0bef601f0>

From another traceback:

Jul 24 10:54:16 bridge_heartbeat[815043]: <INFO> (MainThread) bridge_heartbeat.py:79 - log_status - Sleeping for 60.008979 secs.
Jul 24 10:54:17 log[815043]: <INFO> (Tor listener) log.py:174 - log - Error while receiving a control message (SocketClosed): empty socket content
Jul 24 10:55:17 bridge_scanner[815043]: <INFO> (MainThread) bridge_scanner.py:195 - scan - Finished a loop.
Jul 24 10:55:17 bridge_scanner[815043]: <DEBUG> (MainThread) bridge_scanner.py:79 - scan_bridges - Starting new loop.
[...]
Jul 24 10:55:17 bridge_torcontrol[815043]: <DEBUG> (ThreadPoolExecutor-0_0) bridge_torcontrol.py:44 - set_bridgelines - tor has 20 bridgelines
[..]
  File "/home/onbasca/onbasca/onbrisca/bridge_torcontrol.py", line 51, in set_bridgelines
    self.controller.set_conf("Bridge", new_bridgeline)

it looks like tor socket dies while onbrisca is sleeping before ending a loop. Or could be at other moment and it's only detected when trying to send a query to the control socket, like in the previous traceback?

From tor log, the only thing i can spot so far is that something sent SIGTERM to tor.

Jul 24 10:54:16.000 [notice] Catching signal TERM, exiting cleanly.

What in onbrisca could be sending to tor SIGTERM?

The last change we did before this started to happen was to solve an stem.InvalidRequest exception when setting bridges via control socket (#157), which only happened when receiving a non-valid bridgeline. So in principle it seems unrelated.

In the distant past, Tor's control port had issues where it would queue up too many bytes waiting to get written out to the controller application, and if the controller kept not reading, eventually it would face a choice of either (a) using an unbounded amount of memory, which is a DoS vector, or (b) closing the socket to defend itself.

I think we have improved things a few times over the years, so I don't imagine that's being the issue here. But just in case: how much queued data are we talking here, and are you reading the answer promptly before asking for more?

Before this issue, we were sending SETCONF Brige with 25 bridge lines, now we're sending 25 times SETCONF Bridge plus the one with 25 bridge lines.

In a quick search for SIGTERM in C-tor, i found lost_owning_controller activating the signal, but i don't know in which cases the controller would be lost.

Regarding reading the answer, that's being done by stem, i think at this line: https://gitlab.torproject.org/tpo/network-health/stem/-/blob/master/stem/socket.py#L601 is when it tries to read and gets the error we are having (see the line in the above traceback Jul 24 10:54:17 log[815043]: <INFO> (Tor listener) log.py:174 - log - Error while receiving a control message (SocketClosed): empty socket content ).

I can try later to send less queries again or wait between them to see whether that stops this issue.

removed Next label

added Doing label

atm this is blocked by tpo/tpa/team#41281 (closed)

Added time spent in July

added 7h of time spent

closed with commit 8a99daab

Added time spent in August

added 20h 30m of time spent

onbrisca dies with after getting/setting configurations via control port

Designs

Child items ...

Activity