libtor.a: pubsub_install tor_raw_abort

Summary

We see OONI Probe Android crashes where pubsub_install calls tor_raw_abort for tor 0.4.7.13 using libtor.a embedded into a dynamic library loaded by an Android app. As of 2023-02-09 (around when we started investigating), this issue occurred 526 times in the last 28 days and was one of the main sources of crashes for the OONI Probe Android app.

A typical stack trace obtained from the Google Play console looks like this:

backtrace:
  #00  pc 0x0000000000089b0c  .../lib64/bionic/libc.so (abort+164)
  #01  pc 0x00000000013778a4  .../split_config.arm64_v8a.apk (tor_raw_abort_+12)
  #02  pc 0x0000000001382150  .../split_config.arm64_v8a.apk (tor_abort_+12)
  #03  pc 0x00000000012470a0  .../split_config.arm64_v8a.apk (pubsub_install+120)
  #04  pc 0x0000000001247170  .../split_config.arm64_v8a.apk (tor_run_main+136)

We investigated this issue and manage to reproduce it initially on OONI Probe Android, then in Linux using our Go code for managing libtor.a, and finally with a pure C test case working under Linux. During this investigating we have never seen the first bootstrap failing. Rather, in some cases it took > 30 repeated bootstraps to observe the abort; in other cases, it occurred within the first 3-10 bootstraps.

I searched in the issue tracker for "pubsub", "pubsub_install", "SIGABRT", and "abort". AFAICT, there is no other open issue discussing this problem, however, I think #32729 may be related and ~similar.

Steps to reproduce:

The following steps allowed me to reproduce the problem on Ubuntu 22.04.2:

git clone https://gitlab.torproject.org/tpo/core/tor
cd tor
git checkout tor-0.4.7.13
git apply 004.diff where 004.diff is

diff --git a/src/lib/pubsub/pubsub_check.c b/src/lib/pubsub/pubsub_check.c
index 99e604d715..a5cc4b7658 100644
--- a/src/lib/pubsub/pubsub_check.c
+++ b/src/lib/pubsub/pubsub_check.c
@@ -25,6 +25,7 @@
 #include "lib/malloc/malloc.h"
 #include "lib/string/compat_string.h"
 
+#include <stdio.h>
 #include <string.h>
 
 static void pubsub_adjmap_add(pubsub_adjmap_t *map,
@@ -343,21 +344,27 @@ lint_message(const pubsub_adjmap_t *map, message_id_t msg)
     log_warn(LD_MESG|LD_BUG,
              "Message \"%s\" has subscribers, but no publishers.",
              get_message_id_name(msg));
+    fprintf(stderr, "SBSDEBUG: n_pub == 0 for %s\n", get_message_id_name(msg));
     ok = false;
   } else if (n_sub == 0) {
     log_warn(LD_MESG|LD_BUG,
              "Message \"%s\" has publishers, but no subscribers.",
              get_message_id_name(msg));
+    fprintf(stderr, "SBSDEBUG: n_sub == 0 for %s\n", get_message_id_name(msg));
     ok = false;
   }
 
   /* Check the message graph topology. */
-  if (lint_message_graph(map, msg, pub, sub) < 0)
+  if (lint_message_graph(map, msg, pub, sub) < 0) {
+    fprintf(stderr, "SBSDEBUG: lint_message_graph failed for %s\n", get_message_id_name(msg));
     ok = false;
+  }
 
   /* Check whether the messages have the same fields set on them. */
-  if (lint_message_consistency(msg, pub, sub) < 0)
+  if (lint_message_consistency(msg, pub, sub) < 0) {
+    fprintf(stderr, "SBSDEBUG: lint_message_consistency failed for %s\n", get_message_id_name(msg));
     ok = false;
+  }
 
   if (!ok) {
     /* There was a problem -- let's log all the publishers and subscribers on
@@ -385,6 +392,7 @@ pubsub_adjmap_check(const pubsub_adjmap_t *map)
   bool all_ok = true;
   for (unsigned i = 0; i < map->n_msgs; ++i) {
     if (lint_message(map, i) < 0) {
+      fprintf(stderr, "SBSDEBUG: lint_message failed for %u %s\n", i, get_message_id_name((message_id_t)i));
       all_ok = false;
     }
   }
@@ -401,11 +409,15 @@ pubsub_builder_check(pubsub_builder_t *builder)
   pubsub_adjmap_t *map = pubsub_build_adjacency_map(builder->items);
   int rv = -1;
 
-  if (!map)
+  if (!map) {
+    fprintf(stderr, "SBSDEBUG: pubsub_build_adjacency_map failed\n");
     goto err; // should be impossible
+  }
 
-  if (pubsub_adjmap_check(map) < 0)
+  if (pubsub_adjmap_check(map) < 0) {
+    fprintf(stderr, "SBSDEBUG: pubsub_adjmap_check failed\n");
     goto err;
+  }
 
   rv = 0;
  err:

./autogen.sh
./configure --disable-asciidoc
make
mkdir tmp
vi tmp/main.c where main.c contains

#include "../src/feature/api/tor_api.h"

#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>

static void *threadMain(void *ptr) {
	int *fdp = (int*)ptr;
	(void)sleep(45 /* seconds */);
	(void)close(*fdp);
	free(fdp);
	return NULL;
}

int main() {
	for (;;) {
		tor_main_configuration_t *config = tor_main_configuration_new();
		if (config == NULL) {
			exit(1);
		}
		char *argv[] = {
			"tor",
			"Log",
			"notice stderr",
			"DataDirectory",
			"./x",
			NULL,
		};
		int argc = 5;
		if (tor_main_configuration_set_command_line(config, argc, argv) != 0) {
			exit(2);
		}
		int filedesc = tor_main_configuration_setup_control_socket(config);
		if (filedesc < 0) {
			exit(3);
		}
		int *fdp = malloc(sizeof(*fdp));
		if (fdp == NULL) {
			exit(4);
		}
		*fdp = filedesc;
		pthread_t thread;
		if (pthread_create(&thread, NULL, threadMain, /* move */ fdp) != 0) {
			exit(5);
		}
		(void)tor_run_main(config);
		if (pthread_join(thread, NULL) != 0) {
			exit(6);
		}
		fprintf(stderr, "********** doing another round\n");
	}
}

gcc -Wall tmp/main.c -L. -ltor -levent -lcrypto -lssl -lz -lm
./a.out 2>&1|tee LOG.txt

The tmp/main.c command is a reasonable approximation of what our Go code for running tor does. The main difference is that we start tor with DisableNetwork set and re-enable network later. This difference does not seem to have any impact, since we saw aborts in both cases.

We run repeated bootstraps because the OONI Probe Android app loads tor and the Go code as a shared library and calls tor_run_main each time we run a OONI experiment that requires tor (typically, vanilla_tor and torsf).

What is the current bug behavior?

We can cluster the kind of crashes we observed into two groups.

pubsub_adjmap_check failed

This crash has been the most frequent one we observed. With the above patch applied, it generally looks like this:

[... omitting logs from several bootstraps ...]

Mar 22 14:07:21.000 [notice] Owning controller connection has closed -- exiting now.
Mar 22 14:07:21.000 [notice] Catching signal TERM, exiting cleanly.
********** doing another round
SBSDEBUG: n_sub == 0 for orconn_state
SBSDEBUG: lint_message failed for 5 orconn_state
SBSDEBUG: n_pub == 0 for orconn_state
SBSDEBUG: lint_message failed for 34 orconn_state
SBSDEBUG: pubsub_adjmap_check failed
[1]    300227 IOT instruction (core dumped)  ./a.out 2>&1 | 
       300228 done                           tee LOG.txt

When running this via Go code, we see a different message before the abort. I think this happens because Go installs its own handler for SIGABRT, while the C code does not install any handler. My understanding is also that "IOT instruction" is related to SIGIOT, which seems to be an alias for SIGABRT judging from include/linux/signal.h and Glib's bits/signum-generic.h.

My understanding of the above logs is that, somehow, a message is registered twice: once without publishers, and once without subscribers.

It's also important to point out that the message causing failure has not always been orconn_state. Based on all the aborts we have examined, it seems that also orconn_status could cause failures. For the sake of brevity, I am not going to copy here all the logs we collected, but you can read them along with my thought process when analyzing the bug at https://github.com/ooni/probe/issues/2406.

INTERNAL ERROR: Raw assertion failed in Tor 0.4.7.13 at src/app/main/subsysmgr.c:183: 0

This specific error occurred very rarely (2-3 times). It is not clear whether this is the same issue or not, however I think it makes sense to mention it in the same issue, because it occurred when using the above code to investigate pubsub_install aborts.

2023/03/21 17:59:13  info tunnel: tor: exec: <internal/libtor> x/tunnel/torsf/tor [...]
BUG: subsystem btrack (at 55) could not connect to publish/subscribe system.
============================================================ T= 1679421553
INTERNAL ERROR: Raw assertion failed in Tor 0.4.7.13 at src/app/main/subsysmgr.c:183: 0
A subsystem couldn't be connected.
./testtorsf(dump_stack_symbols_to_error_fds+0x58)[0xe6df08]
./testtorsf(tor_raw_assertion_failed_msg_+0x97)[0xe6e8d7]
./testtorsf(subsystems_add_pubsub_upto+0x128)[0xe47df8]
./testtorsf(pubsub_install+0x29)[0xdf9c99]
./testtorsf(tor_run_main+0x8a)[0xdf9e2a]
./testtorsf(_cgo_2d785783cadf_Cfunc_tor_run_main+0x1b)[0xdf665b]
./testtorsf[0x500e04]

SIGABRT: abort
PC=0x7fa00f89aa7c m=14 sigcode=18446744073709551610
signal arrived during cgo execution

(Because this specific error occurred when using Go code, here you see also the output of Go SIGABRT handler.)

The specific assertion that fails in this case is the following:

int
subsystems_add_pubsub_upto(pubsub_builder_t *builder,
                           int target_level)
{
  for (unsigned i = 0; i < n_tor_subsystems; ++i) {
    const subsys_fns_t *sys = tor_subsystems[i];
    if (!sys->supported)
      continue;
    if (sys->level > target_level)
      break;
    if (! sys_status[i].initialized)
      continue;
    int r = 0;
    if (sys->add_pubsub) {
      subsys_id_t sysid = get_subsys_id(sys->name);
      raw_assert(sysid != ERROR_ID);
      pubsub_connector_t *connector;
      connector = pubsub_connector_for_subsystem(builder, sysid);
      r = sys->add_pubsub(connector);
      pubsub_connector_free(connector);
    }
    if (r < 0) {
      fprintf(stderr, "BUG: subsystem %s (at %u) could not connect to "
              "publish/subscribe system.", sys->name, sys->level);
      raw_assert_unreached_msg("A subsystem couldn't be connected."); // <- HERE
    }
  }

  return 0;
}

What is the expected behavior?

On a very broad level, I think tor should not abort. Because I do not understand very well what is happening, it is difficult to provide a more specific recommendation about what the code should actually do.

Environment

Which version of Tor are you using? Run tor --version to get the version if you are unsure.

Always 0.4.7.13

Which operating system are you using? For example: Debian GNU/Linux 10.1, Windows 10, Ubuntu Xenial, FreeBSD 12.2, etc.

Android (several versions and devices according to the Google Play console); Android 13 on Pixel 4a arm64 (my phone); Ubuntu 22.04.2 on amd64

Which installation method did you use? Distribution package (apt, pkg, homebrew), from source tarball, from Git, etc.

Tor compiled along with all its dependencies using our build scripts as well as tor compiled from sources with Ubuntu 22.04.2 installation dependencies when reproducing the issue using the above mentioned steps.

Relevant logs and/or screenshots

I think I already provided representative logs above. The https://github.com/ooni/probe/issues/2406 issue contains all the logs we produced while investigating this issue on our end. It also describes how we progressively narrowed down the problem from an abort in the Android app to an abort using Go code on Linux to the minimal instructions for reproducing the issue that I mentioned above.

On this note, I initially suspected that there was a data race on our end. That assumption was true but the abort continued to occur after I fixed the data race inside Go code. In any case, the possible presence of data races on our end prompted me to bypass our Go code and write C code that could allow reproducing the issue. In one of my final attempts at understanding the issue using just C code, I patched tor to avoid aborting in case pubsub_install failed, recompiled and run with tsan enabled, seeing just two pubsub_install failures over 490 runs and no sign of data races.

Possible fixes

I don't know. Since the data-race theory is not supported by data and unlikely, perhaps it could be that state from previous runs causes issues with the pubsub subsystem that appear for repeated bootstraps? I'll be happy to collaborate and try other debugging strategies.

Edited Oct 05, 2023 by sbs

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information