Fix test-stem `test_take_ownership_via_controller` failure

changed milestone to %Tor: 0.4.3.x-final in legacy/trac

added 043-should in Legacy / Trac component::core tor/tor in Legacy / Trac milestone::Tor: 0.4.3.x-final in Legacy / Trac owner::teor in Legacy / Trac points::1 in Legacy / Trac priority::medium in Legacy / Trac resolution::fixed in Legacy / Trac severity::normal in Legacy / Trac status::closed in Legacy / Trac type::defect in Legacy / Trac labels

I created a diagnostics branch that just runs test-stem, and just the affected test:

do not merge: https://github.com/torproject/tor/pull/1678

This stem test passes on 0.4.2:

https://travis-ci.org/torproject/tor/jobs/637709536

But fails on master: Stem logs:

https://travis-ci.org/teor2345/tor/builds/639735548#L5757 Tor logs:
https://travis-ci.org/teor2345/tor/builds/639735548#L4708 (I think the Tor logs are too verbose to show anything interesting.)

So it's probably a Tor bug.

I can also reproduce these results locally with the latest tor and stem master, so I should be able to bisect.

We broke this in 0.4.3, so we have to fix it.

Trac:
Keywords: 043-should deleted, 043-must added
Actualpoints: N/A to 0.2

This bug only occurs with ./configure --enable-fragile-hardening on my system. So it may be a tor/stem race condition bug. (legacy/trac#29437 (moved) is a similar bug, we may need legacy/trac#30901 (moved) to debug this kind of race condition.)

Trac:
Actualpoints: 0.2 to N/A
Keywords: 043-must deleted, 043-should added

It looks like this timing issue was introduced in the legacy/trac#30984 (moved) refactor, perhaps in commit c744d23c. (At least on my machine.)

Tor doesn't guarantee control reply timing. And we're unlikely to be able to restore the old timing behaviour. So stem's tests should be adapted to work with the timing in both Tor 0.4.2 and Tor master.

Trac:
Resolution: N/A to worksforme
Status: assigned to closed
Cc: N/A to catalyst

Here's the script I used for bisecting:

if ! test -f configure; then
    # abort bisect if setup fails
    ./autogen.sh  || exit 255
    # fragile hardening is required to trigger the bug
    # disabling asciidoc makes configure require fewer dependencies
    ./configure --disable-asciidoc --enable-fragile-hardening || exit 255
fi

# skip bisect of this commit if it doesn't build
make src/app/tor || exit 125
python3 "$STEM_SOURCE_DIR"/run_tests.py --tor src/app/tor --integ --test process.test_take_ownership_via_controller --log TRACE --log-file stem.log

Replying to teor:

It looks like this timing issue was introduced in the legacy/trac#30984 (moved) refactor, perhaps in commit c744d23c. (At least on my machine.)

Tor doesn't guarantee control reply timing. And we're unlikely to be able to restore the old timing behaviour. So stem's tests should be adapted to work with the timing in both Tor 0.4.2 and Tor master. I'm not sure what that commit has to do with TAKEOWNERSHIP. It seems to be about GETCONF instead. Are you suggesting that a change to the timing or formatting of GETCONF is causing a specific stem test to consistently fail?

Trac:
Status: closed to reopened
Resolution: worksforme to N/A

Let's check this test again, once legacy/trac#33039 (moved) is fixed.

Trac:
Parent: N/A to legacy/trac#33039 (moved)

Taylor and I have been investigating this and here is what we found:

The integ/process.py code is doing this test to see whether Tor is running:

        if tor_process.poll() == 0:
           return  # tor exited

This is calling the poll method of a subprocess.Popen() object, which only returns 0 when the process exits with an exitcode of 0. If Tor exits with any other exit code, it will return something else.

In this case, I found that Tor was actually exiting with a SIGPIPE, because of this chain of events:

stderr had been closed by stem.
There was a memory leak (legacy/trac#33039 (moved)), and so LeakSanitizer was trying to write to stderr.
LeakSanitizer couldn't write to stderr (because it was closed), and so it got a SIGPIPE.

We didn't notice this at the time because there was nothing to tell us that the bug had actually occurred.

I think we have a few things to work on here.

I've opened legacy/trac#33039 (moved) for the leak. We should fix that in 0.4.3.
I've opened a pull request against stem so that it gives a more accurate message if Tor fails during theses tests: https://github.com/torproject/stem/pull/54 . I hope it's in the right place. (I did not find any other cases where stem was using the poll()==0 pattern.)
We should find some way to make it so that when stem is running its tests, it does not close Tor's stderr, but rather reports stderr output as a test failure. This will make it likelier that we will notice LeakSanitizer failures in the future.

Can we close this now? We've merged Taylor's fix for legacy/trac#33039 (moved), and Stem has merged my PR to fix the test failure message.

I think that's it, except for:

We should find some way to make it so that when stem is running its tests, it does not close Tor's stderr, but rather reports stderr output as a test failure. This will make it likelier that we will notice LeakSanitizer failures in the future.

Trac:
Parent: legacy/trac#33039 (moved) to N/A

I've opened https://github.com/torproject/stem/issues/55 for that issue, and am closing this.

Trac:
Resolution: N/A to fixed
Status: reopened to closed

closed

changed time estimate to 8h

mentioned in issue legacy/trac#33039 (moved)

moved from legacy/trac#33006 (moved)

added Bug label and removed 1 deleted label

removed 1 deleted label

Fix test-stem `test_take_ownership_via_controller` failure

Child items 0

Activity