|
|
# 042 bug retrospective
|
|
|
|
|
|
POSSIBLE LESSONS TO TAKE FROM THIS RETROSPECTIVE:
|
|
|
* We should be writing tests for everything. "Having a test for the error case would have caught this" comes up way too much.
|
|
|
* teor: +1, we should add extra time for unit testing to every ticket and proposal
|
|
|
* Our lack of integation testing for certain pieces of functionality is a long-term liability. Even carefully unit testing our code would not have caught all these bugs.
|
|
|
* teor: +1, we should add extra time for integration testing to every ticket and proposal
|
|
|
* we should also change our PR reviews so we ask about integration testing every time
|
|
|
* In many cases we have introduced testing in response to these bugs, but probably not enough.
|
|
|
* Some of our bugs seem to sit around longer than we should like. There are bugs that were first fixed in 0.4.2 dating all the way back to 0.3.1.
|
|
|
* Some parts of our code base are identifiably under-tested or badly held together. They include:
|
|
|
* Accounting isn't integration-tested.
|
|
|
* Our Windows testing and QA is severely underpowered.
|
|
|
* Our seccomp2 sandbox, as it relates to the file system, is held together with twine and chewing gum.
|
|
|
* We should find a way to fuzz our configuration handling code more reasonably.
|
|
|
* +1, note that we have considered configs out of scope for fuzzing in the past
|
|
|
* We should find a way to run even the stem tests that use the tor network -- would pointing them at a chutney network work? In any case we should be running them as part of our dev process whenever we're touching controller code.
|
|
|
* Noticing what changes in a patch is essential. I think we should make pull-all start saying what files changed again.
|
|
|
|
|
|
[dgoulet] +1 ... this is something I'm missing from the normal "git pull" that tells me what changed and where. It is useful to me to know that.
|
|
|
|
|
|
[teor] +1, I check the files when I merge, but the reviewer should also be checking what files *and functions* have changed
|
|
|
|
|
|
|
|
|
|
|
|
Best practices:
|
|
|
* We should clearly distinguish "state" from "configuration" or "messages". We shouldn't mix state into configuration or parsed objects.
|
|
|
|
|
|
[dgoulet] +1.
|
|
|
|
|
|
* When writing warning messages and BUG() messages, we should think about their potential to get loud if they are hit too often, and whether a rate-limit/IF_BUG_ONCE() is appropriate.
|
|
|
|
|
|
|
|
|
LESSONS LEARNED/QUESTIONS ASKED ABOUT THIS RETROSPECTIVE PROCESS:
|
|
|
* Maybe we selected too many tickets to look at?
|
|
|
* Did we select the wrong tickets?
|
|
|
* Should we ignore unreleased bugs completely?
|
|
|
* Did we ignore any bugs that were important.
|
|
|
* teor says: we fixed some microdesc bugs in 0.4.0, 0.4.1, or 0.4.2. Some of them didn't get backported to 0.3.5, and they're causing some issues with chutney
|
|
|
|
|
|
#Methodological notes
|
|
|
```
|
|
|
places to look for bugs:
|
|
|
* tickets marked 'regression'
|
|
|
https://trac.torproject.org/projects/tor/query?keywords=~regression+042-must&col=id&col=summary&col=status&col=type&col=priority&col=milestone&col=component&order=priority
|
|
|
https://trac.torproject.org/projects/tor/query?keywords=~regression&milestone=%5ETor%3A+0.4.2.x-final&col=id&col=summary&col=type&col=status&col=priority&col=milestone&col=component&order=priority
|
|
|
-- DONE.
|
|
|
|
|
|
* any bugs fixed after the .1-alpha release
|
|
|
[see changelog]
|
|
|
-- maybe we should skip the minor ones here? if so, it's DONE
|
|
|
|
|
|
* any bugs marked major
|
|
|
[see changelog]
|
|
|
-- DONE.
|
|
|
|
|
|
Let's focus on 0.4.2.
|
|
|
|
|
|
things to look for in bugs:
|
|
|
* what kind of bug it was
|
|
|
* how "disruptive" was it?
|
|
|
* How did we actually find it?
|
|
|
* how it might be prevented in the future
|
|
|
* if it was backported to 0.2.9 or 0.3.5
|
|
|
* if that backport was avoidable
|
|
|
* if that backport made it into a release
|
|
|
("no" for all 0.2.9 backports, there hasn't been a release for 15 months)
|
|
|
* if we think any 0.2.9 backports should have triggered a release
|
|
|
|
|
|
```
|
|
|
|
|
|
# Bug notes
|
|
|
```
|
|
|
|
|
|
|
|
|
#31375 hs: Crash in token_bucket_ctr_refill() of the INTRO2 DoS defense
|
|
|
(regression)
|
|
|
|
|
|
Type: Crash from uninitialized token bucket -- division by zero?
|
|
|
Impact: Low: Found on master before release. Only triggers with DoS defense. And v2 HS?
|
|
|
Found: By dgoulet while running a relay on master
|
|
|
Bug backported: no.
|
|
|
Bug introduced in: 9f738be8937d675929b43a149d706160641a089d
|
|
|
Could be prevented by:
|
|
|
* making token_bucket detect and work around a zero rate.
|
|
|
* unit testing of init function followd by refill function.
|
|
|
|
|
|
|
|
|
#31793 Bug: tor_addr_is_internal() called from src/feature/dirauth/process_descs.c:447 with a non-IP address of type 0
|
|
|
(regression)
|
|
|
Type: Nonfatal bug warning.
|
|
|
Impact: Low: found on master before release. Authority only.
|
|
|
Found: By arma while running a relay on master
|
|
|
Bug backported: no
|
|
|
Bug introduced in: d9a7d4779887dbd2cba082c2a5daa535fe0d36ce
|
|
|
Could be prevented by:
|
|
|
* Better unit testing for dir_allow_private_addresses()
|
|
|
* Integration testing with some private-network tests disabled?
|
|
|
* Making tor_addr_is_internal() detect null addresses?
|
|
|
* Maybe mitigated by using IF_BUG_ONCE() by default, or making BUG rate-limit by default?
|
|
|
|
|
|
#32841 sandbox error on 0.4.2.x
|
|
|
(regression)
|
|
|
Type: seccomp2 sandbox failuer
|
|
|
Impact: High. Seccomp2 failed in 0.4.2.x and in in 0.4.1.7, users complained.
|
|
|
Found: by weasel while running with the sandbox, by users running on relays.
|
|
|
Bug backported: yes
|
|
|
Bug introduced in: a22fbab98690f802ae3bda276078cc7fc767feba
|
|
|
Backport avoidable: I think so, yeah.
|
|
|
Could be be prevented by:
|
|
|
* Running integration tests with sandbox 1, always (blocked by #32817, #32722)
|
|
|
* some integration tests were running on old Ubuntu versions, the sandbox failed when we upgraded to newer versions, so we turned it off (see #32721 and related tickets)
|
|
|
* Getting more users to test our releases.
|
|
|
* Noticing when we start using any syscall that we haven't used before.
|
|
|
* Writing a list of the syscalls that we don't yet use, and grepping the code for them as part of CI
|
|
|
* Rewriting sandbox to use an improved architecture
|
|
|
|
|
|
#30146 Fix coverity failures as of 04-11-2019
|
|
|
(regression)
|
|
|
Type: memory leaks and false positives
|
|
|
Impact: Very low. The only true bug was a memory leak when failing to make a keys directory.
|
|
|
Found: By nick when going on coverity rotation
|
|
|
Bug backported: possibly/various
|
|
|
Bug introduced in: various
|
|
|
Could be prevented by:
|
|
|
* (To the extent that coverity found these issues, they _were_ prevented.)
|
|
|
* opening tickets for new coverity warnings immediately.
|
|
|
* checking for coverity backlog more often.
|
|
|
|
|
|
|
|
|
#31002 circpadding: Middle node did not accept our padding request
|
|
|
(regression)
|
|
|
Type: Log spam
|
|
|
Impact: low: log messages only.
|
|
|
Found: by dgoulet when running a client on alpha
|
|
|
Bug backported: no
|
|
|
|
|
|
|
|
|
NOTE: I don't know how to classify this one: did we ever actually fix the underlying issue? It looks like we deferred it.
|
|
|
|
|
|
|
|
|
#31352 Jenkins failure on windows: ENETUNREACH undeclared.
|
|
|
(regression)
|
|
|
Type: compilation failure on some but not all windows builds
|
|
|
Impact: very low. Nobody noticed this but us.
|
|
|
Found: by looking at Jenkins
|
|
|
Bug backported: no.
|
|
|
Bug introduced in: c46e99c43c4ee03
|
|
|
Could be prevented by:
|
|
|
* Maintaining jenkins better and looking at it more frequently
|
|
|
* Trying more configurations in appveyor?
|
|
|
* Trying cross-compilation for windows in travis, maybe?
|
|
|
|
|
|
|
|
|
#31354 Compiler "note" in test_addr.c: variable tracking size limit exceeded with ‘-fvar-tracking-assignments’
|
|
|
(regression)
|
|
|
Type: compilation notes (less verbose than warnings (!))
|
|
|
Impact: Low. Only an annoyance to developers, except to the extent that their verbosity made some CI sad.
|
|
|
Found: by using GCC 9.1
|
|
|
Bug backported: no
|
|
|
Could be prevented by:
|
|
|
* (To the extent that jenkins found it, this bug _was_ prevented.)
|
|
|
* Treat merely annoying verbosity as a problem?
|
|
|
* Trying more compilers in CI
|
|
|
* Keeping unit test functions smaller?
|
|
|
* activating practracker on unit tests, perhaps with higer limits?
|
|
|
* creating helpers for repeated unit tests, but with different test data
|
|
|
|
|
|
#31495 cannot configure bridges
|
|
|
(regression)
|
|
|
Type: Logic error due to NULL/empty confusion.
|
|
|
Impact: Medium; the TB team ran into this one. It would have made bridges unusable if it had shipped.
|
|
|
Found: by mcs when using tor-browser nightly builds
|
|
|
Bug backported: no.
|
|
|
Bug introduced in: e16b90b88a76fb82702ae26c54834aca6c591d64
|
|
|
Could be prevented by:
|
|
|
* More integration testing, as we introduced after this with test_parseconf.
|
|
|
* Never designing a type where NULL and "" have different interpretations
|
|
|
* When refactoring, somehow fuzzing before- and after- to find differences?
|
|
|
|
|
|
#31578 practracker scans build directories inside the tor/ directory
|
|
|
(regression)
|
|
|
#31796 practracker git hook warning: Unusual pattern permitted.h in testdata
|
|
|
(regression.)
|
|
|
Type: Practracker logic error
|
|
|
Impact: Low; only developers noticed or could notice
|
|
|
Found: by teor after a failed 'make distcheck'
|
|
|
Bug backported: no
|
|
|
Bug introduced in: 17dd3167494d69b
|
|
|
Could be prevented by:
|
|
|
* More integration on our cod-testing tools, so we're not replicationg the same functionality so much?
|
|
|
|
|
|
|
|
|
#31734 Add accessor functions for cb_buf, which enforce locking and unlocking
|
|
|
(regression. Mislabeled? This is bugfix on 0.2.5.3)
|
|
|
Type: Not using locks enough;
|
|
|
Impact: Low. would lead to a race if two threads had backtraces at once
|
|
|
Found: By teor, looking at the code.
|
|
|
Bug backported: no
|
|
|
Bug introduced in: a3ab31f5dc55b2 in 0.2.5.3-alpha
|
|
|
Could be prevented by:
|
|
|
* Always using this accessor pattern or something similar
|
|
|
* preferring thread-local storage to shared buffers
|
|
|
* Avoiding static buffers
|
|
|
|
|
|
#31958 connection_dir_is_anonymous: Non-fatal assertion !(CONST_TO_OR_CIRCUIT(circ)->p_chan == NULL) failed
|
|
|
(regression)
|
|
|
Type: BUG() warning due to circuit briefly outliving its channel; use-after-mark error.
|
|
|
Impact: Low. Warning only.
|
|
|
Found: By dgoulet, running a test relay
|
|
|
Bug backported: no.
|
|
|
Bug introduced in: ef2123c7c7 in 0.4.2.1-alpha? Or did the bug exist previously?
|
|
|
Could be prevented by:
|
|
|
* more disciplined rules about when to check for marked conns/circuits?
|
|
|
|
|
|
#32060 CID 1454761: wrong type passed to unlock_cb_buf()?
|
|
|
(regression)
|
|
|
Type: Type error in handling of backtrace buffers and locking.
|
|
|
Impact: Low. if this had lasted it would probably have given us some crashes on backtrace cleanup
|
|
|
Found: by coverity scan
|
|
|
Bug backported: no
|
|
|
Bug introduced in: c23986246b970b in 0.4.2.3-alpha
|
|
|
Could be prevented by:
|
|
|
* Prefer structs to void * whenever possible, even for small arrays.
|
|
|
* Avoid substantial machinery in bugfix patches after the first alpha
|
|
|
* Technically, coverity _did_ prevent this from reaching production.
|
|
|
|
|
|
#32835 regression: log subsystem not closing files on switch back to syslog
|
|
|
(regression) [NOT ACTUALLY CLOSED!!!]
|
|
|
Type:
|
|
|
Impact:
|
|
|
Found:
|
|
|
Bug backported:
|
|
|
Bug introduced in:
|
|
|
Could be prevented by:
|
|
|
|
|
|
#32961 Backport the diagnostic logs for is_possible_guard crash
|
|
|
(regression) [but not actually a regression]
|
|
|
Type:
|
|
|
Impact:
|
|
|
Found:
|
|
|
Bug backported:
|
|
|
Bug introduced in:
|
|
|
Could be prevented by:
|
|
|
|
|
|
#31408 torrc : ClientOnionAuthDir after include directives breaks client to v2 services
|
|
|
(regression, during 0.4.1 or earlier but not found till 042) [not actually a regrssion; bugfix on 0.3.1.1-alpha]
|
|
|
Type: Logic error.
|
|
|
Impact: Medium. Running %include on a directory ending with an empty file would result in some config options getting ignored.
|
|
|
Found: by a user
|
|
|
Bug backported: no.
|
|
|
Bug introduced in: 0.3.1.1-alpha, when we added %include directories
|
|
|
Could be prevented by:
|
|
|
* Integration testing, possibly? We probably would have missed this case.
|
|
|
* Use of SLIST rather than hand-rolled pointer manipulation? Not definite.
|
|
|
* Fuzzing of config file handling?
|
|
|
|
|
|
|
|
|
#31810 Bug: ../src/lib/process/process_unix.c:265: process_unix_exec: Assertion line should be unreached failed; aborting.
|
|
|
(regression, during 0.4.1 or earlier but not found till 042)
|
|
|
Type: assertion failure.
|
|
|
Impact: Medium. Useless message when execve fails in child process.
|
|
|
Found: by users.
|
|
|
Bug backported: no.
|
|
|
Bug introduced in: 2e957027e28449 in 0.4.0.1-alpha
|
|
|
Could be prevented by:
|
|
|
* Testing more failing cases
|
|
|
* Avoinding tor_assert()
|
|
|
* Insisting on coverage more.
|
|
|
|
|
|
#31696 Assertion failure in map-anon.c:218
|
|
|
(regression, during 0.4.1 or earlier but not found till 042)
|
|
|
Type: Assertion crash; Failure to try a fallback or check reason for failure.
|
|
|
Impact: medium: crash when built with one Linux kernel configuration but run on another.
|
|
|
Found: by a slackware user
|
|
|
Bug backported: no
|
|
|
Bug introduced in: 8ca808f81d28bb in 0.4.0.x, which wasn't used till 0.4.1.x
|
|
|
Could be prevented by:
|
|
|
* Testing on unusual linux kernel configurations
|
|
|
* Introducing code review assumption that detected-in-headers does not mean present-in-kernel.
|
|
|
* ???
|
|
|
|
|
|
#31897 util/map_anon_nofork test fails on SunOS
|
|
|
(regression, during 0.4.1 or earlier but not found till 042)
|
|
|
Type: unit test failure, due to "char" being unsigned char.
|
|
|
Impact: low. unit test failure on an infrequent platform
|
|
|
Found: by a user.
|
|
|
Bug backported: no
|
|
|
Bug introduced in: 0.4.0.x
|
|
|
Could be prevented by:
|
|
|
* Unit testing on SunOS
|
|
|
* Unit testing on a platform where char is unsigned.
|
|
|
* Not using char for buffers.
|
|
|
|
|
|
#31772 MAPADDRESS control command
|
|
|
(regression, during 0.4.1 or earlier but not found till 042)
|
|
|
Type: Incorrect parsing table for a command.
|
|
|
Impact: MAPADDRESS didn't work
|
|
|
Found: by a user.
|
|
|
Bug backported: no
|
|
|
Bug introduced in: 0c0b869ba450363e36 in 0.4.1.1-alpha, which was about extendcircuit and should never have touched this
|
|
|
Could be prevented by:
|
|
|
* Noticing extraneous change to mapaddress_syntax in a patch that wasn't about that
|
|
|
* Smaller pull requests?
|
|
|
* Running stem integration tests for the mapaddress command.
|
|
|
|
|
|
#31898 Occasional crash during shutdown when using Tor API
|
|
|
(regression, during 0.4.1 or earlier but not found till 042)
|
|
|
Type: Logic error wrt C null pointers.
|
|
|
Impact: Restart-in-process would fail when certain config changes were made.
|
|
|
Found: by sysrqb when working with jni
|
|
|
Bug backported: no
|
|
|
Bug introduced in: 47de9c7 in 0.4.1.1-alpha.
|
|
|
Could be prevented by:
|
|
|
* Using a language without NULL
|
|
|
* Writing unit tests based on explicit description of function, not expected arguments.
|
|
|
* More testing of restart-in-process with configuration changes?
|
|
|
* Fuzzing configs with restart-in-process?
|
|
|
|
|
|
|
|
|
#29819 sandbox crash on rt_sigaction with libseccomp 0.2.4
|
|
|
(marked as major)
|
|
|
Type: Library compatibility, sandbox crash.
|
|
|
Impact: Medium. unable to run with sandbox using newer versions of libseccomp.
|
|
|
Found: by users.
|
|
|
Bug backported: n/a
|
|
|
Bug introduced in: This was caused by a change in libseccomp semantics. Previously rules were considerd to be ordered: now they aren't.
|
|
|
Could be prevented by:
|
|
|
* Libseccomp should probably not have implemented ordered rules semantics unintentionally, and possibly should have documented its guarantees better.
|
|
|
* A simpler sandbox design might have avoided this problem.
|
|
|
* ???
|
|
|
|
|
|
#32778 Initialise pubsub in Windows NT service mode
|
|
|
(regression, NT service failure)
|
|
|
Type: Failure to initialize new component.
|
|
|
Impact: Medium. NT services wouldn't start.
|
|
|
Found: by users.
|
|
|
Bug backported: no
|
|
|
Bug introduced in: 0.4.1.1-alpha with the introduction of pubsub.
|
|
|
Could be prevented by:
|
|
|
* Integration-testing NT services.
|
|
|
* QA on NT services
|
|
|
* Dropping support for NT services
|
|
|
* Reducing number of entry paths to Tor so that tor_api.h is "the only way in" (but see #32883)
|
|
|
|
|
|
#32108 tor can overrun its accountingmax if it enters soft hibernation first
|
|
|
(regression during 0.4.0.1-alpha, fixed in 0.4.2.3-alpha. marked as major)
|
|
|
Type: Bandwidth accounting. Logic error in flags applied to second_elapsed_callback
|
|
|
Impact: High. Relays didn't obey accountingmax under some circumstances.
|
|
|
Found: Not sure who noticed it; Roger figured it out.
|
|
|
Bug backported:
|
|
|
Bug introduced in: 4bf79fa4fa99fe6 in 0.4.0.1-alpha
|
|
|
Could be prevented by:
|
|
|
* Eliminating second_elapsed_callback?
|
|
|
* Integration tests for accounting
|
|
|
* More internal checks about bandwidth accounting
|
|
|
* Never refactoring working code (bad idea tho)
|
|
|
|
|
|
#31548 hs-v3: Service can pick more than HiddenServiceNumIntroductionPoints intro points
|
|
|
(marked as major)
|
|
|
Type: Logic error; race condition.
|
|
|
Impact: Medium. HS was less reliable because it was encoding intro points in descriptor before they were attempted.
|
|
|
Found: by dgoulet testing another patch.
|
|
|
Bug backported: no
|
|
|
Bug introduced in: HSv3 code in 0.3.2.1-alpha
|
|
|
Could be prevented by:
|
|
|
* Clear separation of "state" and "descriptor" for HS?
|
|
|
* ???
|
|
|
|
|
|
``` |
|
|
\ No newline at end of file |