Gaba · 100accb5
--- a/ArchivedWiki/teams-to-be-move-to-each-team/NetworkTeam/CoreTorReleases/042Retrospective.md
+++ b/ArchivedWiki/teams-to-be-move-to-each-team/NetworkTeam/CoreTorReleases/042Retrospective.md
+# 042 bug retrospective
+
+POSSIBLE LESSONS TO TAKE FROM THIS RETROSPECTIVE:
+    * We should be writing tests for everything.  "Having a test for the error case would have caught this" comes up way too much.
+        * teor: +1, we should add extra time for unit testing to every ticket and proposal
+    * Our lack of integation testing for certain pieces of functionality is a long-term liability.  Even carefully unit testing our code would not have caught all these bugs.
+        * teor: +1, we should add extra time for integration testing to every ticket and proposal
+        * we should also change our PR reviews so we ask about integration testing every time
+    * In many cases we have introduced testing in response to these bugs, but probably not enough.
+    * Some of our bugs seem to sit around longer than we should like.  There are bugs that were first fixed in 0.4.2 dating all the way back to 0.3.1.
+    * Some parts of our code base are identifiably under-tested or badly held together. They include:
+        * Accounting isn't integration-tested.
+        * Our Windows testing and QA is severely underpowered.
+        * Our seccomp2 sandbox, as it relates to the file system, is held together with twine and chewing gum.
+        * We should find a way to fuzz our configuration handling code more reasonably.
+            * +1, note that we have considered configs out of scope for fuzzing in the past
+        * We should find a way to run even the stem tests that use the tor network -- would pointing them at a chutney network work? In any case we should be running them as part of our dev process whenever we're touching controller code.
+    * Noticing what changes in a patch is essential.  I think we should make pull-all start saying what files changed again.
+
+        [dgoulet] +1 ... this is something I'm missing from the normal "git pull" that tells me what changed and where. It is useful to me to know that.
+
+        [teor] +1, I check the files when I merge, but the reviewer should also be checking what files *and functions* have changed
+
+
+
+Best practices:
+    * We should clearly distinguish "state" from "configuration" or "messages".  We shouldn't mix state into configuration or parsed objects.
+
+        [dgoulet] +1.
+
+    * When writing warning messages and BUG() messages, we should think about their potential to get loud if they are hit too often, and whether a rate-limit/IF_BUG_ONCE() is appropriate.
+
+
+LESSONS LEARNED/QUESTIONS ASKED ABOUT THIS RETROSPECTIVE PROCESS:
+    * Maybe we selected too many tickets to look at?
+    * Did we select the wrong tickets?
+    * Should we ignore unreleased bugs completely?
+    * Did we ignore any bugs that were important.
+        * teor says: we fixed some microdesc bugs in 0.4.0, 0.4.1, or 0.4.2. Some of them didn't get backported to 0.3.5, and they're causing some issues with chutney
+
+#Methodological notes
+```
+places to look for bugs:
+    * tickets marked 'regression'
+       https://trac.torproject.org/projects/tor/query?keywords=~regression+042-must&col=id&col=summary&col=status&col=type&col=priority&col=milestone&col=component&order=priority
+       https://trac.torproject.org/projects/tor/query?keywords=~regression&milestone=%5ETor%3A+0.4.2.x-final&col=id&col=summary&col=type&col=status&col=priority&col=milestone&col=component&order=priority
+        -- DONE.
+ 
+    * any bugs fixed after the .1-alpha release
+       [see changelog]
+       -- maybe we should skip the minor ones here?  if so, it's DONE
+
+    * any bugs marked major
+       [see changelog]
+        -- DONE.
+
+Let's focus on 0.4.2.
+
+things to look for in bugs:
+    * what kind of bug it was
+    * how "disruptive" was it?  
+    * How did we actually find it?
+    * how it might be prevented in the future
+    * if it was backported to 0.2.9 or 0.3.5
+    * if that backport was avoidable
+    * if that backport made it into a release
+      ("no" for all 0.2.9 backports, there hasn't been a release for 15 months)
+    * if we think any 0.2.9 backports should have triggered a release
+
+```
+
+# Bug notes
+```
+
+
+#31375         hs: Crash in token_bucket_ctr_refill() of the INTRO2 DoS defense
+ (regression)
+
+Type: Crash from uninitialized token bucket -- division by zero?
+Impact: Low: Found on master before release. Only triggers with DoS defense. And v2 HS?
+Found: By dgoulet while running a relay on master
+Bug backported: no.
+Bug introduced in: 9f738be8937d675929b43a149d706160641a089d
+Could be prevented by:
+    * making token_bucket detect and work around a zero rate.
+    * unit testing of init function followd by refill function.
+
+
+#31793         Bug: tor_addr_is_internal() called from src/feature/dirauth/process_descs.c:447 with a non-IP address of type 0
+  (regression)
+Type: Nonfatal bug warning.
+Impact: Low: found on master before release. Authority only.
+Found: By arma while running a relay on master
+Bug backported: no
+Bug introduced in: d9a7d4779887dbd2cba082c2a5daa535fe0d36ce
+Could be prevented by:
+    * Better unit testing for dir_allow_private_addresses()
+    * Integration testing with some private-network tests disabled?
+    * Making tor_addr_is_internal() detect null addresses?
+    * Maybe mitigated by using IF_BUG_ONCE() by default, or making BUG rate-limit by default?
+
+#32841         sandbox error on 0.4.2.x
+  (regression)
+Type: seccomp2 sandbox failuer
+Impact: High. Seccomp2 failed in 0.4.2.x and in in 0.4.1.7, users complained.
+Found: by weasel while running with the sandbox, by users running on relays.
+Bug backported: yes
+Bug introduced in: a22fbab98690f802ae3bda276078cc7fc767feba
+Backport avoidable: I think so, yeah.
+Could be be prevented by:
+    * Running integration tests with sandbox 1, always (blocked by #32817, #32722)
+    * some integration tests were running on old Ubuntu versions, the sandbox failed when we upgraded to newer versions, so we turned it off (see #32721 and related tickets)
+    * Getting more users to test our releases.
+    * Noticing when we start using any syscall that we haven't used before.
+    * Writing a list of the syscalls that we don't yet use, and grepping the code for them as part of CI
+    * Rewriting sandbox to use an improved architecture
+
+#30146         Fix coverity failures as of 04-11-2019
+  (regression)
+Type: memory leaks and false positives
+Impact: Very low. The only true bug was a memory leak when failing to make a keys directory.
+Found: By nick when going on coverity rotation
+Bug backported: possibly/various
+Bug introduced in: various
+Could be prevented by:
+    * (To the extent that coverity found these issues, they _were_ prevented.)
+    * opening tickets for new coverity warnings immediately.
+    * checking for coverity backlog more often.
+
+
+#31002         circpadding: Middle node did not accept our padding request
+  (regression)
+Type: Log spam
+Impact: low: log messages only.
+Found: by dgoulet when running a client on alpha
+Bug backported: no 
+
+
+NOTE: I don't know how to classify this one: did we ever actually fix the underlying issue? It looks like we deferred it.
+
+
+#31352         Jenkins failure on windows: ENETUNREACH undeclared.
+  (regression)
+Type: compilation failure on some but not all windows builds
+Impact: very low. Nobody noticed this but us.
+Found: by looking at Jenkins
+Bug backported: no.
+Bug introduced in: c46e99c43c4ee03
+Could be prevented by:
+    * Maintaining jenkins better and looking at it more frequently
+    * Trying more configurations in appveyor?
+    * Trying cross-compilation for windows in travis, maybe?
+
+
+#31354         Compiler "note" in test_addr.c: variable tracking size limit exceeded with ‘-fvar-tracking-assignments’
+  (regression)
+Type: compilation notes (less verbose than warnings (!))
+Impact: Low. Only an annoyance to developers, except to the extent that their verbosity made some CI sad.
+Found: by using GCC 9.1
+Bug backported: no
+Could be prevented by:
+    * (To the extent that jenkins found it, this bug _was_ prevented.)
+    * Treat merely annoying verbosity as a problem?
+    * Trying more compilers in CI
+    * Keeping unit test functions smaller?
+    * activating practracker on unit tests, perhaps with higer limits?
+    * creating helpers for repeated unit tests, but with different test data
+
+#31495         cannot configure bridges
+  (regression)
+Type: Logic error due to NULL/empty confusion.
+Impact: Medium; the TB team ran into this one. It would have made bridges unusable if it had shipped.
+Found: by mcs when using tor-browser nightly builds
+Bug backported: no.
+Bug introduced in: e16b90b88a76fb82702ae26c54834aca6c591d64
+Could be prevented by:
+    * More integration testing, as we introduced after this with test_parseconf.
+    * Never designing a type where NULL and "" have different interpretations
+    * When refactoring, somehow fuzzing before- and after- to find differences?
+
+#31578         practracker scans build directories inside the tor/ directory
+  (regression)
+#31796         practracker git hook warning: Unusual pattern permitted.h in testdata
+  (regression.)
+Type: Practracker logic error
+Impact: Low; only developers noticed or could notice
+Found: by teor after a failed 'make distcheck'
+Bug backported: no
+Bug introduced in: 17dd3167494d69b
+Could be prevented by:
+    * More integration on our cod-testing tools, so we're not replicationg the same functionality so much?
+
+
+#31734         Add accessor functions for cb_buf, which enforce locking and unlocking
+  (regression. Mislabeled? This is bugfix on 0.2.5.3)
+Type: Not using locks enough;
+Impact: Low. would lead to a race if two threads had backtraces at once
+Found: By teor, looking at the code.
+Bug backported: no
+Bug introduced in: a3ab31f5dc55b2 in 0.2.5.3-alpha
+Could be prevented by: 
+    * Always using this accessor pattern or something similar
+    * preferring thread-local storage to shared buffers
+    * Avoiding static buffers
+
+#31958         connection_dir_is_anonymous: Non-fatal assertion !(CONST_TO_OR_CIRCUIT(circ)->p_chan == NULL) failed
+  (regression)
+Type: BUG() warning due to circuit briefly outliving its channel; use-after-mark error.
+Impact: Low. Warning only.
+Found: By dgoulet, running a test relay
+Bug backported: no.
+Bug introduced in: ef2123c7c7 in 0.4.2.1-alpha? Or did the bug exist previously?
+Could be prevented by: 
+    * more disciplined rules about when to check for marked conns/circuits?
+
+#32060         CID 1454761: wrong type passed to unlock_cb_buf()?
+  (regression)
+Type: Type error in handling of backtrace buffers and locking.
+Impact: Low. if this had lasted it would probably have given us some crashes on backtrace cleanup
+Found: by coverity scan
+Bug backported: no
+Bug introduced in: c23986246b970b in 0.4.2.3-alpha
+Could be prevented by:
+    * Prefer structs to void * whenever possible, even for small arrays.
+    * Avoid substantial machinery in bugfix patches after the first alpha
+    * Technically, coverity _did_ prevent this from reaching production.
+
+#32835         regression: log subsystem not closing files on switch back to syslog
+  (regression) [NOT ACTUALLY CLOSED!!!]
+Type: 
+Impact:
+Found:
+Bug backported:
+Bug introduced in:
+Could be prevented by:
+
+#32961         Backport the diagnostic logs for is_possible_guard crash
+  (regression) [but not actually a regression]
+Type: 
+Impact:
+Found:
+Bug backported:
+Bug introduced in:
+Could be prevented by:
+
+#31408         torrc : ClientOnionAuthDir after include directives breaks client to v2 services
+  (regression, during 0.4.1 or earlier but not found till 042) [not actually a regrssion; bugfix on 0.3.1.1-alpha]
+Type: Logic error.
+Impact: Medium. Running %include on a directory ending with an empty file would result in some config options getting ignored.
+Found: by a user
+Bug backported: no.
+Bug introduced in: 0.3.1.1-alpha, when we added %include directories
+Could be prevented by:
+    * Integration testing, possibly? We probably would have missed this case.
+    * Use of SLIST rather than hand-rolled pointer manipulation? Not definite.
+    * Fuzzing of config file handling?
+
+
+#31810         Bug: ../src/lib/process/process_unix.c:265: process_unix_exec: Assertion line should be unreached failed; aborting.
+  (regression, during 0.4.1 or earlier but not found till 042)
+Type: assertion failure.
+Impact: Medium. Useless message when execve fails in child process.
+Found: by users.
+Bug backported: no.
+Bug introduced in: 2e957027e28449 in 0.4.0.1-alpha
+Could be prevented by:
+    * Testing more failing cases
+    * Avoinding tor_assert()
+    * Insisting on coverage more.
+
+#31696         Assertion failure in map-anon.c:218
+  (regression, during 0.4.1 or earlier but not found till 042)
+Type: Assertion crash; Failure to try a fallback or check reason for failure.
+Impact: medium: crash when built with one Linux kernel configuration but run on another.
+Found: by a slackware user
+Bug backported: no
+Bug introduced in: 8ca808f81d28bb in 0.4.0.x, which wasn't used till 0.4.1.x
+Could be prevented by:
+    * Testing on unusual linux kernel configurations
+    * Introducing code review assumption that detected-in-headers does not mean present-in-kernel.
+    * ???
+
+#31897         util/map_anon_nofork test fails on SunOS
+  (regression, during 0.4.1 or earlier but not found till 042)
+Type: unit test failure, due to "char" being unsigned char.
+Impact: low. unit test failure on an infrequent platform
+Found: by a user.
+Bug backported: no
+Bug introduced in: 0.4.0.x
+Could be prevented by:
+    * Unit testing on SunOS
+    * Unit testing on a platform where char is unsigned.
+    * Not using char for buffers.
+
+#31772         MAPADDRESS control command
+  (regression, during 0.4.1 or earlier but not found till 042)
+Type: Incorrect parsing table for a command.
+Impact: MAPADDRESS didn't work
+Found: by a user.
+Bug backported: no
+Bug introduced in: 0c0b869ba450363e36 in 0.4.1.1-alpha, which was about extendcircuit and should never have touched this
+Could be prevented by:
+    * Noticing extraneous change to mapaddress_syntax in a patch that wasn't about that
+      * Smaller pull requests?
+    * Running stem integration tests for the mapaddress command.
+
+#31898         Occasional crash during shutdown when using Tor API
+  (regression, during 0.4.1 or earlier but not found till 042)
+Type: Logic error wrt C null pointers.
+Impact: Restart-in-process would fail when certain config changes were made.
+Found: by sysrqb when working with jni
+Bug backported: no
+Bug introduced in: 47de9c7 in 0.4.1.1-alpha.
+Could be prevented by:
+    * Using a language without NULL
+    * Writing unit tests based on explicit description of function, not expected arguments.
+    * More testing of restart-in-process with configuration changes?
+    * Fuzzing configs with restart-in-process?
+
+
+#29819         sandbox crash on rt_sigaction with libseccomp 0.2.4
+  (marked as major)
+Type: Library compatibility, sandbox crash.
+Impact: Medium. unable to run with sandbox using newer versions of libseccomp.
+Found: by users.
+Bug backported: n/a
+Bug introduced in: This was caused by a change in libseccomp semantics. Previously rules were considerd to be ordered: now they aren't.
+Could be prevented by:
+    * Libseccomp should probably not have implemented ordered rules semantics unintentionally, and possibly should have documented its guarantees better.
+    * A simpler sandbox design might have avoided this problem.
+    * ???
+
+#32778         Initialise pubsub in Windows NT service mode  
+  (regression, NT service failure)
+Type: Failure to initialize new component.
+Impact: Medium. NT services wouldn't start.
+Found: by users.
+Bug backported: no
+Bug introduced in: 0.4.1.1-alpha with the introduction of pubsub.
+Could be prevented by:
+    * Integration-testing NT services.
+    * QA on NT services
+    * Dropping support for NT services
+    * Reducing number of entry paths to Tor so that tor_api.h is "the only way in" (but see #32883)
+
+#32108         tor can overrun its accountingmax if it enters soft hibernation first  
+(regression during 0.4.0.1-alpha, fixed in 0.4.2.3-alpha. marked as major)
+Type: Bandwidth accounting. Logic error in flags applied to second_elapsed_callback
+Impact: High. Relays didn't obey accountingmax under some circumstances.
+Found: Not sure who noticed it; Roger figured it out.
+Bug backported:
+Bug introduced in: 4bf79fa4fa99fe6 in 0.4.0.1-alpha
+Could be prevented by:
+    * Eliminating second_elapsed_callback?
+    * Integration tests for accounting
+    * More internal checks about bandwidth accounting
+    * Never refactoring working code (bad idea tho)
+
+#31548         hs-v3: Service can pick more than HiddenServiceNumIntroductionPoints intro points  
+ (marked as major)
+Type: Logic error; race condition.
+Impact: Medium. HS was less reliable because it was encoding intro points in descriptor before they were attempted.
+Found: by dgoulet testing another patch.
+Bug backported: no
+Bug introduced in: HSv3 code in 0.3.2.1-alpha
+Could be prevented by:
+    * Clear separation of "state" and "descriptor" for HS?
+    * ???
+
+```
\ No newline at end of file