Skip to content

Improve bug discovery process

Creating the issue as a follow up to the meeting on March 16th, 2023 about snowflake-server buffer reuse bug postmortem #40260 (closed)

(The title of this ticket could be improved as well. Feel free to do so)

The harm to users was minor, but incidents like this are a good opportunity to reflect on our process, to make similar things less likely in the future.

The bug (#40199 (closed)) might have been caught, but was not, at multiple points:

  • Code understanding and review by the initial committer
  • Code review on the merge request
  • Automated tests / CI
  • End user reports or logs
  • Logs or instrumentation at the bridge

Which of these processes, if any, should we change, to decrease the chance of mistakes?

Brainstorming during the meeting:

  • Initial merge request should have included a test to prove the assumption that buffers were not reused.
    • The reviewer might have requested that such a test be added.
  • Any such anomalies, if detected at the client, should be logged in such a way that they show up in the tor log.
  • We should log whenever KCPInErrors is non-zero, at least.
  • We are missing integration testing as part of CI. We have unit testing, but nothing where all the pieces are working together as in production.
  • Should we have another more verbose level of log (debug/trace) so that it takes less effort to debug things in general? (no need to modify code then rebuilt like hazae41 did it https://hackerone.com/reports/1880610)