Improve bug discovery process
Creating the issue as a follow up to the meeting on March 16th, 2023 about snowflake-server buffer reuse bug postmortem #40260 (closed)
(The title of this ticket could be improved as well. Feel free to do so)
The harm to users was minor, but incidents like this are a good opportunity to reflect on our process, to make similar things less likely in the future.
The bug (#40199 (closed)) might have been caught, but was not, at multiple points:
- Code understanding and review by the initial committer
- Code review on the merge request
- Automated tests / CI
- End user reports or logs
- Logs or instrumentation at the bridge
Which of these processes, if any, should we change, to decrease the chance of mistakes?
Brainstorming during the meeting:
- Initial merge request should have included a test to prove the assumption that buffers were not reused.
- The reviewer might have requested that such a test be added.
- Any such anomalies, if detected at the client, should be logged in such a way that they show up in the tor log.
- dcf's private branch that logs KCP's internal error counters:
- The fix this week made the "KCPInErrors" counter go to zero:
- We should log whenever KCPInErrors is non-zero, at least.
- We are missing integration testing as part of CI. We have unit testing, but nothing where all the pieces are working together as in production.
- shelikhoo's setup for distributed snowflake server testing: https://github.com/xiaokangwang/snowflake-mu-docker/blob/master/docker-compose.yaml
- Should we have another more verbose level of log (debug/trace) so that it takes less effort to debug things in general? (no need to modify code then rebuilt like hazae41 did it https://hackerone.com/reports/1880610)