Improve bug discovery process

Creating the issue as a follow up to the meeting on March 16th, 2023 about snowflake-server buffer reuse bug postmortem #40260 (closed)

(The title of this ticket could be improved as well. Feel free to do so)

The harm to users was minor, but incidents like this are a good opportunity to reflect on our process, to make similar things less likely in the future.

The bug (#40199 (closed)) might have been caught, but was not, at multiple points:

Code understanding and review by the initial committer

Code review on the merge request

Automated tests / CI

End user reports or logs

Logs or instrumentation at the bridge

Which of these processes, if any, should we change, to decrease the chance of mistakes?

Brainstorming during the meeting:

Initial merge request should have included a test to prove the assumption that buffers were not reused.

The reviewer might have requested that such a test be added.

Any such anomalies, if detected at the client, should be logged in such a way that they show up in the tor log.

dcf's private branch that logs KCP's internal error counters:

dcf/snowflake@9f43843b

#40262 (comment 2886018)

The fix this week made the "KCPInErrors" counter go to zero:

#40262 (comment 2886032)

We should log whenever KCPInErrors is non-zero, at least.

We are missing integration testing as part of CI. We have unit testing, but nothing where all the pieces are working together as in production.

shelikhoo's setup for distributed snowflake server testing: https://github.com/xiaokangwang/snowflake-mu-docker/blob/master/docker-compose.yaml

Should we have another more verbose level of log (debug/trace) so that it takes less effort to debug things in general? (no need to modify code then rebuilt like hazae41 did it https://hackerone.com/reports/1880610)