hs: Client intro failure cache being poluted by circuit closing without a NACK from the intro
Under current network conditions, a lot of introduction circuit fails for clients because the intro points are overloaded and can't process the ntor
handshake leading to the circuit being closed with a RESOURCELIMIT
.
When it trickles down to the HS client, hs_client_circuit_cleanup_on_free()
is called which leads to flagging the intro point with a generic failure error which in turn makes the client stop using that intro point for another 2 minutes (time the intro will stay in the failure cache).
I have observed a situation where all intro points get flagged this way, the client re-download the descriptor, fails again and then has to wait 120 seconds before being able to do anything. This is conveniently align to the SocksTimeout
of 120 seconds as well so inevitably, the socks connection hangs up.
This has really bad UX reachability consequences because in theory, we could simply retry the intro point a couple times and get it to work instead of failing and fallbacking to other intro points which in turns overload them and has a snowball effect.
Solution here is to mark the failure as "unreachable" and so the HS client will retry up to 5 times before giving up (MAX_INTRO_POINT_REACHABILITY_FAILURES
).
We should definitely backport this.