Tor cache breaking over time and causing AllGuardsDown
Summary
Steps to reproduce:
We are using arti-client in our app and we are having an issue where we get into a state where all of our Tor request time out or fail. I'm still trying to narrow this down and I haven't found the exact reason yet.
Deleting the arti data directory does fix the issue. This leads me to believe it might be an issue with arti itself or how it handles its internal cache.
What is the current bug behavior?
Connections time our fail with error messages such as these:
Client(Error { detail: ObtainHsCircuit { hsid: HsId([…]oqd.onion), cause: DescriptorDownload(RetryError { doing: "retrieve hidden service descriptor", errors: [(Single(1), Report(DescriptorError { hsdir: [scrubbed], error: Circuit(Guard(AllGuardsDown { retry_at: None, running: FilterCount { n_accepted: 0, n_rejected: 60 }, pending: FilterCount { n_accepted: 0, n_rejected: 0 }, suitable: FilterCount { n_accepted: 0, n_rejected: 0 }, filtered: FilterCount { n_accepted: 0, n_rejected: 0 } })) })), (Single(2), Report(DescriptorError { hsdir: [scrubbed], error: Circuit(Guard(AllGuardsDown { retry_at: None, running: FilterCount { n_accepted: 0, n_rejected: 60 }, pending: FilterCount { n_accepted: 0, n_rejected: 0 }, suitable: FilterCount { n_accepted: 0, n_rejected: 0 }, filtered: FilterCount { n_accepted: 0, n_rejected: 0 } })) })), (Single(3), Report(DescriptorError { hsdir: [scrubbed], error: Circuit(Guard(AllGuardsDown { retry_at: None, running: FilterCount { n_accepted: 0, n_rejected: 60 }, pending: FilterCount { n_accepted: 0, n_rejected: 0 }, suitable: FilterCount { n_accepted: 0, n_rejected: 0 }, filtered: FilterCount { n_accepted: 0, n_rejected: 0 } })) })), (Single(4), Report(DescriptorError { hsdir: [scrubbed], error: Circuit(Guard(AllGuardsDown { retry_at: None, running: FilterCount { n_accepted: 0, n_rejected: 60 }, pending: FilterCount { n_accepted: 0, n_rejected: 0 }, suitable: FilterCount { n_accepted: 0, n_rejected: 0 }, filtered: FilterCount { n_accepted: 0, n_rejected: 0 } })) })), (Single(5), Report(DescriptorError { hsdir: [scrubbed], error: Circuit(Guard(AllGuardsDown { retry_at: None, running: FilterCount { n_accepted: 0, n_rejected: 60 }, pending: FilterCount { n_accepted: 0, n_rejected: 0 }, suitable: FilterCount { n_accepted: 0, n_rejected: 0 }, filtered: FilterCount { n_accepted: 0, n_rejected: 0 } })) })), (Single(6), Report(DescriptorError { hsdir: [scrubbed], error: Circuit(Guard(AllGuardsDown { retry_at: None, running: FilterCount { n_accepted: 0, n_rejected: 60 }, pending: FilterCount { n_accepted: 0, n_rejected: 0 }, suitable: FilterCount { n_accepted: 0, n_rejected: 0 }, filtered: FilterCount { n_accepted: 0, n_rejected: 0 } })) }))], n_errors: 6 }) } })) }))) }))
What is the expected behavior?
Connections should be successful without having to periodically delete the arti data directory.
Environment
We are using arti-client 1.5.0, specifically pinned to this commit. See the exact versions here.
Relevant logs and/or screenshots
Some more logs can be found here: https://gist.github.com/binarybaron/83100a901e62568c9085e11e598662c7 but these include other logs too that are unrelated to arti itself. I can help provide more detailed logging if required.
Possible fixes
Deleting the data directory seems to temporarily fix the issue.