Smarter timing for dirmgr downloads and retries
Currently there are multiple RetryDelay
timers and other timers used in dirmgr
. We should document them better, and simplify how they work. I was thinking of doing this as part of #329 (closed), but @eta may get to it first as part of #90.
Here's how it should work:
So, there are two main things happening in dirmgr
: we try to fetch a complete directory now, and once that directory is old, we try to fetch a new one.
Fetching a directory
For background: a directory consists of multiple documents: A consensus document, a set of certificates that sign the consensus, and a set of microdescriptors whose digests are listed in the consensus. We download them in that order.
To fetch a new directory, we currently do (approximately) this algorithm:
initialize RetryDelay "outer" and RetryDelay "inner".
while directory is incomplete or too old:
* Figure out what documents are missing.
* Try to download them
* On success, try to validate and store them.
* If that succeeded, and we learned something from it: continue.
* If the download or validation, or if we learned nothing:
* If we have failed too many times since we last "reset" the directory,
"reset" the directory (set it to empty), reset "inner", and wait for
the next delay from "outer".
* Otherwise, wait for the next delay from "inner".
The directory is complete; declare victory.
There are a few problems with that algorithm. First, it is too happy to reset. The only case in which we should ever consider resetting the directory is when we have a consensus, but we can't find the certificates to authenticate it. (This case probably means that the consensus was never valid to begin with, and the certificates mentioned don't exist, so we should get a new consensus.)
The second problem is that when things are going wrong, it is too happy to reset its RetryDelay
objects, and so it doesn't back off correctly.
Here is a better algorithm:
Initialize a RetryDelay.
Initialize n_failures to 0.
while the directory is incomplete or too old:
* Figure out what documents are missing.
* Try to download them.
* On success, try to validate and store them.
* If that succeeded, and we learned something from it:
* Set "n_failures" to 0, and continue.
* If the download or validation failed, or if we learned nothing:
* Increment n_failures.
* If n_failures is above some threshold, and the current state is
resettable♮, reset the directory (set it to empty).
* Wait for the next delay from our RetryDelay.
The directory is complete; declare victory.
♮ The "downloading certs" state is resettable; other states are not.
This change would mostly involve refactoring the function bootstrap::download()
.
Waiting for the next directory
Every consensus document has a "lifetime", defined using three times: "Valid-After", "Fresh-Until", and "Valid-Until". The idea is that the consensus can be used safely at all times from "Valid-After" through "Valid-Until", and that you shouldn't even think of replacing it until after "Fresh-Until".
Whenever we have a complete directory, we wait until a randomly chosen time between "Fresh-Until" and "Valid-Until" before we start a new download attempt. (The time is randomly chosen to avoid a "thundering herd" of clients all trying to downlaod at once.)
When we have an incomplete directory, we should re-start trying to download a new directory if the consensus we have is one that we would want to replace anyway.
One consideration here: with these times, we're waiting until wallclock times, not until Instant
s. That means we need to be prepared for the possibility of a clock jump. (For example, if somebody resets their clock, or the computer sleeps and wakes up). We currently use sleep_until_wallclock
for that; you could probably implement something similar with TaskHandle
.
And there you have it; it's not too complicated, but it is a bit hairy. I think it ought to be feasible to do this with the TaskHandle
API.