Skip to content
GitLab
  • Menu
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • Arti Arti
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 141
    • Issues 141
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 14
    • Merge requests 14
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages & Registries
    • Packages & Registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • The Tor Project
  • Core
  • ArtiArti
  • Issues
  • #433
Closed
Open
Created Apr 07, 2022 by Nick Mathewson@nickm🎨Owner

Smarter timing for dirmgr downloads and retries

Currently there are multiple RetryDelay timers and other timers used in dirmgr. We should document them better, and simplify how they work. I was thinking of doing this as part of #329 (closed), but @eta may get to it first as part of #90.

Here's how it should work:

So, there are two main things happening in dirmgr: we try to fetch a complete directory now, and once that directory is old, we try to fetch a new one.

Fetching a directory

For background: a directory consists of multiple documents: A consensus document, a set of certificates that sign the consensus, and a set of microdescriptors whose digests are listed in the consensus. We download them in that order.

To fetch a new directory, we currently do (approximately) this algorithm:

initialize RetryDelay "outer" and RetryDelay "inner".

while directory is incomplete or too old:
    * Figure out what documents are missing.
    * Try to download them
    * On success, try to validate and store them.
    * If that succeeded, and we learned something from it: continue.
    * If the download or validation, or if we learned nothing:
       * If we have failed too many times since we last "reset" the directory,
         "reset" the directory (set it to empty), reset "inner", and wait for
         the next delay from "outer".
       * Otherwise, wait for the next delay from "inner".

The directory is complete; declare victory.

There are a few problems with that algorithm. First, it is too happy to reset. The only case in which we should ever consider resetting the directory is when we have a consensus, but we can't find the certificates to authenticate it. (This case probably means that the consensus was never valid to begin with, and the certificates mentioned don't exist, so we should get a new consensus.)

The second problem is that when things are going wrong, it is too happy to reset its RetryDelay objects, and so it doesn't back off correctly.

Here is a better algorithm:

Initialize a RetryDelay.
Initialize n_failures to 0.

while the directory is incomplete or too old: 
    * Figure out what documents are missing.
    * Try to download them.
    * On success, try to validate and store them.
    * If that succeeded, and we learned something from it:
      * Set "n_failures" to 0, and continue.
    * If the download or validation failed, or if we learned nothing:
       * Increment n_failures.
       * If n_failures is above some threshold, and the current state is
         resettable♮, reset the directory (set it to empty).
       * Wait for the next delay from our RetryDelay.

The directory is complete; declare victory.

♮ The "downloading certs" state is resettable; other states are not.

This change would mostly involve refactoring the function bootstrap::download().

Waiting for the next directory

Every consensus document has a "lifetime", defined using three times: "Valid-After", "Fresh-Until", and "Valid-Until". The idea is that the consensus can be used safely at all times from "Valid-After" through "Valid-Until", and that you shouldn't even think of replacing it until after "Fresh-Until".

Whenever we have a complete directory, we wait until a randomly chosen time between "Fresh-Until" and "Valid-Until" before we start a new download attempt. (The time is randomly chosen to avoid a "thundering herd" of clients all trying to downlaod at once.)

When we have an incomplete directory, we should re-start trying to download a new directory if the consensus we have is one that we would want to replace anyway.

One consideration here: with these times, we're waiting until wallclock times, not until Instants. That means we need to be prepared for the possibility of a clock jump. (For example, if somebody resets their clock, or the computer sleeps and wakes up). We currently use sleep_until_wallclock for that; you could probably implement something similar with TaskHandle.


And there you have it; it's not too complicated, but it is a bit hairy. I think it ought to be feasible to do this with the TaskHandle API.

Assignee
Assign to
Time tracking