Skip to content

DirMgr: Revise error handling to better tolerate reset-able failures

Nick Mathewson requested to merge nickm/arti:cert_dl_recovery_v3 into main

Previously we had only two kinds of errors that the State implementations could return: fatal and nonfatal. The fatal ones aborted the bootstrapping process, whereas the nonfatal ones were ignored.

That's not so great, for two reasons:

  • Some nonfatal errors are a fine reason to stop using a given directory cache. (If we treat these as nonfatal, we just keep using the same cache over and over.)
  • Some nonfatal errors mean that we should restart the bootstrapping process with a new consensus download attempt. (If we treat these as fatal, we abort the process when it can actually continue.)

The motivating examples here are issue #438 (closed), #439 (closed), and #440: for all of them, we need a way to say, in the certificate-downloading stage, that the consensus that we're trying to validate was no good and we need to get a new one.

This branch addresses these issues with two main design changes:

  • A State can now report a "blocking error" that means that the bootstrapping process needs to reset.
  • The State::add_from_{cache,download} functions can now report nonfatal errors.

There are also several specific places that needed changing to get the right behavior here, and tamp down a few other errors I found along the way.

Closes #439 (closed).

Part of #329 (closed).

Assigning review to @eta since she's stared into this particular abyss in the past.

Merge request reports

Loading