Skip to content
Snippets Groups Projects

DirMgr: Revise error handling to better tolerate reset-able failures

Merged Nick Mathewson requested to merge nickm/arti:cert_dl_recovery_v3 into main
4 unresolved threads

Previously we had only two kinds of errors that the State implementations could return: fatal and nonfatal. The fatal ones aborted the bootstrapping process, whereas the nonfatal ones were ignored.

That's not so great, for two reasons:

  • Some nonfatal errors are a fine reason to stop using a given directory cache. (If we treat these as nonfatal, we just keep using the same cache over and over.)
  • Some nonfatal errors mean that we should restart the bootstrapping process with a new consensus download attempt. (If we treat these as fatal, we abort the process when it can actually continue.)

The motivating examples here are issue #438 (closed), #439 (closed), and #440: for all of them, we need a way to say, in the certificate-downloading stage, that the consensus that we're trying to validate was no good and we need to get a new one.

This branch addresses these issues with two main design changes:

  • A State can now report a "blocking error" that means that the bootstrapping process needs to reset.
  • The State::add_from_{cache,download} functions can now report nonfatal errors.

There are also several specific places that needed changing to get the right behavior here, and tamp down a few other errors I found along the way.

Closes #439 (closed).

Part of #329 (closed).

Assigning review to @eta since she's stared into this particular abyss in the past.

Merge request reports

Approved by

Merged by etaeta 2 years ago (May 19, 2022 12:32pm UTC)

Merge details

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Nick Mathewson requested review from @eta

    requested review from @eta

  • assigned to @nickm

  • mentioned in issue #438 (closed)

  • Nick Mathewson mentioned in issue #177

    mentioned in issue #177

  • Nick Mathewson mentioned in issue #440

    mentioned in issue #440

  • mentioned in issue #329 (closed)

  • Nick Mathewson added 1 commit

    added 1 commit

    • f83146be - Fix a portability issue with Rust 1.56

    Compare with previous version

  • Contributor

    Hey @nickm! I think this MR fixes some real problems that we previously overlooked (in particular, a way to assign blame based on DocSource is very valuable), but I'm a bit concerned about how convoluted the error-handling pathways in the bootstrapping process seem to be getting.

    It seems like we now have multiple ways to return errors from the state machine:

    • returning Ok((_, Some(err)), a "non-fatal error", which
      • will call note_cache_error, if it happened during a download
      • will be logged, if it happened adding from state
    • returning Err(e), which
      • will retry a download attempt, if it happened during a download (without calling note_cache_error!)
      • will abort the whole bootstrapping process, having the caller call reset(), if it happened adding from state
    • storing an error away and returning it as a "blocking error", which
      • will abort the whole bootstrapping process, no matter where it happens

    I had to trace through the code a fair deal to figure out what the three different kinds of error handling do, and I'm still not convinced that we handle every case properly, or indeed will continue to do so in the future. This also ties in with #433 a tad: we currently do retries on error at multiple different levels and it's a bit confusing. (I did eventually plan to refactor this, and the v1 dirmgr rework MR made some steps toward this, but that didn't land in v2.)

    As for how to fix this, my opinions are mostly "make returning Err(e) the path that all errors go through, and invest effort in handling that path well instead of strapping other forms of error handling around it as a bandaid". I think it should be possible to lean on the actual error types more in order to accomplish this; if we can't tell what action should be taken based on the error type, we could either create more specific error types, or maybe make the returned error something like struct MaybeRecoverableError<T> { inner: T, is_recoverable: bool } if we're really struggling.

    I'm conflicted as to whether this needs to be fixed in this MR, though. If you have spare bandwidth to work on it, it'd be nice, but if you'd rather I take a look at this myself, I'm happy to merge this and then follow it up with another MR!

  • Nick Mathewson added 4 commits

    added 4 commits

    • 6bacf3b3 - DirMgr: Start refactoring error handling.
    • d7a3fd2c - DirMgr: Remove special handling of "changed" boolean
    • 53ed5f40 - DirMgr: Unify error return paths
    • 5de1b931 - DirMgr: Remove blocking_error return path.

    Compare with previous version

  • Thanks for the feedback! As discussed on IRC, I've tried to refactor and unify the error handling of this logic. I'm afraid it got a bit involved, but it still works with my tests.

  • Nick Mathewson added 1 commit

    added 1 commit

    • c70916c0 - Fix compilation with Rust 1.56.

    Compare with previous version

  • eta
    eta @eta started a thread on an outdated change in commit 6bacf3b3
150 /// The error requires that we restart bootstrapping from scratch.
151 ///
152 /// This kind of error typically means that we've downloaded a consensus
153 /// that turned out to be useless at a later stage, and so we need to
154 /// restart the downloading process from the beginning, by downloading a
155 /// fresh one.
156 Reset,
157 /// The error indicates that we cannot bootstrap, and should stop trying.
158 ///
159 /// These are typically internal programming errors, filesystem access
160 /// problems, directory manager shutdown, and the like.
161 Fatal,
162 /// The error should not even be possible during the bootstrapping process.
163 ///
164 /// Having it occur there indicates a bug.
165 Impossible,
  • eta
    eta @eta started a thread on commit 53ed5f40
  • 321 330
    322 match state.add_from_cache(documents) {
    323 Err(Error::UntimelyObject(TimeValidityError::Expired(_))) => {
    324 // This is just an expired object from the cache; we don't need
    325 // to call that an error. Treat it as if it were absent.
    326 Ok(None)
    327 }
    328 other => other,
    329 }
    331 state.add_from_cache(documents)
    330 332 };
    331 333
    332 if outcome.is_ok() {
    333 dirmgr.update_status(state.bootstrap_status());
    334 }
    334 dirmgr.update_status(state.bootstrap_status());
    • Contributor

      This feels like you meant to get rid of outcome and didn't, but changed this part as if you had. (Even if that wasn't your intention, I still think you should get rid of outcome, because this is confusing to read.)

    • Actually, outcome is still necessary; I'm adding a comment to explain why.

    • Please register or sign in to reply
  • eta
    eta @eta started a thread on an outdated change in commit 53ed5f40
  • 588 605 warn!(n_attempts=retry_config.n_attempts(),
    589 606 state=%state.describe(),
    590 607 "Unable to advance downloading state");
    591 return Ok((state, Some(Error::CantAdvanceState)));
    608 return Err(Error::CantAdvanceState);
    592 609 }
    593 610 }
    594 611
    612 /// Replace `state` with `state.reset()`.
    613 fn reset(state: &mut Box<dyn DirState>) {
    614 let mut this_state: Box<dyn DirState> = Box::new(PoisonedState);
    615 std::mem::swap(&mut this_state, state);
    616 this_state = this_state.reset();
    617 std::mem::swap(&mut this_state, state);
  • eta approved this merge request

    approved this merge request

  • Contributor

    Nice! Feel free to merge this yourself after fixing the comments (only one is blocking, though).

  • Nick Mathewson added 3 commits

    added 3 commits

    • 6aebb18e - Remove BootstrapAction::Impossible
    • eab0046d - Explain why we call update_status unconditionally.
    • 5b5b4bbb - Simplify advance and reset functions with mem::replace.

    Compare with previous version

    • Hi! I've made the requested changes except for the one with outcome, where I've added a comment instead. Since I think that was the blocking comment, I'm checking in before I merge. Is this better now?

    • Contributor

      Yeah, makes sense! Merging...

    • Please register or sign in to reply
  • eta enabled an automatic merge when the pipeline for 5b5b4bbb succeeds

    enabled an automatic merge when the pipeline for 5b5b4bbb succeeds

  • merged

  • eta mentioned in commit b4e9e422

    mentioned in commit b4e9e422

  • Nick Mathewson mentioned in issue #402

    mentioned in issue #402

  • mentioned in issue #482 (closed)

  • Nick Mathewson mentioned in merge request !526 (closed)

    mentioned in merge request !526 (closed)

  • Nick Mathewson mentioned in merge request !527 (merged)

    mentioned in merge request !527 (merged)

  • Please register or sign in to reply
    Loading