DirMgr: Revise error handling to better tolerate reset-able failures
Previously we had only two kinds of errors that the State
implementations could return: fatal and nonfatal. The fatal ones aborted the bootstrapping process, whereas the nonfatal ones were ignored.
That's not so great, for two reasons:
- Some nonfatal errors are a fine reason to stop using a given directory cache. (If we treat these as nonfatal, we just keep using the same cache over and over.)
- Some nonfatal errors mean that we should restart the bootstrapping process with a new consensus download attempt. (If we treat these as fatal, we abort the process when it can actually continue.)
The motivating examples here are issue #438 (closed), #439 (closed), and #440: for all of them, we need a way to say, in the certificate-downloading stage, that the consensus that we're trying to validate was no good and we need to get a new one.
This branch addresses these issues with two main design changes:
- A
State
can now report a "blocking error" that means that the bootstrapping process needs to reset. - The
State::add_from_{cache,download}
functions can now report nonfatal errors.
There are also several specific places that needed changing to get the right behavior here, and tamp down a few other errors I found along the way.
Closes #439 (closed).
Part of #329 (closed).
Assigning review to @eta since she's stared into this particular abyss in the past.
Merge request reports
Activity
requested review from @eta
assigned to @nickm
mentioned in issue #438 (closed)
mentioned in issue #177
mentioned in issue #440
mentioned in issue #329 (closed)
Hey @nickm! I think this MR fixes some real problems that we previously overlooked (in particular, a way to assign blame based on
DocSource
is very valuable), but I'm a bit concerned about how convoluted the error-handling pathways in the bootstrapping process seem to be getting.It seems like we now have multiple ways to return errors from the state machine:
- returning Ok((_, Some(err)), a "non-fatal error", which
- will call
note_cache_error
, if it happened during a download - will be logged, if it happened adding from state
- will call
- returning Err(e), which
- will retry a download attempt, if it happened during a download (without calling
note_cache_error
!) - will abort the whole bootstrapping process, having the caller call
reset()
, if it happened adding from state
- will retry a download attempt, if it happened during a download (without calling
- storing an error away and returning it as a "blocking error", which
- will abort the whole bootstrapping process, no matter where it happens
I had to trace through the code a fair deal to figure out what the three different kinds of error handling do, and I'm still not convinced that we handle every case properly, or indeed will continue to do so in the future. This also ties in with #433 a tad: we currently do retries on error at multiple different levels and it's a bit confusing. (I did eventually plan to refactor this, and the v1 dirmgr rework MR made some steps toward this, but that didn't land in v2.)
As for how to fix this, my opinions are mostly "make returning Err(e) the path that all errors go through, and invest effort in handling that path well instead of strapping other forms of error handling around it as a bandaid". I think it should be possible to lean on the actual error types more in order to accomplish this; if we can't tell what action should be taken based on the error type, we could either create more specific error types, or maybe make the returned error something like
struct MaybeRecoverableError<T> { inner: T, is_recoverable: bool }
if we're really struggling.I'm conflicted as to whether this needs to be fixed in this MR, though. If you have spare bandwidth to work on it, it'd be nice, but if you'd rather I take a look at this myself, I'm happy to merge this and then follow it up with another MR!
- returning Ok((_, Some(err)), a "non-fatal error", which
150 /// The error requires that we restart bootstrapping from scratch. 151 /// 152 /// This kind of error typically means that we've downloaded a consensus 153 /// that turned out to be useless at a later stage, and so we need to 154 /// restart the downloading process from the beginning, by downloading a 155 /// fresh one. 156 Reset, 157 /// The error indicates that we cannot bootstrap, and should stop trying. 158 /// 159 /// These are typically internal programming errors, filesystem access 160 /// problems, directory manager shutdown, and the like. 161 Fatal, 162 /// The error should not even be possible during the bootstrapping process. 163 /// 164 /// Having it occur there indicates a bug. 165 Impossible, changed this line in version 5 of the diff
321 330 322 match state.add_from_cache(documents) { 323 Err(Error::UntimelyObject(TimeValidityError::Expired(_))) => { 324 // This is just an expired object from the cache; we don't need 325 // to call that an error. Treat it as if it were absent. 326 Ok(None) 327 } 328 other => other, 329 } 331 state.add_from_cache(documents) 330 332 }; 331 333 332 if outcome.is_ok() { 333 dirmgr.update_status(state.bootstrap_status()); 334 } 334 dirmgr.update_status(state.bootstrap_status()); 588 605 warn!(n_attempts=retry_config.n_attempts(), 589 606 state=%state.describe(), 590 607 "Unable to advance downloading state"); 591 return Ok((state, Some(Error::CantAdvanceState))); 608 return Err(Error::CantAdvanceState); 592 609 } 593 610 } 594 611 612 /// Replace `state` with `state.reset()`. 613 fn reset(state: &mut Box<dyn DirState>) { 614 let mut this_state: Box<dyn DirState> = Box::new(PoisonedState); 615 std::mem::swap(&mut this_state, state); 616 this_state = this_state.reset(); 617 std::mem::swap(&mut this_state, state); changed this line in version 5 of the diff
enabled an automatic merge when the pipeline for 5b5b4bbb succeeds
mentioned in commit b4e9e422
mentioned in issue #402
mentioned in issue #482 (closed)
mentioned in merge request !526 (closed)
mentioned in merge request !527 (merged)