lox resources bypassing grace period
I think there may be an issue with how our Working
/ NotWorking
resource dichotomy plays out in Lox. This issue was being tracked (here)[lox#69] but I think the issue stems from rdsys. Specifically these lines of code:
var resourceState = ResourceState{}
for _, resource := range hashring.GetAll() {
rTest := resource.TestResult()
// If the context includes not-functional bridges or the resource's state is functional
//AND if the context doesn't use the bandwidth ratio or the speed is not speed rejected
if (!ctx.OnlyFunctional || rTest.State == StateFunctional) && (!ctx.UseBandwidthRatio || rTest.Speed != SpeedRejected) {
resourceState.Working = append(resourceState.Working, resource)
} else {
resourceState.Notworking = append(resourceState.Notworking, resource)
}
}
return resourceState
}
Because Lox sorts bridges into static buckets until they're either blocked or not working for an extended period, we're trying to include some leniency when deciding that a bridge is not working (in Lox) by including a grace period. A previously working bridge (defined by its lastPassed
time) currently has a grace period where it can be dysfunctional for 18 hours after its last passed test. However, I'm noticing that no bridges are assigned to the grace period at all. There were some issues with bridgestrap over the last few days that seemed to exacerbate the problem. The logs were showing that once bridges were detected as failing, they were immediately removed from their buckets and then when they came back online, were reassigned to a new bucket.
I think the problem is coming from the above logic. Since the OnlyFunctional
flag changes with the number of functional resources, and dysfunctional bridges in this scenario are sorted into working
resources, if these bridges are suddenly determined to be notworking
, their lastPassed
time may also be outside of the grace period threshold.
One possible solution is to change the lastPassed
time to a lastWorking
time which will indicate when a resource was last sorted into the working
category. This should mitigate the problem so will be the first thing I try.