Looking at weasel's dirauth we get plenty of related error messages:
[warn] Bw Weights error 1 for Case 3be (E scarce, Wee=1, Wmd == Wgd) v10. G=14793673 M=8310679 E=3428814 D=7040272 T=26698303 Wmd=-512 Wme=0 Wmg=2192 Wed=11025 Wee=10000 Wgd=-512 Wgg=7808 Wme=0 Wmg=2192 weight_scale=10000
That is, it seems like the bandwidth calculation of the dirauths got screwed a bit. This might be the result of using the guardfraction data during bandwidth calculation in compute_weighted_bandwidths() that might combine badly with legacy/trac#13297 (moved).
The negative Wmd/Wgd values seem weird here as well.
Let's disable the GuardFraction feature for now, till we figure out this issue.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
OK, I found a silly bug that might be causing this problem.
When we were calculating the total weights (T, G, M, E, D), while we were considering guardfraction information on the D/E weights and G/M weights, we didn't increase the total weight T by the right amount.
So for example, if a router with bandwidth 1000 is 75% guard, it means it has 750 bandwidth as a guard (should be assigned to G), and 250 bandwidth as a non-guard (should be assigned to M). The issue is that we only incremented T by 750, whereas we should increment it by 750 + 250.
This lead to T being much smaller than the sum G + M + E + D, which shouldn't happen IIUC.
This is probably why we ended up with negative weights in the end.
The above is an easy issue to fix. A harder issue to fix is that clients with disabled guardfraction, will try to validate the total weights in networkstatus_verify_bw_weights(). Since they don't understand guardfraction they will end up with different values than the authorities and then they will complain with stuff like:
Jun 02 17:47:59.000 [warn] Bw Weight Failure for Case 3b: Etotal 10050.492000 != Mtotal 9436.164000. G=12930 M=0 E=0 D=21480 T=34410. Wgg=1.000000 Wgd=0.092800 Wmg=0.000000 Wme=0.000000 Wmd=0.439300 Wee=1.000000 Wed=0.467900
Also, because of legacy/trac#13297 (moved) the authorities themselves will also complain after parsing their own consensus.
OK, I found a silly bug that might be causing this problem.
When we were calculating the total weights (T, G, M, E, D), while we were considering guardfraction information on the D/E weights and G/M weights, we didn't increase the total weight T by the right amount.
So for example, if a router with bandwidth 1000 is 75% guard, it means it has 750 bandwidth as a guard (should be assigned to G), and 250 bandwidth as a non-guard (should be assigned to M). The issue is that we only incremented T by 750, whereas we should increment it by 750 + 250.
This lead to T being much smaller than the sum G + M + E + D, which shouldn't happen IIUC.
This is probably why we ended up with negative weights in the end.
The above is an easy issue to fix. A harder issue to fix is that clients with disabled guardfraction, will try to validate the total weights in networkstatus_verify_bw_weights(). Since they don't understand guardfraction they will end up with different values than the authorities and then they will complain with stuff like:
{{{
Jun 02 17:47:59.000 [warn] Bw Weight Failure for Case 3b: Etotal 10050.492000 != Mtotal 9436.164000. G=12930 M=0 E=0 D=21480 T=34410. Wgg=1.000000 Wgd=0.092800 Wmg=0.000000 Wme=0.000000 Wmd=0.439300 Wee=1.000000 Wed=0.467900
}}}
Hm, that's not entirely true. Only authorities validate the bw weights in networkstatus_verify_bw_weights(). Hence we just need to introduce guardfraction logic to that function. Ideally, we would use update_total_bandwidth_weights() there as well.
OK, I now have a branch bug16255_027 which addresses the issues I found.
Specifically:
We used to mess up the calculation of total bandwidth T when guardfraction relays were considered. That's because we only updated T with the guard bandwidth, and ignored the non-guard bandwidth. This should be addressed in commit 79f6c49 and should fix the Bw Weight Failure errors that blocked consensus weight creation.
As authorities we used to throw warnings when verifying the bandwidths of our own consensus, because we did not first apply the guardfraction information. This commit fixes this by applying the guardfraction information when parsing consensuses as authorities.
We should probably get this applied to most of authorities before turning the feature on again.
I also prepared a backport branch for 0.2.6 diruahts bug16255_026 that only includes the first fix (which is very small), and can be used if we want to try this feature on faster. Consensus weights should work with the backport branch, but the authorities will issue a warning when verifying their own consensus.
In general, it's a PITA testing bandwidth-related features with chutney because the bandwidth values in a test network are nothing compared to reality. Even without the above fixes, sometimes my chutney network will generate consensus weights because the bandwidth values are so low it doesn't even notice the T weight being off.
I'd appreciate a good review here and also pointer to other places in the guardfraction design that could cause failures.
Taking a quick note, I need to double-check this, but:
Does this patch change how the input votes affect the output consensus? If so, it must use a new consensus-method.
This fix indeed changes how the input votes affect the output consensus. That's because authorities with this patch will calculate different total bandwidth weights and Wgg/Wgd/Wmg/etc. values than unpatched authorities.
Still, I decided to not add a new consensus method for the following reasons:
We already have a consensus method for guardfraction (20). I would have to add a second stronger one just for this fix which seemed ugly (but this should not stop us if it's the right thing to do).
If I added a new consensus method, I would also have to change the encoding format on the votes (GuardFraction=50, etc). Otherwise, unpatched authorities that support the old consensus method (20) would still parse the GuardFraction values of the patched authorities thinking that they understand them. That would be a mistake since at that point unpatched authorities and patched authorities would generate different consensuses. So if we add a new consensus method we should probably also change the encoding format to something like: GuardFractionVal=50 or Guardfraction=50 or something, so that unpatched authorities do not understand it.
For the two reasons above, I decided (perhaps wrongly) to not add a new consensus method. Instead I provided both 026 and 027 patches and my plan was to ask the authorities to turn on the feature again only after a majority of the authorities have patched themselves.
Do you think that's a dangerous plan, and instead I should just make a new consensus method?
I truly do think there should be a new consensus-method here.
After all, the entire point of having consensus-method support in Tor is so that we should never have to tell all the authorities "Okay, now everybody turn on the feature!" Using a consensus-method means that the feature turns itself on once everybody has it, and not before.
I truly do think there should be a new consensus-method here.
After all, the entire point of having consensus-method support in Tor is so that we should never have to tell all the authorities "Okay, now everybody turn on the feature!" Using a consensus-method means that the feature turns itself on once everybody has it, and not before.
OK that makes sense then.
So, should I bump MIN_METHOD_FOR_GUARDFRACTION to 22?
And also, should I change the GuardFraction= encoding format to something else, so that unpatched authorities don't use it? To what? Do you prefer GuardFractionPercentage= or Guardfraction= or GuardFractionVal= or what?
I truly do think there should be a new consensus-method here.
After all, the entire point of having consensus-method support in Tor is so that we should never have to tell all the authorities "Okay, now everybody turn on the feature!" Using a consensus-method means that the feature turns itself on once everybody has it, and not before.
OK that makes sense then.
So, should I bump MIN_METHOD_FOR_GUARDFRACTION to 22?
This is a little tricky. Yes, but you'll need to add a 22 for 0.2.6, and a 23 for 0.2.7. And the code in 0.2.7 that checks for MIN_METHOD_FOR_ED25519_ID_* needs to make sure that it isn't including 22.
And also, should I change the GuardFraction= encoding format to something else, so that unpatched authorities don't use it? To what? Do you prefer GuardFractionPercentage= or Guardfraction= or GuardFractionVal= or what?
Whatever you like. "GuardPct" would be one option.
Trac: Status: needs_revision to assigned Keywords: TorCoreTeam201508 deleted, TorCoreTeam201509 PostFreeze027 added Owner: N/Ato asn Priority: normal to major