Guardfraction on dirauths screws up bandwidth weights

changed milestone to %Tor: unspecified in legacy/trac

added SponsorU-deferred in Legacy / Trac component::core tor/tor in Legacy / Trac mike-can in Legacy / Trac milestone::Tor: unspecified in Legacy / Trac owner::asn in Legacy / Trac parent::24456 in Legacy / Trac points::medium/large in Legacy / Trac priority::high in Legacy / Trac severity::normal in Legacy / Trac status::needs-revision in Legacy / Trac tor-dirauth in Legacy / Trac tor-guard in Legacy / Trac type::defect in Legacy / Trac labels

Ok, I disabled GuardfractionFile on moria1 for now.

OK, I found a silly bug that might be causing this problem.

When we were calculating the total weights (T, G, M, E, D), while we were considering guardfraction information on the D/E weights and G/M weights, we didn't increase the total weight T by the right amount.

So for example, if a router with bandwidth 1000 is 75% guard, it means it has 750 bandwidth as a guard (should be assigned to G), and 250 bandwidth as a non-guard (should be assigned to M). The issue is that we only incremented T by 750, whereas we should increment it by 750 + 250.

This lead to T being much smaller than the sum G + M + E + D, which shouldn't happen IIUC. This is probably why we ended up with negative weights in the end.

The above is an easy issue to fix. A harder issue to fix is that clients with disabled guardfraction, will try to validate the total weights in networkstatus_verify_bw_weights(). Since they don't understand guardfraction they will end up with different values than the authorities and then they will complain with stuff like:

Jun 02 17:47:59.000 [warn] Bw Weight Failure for Case 3b: Etotal 10050.492000 != Mtotal 9436.164000. G=12930 M=0 E=0 D=21480 T=34410. Wgg=1.000000 Wgd=0.092800 Wmg=0.000000 Wme=0.000000 Wmd=0.439300 Wee=1.000000 Wed=0.467900

Also, because of legacy/trac#13297 (moved) the authorities themselves will also complain after parsing their own consensus.

Need to think more how to correctly deploy this.

Replying to asn:

OK, I found a silly bug that might be causing this problem.

When we were calculating the total weights (T, G, M, E, D), while we were considering guardfraction information on the D/E weights and G/M weights, we didn't increase the total weight T by the right amount.

So for example, if a router with bandwidth 1000 is 75% guard, it means it has 750 bandwidth as a guard (should be assigned to G), and 250 bandwidth as a non-guard (should be assigned to M). The issue is that we only incremented T by 750, whereas we should increment it by 750 + 250.

This lead to T being much smaller than the sum G + M + E + D, which shouldn't happen IIUC. This is probably why we ended up with negative weights in the end.

The above is an easy issue to fix. A harder issue to fix is that clients with disabled guardfraction, will try to validate the total weights in networkstatus_verify_bw_weights(). Since they don't understand guardfraction they will end up with different values than the authorities and then they will complain with stuff like: {{{ Jun 02 17:47:59.000 [warn] Bw Weight Failure for Case 3b: Etotal 10050.492000 != Mtotal 9436.164000. G=12930 M=0 E=0 D=21480 T=34410. Wgg=1.000000 Wgd=0.092800 Wmg=0.000000 Wme=0.000000 Wmd=0.439300 Wee=1.000000 Wed=0.467900 }}}

Hm, that's not entirely true. Only authorities validate the bw weights in networkstatus_verify_bw_weights(). Hence we just need to introduce guardfraction logic to that function. Ideally, we would use update_total_bandwidth_weights() there as well.

Trac:
Keywords: N/A deleted, https://trac.torproject.org/projects/tor/query?status=!closed&keywords=~TorCoreTeam201507 added

Trac:
Milestone: Tor: 0.2.6.x-final to Tor: 0.2.7.x-final

Trac:
Keywords: https://trac.torproject.org/projects/tor/query?status=!closed&keywords=~TorCoreTeam201507 deleted, TorCoreTeam201507 added

OK, I now have a branch bug16255_027 which addresses the issues I found.

Specifically:

We used to mess up the calculation of total bandwidth T when guardfraction relays were considered. That's because we only updated T with the guard bandwidth, and ignored the non-guard bandwidth. This should be addressed in commit 79f6c49 and should fix the Bw Weight Failure errors that blocked consensus weight creation.
As authorities we used to throw warnings when verifying the bandwidths of our own consensus, because we did not first apply the guardfraction information. This commit fixes this by applying the guardfraction information when parsing consensuses as authorities.

We should probably get this applied to most of authorities before turning the feature on again.

I also prepared a backport branch for 0.2.6 diruahts bug16255_026 that only includes the first fix (which is very small), and can be used if we want to try this feature on faster. Consensus weights should work with the backport branch, but the authorities will issue a warning when verifying their own consensus.

In general, it's a PITA testing bandwidth-related features with chutney because the bandwidth values in a test network are nothing compared to reality. Even without the above fixes, sometimes my chutney network will generate consensus weights because the bandwidth values are so low it doesn't even notice the T weight being off.

I'd appreciate a good review here and also pointer to other places in the guardfraction design that could cause failures.

Trac:
Status: new to needs_review

Trac:
Keywords: TorCoreTeam201507 deleted, TorCoreTeam201508 added

Taking a quick note, I need to double-check this, but:

Does this patch change how the input votes affect the output consensus? If so, it must use a new consensus-method.

Replying to nickm:

Taking a quick note, I need to double-check this, but:

Does this patch change how the input votes affect the output consensus? If so, it must use a new consensus-method.

This fix indeed changes how the input votes affect the output consensus. That's because authorities with this patch will calculate different total bandwidth weights and Wgg/Wgd/Wmg/etc. values than unpatched authorities.

Still, I decided to not add a new consensus method for the following reasons:

We already have a consensus method for guardfraction (20). I would have to add a second stronger one just for this fix which seemed ugly (but this should not stop us if it's the right thing to do).
If I added a new consensus method, I would also have to change the encoding format on the votes (GuardFraction=50, etc). Otherwise, unpatched authorities that support the old consensus method (20) would still parse the GuardFraction values of the patched authorities thinking that they understand them. That would be a mistake since at that point unpatched authorities and patched authorities would generate different consensuses. So if we add a new consensus method we should probably also change the encoding format to something like: GuardFractionVal=50 or Guardfraction=50 or something, so that unpatched authorities do not understand it.

For the two reasons above, I decided (perhaps wrongly) to not add a new consensus method. Instead I provided both 026 and 027 patches and my plan was to ask the authorities to turn on the feature again only after a majority of the authorities have patched themselves.

Do you think that's a dangerous plan, and instead I should just make a new consensus method?

Thanks!

I truly do think there should be a new consensus-method here.

After all, the entire point of having consensus-method support in Tor is so that we should never have to tell all the authorities "Okay, now everybody turn on the feature!" Using a consensus-method means that the feature turns itself on once everybody has it, and not before.

Replying to nickm:

I truly do think there should be a new consensus-method here.

After all, the entire point of having consensus-method support in Tor is so that we should never have to tell all the authorities "Okay, now everybody turn on the feature!" Using a consensus-method means that the feature turns itself on once everybody has it, and not before.

OK that makes sense then.

So, should I bump MIN_METHOD_FOR_GUARDFRACTION to 22?

And also, should I change the GuardFraction= encoding format to something else, so that unpatched authorities don't use it? To what? Do you prefer GuardFractionPercentage= or Guardfraction= or GuardFractionVal= or what?

Replying to asn:

Replying to nickm:

I truly do think there should be a new consensus-method here.

After all, the entire point of having consensus-method support in Tor is so that we should never have to tell all the authorities "Okay, now everybody turn on the feature!" Using a consensus-method means that the feature turns itself on once everybody has it, and not before.

OK that makes sense then.

So, should I bump MIN_METHOD_FOR_GUARDFRACTION to 22?

This is a little tricky. Yes, but you'll need to add a 22 for 0.2.6, and a 23 for 0.2.7. And the code in 0.2.7 that checks for MIN_METHOD_FOR_ED25519_ID_* needs to make sure that it isn't including 22.

And also, should I change the GuardFraction= encoding format to something else, so that unpatched authorities don't use it? To what? Do you prefer GuardFractionPercentage= or Guardfraction= or GuardFractionVal= or what?

Whatever you like. "GuardPct" would be one option.

Trac:
Status: needs_review to needs_revision

Trac:
Status: needs_revision to assigned
Keywords: TorCoreTeam201508 deleted, TorCoreTeam201509 PostFreeze027 added
Owner: N/A to asn
Priority: normal to major

Trac:
Status: assigned to needs_revision

asn says this should be ready in mid-sep, which makes it 0.2.8 material

Trac:
Keywords: TorCoreTeam201509 PostFreeze027 deleted, TorCoreTeam201509 added
Milestone: Tor: 0.2.7.x-final to Tor: 0.2.8.x-final

Trac:
Keywords: N/A deleted, 0.2.8.x-triage added

Guardfraction on dirauths screws up bandwidth weights

Child items ...

Activity