When an authority has measured bandwidth values, it should set its thresholds for assigning the various flags, and the flags themselves, based on those values, not on advertised values.
Basically, every invocation of router_get_advertised_bandwidth*() in dirserv.c should be replaced if possible.
The "/* Check if we cleared the whole thing and free if so */" seems unnecessary. dirserv_clear_measured_bw_cache will free this eventually, right?
In case time is unsigned, the expiry check should be something like "e->as_of < now - MX_MEASUREMENT_TIME". This might also help the compiler raise the computation. (Not that it matters.)
We should be calling dirserv_clear_measured_bw_cache from dirserv_free_all.
Would DIGESTMAP_FOREACH_MODIFY and DIGESTMAP_FOREACH_MODIFY_END make the expiry function easier to write?
The node_id argument for dirserv_query_measured_bw_cache should probably be const. The documentation for that function should explain what the return value means, and what exactly happens to the *_out variables.
dirserv_get_bandwidth_for_router should take a const routerinfo
I think dirserv_cache_measured_bw should take a const parsed_line.
And an issue of medium trickiness:
It would be cool to have some unit tests for the new machinery here.
And a trickier issue:
When we have a sufficient number/fraction of nodes with measured bandwidth, I think we should stop believing nodes' advertised bandwidths. That is, if the number/fraction of measured nodes is high enough, then we should never call unmeasured nodes Fast, or Guard, or HSDir, or anything else that depends on bandwidth. (Plausible?)
Oh, one thing I should check before the merge: I didn't check whether there were any additional router_get_advertised_bandwidths that should get changed into dirserv_get_bandwidth_for_router.
Karsten (on #8356 (moved)) IIUC seems to think that "Run it on one authority and see what happens" will be an acceptable way to test this patch once it's revised.
The remaining question that you had (for Mike perhaps) was whether this is likely to set up some kind of horrible feedback loop. I don't think it would, assuming it just does what's on the carton and changes how flags are assigned, but a second opinion would sure be nice.
Hrmm.. I just checked the bandwidth authority source, and right now it appears we only bother to scan nodes already marked as "Fast". If we deploy this in combination with the "Unmeasured" cap from #2286 (moved), it might be possible that new nodes never get the Fast flag, and thus stay Unmeasured forever. This would be bad...
I suppose an even more important question right now is can this already happen with the current Fast cutoff vs the Unmeasured cap?
aagbsn: This might require a bw auth fix. See my comment above.
I don't think it will be too disasterous to change the bw auths to also scan non-Fast nodes, but it might slow down the scanning quite a bit. Is there a better way to handle this problem? One approach might be adding an additional OrNodeRestriction to select all nodes with either Fast or Unmeasured...
Trac: Cc: karsten, mikeperry to karsten, mikeperry, aagbsn
aagbsn: This might require a bw auth fix. See my comment above.
I don't think it will be too disasterous to change the bw auths to also scan non-Fast nodes, but it might slow down the scanning quite a bit. Is there a better way to handle this problem? One approach might be adding an additional OrNodeRestriction to select all nodes with either Fast or Unmeasured...
I think that's a fine idea. Another option is to scan all nodes which are Fast, plus all nodes which will meet the current Fast threshold if their advertised values turn out to be accurate. I don't have any sense which is better though.
I went with mikeperry's suggestion because it was easier; but it would be good to know how much longer the network takes to scan. Does anyone have current numbers on how long a complete measurement takes?
One problem with my suggestion vs Nick's is that mine might flap. I just came down with a cold, so my brain might be misfiring, but here's what I think can happen: If you're scanning Fast or Unmeasured, you see a slow unmeasured node and then scan it. If it turns out to be not-Fast, then the next time around, you don't measure it, and it becomes unmeasured again, at which point you measure it.
It's not as bad as never measuring, of course. But I wonder if it will be a substantial problem? My guess is "Probably not", except for the fact that this means there will be a lot of non-Fast nodes at that 1.0 ratio that we assign to Unmeasured nodes. This could mess up the measurement of nodes that are actually perfectly average (arithmetic mean) in terms of measuring them, because they will now be more likely to be paired with slow nodes that fell out of the Fast status and lost their measured flag.
So maybe an even simpler fix of scanning non-Fast nodes is the best one? K.I.S.S.? The one thing I would suggest if we decide to go that route is to perhaps make an extra entry in the data/bwfiles URL list. (Where the hell is this file btw? It's not in my git checkout? Did we forget to add it?)
The "/* Check if we cleared the whole thing and free if so */" seems unnecessary. dirserv_clear_measured_bw_cache will free this eventually, right?
It saves one extra block of memory for the map itself when it's empty.
In case time is unsigned, the expiry check should be something like "e->as_of < now - MX_MEASUREMENT_TIME". This might also help the compiler raise the computation. (Not that it matters.)
Done.
We should be calling dirserv_clear_measured_bw_cache from dirserv_free_all.
Done.
Would DIGESTMAP_FOREACH_MODIFY and DIGESTMAP_FOREACH_MODIFY_END make the expiry function easier to write?
Done.
The node_id argument for dirserv_query_measured_bw_cache should probably be const. The documentation for that function should explain what the return value means, and what exactly happens to the *_out variables.
Done.
dirserv_get_bandwidth_for_router should take a const routerinfo
Done.
I think dirserv_cache_measured_bw should take a const parsed_line.
Done.
And an issue of medium trickiness:
It would be cool to have some unit tests for the new machinery here.
Done; not really that tricky.
And a trickier issue:
When we have a sufficient number/fraction of nodes with measured bandwidth, I think we should stop believing nodes' advertised bandwidths. That is, if the number/fraction of measured nodes is high enough, then we should never call unmeasured nodes Fast, or Guard, or HSDir, or anything else that depends on bandwidth. (Plausible?)
That sounds like a good idea potentially; what about other circumstances we use the advertised bandwidth? Should we consider those cases or just concern ourselves with the flags?
The "/* Check if we cleared the whole thing and free if so */" seems unnecessary. dirserv_clear_measured_bw_cache will free this eventually, right?
It saves one extra block of memory for the map itself when it's empty.
Okay. I don't think this will make a big difference, but it can't hurt to leave it in.
[...]
And a trickier issue:
When we have a sufficient number/fraction of nodes with measured bandwidth, I think we should stop believing nodes' advertised bandwidths. That is, if the number/fraction of measured nodes is high enough, then we should never call unmeasured nodes Fast, or Guard, or HSDir, or anything else that depends on bandwidth. (Plausible?)
That sounds like a good idea potentially; what about other circumstances we use the advertised bandwidth? Should we consider those cases or just concern ourselves with the flags?
I think we should, for this ticket, just look at the flags. Other users of advertised bandwidth can get other tickets if they don't have them already. (Shall we open those tickets as we work on this one?)
I'll review the changes you mentioned above Friday (I hope); are you okay with implementing the "trickier issue" thing here? Let's talk online and open the new tickets together, unless you're feeling psyched to do them yourself.
The "/* Check if we cleared the whole thing and free if so */" seems unnecessary. dirserv_clear_measured_bw_cache will free this eventually, right?
It saves one extra block of memory for the map itself when it's empty.
Okay. I don't think this will make a big difference, but it can't hurt to leave it in.
[...]
And a trickier issue:
When we have a sufficient number/fraction of nodes with measured bandwidth, I think we should stop believing nodes' advertised bandwidths. That is, if the number/fraction of measured nodes is high enough, then we should never call unmeasured nodes Fast, or Guard, or HSDir, or anything else that depends on bandwidth. (Plausible?)
That sounds like a good idea potentially; what about other circumstances we use the advertised bandwidth? Should we consider those cases or just concern ourselves with the flags?
I think we should, for this ticket, just look at the flags. Other users of advertised bandwidth can get other tickets if they don't have them already. (Shall we open those tickets as we work on this one?)
I'll review the changes you mentioned above Friday (I hope); are you okay with implementing the "trickier issue" thing here? Let's talk online and open the new tickets together, unless you're feeling psyched to do them yourself.
I just created 8435; want me to go ahead and implement it? It seems like something we should do, but also like more of a feature than a bugfix, and the deadline for those is past.