See attached files. Relay powerlay lowered max advertised bandwidth late 5/23. SBWS longclaw never recognized the change. SBWS bastet saw the value but was started after the change. Torflow recognized the change instantly.
~~ Have observed that SBWS, when it makes note of max advertised bandwidth, averages the value. This value should not be averaged, the current value whatever it is apples. ~~
Have observed that SBWS, when it makes note of max advertised bandwidth, averages the value. This value should not be averaged, the current value whatever it is apples.
Possibly strike this. What I may be seeing are stale max advertised values so old they seem like averages related to recent history.
Explain these bugs like we don't know what you're talking about.
If you don't provide good explanations, we don't know how serious the bug is. So we leave it for days or weeks until we have time to look at the details.
Here's a detailed explanation for this bug:
The MaxAdvertisedBandwidth was changed around 2019-05-23 00:30? to a lower value.
longclaw is running sbws. It shows no bandwidth change on 05-23, but does show a bandwidth change on 05-28, 5 days after the change. (sbws results expire after 5 days.)
This bug blocks deployment of sbws to more than half the bandwidth authorities.
We might be able to diagnose this bug better when we know the answers to these questions:
What was the MaxAdvertisedBandwidth value before and after the change?
What was the exact time of the change?
Do torflow and sbws report times in UTC?
What are the full sbws bandwidth file lines around the time of the change, and 5 days after the change?
Do we need to add more diagnostics to sbws relay lines?
Trac: Severity: Normal to Critical Priority: Medium to Very High Keywords: N/Adeleted, sbws-majority-blocker added Milestone: N/Ato sbws: 1.1.x-final Description: See attached files. Relay powerlay lowered max advertised bandwidth late 5/23. SBWS longclaw never recognized the change. SBWS bastet saw the value but was started after the change. Torflow recognized the change instantly.
Have observed that SBWS, when it makes note of max advertised bandwidth, averages the value. This value should not be averaged, the current value whatever it is apples.
to
See attached files. Relay powerlay lowered max advertised bandwidth late 5/23. SBWS longclaw never recognized the change only recognised the change after 5 days. SBWS bastet saw the value but was started after the change. Torflow recognized the change instantly.
~~ Have observed that SBWS, when it makes note of max advertised bandwidth, averages the value. This value should not be averaged, the current value whatever it is apples. ~~
Summary: SBWS logic related to max advertised bandwidth is broken to SBWS is using max advertised bandwidth from 5 day old descriptors
What the scanner did was--after the protracted delay of five days--detect lower available bandwidth which resulted as a consequence of the reduced configured maximum, because BandwidthRate was used to establish it rather than MaxAdvertisedBandwidth.
Trac: Description: See attached files. Relay powerlay lowered max advertised bandwidth late 5/23. SBWS longclaw never recognized the change only recognised the change after 5 days. SBWS bastet saw the value but was started after the change. Torflow recognized the change instantly.
~~ Have observed that SBWS, when it makes note of max advertised bandwidth, averages the value. This value should not be averaged, the current value whatever it is apples. ~~
to
See attached files. Relay powerlay lowered max advertised bandwidth late 5/23. SBWS longclaw never recognized the change. SBWS bastet saw the value but was started after the change. Torflow recognized the change instantly.
~~ Have observed that SBWS, when it makes note of max advertised bandwidth, averages the value. This value should not be averaged, the current value whatever it is apples. ~~
Defect is extremely severe. SBWS should be removed from production until this is fixed. Longclaw has thirteen percent of listed relays with incorrect descriptor information more than 7% away from the published consensus.
Our acceptance criteria for sbws is a 50% variance from the consensus, because that's what torflow's variance was before we started deploying sbws. See legacy/trac#27339 (moved).
between 2019-05-25 16:00 and 2019-06-12 03:00 longclaw sbws shows descriptor bandwidth max/burst/observed for 4138 relays with values that match perfectly, out of 5593 relays present at both times
74% of relays with no update whatsoever to descriptor information for 17 days
Trac: Summary: SBWS does not detect changes in max advertised bandwidth to SBWS does not detect changes in descriptor bandwidth values
implement the patch and the tests, but this is hard to test with the current test network we have
create PR and set to review
wait to get it review
release a new version
update debian package
install the new package in production when it's available (and announce we'll use new version)
then finally we can see if the patch is working
So, because it's a long process, i'm just going to add those options to longclaw's sbws, wait ~3 days (the time it takes to measure the whole network) and check if some descriptors' bandwidth changed.
If the descriptors bandwidth change, cool, it worked and i'll create the PR. If not, then i'll comment it and try to figure out what else is wrong.
perhaps it would be best to not rely on scanning client instances for descriptor information and either a) run a low-limit relay with DirCache=1 along with Fetch* settings expressly for providing descriptors or b) connect directly to authorities or fallback directories when obtaining descriptors
So, because it's a long process, i'm just going to add those options to longclaw's sbws, wait ~3 days (the time it takes to measure the whole network) and check if some descriptors' bandwidth changed.
Because when restarting it, it takes correctly the descriptor bandwidth of powerlay, what i did was to change the advertised bandwidth of a relay i've access to after i restarted sbws (at ), and sbws detected the change at XX
If the descriptors bandwidth change, cool, it worked and i'll create the PR. If not, then i'll comment it and try to figure out what else is wrong.
I'm not sure whether i should change sbws bugfix version now instead of waiting for more bugfixes and whether i should have change
the one in the longclaw's sbws, even if not released yet.
I'm not sure whether i should change sbws bugfix version now instead of waiting for more bugfixes and whether i should have change
the one in the longclaw's sbws, even if not released yet.
In tor, we call development versions "alpha-dev", and include the commit hash in the version.
Let's do the same thing with sbws?
So for example, longclaw should now be 1.1.1-alpha-dev.
See legacy/trac#30899 (moved) for a ticket, it is not urgent or important.
We don't have time to make major changes to the sbws design at this point.
Major changes are high risk, and they may introduce other bugs.
Let's just fix the bugs that we are seeing, with the smallest possible changes.
I'm not sure whether i should change sbws bugfix version now instead of waiting for more bugfixes and whether i should have change
the one in the longclaw's sbws, even if not released yet.
In tor, we call development versions "alpha-dev", and include the commit hash in the version.
Let's do the same thing with sbws?
So for example, longclaw should now be 1.1.1-alpha-dev.
Right now sbws is 1.1.1-dev0, since you say that it should include commit hash, should it be 1.1.1-alpha-dev-commithash?
In this case where i also changed the longclaw's sbws, should not i add 1 more commit to this PR to change the version rather than doing so in a different ticket?. And then change also the version in longclaw's sbws to the same in this PR?.
I'm not sure whether i should change sbws bugfix version now instead of waiting for more bugfixes and whether i should have change
the one in the longclaw's sbws, even if not released yet.
In tor, we call development versions "alpha-dev", and include the commit hash in the version.
Let's do the same thing with sbws?
So for example, longclaw should now be 1.1.1-alpha-dev.
Right now sbws is 1.1.1-dev0, since you say that it should include commit hash, should it be 1.1.1-alpha-dev-commithash?
I'm not sure if there are python or Debian conventions. If there are, you should follow python or Debian, rather than the conventions that tor made up.
I think 1.1.1-dev0 is fine, because it says "dev".
But that's not what longclaw is showing in its bandwidth file: it says "software_version=1.1.0".
I think the commit hash should be dynamic, and be a different field in the bandwidth file.
I'll add details to legacy/trac#30899 (moved).
Just so you know, the tor version looks like:
"Tor 0.2.9.16 (git-9ef571339967c1e5)"
In this case where i also changed the longclaw's sbws, should not i add 1 more commit to this PR to change the version rather than doing so in a different ticket?. And then change also the version in longclaw's sbws to the same in this PR?.
I think longclaw should say "software_version=1.1.1-dev0".
i've access to after i restarted sbws (at ), and sbws detected the change at XX
Because i was waiting for the next bandwidth file. Then saw that the descriptor average didn't change, so i need to still check whether i'm doing a mistake configuring the relay bandwidth values or the relay didn't publish the descriptor or sbws didn't not receive it yet.
in longclaw data between 20190615-05 and 20190617-05 only 11% of relay descriptor tuples match exactly--dramatic improvement, the change appears to correct the issue
my suggestion regarding sourcing descriptors from a relay tor instance rather than client-only instance was intended for longer term (unless the parameter change didn't work) -- the thought is dir-serve relay mode must always maintain a complete set of current descriptors and is less likely to regress than client modes; or sourcing from authorities has the advantage of obtaining descriptors with no propagation delay
documentation on DirCache is confusing, seems to imply relation to dir-server readiness where it really relates to minimizing client memory utilization on very small hardware; what matters is FetchDirInfoEarly=1 FetchUselessDescriptors=1 are in effect activated by relay mode operation
in longclaw data between 20190615-05 and 20190617-05 only 11% of relay descriptor tuples match exactly--dramatic improvement, the change appears to correct the issue
This improvement might be caused by restarting sbws, not just the Fetch* options i added.
I think also that when changing relaylist to keep the number of consensuses (417ebfa9), the relays objects stop to be replaced in each consensus, but was forgotten to update also the descriptors attributes.
I've restarted longclaw's sbws with this change too (e2ee16e1c42a243dec419b9dcc7555b0621c3e8b) and the current sbws version (1.1.0-dev0). Again need to wait until the relay i changed gets measured, so not putting this in needs_review yet.
So maybe the descriptors where not being fetch with lot of delay, but some Fetch* options will ensure that they are actually recent. After a conversation with arma in irc, i'm just not sure that which Fetch* options we really need.
So maybe the descriptors where not being fetch with lot of delay, but some Fetch* options will ensure that they are actually recent. After a conversation with arma in irc, i'm just not sure that which Fetch* options we really need.
we use FetchUselessDescriptors in sbws to make sure tor keeps on downloading descriptors, even if sbws is idle
sbws does not need FetchDirInfoExtraEarly, but if we want it to respond to MaxcAdvertisedBandwidth changes as soon as possible, we should use it
FetchDirInfoEarly is redundant when FetchDirInfoExtraEarly is set, so we can remove it
Now is a good time to change the comments in legacy/trac#30733 (moved), so we know why sbws is using each option, next time someone asks these questions.
Also, you might want to check dormant mode and see if you need to disable that too, since if your tor thinks you are not using it, it will go dormant in 24h and stop downloading consensuses etc. The right way to disable dormant mode is to send the ACTIVE signal with the controller every once in a while. I'm not sure if this applies in your case, but I noticed the new torrc options you added are relevant.
I'm marking this as merge_ready for the purposes of code review, but you might wan to add unittests and check the dormant thing. If you do, let me know and i can review your changes as well.