sbws does not detect changes in descriptor bandwidth values

changed milestone to %sbws: 1.1.x-final in legacy/trac

Trac:
Parent Ticket: legacy/trac#33121 (moved)
Child Ticket(s): legacy/trac#30747 (moved), legacy/trac#30908 (moved), legacy/trac#30909 (moved)

Trac:
powerlay_longclaw_series.txt

Trac:
powerlay_bastet_series.txt

Trac:
powerlay_maatuska_series.txt

https://metrics.torproject.org/rs.html#details/C381468CDA4A90664F0889A6AE334547AEE4D9B4

Have observed that SBWS, when it makes note of max advertised bandwidth, averages the value. This value should not be averaged, the current value whatever it is apples.

Possibly strike this. What I may be seeing are stale max advertised values so old they seem like averages related to recent history.

Trac:
Version: sbws: unspecified to sbws: 1.1.0

Hi, when you log bugs like this, please show us:

the full torflow and sbws bandwidth file lines,
the exact lines that show the bug, and
the relevant values in your torrc.

Explain these bugs like we don't know what you're talking about.

If you don't provide good explanations, we don't know how serious the bug is. So we leave it for days or weeks until we have time to look at the details.

Here's a detailed explanation for this bug:

The MaxAdvertisedBandwidth was changed around 2019-05-23 00:30? to a lower value.

longclaw is running sbws. It shows no bandwidth change on 05-23, but does show a bandwidth change on 05-28, 5 days after the change. (sbws results expire after 5 days.)

bw=56000 desc_bw_avg=1073741824 time=2019-05-22T18:46:34
bw=62000 desc_bw_avg=1073741824 time=2019-05-24T00:58:14

bw=62000 desc_bw_avg=1073741824 time=2019-05-27T20:17:40
bw=47000 desc_bw_avg=1073741824 time=2019-05-29T02:18:34

bastet and maatuska were running torflow. They change immediately, or after a few hours:

bastet:

bw=131000 updated_at=2019-05-23T00:42:57 pid_bw=86361445
bw=38300 updated_at=2019-05-23T17:05:28 pid_bw=86361445

maatuska:

bw=98300 updated_at=2019-05-21T02:33:18 pid_bw=91584752
bw=76700 updated_at=2019-05-23T06:07:45 pid_bw=91584752

Or possibly

bw=79000 updated_at=2019-05-23T06:07:45 pid_bw=91584752
bw=65900 updated_at=2019-05-25T00:35:22 pid_bw=91584752

bastet then switched to sbws, which appears to have picked up the MaxAdvertisedBandwidth value straight away:

bw=48700 updated_at=2019-05-25T09:47:17 pid_bw=86361445
bw=58000 desc_bw_avg=36700160 time=2019-05-27T23:51:27

This bug blocks deployment of sbws to more than half the bandwidth authorities.

We might be able to diagnose this bug better when we know the answers to these questions:

What was the MaxAdvertisedBandwidth value before and after the change?
What was the exact time of the change?
Do torflow and sbws report times in UTC?
What are the full sbws bandwidth file lines around the time of the change, and 5 days after the change?
- Do we need to add more diagnostics to sbws relay lines?

Trac:
Severity: Normal to Critical
Priority: Medium to Very High
Keywords: N/A deleted, sbws-majority-blocker added
Milestone: N/A to sbws: 1.1.x-final
Description: See attached files. Relay powerlay lowered max advertised bandwidth late 5/23. SBWS longclaw never recognized the change. SBWS bastet saw the value but was started after the change. Torflow recognized the change instantly.

Have observed that SBWS, when it makes note of max advertised bandwidth, averages the value. This value should not be averaged, the current value whatever it is apples.

to

See attached files. Relay powerlay lowered max advertised bandwidth late 5/23. SBWS longclaw ~~never recognized the change~~ only recognised the change after 5 days. SBWS bastet saw the value but was started after the change. Torflow recognized the change instantly.

~~ Have observed that SBWS, when it makes note of max advertised bandwidth, averages the value. This value should not be averaged, the current value whatever it is apples. ~~

Summary: SBWS logic related to max advertised bandwidth is broken to SBWS is using max advertised bandwidth from 5 day old descriptors

Stand by it; still true:

"SBWS longclaw never recognized the change"

desc_bw_avg=1073741824
nick=powerlay
node_id=$C381468CDA4A90664F0889A6AE334547AEE4D9B4
time=2019-06-03T05:50:32

What the scanner did was--after the protracted delay of five days--detect lower available bandwidth which resulted as a consequence of the reduced configured maximum, because BandwidthRate was used to establish it rather than MaxAdvertisedBandwidth.

The raw data is all here https://collector.torproject.org/archive/relay-descriptors/bandwidths/

A one-line perl script extracted the attachment data such that it is comprehensible in the context of the issue.

Trac:
Summary: SBWS is using max advertised bandwidth from 5 day old descriptors to SBWS does not detect some changes in max advertised bandwidth

Trac:
Description: See attached files. Relay powerlay lowered max advertised bandwidth late 5/23. SBWS longclaw ~~never recognized the change~~ only recognised the change after 5 days. SBWS bastet saw the value but was started after the change. Torflow recognized the change instantly.

~~ Have observed that SBWS, when it makes note of max advertised bandwidth, averages the value. This value should not be averaged, the current value whatever it is apples. ~~

to

See attached files. Relay powerlay lowered max advertised bandwidth late 5/23. SBWS longclaw never recognized the change. SBWS bastet saw the value but was started after the change. Torflow recognized the change instantly.

~~ Have observed that SBWS, when it makes note of max advertised bandwidth, averages the value. This value should not be averaged, the current value whatever it is apples. ~~

We have already tried to implement this feature twice, in legacy/trac#28588 (moved) and GitHub ticket 155.

We need to create better tests for this feature, so we can be sure it is working correctly.

Defect is extremely severe. SBWS should be removed from production until this is fixed. Longclaw has thirteen percent of listed relays with incorrect descriptor information more than 7% away from the published consensus.

Trac:
sbws_longclaw_30733_fail.txt

list of relays where longclaw descriptor differs more than 7% from consensus

correction: neglected to take absolute value of the difference, longclaw has 33% of relays more than 7% away from the consensus.

Trac:
Summary: SBWS does not detect some changes in max advertised bandwidth to SBWS does not detect changes in max advertised bandwidth

Our acceptance criteria for sbws is a 50% variance from the consensus, because that's what torflow's variance was before we started deploying sbws. See legacy/trac#27339 (moved).

Replying to starlight:

Defect is extremely severe. SBWS should be removed from production until this is fixed.

This defect has a limited impact. So we won't remove sbws from production.

Longclaw has thirteen percent of listed relays with incorrect descriptor information more than 7% away from the published consensus.

Why pick 7%?

The total consensus weight of the bandwidth authorities now varies by up to +43%, -14% from the consensus. https://metrics.torproject.org/totalcw.html

between 2019-05-25 16:00 and 2019-06-12 03:00 longclaw sbws shows descriptor bandwidth max/burst/observed for 4138 relays with values that match perfectly, out of 5593 relays present at both times

74% of relays with no update whatsoever to descriptor information for 17 days

Trac:
Summary: SBWS does not detect changes in max advertised bandwidth to SBWS does not detect changes in descriptor bandwidth values

Trac:
longclaw_sbws_unchanged_bw_descriptors.txt

I realized we don't have these options: https://stem.torproject.org/tutorials/mirror_mirror_on_the_wall.html#can-i-get-descriptors-from-the-tor-process

Looking at the log, found it was removed in version 0.1.1 and never added again: https://github.com/torproject/sbws/commit/816591fa67593d74d4facc63ed202a9cd1cf0d99.

I was quite surprised since we talked about the options several times, so searched for tickets and i didn't find anywhere were we said to remove them: https://github.com/torproject/sbws/issues/146#issuecomment-385974474, https://github.com/torproject/sbws/pull/147#discussion_r185492495, https://github.com/torproject/sbws/issues/144, https://trac.torproject.org/projects/tor/timeline?from=2018-05-02T01%3A56%3A25Z&precision=second.

The way i have work so far is:

implement the patch and the tests, but this is hard to test with the current test network we have
create PR and set to review
wait to get it review
release a new version
update debian package
install the new package in production when it's available (and announce we'll use new version)
then finally we can see if the patch is working

So, because it's a long process, i'm just going to add those options to longclaw's sbws, wait ~3 days (the time it takes to measure the whole network) and check if some descriptors' bandwidth changed.

If the descriptors bandwidth change, cool, it worked and i'll create the PR. If not, then i'll comment it and try to figure out what else is wrong.

For the record, options in https://stem.torproject.org/tutorials/mirror_mirror_on_the_wall.html#can-i-get-descriptors-from-the-tor-process added, and longclaw's sbws restarted.

We always referred to the scanner with lower letters.

Trac:
Summary: SBWS does not detect changes in descriptor bandwidth values to sbws does not detect changes in descriptor bandwidth values

perhaps it would be best to not rely on scanning client instances for descriptor information and either a) run a low-limit relay with DirCache=1 along with Fetch* settings expressly for providing descriptors or b) connect directly to authorities or fallback directories when obtaining descriptors

Replying to juga: [..]

So, because it's a long process, i'm just going to add those options to longclaw's sbws, wait ~3 days (the time it takes to measure the whole network) and check if some descriptors' bandwidth changed.

Because when restarting it, it takes correctly the descriptor bandwidth of powerlay, what i did was to change the advertised bandwidth of a relay i've access to after i restarted sbws (at ), and sbws detected the change at XX

If the descriptors bandwidth change, cool, it worked and i'll create the PR. If not, then i'll comment it and try to figure out what else is wrong.

PR: https://github.com/torproject/sbws/pull/358

I'm not sure whether i should change sbws bugfix version now instead of waiting for more bugfixes and whether i should have change the one in the longclaw's sbws, even if not released yet.

Trac:
Status: new to needs_review

Replying to starlight:

perhaps it would be best to not rely on scanning client instances for descriptor information and either a) run a low-limit relay with DirCache=1

This is the default

along with Fetch* settings expressly for providing descriptors

This is what i said in https://trac.torproject.org/projects/tor/ticket/30733?replyto=17#comment:14. However i don't understand what you mean by "client instances", this configuration affects the sbws' "client instance" to which communicates via socket.

or b) connect directly to authorities or fallback directories when obtaining descriptors

I considered this time ago (https://github.com/torproject/sbws/blob/master/sbws/lib/relaylist.py#L400), but i'm avoiding to do any request not via the Tor network, and stem's remote do that.

i need to check a couple of things before this get reviewed.

Trac:
Owner: N/A to juga
Status: needs_review to assigned

Replying to juga:

Replying to juga: [..]

If the descriptors bandwidth change, cool, it worked and i'll create the PR. If not, then i'll comment it and try to figure out what else is wrong.

PR: https://github.com/torproject/sbws/pull/358

I'm not sure whether i should change sbws bugfix version now instead of waiting for more bugfixes and whether i should have change the one in the longclaw's sbws, even if not released yet.

In tor, we call development versions "alpha-dev", and include the commit hash in the version. Let's do the same thing with sbws? So for example, longclaw should now be 1.1.1-alpha-dev. See legacy/trac#30899 (moved) for a ticket, it is not urgent or important.

Replying to juga:

Replying to starlight:

perhaps it would be best to not rely on scanning client instances for descriptor information and either a) run a low-limit relay with DirCache=1

This is the default

juga, sbws runs a client (no ORPort), so the DirCache setting is ignored.

starlight, no, we don't have time to make major changes to the sbws design at this point.

along with Fetch* settings expressly for providing descriptors

This is what i said in https://trac.torproject.org/projects/tor/ticket/30733?replyto=17#comment:14. However i don't understand what you mean by "client instances", this configuration affects the sbws' "client instance" to which communicates via socket.

This fix is the best way to get all the current relay descriptors on a tor client.

If it doesn't work, then sbws has a bug where it is not storing descriptor information for long enough, and we should fix that bug in sbws.

or b) connect directly to authorities or fallback directories when obtaining descriptors

I considered this time ago (https://github.com/torproject/sbws/blob/master/sbws/lib/relaylist.py#L400), but i'm avoiding to do any request not via the Tor network, and stem's remote do that.

We don't have time to make major changes to the sbws design at this point. Major changes are high risk, and they may introduce other bugs. Let's just fix the bugs that we are seeing, with the smallest possible changes.

I did a review of this pull request: the code looks good, but one of the comments is wrong.

See my comment on the pull request: https://github.com/torproject/sbws/pull/358/commits/2b3860678595dc8a1f26ab93bc6cef54fe796a7d#r294117174

Trac:
Status: assigned to needs_revision

Replying to teor:

Replying to juga:

Replying to juga: [..]

If the descriptors bandwidth change, cool, it worked and i'll create the PR. If not, then i'll comment it and try to figure out what else is wrong.

PR: https://github.com/torproject/sbws/pull/358

I'm not sure whether i should change sbws bugfix version now instead of waiting for more bugfixes and whether i should have change the one in the longclaw's sbws, even if not released yet.

In tor, we call development versions "alpha-dev", and include the commit hash in the version. Let's do the same thing with sbws? So for example, longclaw should now be 1.1.1-alpha-dev.

Right now sbws is 1.1.1-dev0, since you say that it should include commit hash, should it be 1.1.1-alpha-dev-commithash?

See legacy/trac#30899 (moved) for a ticket, it is not urgent or important.

In this case where i also changed the longclaw's sbws, should not i add 1 more commit to this PR to change the version rather than doing so in a different ticket?. And then change also the version in longclaw's sbws to the same in this PR?.

Replying to juga:

Replying to teor:

Replying to juga:

Replying to juga: [..]

If the descriptors bandwidth change, cool, it worked and i'll create the PR. If not, then i'll comment it and try to figure out what else is wrong.

PR: https://github.com/torproject/sbws/pull/358

I'm not sure whether i should change sbws bugfix version now instead of waiting for more bugfixes and whether i should have change the one in the longclaw's sbws, even if not released yet.

In tor, we call development versions "alpha-dev", and include the commit hash in the version. Let's do the same thing with sbws? So for example, longclaw should now be 1.1.1-alpha-dev.

Right now sbws is 1.1.1-dev0, since you say that it should include commit hash, should it be 1.1.1-alpha-dev-commithash?

I'm not sure if there are python or Debian conventions. If there are, you should follow python or Debian, rather than the conventions that tor made up. I think 1.1.1-dev0 is fine, because it says "dev". But that's not what longclaw is showing in its bandwidth file: it says "software_version=1.1.0".

I think the commit hash should be dynamic, and be a different field in the bandwidth file. I'll add details to legacy/trac#30899 (moved).

Just so you know, the tor version looks like: "Tor 0.2.9.16 (git-9ef571339967c1e5)"

See legacy/trac#30899 (moved) for a ticket, it is not urgent or important.

In this case where i also changed the longclaw's sbws, should not i add 1 more commit to this PR to change the version rather than doing so in a different ticket?. And then change also the version in longclaw's sbws to the same in this PR?.

I think longclaw should say "software_version=1.1.1-dev0".

Replying to juga:

i need to check a couple of things before this get reviewed.

I removed this in needs_review, because:

Initially CI failed (https://travis-ci.org/juga0/sbws/jobs/546378459#L1339), which is a bug in the test (legacy/trac#30902 (moved)) that only happens some times.
I forgot to mention the times in https://trac.torproject.org/projects/tor/ticket/30733?replyto=20#comment:18:

i've access to after i restarted sbws (at ), and sbws detected the change at XX

Because i was waiting for the next bandwidth file. Then saw that the descriptor average didn't change, so i need to still check whether i'm doing a mistake configuring the relay bandwidth values or the relay didn't publish the descriptor or sbws didn't not receive it yet.

However, thanks teor for the review.

in longclaw data between 20190615-05 and 20190617-05 only 11% of relay descriptor tuples match exactly--dramatic improvement, the change appears to correct the issue

my suggestion regarding sourcing descriptors from a relay tor instance rather than client-only instance was intended for longer term (unless the parameter change didn't work) -- the thought is dir-serve relay mode must always maintain a complete set of current descriptors and is less likely to regress than client modes; or sourcing from authorities has the advantage of obtaining descriptors with no propagation delay

documentation on DirCache is confusing, seems to imply relation to dir-server readiness where it really relates to minimizing client memory utilization on very small hardware; what matters is FetchDirInfoEarly=1 FetchUselessDescriptors=1 are in effect activated by relay mode operation

Replying to starlight:

in longclaw data between 20190615-05 and 20190617-05 only 11% of relay descriptor tuples match exactly--dramatic improvement, the change appears to correct the issue

This improvement might be caused by restarting sbws, not just the Fetch* options i added.

I think also that when changing relaylist to keep the number of consensuses (417ebfa9), the relays objects stop to be replaced in each consensus, but was forgotten to update also the descriptors attributes.

I've restarted longclaw's sbws with this change too (e2ee16e1c42a243dec419b9dcc7555b0621c3e8b) and the current sbws version (1.1.0-dev0). Again need to wait until the relay i changed gets measured, so not putting this in needs_review yet.

So maybe the descriptors where not being fetch with lot of delay, but some Fetch* options will ensure that they are actually recent. After a conversation with arma in irc, i'm just not sure that which Fetch* options we really need.

This code has a typo: descriptor = c.get_server_descriptor(r.pringerprint, should be: descriptor = c.get_server_descriptor(r.fingerprint,

Replying to juga:

So maybe the descriptors where not being fetch with lot of delay, but some Fetch* options will ensure that they are actually recent. After a conversation with arma in irc, i'm just not sure that which Fetch* options we really need.

we use FetchUselessDescriptors in sbws to make sure tor keeps on downloading descriptors, even if sbws is idle
sbws does not need FetchDirInfoExtraEarly, but if we want it to respond to MaxcAdvertisedBandwidth changes as soon as possible, we should use it
FetchDirInfoEarly is redundant when FetchDirInfoExtraEarly is set, so we can remove it

Now is a good time to change the comments in legacy/trac#30733 (moved), so we know why sbws is using each option, next time someone asks these questions.

Hi Juga,

chutney runs travis tests with all supported python versions between 2.7 and nightly. Here's a list of their names in Travis: https://github.com/torproject/chutney/blob/master/.travis.yml#L106

Also, python 3.5's end of life was last month, I think.

Trac:
Keywords: N/A deleted, sbws-roadmap-august added
Points: N/A to 1

Juga says they have some unpublished work on this ticket.

When legacy/trac#30909 (moved) is merged, this ticket can be closed, since the remaining child tickets, legacy/trac#30747 (moved) and legacy/trac#30908 (moved) have been solved. Other previous comments are part of other tickets, like legacy/trac#30067 (moved) (#comment:31) and not this one.

Replying to juga:

When legacy/trac#30909 (moved) is merged, this ticket can be closed, since the remaining child tickets, legacy/trac#30747 (moved) and legacy/trac#30908 (moved) have been solved.

I went through all the thread and realized there was a PR associated to this ticket:

Replying to teor:

I did a review of this pull request: the code looks good, but one of the comments is wrong.

See my comment on the pull request: https://github.com/torproject/sbws/pull/358/commits/2b3860678595dc8a1f26ab93bc6cef54fe796a7d#r294117174

I squashed that PR, changed comments, cherry-picked the commits created other PR based on maint-1.1: https://github.com/torproject/sbws/pull/362

Trac:
Status: needs_revision to needs_review

Changing keyword of roadmapped open sbws tickets to a general sbws-roadmap one.

Trac:
Keywords: N/A deleted, sbws-roadmap added

remove august keyword

Trac:
Keywords: sbws-roadmap-august deleted, N/A added

The goal is to deploy sbws in all bw authorities. We need to fix critical bugs to do this.

Trac:
Parent: N/A to legacy/trac#33121 (moved)

Trac:
Reviewer: N/A to asn

Code looks reasonable to me.

It would be nice if the new code was also tested.

Also, you might want to check dormant mode and see if you need to disable that too, since if your tor thinks you are not using it, it will go dormant in 24h and stop downloading consensuses etc. The right way to disable dormant mode is to send the ACTIVE signal with the controller every once in a while. I'm not sure if this applies in your case, but I noticed the new torrc options you added are relevant.

I'm marking this as merge_ready for the purposes of code review, but you might wan to add unittests and check the dormant thing. If you do, let me know and i can review your changes as well.

Trac:
Status: needs_review to merge_ready

sbws does not detect changes in descriptor bandwidth values

Designs

Child items ...

Activity