Low bandwidth scanner values to certain relays

Sorry if this isn't the correct place to post this.

Yes, it's the correct place, since the issue is probably related to this software.

More data at: https://lists.torproject.org/pipermail/tor-relays/2020-August/018851.html

My two cents are that it could be a peering problem. I know a lot about how routers/switches and the internet works and this explanation gets technical (though I don't do that for a living). You may not have knowledge or interest in networking.

I am assuming the bandwidth scanner is on the same ISP as the dirauth longclaw but a different machine. I could be wrong.

Assuming the statement is true, it appears longclaw is hosted on "Koumbit" which also I believe hosts the bandwidth scanner.

Koumbit contracts "eStruxture" for colocation/bandwidth. Many routes going to eStruxture go through Hurricane Electric. The eStruxture->HE link probably had issues, especially for West Coast routes.

I contacted both HE and eStruxture. HE said there's no congestion. eStruxture claimed they "resolved" the issue. The latter didn't share details since I'm not a customer.

I'll wait and see if the values go up. It could also be a software issue, but since it didn't affect Europe it's probably a network issue.

Thanks @neel to help figure out why this is happening and the info.

Correct about Koumbit and longclaw, which is in Canada, but the bandwidth scanner is in a different machine, in a different provider, different AS, in East US.

I'm asking the East US provider to see if they know about bandwidth issues with West coast.

This software is being actively developed and we're deploying new versions to the machine in East US, which send the data to longclaw. Except for maatuska, all the other bwauths still run Torflow. So if there are bandwidth issues only in longclaw, it's very likely that is an sbws bug.

Do you know when exactly this started to happen? If you have time and experience with collector and/or stem, it's possible to retrieve the past bandwidth files (and consensuses) to see when this start happening.

I'll have a look next days to which versions we deployed last, to try to find out if there's a bug

Oh, in case it helps: https://gitlab.torproject.org/tpo/network-health/sbws/-/wikis/bandwidth%20authorities%20timeline

Thanks for looking into this.

I believe the slowness started to happen around June.

My Psychz server is relatively new, I replaced an older 300 Mbps server with my current Gigabit one. In the past, Psychz in Los Angeles was faster than it is now, but with different keys.

The Wave connection, and the relays were also rekeyed. Here, replacing Wave isn't an option: do I really want Comcast? No so-called "phone company" serves me with fiber or copper. Surprisingly, I had faster measurements in the past on Wave also, that on crappy older switches in my area that were replaced.

In short, Wave Seattle and Psychz LA were faster than they are now, even on crappier hardware/routers.

I'm really more of a Core Tor person than a stem/collector person, but I can happily help you if you need help, even with coding. I do however have a $DAYJOB that isn't Tor-related.

Any updates?

The consensus weight of my West Coast Tor relays has gone up, but that is more maatuska and gabelmoo had improvements than longclaw.

longclaw is among the bottom two in terms of consensus weight for the West Coast, so it must be either sbws or bad peering.

Yes, i was told that the machine running longclaw's sbws in East US is connected to the Hurricane Electric backbone, so quite sure location or peering is not an issue.

So far i've not been able to spot what could be the bug in sbws.

It could also be the CDN that it's used as Web server to download the data to measure the speed.

I think it's probably a slow CDN then.

I remember Wave had an issue where their switches had performance degradation far greater than what's normal with TCP (fortunately resolved now). Your CDN could be similar.

Major "cloud" providers like Amazon AWS, MS Azure, Google Cloud offer "trial" credits and have their own respective CDN. You could try one of these as well.

You could also use a more well-connected CDN, like an Akamai, Cloudflare, or EdgeCast, as opposed to what you're using right now, or a smaller provider like CDN77.

The issue still persists. Any updates on your part?

Whenever longclaw is absent from the consensus, the weights of West Coast relays go up, but whenever longclaw is added back, it goes back down.

longclaw is the slowest bwauth to the West Coast in many cases (yes really), slower than the European bwauths.

Have you contacted the CDN backing longclaw's bandwidth scanner about this issue?

If you need to replace the CDN, Amazon AWS and Microsoft Azure have offers for nonprofits where they can donate cloud credits ($2000 and $3500 yearly respectively) for nonprofits. This may not be enough for the bwauth. I don't know about other clouds like Google and Oracle, and don't know if the nonprofit Azure credit is being used by meek.

Disclaimer: I work at Microsoft, but not on Azure. If you are interested in Azure's CDN (or Azure in general), I could try my best to find a contact (or you could use the link above).

Thanks @neel to follow up on this.

The issue still persists. Any updates on your part?

No, sorry, as i said in my first reply, it might takes us some weeks, since a lot of the work done on this topic is volunteer work. Apologies in advance about it.

Whenever longclaw is absent from the consensus, the weights of West Coast relays go up, but whenever longclaw is added back, it goes back down.

This is weird since the consensus takes the median from all the bwauths, so if it's only longclaw the bwauth that it's measuring lower, it'd not affect much to the median.

longclaw is the slowest bwauth to the West Coast in many cases (yes really), slower than the European bwauths.

A few questions that would help up to solve this:

Do you think that the longclaw's bwauth measures low bandwidth only in the the West Coast?
Have you check that this happen to all relays in the West Coast and not only yours?
What do you understand by West Coast?
How have you check that longclaw's bwauth is not measuring lower in Europe?
Do you have scripts to check this?

I'm asking this because the relays have information about IP, ASN and country, but not "regions" as in West Coast, so I'm not sure how we can check that.

Also, there're 7000 relays, we would need an automatic way to check whether the bandwidths are lower in some countries or ASNs other than going one by one over the 7000 relays.

Have you contacted the CDN backing longclaw's bandwidth scanner about this issue?

Not yet. A CDN problem is a possibility, but i don't think it's the most probable.

Most likely, this is an issue in this software, sbws, that is running only in the longclaw's and maautska's bwauths, the other 4 bwauths run Torflow software.

Have you contacted the CDN backing longclaw's bandwidth scanner about this issue?

I've now checked the CDN's status page: https://status.fastly.com/history?page=1&filter=gm0s0j7l1bm1. I've filtered by North America, and if i'm reading it correctly, the only issue that could be related to the "West Coast" in the last 3 months, happened 26 days ago and was resolved by then: https://status.fastly.com/incidents/8x145jmp6rtq

thanks @neel! We are applying for an AWS grant so we can get OnionPerf instances running there. And we just got Azure credits that we will use for meek.

Do you think that the longclaw's bwauth measures low bandwidth only in the the West Coast?

Yes. longclaw measures as normal with European relays, including common hosts like OVH, Hetzner, etc. as well as less-common hosts and broadband ISPs.

Have you check that this happen to all relays in the West Coast and not only yours?

Yes, it happens to other ISPs as well, such as QuadraNet, AT&T U-verse, and Sonic.net.

QuadraNet: https://metrics.torproject.org/rs.html#details/64CB2B32C10ADD4B93D725B7DED238CCCD6D6DBA AT&T California: https://metrics.torproject.org/rs.html#details/1F772DD93DA20A6745E334BAFFC7B9765876BB11 Sonic: https://metrics.torproject.org/rs.html#details/CF4DB0E0480FEB1E8F59C3D774EB9275283D579D

As of now, I do not subscribe to QuadraNet, AT&T or Sonic.

The slowdown also happens to many relays outside the East Coast but in other parts of the US, such as Texas.

AT&T Texas: https://metrics.torproject.org/rs.html#details/6B13A4EC169E4FB0F3141E1AE4307966569FA47C

What do you understand by West Coast?

Parts of the US and Canada facing the Pacific Ocean, such as Los Angeles, Bay Area, Seattle, Vancouver, etc.

This slowdown also happens in other areas like Texas which aren't "West Coast" but still further from the East Coast, but I don't live or have servers there.

How have you check that longclaw's bwauth is not measuring lower in Europe?

Yes, like mentioned earlier, longclaw measures as normal with European relays.

Do you have scripts to check this?

Unfortunately, no. I just use https://consensus-health.torproject.org/

changed milestone to %sbws: 1.1.x-final

added NL O1.4 label

added Doing label

assigned to @juga

This is still an issue even now.

For instance, these two relays hosted at Psychz Networks/Los Angeles:

The consensus weights of these two has been hovering around 6000-7000.

It could be longclaw's bandwidth scanner's host, or their transit providers. Test downloading Linux ISOs from West Coast and from Europe: if the West Coast is slower, than there's likely peering problems you can report to your host.

It could also be your CDN (Fastly) having to do an origin pull from a far away server instead of storing locally, reducing throughput. But if the same CDN is used by other bandwidth scanners, this isn't as much of an issue.

Hi @neel

The consensus weights of these two has been hovering around 6000-7000.

You know how much it was before and when?

Test downloading Linux ISOs from West Coast and from Europe

ok, will test that later.

But if the same CDN is used by other bandwidth scanners

Yes, other bwauths use it too.

In the meanwhile, we have upgraded maatuska's bw scanner to the same version as longclaw.

It would be great to check whether the consensus health reported by maatuska will decrease in the areas you mention.

Sorry for the delay. I was getting through things at $DAYJOB.

You know how much it was before and when?

I remember it did not go beyond. The Psychz earlier (back in April) had much higher values, like 12000-18000 on a Gigabit link, but on another IP/fingerprint.

Fortunately longclaw measures 9800 now for the two Psychz/LA relays. Wave Broadband/Seattle is less lucky. But don't take things for granted, we should make sure it's consistently high before closing this.

I feel it's a network problem with the bwauth, but Wave in my area is slower for some reason (in the real world) despite having "Gigabit" service.

I'm moving, but will likely have to take Wave's subsidiary "Cascadelink" since the phone company (CenturyLink) only has VDSL where I'm going to move, and again why would I want Comcast? I just hope the new subsidary/equipment is better.

ok, will test that later.

Thanks.

Test downloading Linux ISOs from West Coast and from Europe

I searched for mirrors in the US East and West coasts with the same claimed bandwidth at https://launchpad.net/ubuntu/+cdmirrors

For East coast: Esslingen University of Applied Sciences, 10 Gbps:

curl -o ubuntu_esslingen.iso https://ftp-stud.hs-esslingen.de/pub/Mirrors/releases.ubuntu.com/groovy/ubuntu-20.10-beta-desktop-amd64.iso
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2838M  100 2838M    0     0  26.4M      0  0:01:47  0:01:47 --:--:-- 14.0M

For West coast: Pacific NorthWest National Lab, 10 Gbps:

curl -o ubuntu_westus.iso https://mirror.pnl.gov/releases/groovy/ubuntu-20.10-beta-desktop-amd64.iso
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2838M  100 2838M    0     0  25.3M      0  0:01:51  0:01:51 --:--:-- 16.0M

So, it's a bit slower in the West coast, which makes sense, since the scanner is in the East coast.

Fortunately longclaw measures 9800 now for the two Psychz/LA relays. Wave Broadband/Seattle is less lucky. But don't take things for granted, we should make sure it's consistently high before closing this.

Indeed, looking at other bwauth that doesn't run sbws, eg. gabelmoo, reports 5020, which is lower.

I finally created an script to obtain the consensus bandwidth of all the relays you mention, from https://collector.torproject.org/archive/relay-descriptors/consensuses/. Then i calculated the median

This is the result for all May consensuses (fingerprint, bandwidth):

A69CEB30328B1E85C6B167FECAF2F509CBD9517F,0
156AAC3FAD1ACC8906316519DCB444B8C77E4EBF,0
64CB2B32C10ADD4B93D725B7DED238CCCD6D6DBA,5300
1F772DD93DA20A6745E334BAFFC7B9765876BB11,7195.0
CF4DB0E0480FEB1E8F59C3D774EB9275283D579D,0
6B13A4EC169E4FB0F3141E1AE4307966569FA47C,0
B0F9BA27944FA59E3B1A182208FF7C0CFF5497B2,0
DB71014D7329B7289CFCC547F48EF53F812C40D,0

For September:

A69CEB30328B1E85C6B167FECAF2F509CBD9517F,6900.0
156AAC3FAD1ACC8906316519DCB444B8C77E4EBF,4900.0
64CB2B32C10ADD4B93D725B7DED238CCCD6D6DBA,6000
1F772DD93DA20A6745E334BAFFC7B9765876BB11,6400.0
CF4DB0E0480FEB1E8F59C3D774EB9275283D579D,4700.0
6B13A4EC169E4FB0F3141E1AE4307966569FA47C,23000.0
B0F9BA27944FA59E3B1A182208FF7C0CFF5497B2,2500
DB71014D7329B7289CFCC547F48EF53F812C40D,0

So, for the 2 fingerprints that already existed in May, they have approximate consensus weight, and 64CB2B32C10ADD4B93D725B7DED238CCCD6D6DBA has higher bandwidth in September.

I'm not going to close this ticket yet, but unless we create more complicated scripts to compare more months, more relays and between different bwauths, i don't see an easy way to demonstrate that longclaw is measuring lower US West coast.

What we know is that there was a sbws bug from January to May that made maatuska to mismeasure relays in the same network too high or too low. It's possible that because of that bug, your relays were having higher consensus weight that they should.

To discard a CDN problem, we also downloaded the some file from it. In the West coast it took 107M average, in the East, 110M. So, also not a problem.

Low bandwidth scanner values to certain relays

Designs

Child items ...

Activity