Sorry if this isn't the correct place to post this.
It appears that the bandwidth scanner values from some dirauths such as "longclaw" are reporting lower than normal consensus weights, especially to West Coast relays. This was not an problem earlier.
My two cents are that it could be a peering problem. I know a lot about how routers/switches and the internet works and this explanation gets technical (though I don't do that for a living). You may not have knowledge or interest in networking.
I am assuming the bandwidth scanner is on the same ISP as the dirauth longclaw but a different machine. I could be wrong.
Assuming the statement is true, it appears longclaw is hosted on "Koumbit" which also I believe hosts the bandwidth scanner.
Koumbit contracts "eStruxture" for colocation/bandwidth. Many routes going to eStruxture go through Hurricane Electric. The eStruxture->HE link probably had issues, especially for West Coast routes.
I contacted both HE and eStruxture. HE said there's no congestion. eStruxture claimed they "resolved" the issue. The latter didn't share details since I'm not a customer.
I'll wait and see if the values go up. It could also be a software issue, but since it didn't affect Europe it's probably a network issue.
Thanks @neel to help figure out why this is happening and the info.
Correct about Koumbit and longclaw, which is in Canada, but the bandwidth scanner is in a different machine, in a different provider, different AS, in East US.
I'm asking the East US provider to see if they know about bandwidth issues with West coast.
This software is being actively developed and we're deploying new versions to the machine in East US, which send the data to longclaw.
Except for maatuska, all the other bwauths still run Torflow.
So if there are bandwidth issues only in longclaw, it's very likely that is an sbws bug.
Do you know when exactly this started to happen?
If you have time and experience with collector and/or stem, it's possible to retrieve the past bandwidth files (and consensuses) to see when this start happening.
I'll have a look next days to which versions we deployed last, to try to find out if there's a bug
I believe the slowness started to happen around June.
My Psychz server is relatively new, I replaced an older 300 Mbps server with my current Gigabit one. In the past, Psychz in Los Angeles was faster than it is now, but with different keys.
The Wave connection, and the relays were also rekeyed. Here, replacing Wave isn't an option: do I really want Comcast? No so-called "phone company" serves me with fiber or copper. Surprisingly, I had faster measurements in the past on Wave also, that on crappy older switches in my area that were replaced.
In short, Wave Seattle and Psychz LA were faster than they are now, even on crappier hardware/routers.
I'm really more of a Core Tor person than a stem/collector person, but I can happily help you if you need help, even with coding. I do however have a $DAYJOB that isn't Tor-related.
Yes, i was told that the machine running longclaw's sbws in East US is connected to the Hurricane Electric backbone, so quite sure location or peering is not an issue.
So far i've not been able to spot what could be the bug in sbws.
It could also be the CDN that it's used as Web server to download the data to measure the speed.
I remember Wave had an issue where their switches had performance degradation far greater than what's normal with TCP (fortunately resolved now). Your CDN could be similar.
Major "cloud" providers like Amazon AWS, MS Azure, Google Cloud offer "trial" credits and have their own respective CDN. You could try one of these as well.
You could also use a more well-connected CDN, like an Akamai, Cloudflare, or EdgeCast, as opposed to what you're using right now, or a smaller provider like CDN77.
The issue still persists. Any updates on your part?
Whenever longclaw is absent from the consensus, the weights of West Coast relays go up, but whenever longclaw is added back, it goes back down.
longclaw is the slowest bwauth to the West Coast in many cases (yes really), slower than the European bwauths.
Have you contacted the CDN backing longclaw's bandwidth scanner about this issue?
If you need to replace the CDN, Amazon AWS and Microsoft Azure have offers for nonprofits where they can donate cloud credits ($2000 and $3500 yearly respectively) for nonprofits. This may not be enough for the bwauth. I don't know about other clouds like Google and Oracle, and don't know if the nonprofit Azure credit is being used by meek.
Disclaimer: I work at Microsoft, but not on Azure. If you are interested in Azure's CDN (or Azure in general), I could try my best to find a contact (or you could use the link above).
The issue still persists. Any updates on your part?
No, sorry, as i said in my first reply, it might takes us some weeks, since a lot of the work done on this topic is volunteer work. Apologies in advance about it.
Whenever longclaw is absent from the consensus, the weights of West Coast relays go up, but whenever longclaw is added back, it goes back down.
This is weird since the consensus takes the median from all the bwauths, so if it's only longclaw the bwauth that it's measuring lower, it'd not affect much to the median.
longclaw is the slowest bwauth to the West Coast in many cases (yes really), slower than the European bwauths.
A few questions that would help up to solve this:
Do you think that the longclaw's bwauth measures low bandwidth only in the the West Coast?
Have you check that this happen to all relays in the West Coast and not only yours?
What do you understand by West Coast?
How have you check that longclaw's bwauth is not measuring lower in Europe?
Do you have scripts to check this?
I'm asking this because the relays have information about IP, ASN and country, but not "regions" as in West Coast, so I'm not sure how we can check that.
Also, there're 7000 relays, we would need an automatic way to check whether the bandwidths are lower in some countries or ASNs other than going one by one over the 7000 relays.
Have you contacted the CDN backing longclaw's bandwidth scanner about this issue?
Not yet. A CDN problem is a possibility, but i don't think it's the most probable.
Most likely, this is an issue in this software, sbws, that is running only in the longclaw's and maautska's bwauths, the other 4 bwauths run Torflow software.
thanks @neel! We are applying for an AWS grant so we can get OnionPerf instances running there.
And we just got Azure credits that we will use for meek.
Parts of the US and Canada facing the Pacific Ocean, such as Los Angeles, Bay Area, Seattle, Vancouver, etc.
This slowdown also happens in other areas like Texas which aren't "West Coast" but still further from the East Coast, but I don't live or have servers there.
How have you check that longclaw's bwauth is not measuring lower in Europe?
Yes, like mentioned earlier, longclaw measures as normal with European relays.
The consensus weights of these two has been hovering around 6000-7000.
It could be longclaw's bandwidth scanner's host, or their transit providers. Test downloading Linux ISOs from West Coast and from Europe: if the West Coast is slower, than there's likely peering problems you can report to your host.
It could also be your CDN (Fastly) having to do an origin pull from a far away server instead of storing locally, reducing throughput. But if the same CDN is used by other bandwidth scanners, this isn't as much of an issue.
Sorry for the delay. I was getting through things at $DAYJOB.
You know how much it was before and when?
I remember it did not go beyond. The Psychz earlier (back in April) had much higher values, like 12000-18000 on a Gigabit link, but on another IP/fingerprint.
Fortunately longclaw measures 9800 now for the two Psychz/LA relays. Wave Broadband/Seattle is less lucky. But don't take things for granted, we should make sure it's consistently high before closing this.
I feel it's a network problem with the bwauth, but Wave in my area is slower for some reason (in the real world) despite having "Gigabit" service.
I'm moving, but will likely have to take Wave's subsidiary "Cascadelink" since the phone company (CenturyLink) only has VDSL where I'm going to move, and again why would I want Comcast? I just hope the new subsidary/equipment is better.
For East coast: Esslingen University of Applied Sciences, 10 Gbps:
curl -o ubuntu_esslingen.iso https://ftp-stud.hs-esslingen.de/pub/Mirrors/releases.ubuntu.com/groovy/ubuntu-20.10-beta-desktop-amd64.iso % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 2838M 100 2838M 0 0 26.4M 0 0:01:47 0:01:47 --:--:-- 14.0M
For West coast: Pacific NorthWest National Lab, 10 Gbps:
curl -o ubuntu_westus.iso https://mirror.pnl.gov/releases/groovy/ubuntu-20.10-beta-desktop-amd64.iso % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 2838M 100 2838M 0 0 25.3M 0 0:01:51 0:01:51 --:--:-- 16.0M
So, it's a bit slower in the West coast, which makes sense, since the scanner is in the East coast.
Fortunately longclaw measures 9800 now for the two Psychz/LA relays. Wave Broadband/Seattle is less lucky. But don't take things for granted, we should make sure it's consistently high before closing this.
Indeed, looking at other bwauth that doesn't run sbws, eg. gabelmoo, reports 5020, which is lower.
So, for the 2 fingerprints that already existed in May, they have approximate consensus weight, and 64CB2B32C10ADD4B93D725B7DED238CCCD6D6DBA has higher bandwidth in September.
I'm not going to close this ticket yet, but unless we create more complicated scripts to compare more months, more relays and between different bwauths, i don't see an easy way to demonstrate that longclaw is measuring lower US West coast.
What we know is that there was a sbws bug from January to May that made maatuska to mismeasure relays in the same network too high or too low.
It's possible that because of that bug, your relays were having higher consensus weight that they should.