Investigate why longclaw started to report a higher number of measured relays from middle December

assigned to @juga

moved from team#284 (moved)

mentioned in issue team#251 (moved)

estimate: 4h * 2 (high uncertainty)

changed time estimate to 8h

added Q1 S61-O2-Maybe - FINISHED Sponsor 61 - FINISHED labels

added Next label

added Doing label and removed Next label

We didn't change anything during middle December, but we did change several times sbws version around middle November to try the several attempts to solve #40142 (closed) and #40150 (closed).

It's possible that we didn't notice these changes in the graphs until December cause sbws keeps the data for a month.

I attach some csvs created with bwauthealth to see differences between a longclaw's bandwidth file from beginning November and another from end December, or rather longclaw's and moria's bandwidth files in December.

relays in moria1 not in longclaw

relays in longclaw not in moria1

relays in both bwauths, what's the weight in lonclaw and media, whether they are under/over-weighted

Apart of what we know, that longclaw measures more relays and moria1 obtains higher total bandwidth, i couldn't not decipher anything else with the csvs.

I think that the reasons for the the changes in the graphs can be any combination of:

uploads don't retry the upload several times to adjust the size and the duration (onbasca#128)
uploads don't retry the upload when a circuit fail with an exit as entry
uploads use alway the same size, 1.5MiB while the downloads start with 16MiB and change it depending on the time it took.

So i changed sbws to do upload almost in the same way as the download and this change has already been deployed (https://gitlab.torproject.org/tpo/network-health/bandwidth-authorities/-/wikis/bandwidth%20authorities%20timeline#2022-02-23-1640). I wonder whether, it not only will need 5 days to see changes in the grapah, but a whole month, as it took to see the changes after the last branch.

So i changed sbws to do upload almost in the same way as the download and this change has already been deployed (https://gitlab.torproject.org/tpo/network-health/bandwidth-authorities/-/wikis/bandwidth%20authorities%20timeline#2022-02-23-1640). I wonder whether, it not only will need 5 days to see changes in the grapah, but a whole month, as it took to see the changes after the last branch.

It seems the numbers are going down already and are close to the other bwauths. I find those three possible reasons you mentioned interesting, though, in particular wrt the measurement results we see. Without doing the same dance as in the download case we seem to get considerably more relays measured. Does that speak against following the download case then? Or do we get better measurements following the download case and the trade-off is less relays measured (and that trade-off is actually worth it)?

I'm not sure tbh. We might want to prioritize getting "better" measurements (and theoretically higher total weight) than more relays measured. As we commented in our sync, let's wait until March 6 and see how things change with this branch.

Another thing that might be relevant here is figuring out why longclaw's totalcw did not rise as much as for the other bw auths between 03/02 ann 03/04:

Indeed, i don't know why, going to have a look at the logs and see if i figure out something

bwscanner_cc got set to 2 in the consensus 2023-03-03 11-00-00. I still don't see why this would have any affect on the non-longclaw bwauths, but I might be missing something here and it might be one thing to double-check anyway.

Actually, I guess we can scratch that as it seem the rise is already visible between 2023-03-02 00-00-00 and 2023-03-03 00-00-00, before the consensus param changed.

Oh, another thing: we do seem to have quite a rise in the adv bw in that timeframe which the non-longclaw bwauths might have been reacting to:

added 14h of time spent

I've been looking at the number or relays with XOFF_RECV event in longclaw:

2022-11-03-15-29-36 (right after version with xoff in bwfiles was deployed)
- #relays: 8340
- #relays with xoff: 0
- %relays with xoff: 0
2022-11-04-08-28-42 (right before 1.5.2 was deployed again)
- #relays: 8295
- #relays with xoff: 21
- %relays with xoff: 0.25
2022-11-16-15-28-40 (right after version with xoff in bwfiles was deployed again)
- #relays: 7962
- #relays with xoff: 34
- %relays with xoff: 0.43
2022-11-21-15-29-54 (right after restarted with 2 threads, version that was attaching stream to the same circuits)
- #relays: 7899
- #relays with xoff: 2138
- %relays with xoff: 27.07
2022-11-30-23-33-47 (without waiting SS=0)
- #relays: 7523
- #relays with xoff: 2191
- %relays with xoff: 29.12
2022-12-15-01-49-47
- #relays: 7794
- #relays with xoff: 2000
- %relays with xoff: 25.67
2022-12-17-01-59-40
- #relays: 7791
- #relays with xoff: 1189
- %relays with xoff: 15.26
2022-12-18-01-03-29
- #relays: 7780
- #relays with xoff: 580
- %relays with xoff: 7.46
2022-12-19-01-11-34
- #relays: 7744
- #relays with xoff: 186
- %relays with xoff: 2.42
No longclaw bwfiles on the 20th?
2022-12-21-01-10-16
- #relays: 7798
- #relays with xoff: 161
- %relays with xoff: 2.06
2022-12-31-23-57-54
- #relays: 7710
- #relays with xoff: 119
- %relays with xoff: 1.54
2023-02-23-15-46-32
- #relays: 7721
- #relays with xoff: 115
- %relays with xoff: 1.49
2023-02-27-00-43-57
- #relays: 7688
- #relays with xoff: 100
- %relays with xoff: 1.30
2023-02-28-23-43-15
- #relays: 7651
- #relays with xoff: 88
- %relays with xoff: 1.15
2023-03-02-10-42-01
- #relays with xoff: 77
- #relays: 7628
- %relays with xoff: 1.01
2023-03-02-11-41-57
- #relays with xoff: 77
- #relays: 7627
- %relays with xoff: 1.01
2023-03-07-02-38-32
- #relays with xoff: 60
- #relays: 7538
- %relays with xoff: 0.80

It can be observed that:

the number of relays that received XOFF_RECV decreased from February until now, but the number of reported relays have been decreasing too. Still the percentage seems to be decreasing. I don't know whether it's due using uploads witout checking for SS=0.
the number of relays that received XOFF_RECV increased a lot from the 16th to the 21st November. It might be cause of using branches that were attaching streams to the same circuit.
the number of relays that received XOFF_RECV decreased a lot from 17th to 21st December, the same period in wich the total weight decreased a lot. Maybe cause after a month from November, it lost the data about the high bandwidth measured after SS=0 and the high number of XOFF due to attach streams to the same circuit.
in terms of number of releas that received XOFF_RECV during the first days of March, there aren't big differences.

I think that in any case, there're too many variables to really understand what's really going on.

Since the upload versions without waiting for SS=0 are bringing us new questions, what if we revert to the download version, but with the XOFF events?

added Needs Review label

Since the upload versions without waiting for SS=0 are bringing us new questions, what if we revert to the download version, but with the XOFF events?

That's one thing we could do. I am not sure whether we would gain much insight by doing so. Maybe @mikeperry has an opinion here.

Yes, reverting seems an ok move to check to see if there is a sudden difference due to that. (Cue new DDoS attack).

With respect to which XOFF values to monitor for what, here is that break:

If upload is enabled:
- An XOFF_RECV means that the destination webserver is too slow to receive the upload at full bandwidth, or the exit is having difficulty sending on its outbound connections as fast as the circuit processing can happen (ie: exit connection flood attack)
- An XOFF_SENT should be extremely rare/non-existent, because the client is not getting much data back for the socks port to deliver, other than HTTP response headers.
If download is enabled:
- An XOFF_RECV should be extremely rare/non-existent, because the only data the client is sending to the webserver are the GET request and the HTTP request headers.
- An XOFF_SENT means that sbws is having trouble reading from its socks port fast enough for the circuit to deleiver. This can happen if there are sync issues between the SOCKS read thread and other stuff, or if the machine it is running on is under heavy CPU load.

It is odd that we did not see a spike in XOFF_RECV when enabling upload globally; it is also odd that this did not happen, and we also measured relays slower in this case.

It is also odd that we saw high counts of XOFF_RECV before enabling upload, back in december, unless that was a test. Though there were reports of connection floods during that period. They may have been so severe that even the HTTP headers were enough to trigger an XOFF?

@mikeperry i'm sorry i didn't comment here something important: since the first deployment of the sbws version doing uploads in November, i've always been adding an extra commit hardcoding that bwscanner_cc is 2.

So that explains why there isn't any change after enabling upload globally. Actually, that should not affect on anything (except we don't need the extra commit anymore, which ~~makes~~ was making the commit hash confusing to track).

Also, XOFF values have only been tested with uploads so far and in my comment above all the xoff numbers refer to xoff_recv. So ~~they're~~ that's only the first case you comment.

We'll be able to see what happens with downloads and xoff_sent as soon as we revert to downloads, keeping the xoff tracking code.

mentioned in merge request !157 (closed)

!157 (closed) got enabled 2023-03-10 11-00-00. What it essentially does is disabling upload. And since upload got disabled the amount of measured relays went significantly up:

As @gk and I talked in irc, i took longclaw's bandwidth files from before the 10th of March and after that date to see which relays were not measured by the upload branch but were measured by the download branch.

In concrete i took files from 8-9 March and 11-12 March.

The number of relays measured after but not before is 186.

I modified sbws to measure only those relays with the upload branch and 76 of them were successfully measured. The ones that failed, failed with these reasons:

circuit DESTROYED
stream: 'Remote end closed connection without response'
stream: 'TTL expired'
stream: 'Connection reset by peer'

The last 3 types of failures could be due i ran sbws in residential Internet, not the server.

There were a few XON_RECV and XOFF_RECV events (~5-10 in a loop).

All together, i don't see anything too weird measuring these relays and i can't still explain why uploads would measure less relay in total.

The number of relays measured after but not before is 186.

Were all of them exits or non-exits? Or a mix?

A mix, both in success and failed measurements

mentioned in issue team#294 (closed)

removed Needs Review label

As @gk and i have commented in our 1:1, we can continue to investigate this once we've some prometheus/grafana metrics from the new metrics database. So closing for now.

closed

Time spent in March

added 8h of time spent

mentioned in issue #40208 (closed)

mentioned in issue analysis#80

Investigate why longclaw started to report a higher number of measured relays from middle December

Designs

Child items ...

Activity