As @mikeperry commented in the net-team meeting the 20th of March,
to know whether to increase or not the bridges' ratio threshold to stop
distributing slow bridges, we should look whether it's usually the same subset
of bridges that are under the threshold or not.
Designs
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
To look at this, i parsed the onbrisca logs responding the rdsys queries, which
have information about which bridges were requested and which are functional or
not.
Functional here means that the ratio is over the 0.75 threshold.
Then, as @mikeperry suggested, I compared this using sets of one day, with the
previous day.
These are the results with the logs from 6 days.
2023-03-20
number of functional bridges: 2041
number of non-functional bridges: 165
percent of non-functional bridges with respect functional ones: 8.08
2023-03-19
number of functional bridges: 2007
number of non-functional bridges: 152
percent of non-functional bridges with respect functional ones: 7.57%
number of non-functional bridges in 2023-03-20 and in 2023-03-19: 147
percent of the previous: 96.71%
2023-03-18
number of functional bridges: 2028
number of non-functional bridges: 156
percent of non-functional bridges with respect functional ones: 7.64%
number of non-functional bridges in 2023-03-19 and in 2023-03-18: 143
percent of the previous: 91.67%
2023-03-17
number of functional bridges: 2021
number of non-functional bridges: 182
percent of non-functional bridges with respect functional ones: 9.00%
number of non-functional bridges in 2023-03-18 and in 2023-03-17: 145
percent of the previous: 79.67%
2023-03-16
number of functional bridges: 2031
number of non-functional bridges: 211
percent of non-functional bridges with respect functional ones: 10.39%
number of non-functional bridges in 2023-03-17 and in 2023-03-16: 155
percent of the previous: 73.46%
2023-03-15
number of functional bridges: 2032
number of non-functional bridges: 260
percent of non-functional bridges with respect functional ones: 12.78%
number of non-functional bridges in 2023-03-16 and in 2023-03-15: 178
percent of the previous: 68.46%
It looks like when the percent of non functional bridges with respect
functional ones tend to decrease, the percent of non functional bridges that
were also non functional the previous day increases.
This might be cause the scanner is still measuring bridges that were not
measured yet, but i'm not sure.
In any case, the percent of non functional bridges that are the same in the
previous day is over 50%, so it looks like there's a set of bridges that is
usually slow?
Other question would be how much should we decrease the ratio threshold.
To try to answer that, i've created an histogram with the ratio distribution for the functional bridges (the intersection of all the dates data): . Maybe we can increase it to 0.9?
I was then curious to see how is the distribution of ratios for the non functional ones:
The ones with 0 ratio means that they failed to get measured. When they haven't been measured yet, they are considered functional.
This is the branch i've used to parse the logs. I'm not creating a mr cause we might
not need to calculate this again and if we'll need, it'd be better to include
these statistics in the code to be calculated live, rather than parsing logs
afterwards.
In any case, the percent of non functional bridges that are the same in the previous day is over 50%, so it looks like there's a set of bridges that is usually slow?
Did the "percent of the previous" line (and the one before that) move to the wrong date? I'd expect the first line to show up on 2023-03-16 and not 2023-03-15 as at 2023-03-15 there is no previous data set (and 2023-03-20 should get those lines as, clearly, there is previous data (2023-03-19).
That said, yes, seems to indicate that the number of problematic bridges is not changing much. @juga: just to be clear "percent of the previous" means by fingerprint? That is 96.71% means almost 97% of the bridge being non-functional today were non-functional yesterday as well? If so, then your numbers seem to suggest that this set is pretty stable.
Other question would be how much should we decrease the ratio threshold. [...] Maybe we can increase it to 0.9?
Did the "percent of the previous" line (and the one before that) move to the wrong date? I'd expect the first line to show up on 2023-03-16 and not 2023-03-15 as at 2023-03-15 there is no previous data set (and 2023-03-20 should get those lines as, clearly, there is previous data (2023-03-19).
i'm not sure i understand what you mean.
I started from the newest date, the 20th and finished with the oldest, the 15th.
So, i'm comparing the 15th with the previous date (newer), which is the 16th.
The "percent of the previous" means the percent of non-functional bridges in the current date that are also in the previous date, which in this case is a newer date, with respect the current date non functional bridges.
So for 2023-03-15, the "percent of the previous" is calculated as:
number of non-functional bridges in 2023-03-16 intersection non-functional bridges in 2023-03-15 * 100 / number of non-functional bridges in 2023-03-15 = 100 * 178 / 260 = 68.46
@juga: just to be clear "percent of the previous" means by fingerprint?
yes, all the bridges are counted by unique fingerprint
Ah, okay. I was not walking backwards in time but starting from the oldest and "previous" for the 15th meant for me 14th as that is the date coming before the 15th.
trying 0.5 and 0.9 both with this kind of analysis and verify that it does what we expect (0.5 should have less variance in bridges marked slow, 0.9 would have more change) also less total bridges marked, obviously with 0.5, and more with 0.9
We'll try with 0.9 this week and 0.5 threshold next one. So leaving this issue open
We could see how the changes in the thresholds affected the number of rejected bridges by looking at grafana panel, though i'm not sure now which changes we made.
On March 27, at 6:47utc, @gk changed the onbrisca threshold to 0.9.
It can be seen an increase of rejected bridges in the onbrisca graphs (~0.05?), but not sure about rdsys. Was it changed in rdsys?
On April 3, at 9:19h, @gk changed the onbrisca threshold to 0.5.
The change is almost not noticeable in the onbrisca graphs.
@meskio changed rdsys threshold at 15:06h and it looks like the number rejected bridges decreased (~0.02?) in the rdsys graphs.
Maybe one of you can confirm or interpret the graphs better than i?
Thanks!
Where are these graphs in the dashboard? I am not seeing an obviously named onbrisca dashboard in grafana2. Is it hidden somewhere?
These results are a bit odd. I expected a bit more difference than this, especially given that your histrogram in #152 (comment 2888840) showed many bridges below 0.9 (at least more than below 0.5, which looks like none in that comment's histogram).
It is also odd that the center/mean/mode of that comment's historgram seems to be around 1.5. How is it that on average, most bridges tend to have 1.5 times faster stream capacity than the average? Is this due to filtering, or doing something special with 0-valued bridges?
Also, we have to change this value in two places? That is not ideal. Can both places listen for the same consensus parameter?
The threshold changes in onbrisca are ignored by rdsys, rdsys for now uses the threshold from it's configuration file. Looking at the graph the changes happen March 27 1800UTC and March 6 1530UTC.
These results are a bit odd. I expected a bit more difference than this, especially given that your histrogram in #152 (closed) (comment 2888840) showed many bridges below 0.9 (at least more than below 0.5, which looks like none in that comment's histogram).
The 1st histogram has only values for the "functional" bridges, ie. over 0.75 threshold in this case and the 2nd the non functional ones, ie. under 0.75 threshold.
I can see now how i'd have created the histogram with all of them, to make it easier to see the amount of bridges between 0 and 1 threshold, which the 2 current histograms it's a bit harder to compare.
I could repeat those histograms with all the bridges, though it takes way longer that just looking at grafana.
It is also odd that the center/mean/mode of that comment's historgram seems to be around 1.5. How is it that on average, most bridges tend to have 1.5 times faster stream capacity than the average? Is this due to filtering, or doing something special with 0-valued bridges?
Are the changes in the graph visible a bit later than the dates/time that were said by IRC because the bridges are not requested all at once?
I'm not sure. We do restart rdsys to change the threshold, that is why we see the spike of untested bridges. That spike should be just on the moment of the restart, then rdsys will ask for bridges on batches of 25, so it might take some time to collect all of them.