Mike's performance work has shown that the smaller relays -- for example, the ones that set bandwidthrate and bandwidthburst to 20k -- are never good news to have in your circuit.
Damon McCoy's hotpets 2010 paper showed more details of how you could improve performance by dumping the bottom X% of the relays.
Of course, there's a network effect issue here: clearly you get better performance if you're the only one ignoring the slower relays.
But I think there's something to this even when everybody is doing it. Our load balancing makes a 500KB relay 10x more likely to be used than a 50KB relay, but given a whole lot of users building paths, the 50KB relay will get overloaded more often and show worse characteristics when overloaded than the 500KB relay -- in large part because we're load balancing by circuit rather than by byte.
So I'd like to do a series of performance experiments where the directory authorities take away the Fast flag from everybody whose consensus bandwidth is under X.
Ideally we'd do it while the network is under a variety of load conditions (removing capacity from the network when there's a lot of load seems like it would hurt us more, but then, using overloaded relays when there's a lot of load could hurt us a lot too).
This could even be a research task that we try to give to a research group that wants to work on simulated Tor network performance. But I think that's a separate project.
Along with the performance simulations we need to consider the anonymity implications of reducing the diversity of relays. How much anonymity do we lose if we treat anonymity as entropy? How much do we lose if we consider the location-based anonymity metrics of Feamster or Edman? Ideally we'd figure out some way to compare performance and anonymity so we can decide if we like various points in the tradeoff space. Really, we should be working on this piece already to analyze whether Mike's bwauth algorithm is worth it.
Finally, should we consider keeping them in the network if they have nice exit policies?
Relays that are too slow should be encouraged to become bridges. Even better, we should help people recognize when they ought to start out as a bridge rather than trying to be a relay.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related.
Learn more.
What about using them for directory information still? Seems like once we have microdescriptors out, they could cache these and distribute the new descriptors to clients (thought bootstrapping off them might be a bit painful).
Trac: Description: Mike's performance work has shown that the smaller relays -- for example, the ones that set bandwidthrate and bandwidthburst to 20k -- are never good news to have in your circuit.
Damon McCoy's hotpets 2010 paper showed more details of how you could improve performance by dumping the bottom X% of the relays.
Of course, there's a network effect issue here: clearly you get better performance if you're the only one ignoring the slower relays.
But I think there's something to this even when everybody is doing it. Our load balancing makes a 500KB relay 10x more likely to be used than a 50KB relay, but given a whole lot of users building paths, the 50KB relay will get overloaded more often and show worse characteristics when overloaded than the 500KB relay -- in large part because we're load balancing by circuit rather than by byte.
So I'd like to do a series of performance experiments where the directory authorities take away the Running flag from everybody whose consensus bandwidth is under X.
Ideally we'd do it while the network is under a variety of load conditions (removing capacity from the network when there's a lot of load seems like it would hurt us more, but then, using overloaded relays when there's a lot of load could hurt us a lot too).
This could even be a research task that we try to give to a research group that wants to work on simulated Tor network performance. But I think that's a separate project.
Along with the performance simulations we need to consider the anonymity implications of reducing the diversity of relays. How much anonymity do we lose if we treat anonymity as entropy? How much do we lose if we consider the location-based anonymity metrics of Feamster or Edman? Ideally we'd figure out some way to compare performance and anonymity so we can decide if we like various points in the tradeoff space. Really, we should be working on this piece already to analyze whether Mike's bwauth algorithm is worth it.
Finally, should we consider keeping them in the network if they have nice exit policies?
Relays that are too slow should be encouraged to become bridges. Even better, we should help people recognize when they ought to start out as a bridge rather than trying to be a relay.
to
Mike's performance work has shown that the smaller relays -- for example, the ones that set bandwidthrate and bandwidthburst to 20k -- are never good news to have in your circuit.
Damon McCoy's hotpets 2010 paper showed more details of how you could improve performance by dumping the bottom X% of the relays.
Of course, there's a network effect issue here: clearly you get better performance if you're the only one ignoring the slower relays.
But I think there's something to this even when everybody is doing it. Our load balancing makes a 500KB relay 10x more likely to be used than a 50KB relay, but given a whole lot of users building paths, the 50KB relay will get overloaded more often and show worse characteristics when overloaded than the 500KB relay -- in large part because we're load balancing by circuit rather than by byte.
So I'd like to do a series of performance experiments where the directory authorities take away the Running flag from everybody whose consensus bandwidth is under X.
Ideally we'd do it while the network is under a variety of load conditions (removing capacity from the network when there's a lot of load seems like it would hurt us more, but then, using overloaded relays when there's a lot of load could hurt us a lot too).
This could even be a research task that we try to give to a research group that wants to work on simulated Tor network performance. But I think that's a separate project.
Along with the performance simulations we need to consider the anonymity implications of reducing the diversity of relays. How much anonymity do we lose if we treat anonymity as entropy? How much do we lose if we consider the location-based anonymity metrics of Feamster or Edman? Ideally we'd figure out some way to compare performance and anonymity so we can decide if we like various points in the tradeoff space. Really, we should be working on this piece already to analyze whether Mike's bwauth algorithm is worth it.
Finally, should we consider keeping them in the network if they have nice exit policies?
Relays that are too slow should be encouraged to become bridges. Even better, we should help people recognize when they ought to start out as a bridge rather than trying to be a relay. Points: N/AtoN/A
So I'd like to do a series of performance experiments where the directory authorities take away the Running flag from everybody whose consensus bandwidth is under X.
A cleaner way to run the experiment would be to take away their Fast flag. I could imagine putting in a consensus param that authorities look at (off by default), so we can easily modify the values over time and see how the torperf output changes.
We should also ponder if we really mean consensus bandwidth here, or if we mean relay descriptor bandwidth. Currently the Fast flag is assigned based on relay descriptor bandwidth.
Trac: Milestone: Deliverable-Mar2011 toN/A Owner: N/Ato karsten Component: Tor Relay to Metrics
This sounds like an analysis task among many others we should work on. Removing the "Project: " part from the summary. If this is really a project, please change ticket type to "project."
Trac: Summary: Project: Raise the minimum bandwidth for being a relay? to Investigate raising the minimum bandwidth for being a relay Type: enhancement to task
A cleaner way to run the experiment would be to take away their Fast flag. I could imagine putting in a consensus param that authorities look at (off by default), so we can easily modify the values over time and see how the torperf output changes.
Modified this task to be more clearly about analyzing taking the Fast flag away.
Trac: Summary: Investigate raising the minimum bandwidth for being a relay to Investigate raising the minimum bandwidth for getting the Fast flag Description: Mike's performance work has shown that the smaller relays -- for example, the ones that set bandwidthrate and bandwidthburst to 20k -- are never good news to have in your circuit.
Damon McCoy's hotpets 2010 paper showed more details of how you could improve performance by dumping the bottom X% of the relays.
Of course, there's a network effect issue here: clearly you get better performance if you're the only one ignoring the slower relays.
But I think there's something to this even when everybody is doing it. Our load balancing makes a 500KB relay 10x more likely to be used than a 50KB relay, but given a whole lot of users building paths, the 50KB relay will get overloaded more often and show worse characteristics when overloaded than the 500KB relay -- in large part because we're load balancing by circuit rather than by byte.
So I'd like to do a series of performance experiments where the directory authorities take away the Running flag from everybody whose consensus bandwidth is under X.
Ideally we'd do it while the network is under a variety of load conditions (removing capacity from the network when there's a lot of load seems like it would hurt us more, but then, using overloaded relays when there's a lot of load could hurt us a lot too).
This could even be a research task that we try to give to a research group that wants to work on simulated Tor network performance. But I think that's a separate project.
Along with the performance simulations we need to consider the anonymity implications of reducing the diversity of relays. How much anonymity do we lose if we treat anonymity as entropy? How much do we lose if we consider the location-based anonymity metrics of Feamster or Edman? Ideally we'd figure out some way to compare performance and anonymity so we can decide if we like various points in the tradeoff space. Really, we should be working on this piece already to analyze whether Mike's bwauth algorithm is worth it.
Finally, should we consider keeping them in the network if they have nice exit policies?
Relays that are too slow should be encouraged to become bridges. Even better, we should help people recognize when they ought to start out as a bridge rather than trying to be a relay.
to
Mike's performance work has shown that the smaller relays -- for example, the ones that set bandwidthrate and bandwidthburst to 20k -- are never good news to have in your circuit.
Damon McCoy's hotpets 2010 paper showed more details of how you could improve performance by dumping the bottom X% of the relays.
Of course, there's a network effect issue here: clearly you get better performance if you're the only one ignoring the slower relays.
But I think there's something to this even when everybody is doing it. Our load balancing makes a 500KB relay 10x more likely to be used than a 50KB relay, but given a whole lot of users building paths, the 50KB relay will get overloaded more often and show worse characteristics when overloaded than the 500KB relay -- in large part because we're load balancing by circuit rather than by byte.
So I'd like to do a series of performance experiments where the directory authorities take away the Fast flag from everybody whose consensus bandwidth is under X.
Ideally we'd do it while the network is under a variety of load conditions (removing capacity from the network when there's a lot of load seems like it would hurt us more, but then, using overloaded relays when there's a lot of load could hurt us a lot too).
This could even be a research task that we try to give to a research group that wants to work on simulated Tor network performance. But I think that's a separate project.
Along with the performance simulations we need to consider the anonymity implications of reducing the diversity of relays. How much anonymity do we lose if we treat anonymity as entropy? How much do we lose if we consider the location-based anonymity metrics of Feamster or Edman? Ideally we'd figure out some way to compare performance and anonymity so we can decide if we like various points in the tradeoff space. Really, we should be working on this piece already to analyze whether Mike's bwauth algorithm is worth it.
Finally, should we consider keeping them in the network if they have nice exit policies?
Relays that are too slow should be encouraged to become bridges. Even better, we should help people recognize when they ought to start out as a bridge rather than trying to be a relay. Keywords: N/Adeleted, performance added
Why is this ticket assigned to me? Am I supposed to do something here? If so, please tell me and re-assign to me. Assigning to ticket reporter for now.
Trac: Owner: karsten to arma Status: new to assigned
Based on Rob's CSET paper, I am now less optimistic that we can answer this question with simulations: messing with what relays make up a test network is among the least solved pieces of simulating Tor networks.
So I think we should proceed in two directions:
A) We should get gsathya or asn or whoever to confirm that dropping relays with bandwidth less than X doesn't change any of the diversity metrics much (because they're never picked often enough to matter). What's the largest X for which you can reasonably say that?
B) Then we should do an actual performance experiment on the live Tor network, using the FastFlagMinThreshold consensus param added in #3946 (moved), and see what we see on torperf.
Judging the performance experiment on the live Tor network will be especially messy because there are so many variables, but I think despite that it may still be the best route.
Based on Rob's CSET paper, I am now less optimistic that we can answer this question with simulations: messing with what relays make up a test network is among the least solved pieces of simulating Tor networks.
Unless we use rpw's machine to simulate all of the existing relays. Then we don't worry about down sampling problems;)
A) We should get gsathya or asn or whoever to confirm that dropping relays with bandwidth less than X doesn't change any of the diversity metrics much (because they're never picked often enough to matter).
So, is this ticket about dropping relays from the consensus, or taking away their Fast flag? I can see how we can graph the former, but I'm not sure about the latter.
What's the largest X for which you can reasonably say that?
Sounds like we want #6232 (moved) graphs with the minimum bandwidth to keep relays in the consensus on the X axis. For example, a graph similar to https://trac.torproject.org/projects/tor/attachment/ticket/6232/entropy-august.png would have its blue lines decreasing steadily, because we're taking away relays, but the red lines would stay on the same level and only drop in the last third or so, because we start taking away relays from the slowest ones.
Is that what you have in mind here?
gsathya, asn, is this something you want to look into?
So, is this ticket about dropping relays from the consensus, or taking away their Fast flag? I can see how we can graph the former, but I'm not sure about the latter.
Shouldn't matter much. I guess that leads to: do your consensus diversity analysis tools consider the Fast flag? They probably should, since clients do.
Sounds like we want #6232 (moved) graphs with the minimum bandwidth to keep relays in the consensus on the X axis. For example, a graph similar to https://trac.torproject.org/projects/tor/attachment/ticket/6232/entropy-august.png would have its blue lines decreasing steadily, because we're taking away relays, but the red lines would stay on the same level and only drop in the last third or so, because we start taking away relays from the slowest ones.
Is that what you have in mind here?
Sounds plausible. One nice way of looking at it might be: what's the highest bandwidth cutoff such that the red lines in your graph lose 1% or less? Then the same question for 2%, 3%, 4%, 5%.
Of course, that needs a definition of what it means for two lines to differ. We might try defining the difference as the point x where f1(x) and f2(x) differ the most. If there's noise, we might define it as the 10th percentile of these points x, which would let us say "90% of the time there was at most a 1% difference."
In that output, min_cw is the minimum consensus weight of relays that we keep in the consensus. That value would start at the smallest consensus weight in the network, and we'd calculate entropy values for all relays in the consensus. Then we'd raise the minimum to the second-smallest value in the network, throw out all relays below that value, and compute new entropy values. Continue until we're at the relay with highest consensus weight.
The first column, validafter, is the consensus valid-after time. The third column, relays, contains the number of relays left. The other columns (all, max_all, etc.) are defined similar to #6232 (moved).
Roger, please note that I assumed you want to cut out relays based on consensus weight, not advertised bandwidth. Please correct me if that assumption is wrong. (Writing the analysis script for consensus weights is probably easier, so we could later extend it to advertised bandwidth if required.)
So, is this ticket about dropping relays from the consensus, or taking away their Fast flag? I can see how we can graph the former, but I'm not sure about the latter.
Shouldn't matter much.
Really?
I guess that leads to: do your consensus diversity analysis tools consider the Fast flag? They probably should, since clients do.
Our tools don't consider the Fast flag. They're only based on relays' consensus weights, their Exit and Guard flags, and the bandwidth-weights line.
Simulating what clients would do, including considering the Fast flag is almost impossible. There are too many variables which relays clients would pick depending on what other relays they already have in their circuit, including family settings and same /16's, that we can't reasonably model. If we want results this precise, we'll have to run simulations with the actual Tor code.
Sounds plausible. One nice way of looking at it might be: what's the highest bandwidth cutoff such that the red lines in your graph lose 1% or less? Then the same question for 2%, 3%, 4%, 5%.
Sure, that's something that the CDF I suggested above should show. We could put percent values on the y axis and start with current diversity at 100%. Then you could read what x value corresponds to 99% (98%, ...).
Of course, that needs a definition of what it means for two lines to differ. We might try defining the difference as the point x where f1(x) and f2(x) differ the most. If there's noise, we might define it as the 10th percentile of these points x, which would let us say "90% of the time there was at most a 1% difference."
Ah, my idea was to start with a single consensus. Combining multiple consensuses would be step 2. (The data format I suggested above should support the graphs you suggest here.)
In that output, min_cw is the minimum consensus weight of relays that we keep in the consensus. That value would start at the smallest consensus weight in the network, and we'd calculate entropy values for all relays in the consensus. Then we'd raise the minimum to the second-smallest value in the network, throw out all relays below that value, and compute new entropy values. Continue until we're at the relay with highest consensus weight.
The first column, validafter, is the consensus valid-after time. The third column, relays, contains the number of relays left. The other columns (all, max_all, etc.) are defined similar to #6232 (moved).
There seems to be quite a bit of relays with "None" bandwidth, should we consider such relays?(They count when calculating number of relays but don't provide any bandwidth)
Trac: Cc: karsten, gsathya, asn, robgjansen to karsten, gsathya, asn, robgjansen, aaron.m.johnson@nrl.navy.mil Status: assigned to needs_review
Roger, please note that I assumed you want to cut out relays based on consensus weight, not advertised bandwidth. Please correct me if that assumption is wrong. (Writing the analysis script for consensus weights is probably easier, so we could later extend it to advertised bandwidth if required.)
The Fast and Guard flags look at descriptor bandwidth, not consensus bandwidth. So yes, eventually we should do a version of this analysis that looks at descriptor bandwidth.
There seems to be quite a bit of relays with "None" bandwidth, should we consider such relays?(They count when calculating number of relays but don't provide any bandwidth)
I think this was caused by your code looking at position-dependent consensus weights, not raw consensus weights. Should be fixed.
Roger, please note that I assumed you want to cut out relays based on consensus weight, not advertised bandwidth. Please correct me if that assumption is wrong. (Writing the analysis script for consensus weights is probably easier, so we could later extend it to advertised bandwidth if required.)
The Fast and Guard flags look at descriptor bandwidth, not consensus bandwidth. So yes, eventually we should do a version of this analysis that looks at descriptor bandwidth.
A version of this analysis that looks at descriptor bandwidth would sort relays by advertised bandwidth and cut off the slowest relays based on that. In the graphs, the x axis would say "Minimum advertised bandwidth" instead of "Minimum consensus weight", and of course the lines might be slightly different. But everything else would remain the same, including how we calculate guard entropy for the "All guards" sub graph.
We'll mostly have to change router.bandwidth to router.advertised_bw a few times in the code. Shouldn't be too hard.
Sathya, do you want to look into this, or shall I?
I talked to Ian and Aaron a bit more about this analysis. What we'd like to see, for a given consensus, is a graph with bandwidth cutoff on the x axis and L_\inf on the y axis. L_\inf is the largest distance between the two probability distributions -- one being the probability distribution of which relay you'd pick from the pristine consensus, and the other the distribution in the modified consensus. "largest distance" means the element (i.e. relay) with the largest difference.
Then we should consider time: looking at C consensuses over the past year or something, for a given cutoff, we should graph the cdf of these C data points where each data point is the L_\inf of that consensus for that cutoff. The hope is that for some cutoffs, the cdf has very high area-under-the-curve.
I talked to Ian and Aaron a bit more about this analysis. What we'd like to see, for a given consensus, is a graph with bandwidth cutoff on the x axis and L_\inf on the y axis. L_\inf is the largest distance between the two probability distributions -- one being the probability distribution of which relay you'd pick from the pristine consensus, and the other the distribution in the modified consensus. "largest distance" means the element (i.e. relay) with the largest difference.
Sounds doable. I'd say let's start with plain consensus weight fractions and postpone exit, guard, country, and AS probabilities until we have a better handle on this type of analysis.
Here, validafter is the consensus valid-after time, min_advbw is the minimum advertised bandwidth of relays kept in the modified consensus, relays is the number of those relays, and linf is the largest difference between consensus weight fractions of all relays. The probability in the pristine consensus is always the consensus weight fraction. The probability in the modified consensus is 0 if the relay was excluded, or the consensus weight fraction relative to the new consensus weight sum (which is lower than the original consensus weight sum, because we cut out some relays). We'll want to compare probabilities of all relays, including those that we excluded, because they have non-zero probability in the modified consensus.
Then we should consider time: looking at C consensuses over the past year or something, for a given cutoff, we should graph the cdf of these C data points where each data point is the L_\inf of that consensus for that cutoff. The hope is that for some cutoffs, the cdf has very high area-under-the-curve.
Sure, we should be able to plot those graphs from the file format above.
Sathya, want to look into modifying pyentropy.py for the linf stuff?
Done. I just monkey patched router.bandwidth to router.advertised_bw, I think this is fine for now.
Hmm. I didn't look very closely, but I think this doesn't work. We'll want to exclude relays based on descriptor bandwidth but calculate the various metrics based on consensus weight. With your patch we're using descriptor bandwidth for everything.
This specific patch is probably moot, now that we're going to change the analysis from entropy values to L_\inf. But we'll want to have a similar patch for the L_\inf stuff, too.
Here, validafter is the consensus valid-after time, min_advbw is the minimum advertised bandwidth of relays kept in the modified consensus, relays is the number of those relays, and linf is the largest difference between consensus weight fractions of all relays. The probability in the pristine consensus is always the consensus weight fraction. The probability in the modified consensus is 0 if the relay was excluded, or the consensus weight fraction relative to the new consensus weight sum (which is lower than the original consensus weight sum, because we cut out some relays). We'll want to compare probabilities of all relays, including those that we excluded, because they have non-zero probability in the modified consensus.
"The probability in the modified consensus is 0 if the relay was excluded," and "including those that we excluded, because they have non-zero probability in the modified consensus" seem to be contradicting?
Sathya, want to look into modifying pyentropy.py for the linf stuff?
I'm ignoring the probabilities of relays that we excluded because they have 0 probability. Please check my bug_1854_v2 branch Thanks!
"The probability in the modified consensus is 0 if the relay was excluded," and "including those that we excluded, because they have non-zero probability in the modified consensus" seem to be contradicting?
What I meant is "including those that we excluded, because they have non-zero probability in the pristine consensus". Sorry.
Sathya, want to look into modifying pyentropy.py for the linf stuff?
I'm ignoring the probabilities of relays that we excluded because they have 0 probability. Please check my bug_1854_v2 branch Thanks!
Can you change the above? I just had a quick look, but I'd want to look closer once it's doing the thing that I think arma et al. had in mind.
And can you either remove pyentropy.py, or move your changes to pyentropy.py and remove pylinf.py, so that there's only the code file that we're actually using in the repository?
"The probability in the modified consensus is 0 if the relay was excluded," and "including those that we excluded, because they have non-zero probability in the modified consensus" seem to be contradicting?
What I meant is "including those that we excluded, because they have non-zero probability in the pristine consensus". Sorry.
Sathya, want to look into modifying pyentropy.py for the linf stuff?
I'm ignoring the probabilities of relays that we excluded because they have 0 probability. Please check my bug_1854_v2 branch Thanks!
Can you change the above? I just had a quick look, but I'd want to look closer once it's doing the thing that I think arma et al. had in mind.
Done
And can you either remove pyentropy.py, or move your changes to pyentropy.py and remove pylinf.py, so that there's only the code file that we're actually using in the repository?
Over the past few days, the minimum bandwidth for the Fast flag looks like it moved from 32KB/s up to 50KB/s and then back down. So maybe there is data to analyze even without explicitly doing the experiment. :)
Some background from Karsten's email -
I usually take another approach for combining network statuses and server descriptors in an analysis: parse all server descriptors, extract the relevant parts, keep them in memory stored under their descriptor digest, parse consensuses, use server descriptor parts from memory. This is faster, because we only have to parse a server descriptor once, not every time it's referenced from a consensus, which can be 12 times or more. There's also the option to store intermediate results from parsing server descriptors in a temp file and only read that when re-running the analysis, which typically happens quite often. This approach is also more efficient, because we can parse server descriptors contained in tarballs without extracting them.
I've changed pylinf to be able to read a single tar file or a bunch of tar server descriptor files and store it in memory. I haven't had the chance to test it much, let me know if you find any bugs.
Sounds good. Did the code produce meaningful output? I won't be able to review the code today, but I could try tomorrow or Friday. Knowing that the code probably works as expected would be good though. Thanks!
Sounds good. Did the code produce meaningful output? I won't be able to review the code today, but I could try tomorrow or Friday. Knowing that the code probably works as expected would be good though. Thanks!
Made more changes here - https://github.com/gsathya/metrics-tasks/compare/bug_1854_v2 It's been running on my tiny vps for more than 4 hrs processing 2 months server descriptor and 1 month consensus data. I'm going to run this on lemmonni now.
Sounds good. Did the code produce meaningful output? I won't be able to review the code today, but I could try tomorrow or Friday. Knowing that the code probably works as expected would be good though. Thanks!
Made more changes here - https://github.com/gsathya/metrics-tasks/compare/bug_1854_v2 It's been running on my tiny vps for more than 4 hrs processing 2 months server descriptor and 1 month consensus data. I'm going to run this on lemmonni now.
Code looks good, merged. I also graphed your 1 month of data. I'm currently running your script in an EC2 instance on 1 year of data. Will let you know once I have results.
What's going on at the right end of the linf graph there?
Other than that, the plot shows that setting the cutoff to 1 MB/s (using only the top 400 relays or so) would affect the choice of relays in a tiny amount I can't read from the graph. (Can you make that graph log/log?)
What is the linf comparison to? A cutoff of 20 KB/s? No cutoff? There are relays appearing in the upper plot with speeds < 20 KB/s.
Can you also plot total advertised bandwidth with the same x-axis? A rough eyeing of the top figure suggests that the ~2100 relays with bandwidths below 1 MB/s contribute a total of ~500 MB/s, but it seems to me that that should produce more than a negligible change in probability distribution.
What's going on at the right end of the linf graph there?
You mean why is it skyrocketing and then dropping to almost zero? I think when there's only 1 relay left, the probability of it being picked grows to 100%, so linf is 100% minus its previous probability of being picked. And when no relay is left, linf goes down to the maximum probability of a relay being picked in the pristine consensus that now cannot be picked anymore.
Other than that, the plot shows that setting the cutoff to 1 MB/s (using only the top 400 relays or so) would affect the choice of relays in a tiny amount I can't read from the graph. (Can you make that graph log/log?)
Attached. (I left the original graph in and added another graph for log/log, because number of relays looks funny on a log scale and there's no easy way to use different scales for both sub graphs.)
What is the linf comparison to? A cutoff of 20 KB/s? No cutoff? There are relays appearing in the upper plot with speeds < 20 KB/s.
No cutoff, that is, comparing to the pristine consensus where any relay could be picked. That's how linf is defined in the script right now. We can change that, but new results would be at least 9+ hours away.
Can you also plot total advertised bandwidth with the same x-axis? A rough eyeing of the top figure suggests that the ~2100 relays with bandwidths below 1 MB/s contribute a total of ~500 MB/s, but it seems to me that that should produce more than a negligible change in probability distribution.
Attached. Please note that the first graphs were previously wrongly labeled. They showed data from 2011-11-19 23:00:00, not 2012-10-31 23:00:00. The new PDF shows the correct data, including total excluded advertised bandwidth.
OK, so my eyeballing of ~500 MB/s excluded at a 1 MB/s cutoff turned out to be pretty darned close. ;-)
So at that cutoff, about 15% of the network bandwidth disappears. But that 15% was spread (highly unevenly) over 2100 relays. Each of those relays, according to the linf figure, contributed a maximum of about 0.5% of the bandwidth, and in turn, the remaining relays see at most 0.5% extra users. (NOTE: that's 0.5% of all the users, not 0.5% of what it had before.)
OK, here's the plot I'm interested in now: x-axis: bandwidth of relay (log scale). y-axis: one line showing the probability distribution of relay selection with a 20 KB/s cutoff, and one with a 1 MB/s cutoff. Feel free to throw other intermediate values in there as well. We'll probably need a version with a linear y-axis and one with a log y-axis.
OK, here's the plot I'm interested in now: x-axis: bandwidth of relay (log scale). y-axis: one line showing the probability distribution of relay selection with a 20 KB/s cutoff, and one with a 1 MB/s cutoff. Feel free to throw other intermediate values in there as well. We'll probably need a version with a linear y-axis and one with a log y-axis.
This PDF (accidentally uploaded twice, *-c.2.pdf is the same file) now contains two new graphs of cumulative probability distributions, along with the existing graphs.
Attached. It's not an actual probability distribution function though, because multiple relays can have exactly the same advertised bandwidth, and I figured you don't want a graph with probabilities of those relays being summed up. (Unless you actually wanted such a graph, in which case I could easily make one.)
Same as above. but y axis is the ratio (prob with 1 MB cutoff / prob with 20 KB cutoff).
I expect 0 below 1 MB/s and a fairly constant value (~ 1.15 I think?) above 1 MB/s. Is that what we see?
Attached. I cut out relays below 1 MiB/s, because we set them to exactly 0.0, so we'd run into div/0 there. 1.087879 is the exact constant value for all relays above 1 MiB/s.
Post by Paul, who still doesn't have a proper account:
Been talking to Ian about this today here at Dagstuhl. I don't think all the effects of significantly shrinking the set of nodes that are ever chosen has been considered. If the network shrinks to c. 1/4 it's current size, this has the potential for tremendous psychological impact on users, relay volunteers, some adversaries, funders, etc. There is thus a big difference between switching a lot of nodes to be never chosen vs. changing the distributions to make them more rarely chosen. Instead of changing the fast flag, it would then make more sense to alter the bandwidth weighting. And the more gradual the change in probability of being chosen, the less any nodes will naturally count as the group that has been simply excluded. If performance is best served by more of a step function, then perhaps something in between will still significantly improve performance statistics without, e.g., resulting in graphs showing a 75% drop in the number of nodes with the fast flag when the change is rolled out.
In Canada at least, the assumption that home relay users can provide less than 500KB/s bandwidth may be obsolete by next year. Cable and DSL carriers are pushing connections with either 2Megabit or 10Megabit upload as their "standard" package, albeit with total bytes transferred restrictions resulting in hibernations.
http://www.rogers.com/web/link/hispeedBrowseFlowDefaultPlanshttp://www.bell.ca/Bell_Internet/Internet_access
However, that might mean users' expectations of speed from Tor will also be higher, at least in Canada.
Please never let users who run Vidalia for years on slow connections somehow find out in press: your connection is considered too slow, you wasted one year of electricity running Vidalia with volunteer option, without volunteering anything.
In any case someone won't volunteer anymore: please show them a big fat warning, so they won't waste their uptime.
As Paul suggested: please keep the numbers. Try to talk them into becoming bridges or just select them very rarely.
And finally, please don't tell people, residential connections aren't of a big help, keep the community spirit alive!
Encouraging users on slow connections to be bridges seems to make much more sense than encouraging them to be relays, no? Even "all (stable?) clients are automatically bridges" makes plausible sense, whereas "all (stable?) clients are automatically relays" will make the consensus melt.
Please never let users who run Vidalia for years on slow connections somehow find out in press: your connection is considered too slow, you wasted one year of electricity running Vidalia with volunteer option, without volunteering anything.
In any case someone won't volunteer anymore: please show them a big fat warning, so they won't waste their uptime.
We ought be mindful of the future - The future may bring exciting possibilities for even low-bandwidth relays:
parallel (torrent-like, i2p-like) pathways/cells
distributed data store possibilities
every relay a small encrypted (opt-in or opt-out) data store provider perhaps - I guess ala freenet, but different
distributed redundant hash table(s) - greater redundancy is usually not a bad thing for a censorship-resistant distributed data store, I thought...
build a network-of-trust (GPG sort of style) on Tor network - certain data models may depend on greater number of nodes in future, and be weakened by reduction of node count
The point is - tor is not finished! We have a long way to go to fully decentralise communications authority amongst the broader community. Let's definitely not pre-empt our future by reducing possibilities.
As Paul suggested: please keep the numbers. Try to talk them into becoming bridges or just select them very rarely.
Those who have a spirit of contribution, will be grateful they can contribute, even if only a little. In the future, they may be able to make bigger contributions.
And finally, please don't tell people, residential connections aren't of a big help, keep the community spirit alive!
A big Ack!
A small help is a big help, is what we ought to say!
In addition: Building community, and building the future "more significant contributors" - which might be in many terms - financial, bandwidth, useful hidden services, brilliant ideas for future development, or even development itself. Every contributor starts somewhere!
Emphasize the genuine contribution directions, as said above such as bridge; I believe the website appears to do this pretty well now - anecdotal, but I recently became an exit relay, and the website's encouragement steered me there as "this is needed, especially full exits" (the website wording can still be improved here - less fear, more reference to established legal precedents regarding 'carriers' in various jurisdictions), to steer people in the most useful directions.
Parallel paths may make those slow relays useful in the future. So discouraging people to do something towards their freedom and others' freedom is counter-productive to the long term health of the broader community.
Contributer Graduation - building the community of those who wish to contribute, in time some will graduate to be bigger contributors.
Again, future thinking.
A small help is a big help.
Everyone doing a little bit makes the future jobs easier.
We have not yet solved all the problems, so definitely benefit from more people putting attention, action and in many cases future intention, towards our broader goals.
Optimising bandwidth for current tor facilities is good.
Maximising our community base from which future development and technology can build is very good.