I'm still not sure if this is a good metric for #2681 (moved), but Sathya asked me the idea in a metrics ticket some days ago:
Just brainstorming here, but I wonder if some kind of metric on how quickly the Tor network changes would help us decide if 3 days is a better interval than 5 days.
By "how quickly the Tor network changes", I mean that if you take a consensus X from 3 days ago and a consensus Y from today, what's the percentage of routers in Y that are also in X (based on identity key)?
Such a metric could be a set of probability distributions that describe how likely it is for the Tor network to change by a specific amount in X days.
So, for example, the probability distributions would tell us stuff like "Based on previous data, the Tor network has 40% chance to change by 20%, in five days." or "The Tor network has 80% chance to change by less than 5%, in one day." or "The Tor network has 40% chance to change by 35%, in two months".
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Sounds like a fun analysis for someone who wants to play with stem's new consensus-parsing module.
In addition to the metric you suggest that is based on absolute relay numbers, I'd probably add another metric. "If you take a consensus X from 3 days ago and a consensus Y from today, what's the fraction of total consensus weights that is available for building circuits?"
The output could be a CSV file like this (example data):
This would mean that, when using a 72 hour old consensus at the stated date and time, you'd know 55.4% of relays from the recent consensus which together have 76.8% of total consensus weights.
Useful numbers of hours might be 1, 2, 3, 4, 5, 6, 12, 24, 36, 48, 72, 96, 120, 144, and 168.
As a special case, if no consensus was published exactly 72 hours ago, we'd look at the one 72.5 or 73 hours ago, and so on.
Having data for the past 12 months would be good. I could help with graphing results.
Changing component to Analysis, because this task is about a one-time analysis, not a metrics utility that we'd want to write and then maintain.
The data and graphs are from a first pass analysis. s2012.csv should be a complete view of 2012 and was generated using data from 2012 and December 2011.
Definitions
Let Y be the consensus listed (now) and X the consensus some hours ago (now - hours).
frac_relay is the number of routers in Y that are also in X based on fingerprint, divided by the total number of routers in Y.
frac_cw is ratio of bandwidth sum from X arising from routers in both X and Y divided by the bandwidth sum from Y of routers in both X and Y.
Notes
Some hourly consensus documents were missing and comparisons involving those documents were ignored (no special handling).
A router that is present in both Y and X could be missing from a consensus document between [now, now-hours], but this situation is currently ignored due to processing time.
Some ratios in the frac_cw graphs are odd. A quick look at the data shows:
(sum of cw for routers in both X and Y for X, sum of cw for routers in both X and Y for Y)consensus,hours,frac_relays,frac_cw,month,day,day_of_week(9819752, 7841626)2012-04-17-13-00-00-consensus,1,0.969956,1.252260,4,17,2(9764530, 2363373)2012-04-17-13-00-00-consensus,2,0.955540,4.131608,4,17,2(9398785, 2323009)2012-04-17-13-00-00-consensus,3,0.952431,4.045953,4,17,2(9206152, 8152643)2012-04-17-13-00-00-consensus,4,0.946181,1.129223,4,17,2(9519503, 9373338)2012-04-17-13-00-00-consensus,5,0.933105,1.015594,4,17,2(9727357, 9408589)2012-04-17-13-00-00-consensus,6,0.934238,1.033881,4,17,2(9375476, 7251736)2012-04-17-13-00-00-consensus,12,0.897784,1.292860,4,17,2(9758498, 7935133)2012-04-17-13-00-00-consensus,24,0.896715,1.229784,4,17,2(9674191, 6843889)2012-04-17-13-00-00-consensus,36,0.872363,1.413552,4,17,2(9141303, 8071610)2012-04-17-13-00-00-consensus,48,0.848475,1.132525,4,17,2(9591979, 8984097)2012-04-17-13-00-00-consensus,72,0.839209,1.067662,4,17,2(9586237, 8260454)2012-04-17-13-00-00-consensus,96,0.849177,1.160498,4,17,2(9061210, 6923951)2012-04-17-13-00-00-consensus,120,0.836865,1.308676,4,17,2(9306138, 8460224)2012-04-17-13-00-00-consensus,144,0.821573,1.099987,4,17,2(9184215, 8564116)2012-04-17-13-00-00-consensus,168,0.812981,1.072407,4,17,2
Questions
What can be done to parse and operate on the consensus data more quickly? s2012.csv took around 12 hours to generate with pypy first_pass.py and limited memory.
Any insight into the missing hourly consensus documents?
Anyone want to try working with the first pass data?
What can be done to parse and operate on the consensus data more quickly? s2012.csv took around 12 hours to generate with pypy first_pass.py and limited memory.
On a first glance -
You seem to be reading and parsing the same consensus file multiple times. Loading all the consensus(or most) into memory first would speed it up.
if router.fingerprint in base_routers: router_overlap.append(router.fingerprint) current_router_overlap_count += 1 current_router_overlap_bandwidth += router.bandwidth for fingerprint in router_overlap: base_router_overlap_bandwidth += base_routers[fingerprint]
can be changed to
if router.fingerprint in base_routers: router_overlap.append(router.fingerprint) base_router_overlap_bandwidth += base_routers[router.fingerprint] current_router_overlap_count += 1 current_router_overlap_bandwidth += router.bandwidth
2012_frac_relays.png looks correct. I like how the fraction goes down at 12 and 36 hours and back up at 24 and 48 hours. But it makes sense: there are quite a few relays on a 24-hour cycle, and even if they're offline 12 hours later, they can be back online 24 hours later.
2012_frac_cw.png looks wrong though. The definition should be: "frac_cw is the sum of consensus weights of routers in Y that are also in X based on fingerprint, divided by the total sum of consensus weights of all routers in Y." The result should be a graph similar to 2012_frac_relays.png, so distributions from [0..1], with fractions dropping more slowly; based on the assumption that most of the fast relays are 24/7.
Do you mind if we add your code, possibly after making the changes that gsathya suggests, to metrics-tasks.git? Do you want to clone that repository, commit your code, and let me know from where to pull? It would be good to have all analysis code in a single repository for others who want to run similar analyses.
To answer your question about missing hourly consensuses: it happens from time to time that the 9 directory authorities fail to generate a consensus, and it's also possible that the descriptor-fetching service fails temporarily. Your approach, to ignore cases when a consensus is missing, is perfectly reasonable.
Also, the decision not to look at routers that are missing between ]now-hours, now[ is perfectly fine. That would be a different research question, so please leave this as is.
Thanks for the comments and suggestions and will look into a git repo.
first_pass.py has been updated with gsathya's suggestion and has been modified to load and store fingerprints and bandwidth values.
Total processing time has been reduced to about one hour. Loading four months of data takes around 1.5 GB of RAM. More time is spent loading the consensus file than iterating over the entries.
Revised definitions:
Let Y be the consensus listed (now) and X the consensus some hours ago (now - hours).
Let intersection(X,Y) be the routers in both X and Y based on fingerprint.
frac_relay is count(intersection(X,Y))/count(Y).
frac_cw is the sum of consensus weights in Y over intersection(X,Y) divided by the sum of consensus weights in Y.
Trac: Username: peer Status: needs_revision to needs_review
Looks great, both code and results. Nice work! Please let us know if you need help with the Git repo. And if you need help finding another fun analysis task to work on, let me know, too! Thanks!
Looks good, but can you put a real email address in the commit (git config --global user.email "your@address.com" && git commit --amend --reset-author), and can you include the R code, too? Thanks!
asn, does that answer your question? Can we close this ticket then?
I wonder what we should do with these results, or with results from Analysis tickets in general. It would be sad if we closed this ticket and forgot that we ever worked on it. Would it make sense to set up a wiki page (maybe doc/AnalysisTicketResults) for Analysis ticket results that has short descriptions of what we analyzed and links to the relevant file assignments? peer, is this maybe something you'd want to look into?
asn, does that answer your question? Can we close this ticket then?
It does. That was a great analysis!
I wonder what we should do with these results, or with results from Analysis tickets in general. It would be sad if we closed this ticket and forgot that we ever worked on it. Would it make sense to set up a wiki page (maybe doc/AnalysisTicketResults) for Analysis ticket results that has short descriptions of what we analyzed and links to the relevant file assignments? peer, is this maybe something you'd want to look into?
Hm, would it make sense to add such graphs somewhere in:
https://metrics.torproject.org/graphs.html
maybe in the 'Network' page or in a new 'Miscellaneous' page, and auto-generate them every so often? That would make the graphs more easily findable by people that might be interested in them.
I wonder what we should do with these results, or with results from Analysis tickets in general. It would be sad if we closed this ticket and forgot that we ever worked on it. Would it make sense to set up a wiki page (maybe doc/AnalysisTicketResults) for Analysis ticket results that has short descriptions of what we analyzed and links to the relevant file assignments? peer, is this maybe something you'd want to look into?
Hm, would it make sense to add such graphs somewhere in:
https://metrics.torproject.org/graphs.html
maybe in the 'Network' page or in a new 'Miscellaneous' page, and auto-generate them every so often? That would make the graphs more easily findable by people that might be interested in them.
Unfortunately not. There's a big difference between writing a command-line tool that processes a bunch of data and draws some graphs and extending a database to continually pre-process those data to allow users to draw customized graphs from them.
Thanks for the suggestion. Documenting the analysis results at doc/AnalysisTicketResults, starting with metrics-tasks, would be a good way to understand the current research questions.