Visualize how quickly the Tor network changes

added component::metrics/analysis parent::2681 priority::medium resolution::implemented status::closed type::task labels

Sounds like a fun analysis for someone who wants to play with stem's new consensus-parsing module.

In addition to the metric you suggest that is based on absolute relay numbers, I'd probably add another metric. "If you take a consensus X from 3 days ago and a consensus Y from today, what's the fraction of total consensus weights that is available for building circuits?"

The output could be a CSV file like this (example data):

consensus,hours,frac_relays,frac_cw
2012-10-31 23:00:00,72,0.554,0.768
2012-10-31 23:00:00,120,0.341,0.509

This would mean that, when using a 72 hour old consensus at the stated date and time, you'd know 55.4% of relays from the recent consensus which together have 76.8% of total consensus weights.

Useful numbers of hours might be 1, 2, 3, 4, 5, 6, 12, 24, 36, 48, 72, 96, 120, 144, and 168.

As a special case, if no consensus was published exactly 72 hours ago, we'd look at the one 72.5 or 73 hours ago, and so on.

Having data for the past 12 months would be good. I could help with graphing results.

Changing component to Analysis, because this task is about a one-time analysis, not a metrics utility that we'd want to write and then maintain.

Trac:
Component: Metrics Utilities to Analysis

The data and graphs are from a first pass analysis. s2012.csv should be a complete view of 2012 and was generated using data from 2012 and December 2011.

Definitions

Let Y be the consensus listed (now) and X the consensus some hours ago (now - hours).

frac_relay is the number of routers in Y that are also in X based on fingerprint, divided by the total number of routers in Y.

frac_cw is ratio of bandwidth sum from X arising from routers in both X and Y divided by the bandwidth sum from Y of routers in both X and Y.

Notes

Some hourly consensus documents were missing and comparisons involving those documents were ignored (no special handling).

A router that is present in both Y and X could be missing from a consensus document between [now, now-hours], but this situation is currently ignored due to processing time.

Some ratios in the frac_cw graphs are odd. A quick look at the data shows:

(sum of cw for routers in both X and Y for X, sum of cw for routers in both X and Y for Y)
consensus,hours,frac_relays,frac_cw,month,day,day_of_week
(9819752, 7841626)
2012-04-17-13-00-00-consensus,1,0.969956,1.252260,4,17,2
(9764530, 2363373)
2012-04-17-13-00-00-consensus,2,0.955540,4.131608,4,17,2
(9398785, 2323009)
2012-04-17-13-00-00-consensus,3,0.952431,4.045953,4,17,2
(9206152, 8152643)
2012-04-17-13-00-00-consensus,4,0.946181,1.129223,4,17,2
(9519503, 9373338)
2012-04-17-13-00-00-consensus,5,0.933105,1.015594,4,17,2
(9727357, 9408589)
2012-04-17-13-00-00-consensus,6,0.934238,1.033881,4,17,2
(9375476, 7251736)
2012-04-17-13-00-00-consensus,12,0.897784,1.292860,4,17,2
(9758498, 7935133)
2012-04-17-13-00-00-consensus,24,0.896715,1.229784,4,17,2
(9674191, 6843889)
2012-04-17-13-00-00-consensus,36,0.872363,1.413552,4,17,2
(9141303, 8071610)
2012-04-17-13-00-00-consensus,48,0.848475,1.132525,4,17,2
(9591979, 8984097)
2012-04-17-13-00-00-consensus,72,0.839209,1.067662,4,17,2
(9586237, 8260454)
2012-04-17-13-00-00-consensus,96,0.849177,1.160498,4,17,2
(9061210, 6923951)
2012-04-17-13-00-00-consensus,120,0.836865,1.308676,4,17,2
(9306138, 8460224)
2012-04-17-13-00-00-consensus,144,0.821573,1.099987,4,17,2
(9184215, 8564116)
2012-04-17-13-00-00-consensus,168,0.812981,1.072407,4,17,2

Questions

What can be done to parse and operate on the consensus data more quickly? s2012.csv took around 12 hours to generate with pypy first_pass.py and limited memory.

Any insight into the missing hourly consensus documents?

Anyone want to try working with the first pass data?

Trac:
Username: peer
Status: new to needs_review

Hi peer. Thanks for hacking on this!

Replying to peer:

What can be done to parse and operate on the consensus data more quickly? s2012.csv took around 12 hours to generate with pypy first_pass.py and limited memory.

On a first glance - You seem to be reading and parsing the same consensus file multiple times. Loading all the consensus(or most) into memory first would speed it up.

					if router.fingerprint in base_routers:
						router_overlap.append(router.fingerprint)
						current_router_overlap_count += 1
						current_router_overlap_bandwidth += router.bandwidth

				for fingerprint in router_overlap:
					base_router_overlap_bandwidth += base_routers[fingerprint]

can be changed to

					if router.fingerprint in base_routers:
						router_overlap.append(router.fingerprint)
                                                base_router_overlap_bandwidth += base_routers[router.fingerprint]
						current_router_overlap_count += 1
						current_router_overlap_bandwidth += router.bandwidth

Trac:
Status: needs_review to needs_revision

Neat analysis!

2012_frac_relays.png looks correct. I like how the fraction goes down at 12 and 36 hours and back up at 24 and 48 hours. But it makes sense: there are quite a few relays on a 24-hour cycle, and even if they're offline 12 hours later, they can be back online 24 hours later.

2012_frac_cw.png looks wrong though. The definition should be: "frac_cw is the sum of consensus weights of routers in Y that are also in X based on fingerprint, divided by the total sum of consensus weights of all routers in Y." The result should be a graph similar to 2012_frac_relays.png, so distributions from [0..1], with fractions dropping more slowly; based on the assumption that most of the fast relays are 24/7.

Do you mind if we add your code, possibly after making the changes that gsathya suggests, to metrics-tasks.git? Do you want to clone that repository, commit your code, and let me know from where to pull? It would be good to have all analysis code in a single repository for others who want to run similar analyses.

To answer your question about missing hourly consensuses: it happens from time to time that the 9 directory authorities fail to generate a consensus, and it's also possible that the descriptor-fetching service fails temporarily. Your approach, to ignore cases when a consensus is missing, is perfectly reasonable.

Also, the decision not to look at routers that are missing between ]now-hours, now[ is perfectly fine. That would be a different research question, so please leave this as is.

Thanks!

Trac:
Username: peer

s2012.csv.7z

Trac:
Username: peer

first_pass.py

Trac:
Username: peer

$2012_frac_cw$

Trac:
Username: peer

$2012_frac_relays.2$

Trac:
Username: peer

$2012_frac_relays$

Trac:
Username: peer

Thanks for the comments and suggestions and will look into a git repo.

first_pass.py has been updated with gsathya's suggestion and has been modified to load and store fingerprints and bandwidth values.

Total processing time has been reduced to about one hour. Loading four months of data takes around 1.5 GB of RAM. More time is spent loading the consensus file than iterating over the entries.

Revised definitions:

Let Y be the consensus listed (now) and X the consensus some hours ago (now - hours).

Let intersection(X,Y) be the routers in both X and Y based on fingerprint.

frac_relay is count(intersection(X,Y))/count(Y).

frac_cw is the sum of consensus weights in Y over intersection(X,Y) divided by the sum of consensus weights in Y.

Trac:
Username: peer
Status: needs_revision to needs_review

Looks great, both code and results. Nice work! Please let us know if you need help with the Git repo. And if you need help finding another fun analysis task to work on, let me know, too! Thanks!

For Git, how is https://bitbucket.org/peer_zero/metrics-tasks/commits/ef987c501f2c4418b9677a42bdfca7fbfe027ec4 ? Thanks!

Trac:
Username: peer

Looks good, but can you put a real email address in the commit (git config --global user.email "your@address.com" && git commit --amend --reset-author), and can you include the R code, too? Thanks!

Trac:
Status: needs_review to needs_revision

https://bitbucket.org/peer_zero/metrics-tasks/commits/0e4f962092b1e4c5545136485bb338cd is the revised version. Thanks!

Trac:
Username: peer
Status: needs_revision to needs_review

Great! Merged. :) Thanks, peer!

asn, does that answer your question? Can we close this ticket then?

I wonder what we should do with these results, or with results from Analysis tickets in general. It would be sad if we closed this ticket and forgot that we ever worked on it. Would it make sense to set up a wiki page (maybe doc/AnalysisTicketResults) for Analysis ticket results that has short descriptions of what we analyzed and links to the relevant file assignments? peer, is this maybe something you'd want to look into?

Trac:
Status: needs_review to new

Replying to karsten:

Great! Merged. :) Thanks, peer!

asn, does that answer your question? Can we close this ticket then?

It does. That was a great analysis!

I wonder what we should do with these results, or with results from Analysis tickets in general. It would be sad if we closed this ticket and forgot that we ever worked on it. Would it make sense to set up a wiki page (maybe doc/AnalysisTicketResults) for Analysis ticket results that has short descriptions of what we analyzed and links to the relevant file assignments? peer, is this maybe something you'd want to look into?

Hm, would it make sense to add such graphs somewhere in: https://metrics.torproject.org/graphs.html maybe in the 'Network' page or in a new 'Miscellaneous' page, and auto-generate them every so often? That would make the graphs more easily findable by people that might be interested in them.

Replying to asn:

Replying to karsten:

I wonder what we should do with these results, or with results from Analysis tickets in general. It would be sad if we closed this ticket and forgot that we ever worked on it. Would it make sense to set up a wiki page (maybe doc/AnalysisTicketResults) for Analysis ticket results that has short descriptions of what we analyzed and links to the relevant file assignments? peer, is this maybe something you'd want to look into?

Hm, would it make sense to add such graphs somewhere in: https://metrics.torproject.org/graphs.html maybe in the 'Network' page or in a new 'Miscellaneous' page, and auto-generate them every so often? That would make the graphs more easily findable by people that might be interested in them.

Unfortunately not. There's a big difference between writing a command-line tool that processes a bunch of data and draws some graphs and extending a database to continually pre-process those data to allow users to draw customized graphs from them.

Replying to karsten:

Thanks for the suggestion. Documenting the analysis results at doc/AnalysisTicketResults, starting with metrics-tasks, would be a good way to understand the current research questions.

Trac:
Username: peer

How are the blurbs at doc/AnalysisTicketResults (for metrics-tasks)? Were any tickets misinterpreted?

Trac:
Username: peer
Status: new to needs_review

Replying to peer:

How are the blurbs at doc/AnalysisTicketResults (for metrics-tasks)? Were any tickets misinterpreted?

I just created #8105 (moved) for this discussion. Thanks!

Keeping this ticket open until your most recent code changes are merged.

Merged peer's most recent code changes. Closing. Thanks!

Trac:
Resolution: N/A to implemented
Status: needs_review to closed

closed

mentioned in issue #7828 (closed)

mentioned in issue #7986 (moved)

mentioned in issue #8105 (moved)

mentioned in issue #8322 (closed)

Visualize how quickly the Tor network changes

Child items ...

Activity

Definitions

Notes

Questions