I'd like to know what fraction of total Tor usage is hidden service usage, so we have a sense of whether hidden services matter now, and so we can track trends into the future.
For example, it would have been nice in August 2013 to have some metric of hidden service fraction that told us the spike in load and users had to do with hidden services.
Such statistics would also be useful to counter (or who knows, confirm) the analysts who say statements like "97% of Tor use is silk road".
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
My hs-stats branch implements this. I made it print out aggregate stats every thirty minutes, of the form
Sep 23 22:30:00.851 [notice] Heartbeat: 0|0 exit / 9|229 hidden service / 504|22963 middle circuits / 0|1000 other circuits handled.
That's 9 circuits that were hidden service related (and 229 cells on those), and 504 circuits that we saw an extend cell on (and 22963 cells on those).
The first graph shows all cell types in a bar chart. While this sounds great for comparing cells by type, this graph works really poorly for the tiny fraction of hidden-service cells in the network (yes, there's a green line near the x axis, at least for bolobolo1 and wannabe). Maybe it's useful for comparing middle vs. exit.
The second graph shows the fraction of hidden-service cells as compared to all cells. moria5 might be somewhat misleading here, because moria5 handles a tiny number of cells compared to the other two relays. The network-wide average is probably much closer to bolobolo1's or wannabe's fraction than moria5's fraction.
I also attached the R code which comes with easy-to-follow instructions.
asn, I just pushed more commits to my task-13192 branch and ran some tests in Chutney. Please take a look if you like my changes.
The next step would be to rewrite history. So far, only you and I have read this code. But the goal would be to have many more people read it. What we need are a few carefully crafted commits that are trivial to review even for people who are not C programmers. I have the following commit series in mind:
Your constify crypto_pk_get_digest patch.
My obfuscation functions including unit tests.
The two hidden-service statistics including everything from config option to writing stats to disk.
Your log statements and any remaining XXX, including the commented-out code that would include stats in extra-info descriptors. This commit shall not be given out to relay operators who we ask to help with gathering statistics, because they shouldn't accidentally run it. It's just for our testing and further development.
I can change history as described. Just let me know if this sounds reasonable to you.
I need to think more about the overflow protection in round_long_to_next_multiple_of(). I'm not sure what happens if there is actually an overflow there and that line is skipped.
Where can I find the formula you are using in transform_uniform_random_to_laplace()? Also, maybe we could rename that function to sample_laplace_distribution() or something.
When you do digestmap_free(hs_stats->onions_seen_this_period, NULL) do the elements of the digestmap get freed? I think you need to pass the tor_free_() func as the second argument?
This value, multiplied with EPSILON, is Laplace parameter b. */ I think this is divided not multiplied.
An interesting consequence of rounding up negative numbers is that a result of 0 doesn't mean that the underlying value was also 0. In our case it can be anything from -7 to 0. This is not something bad but something to keep in mind.
BTW, wrt to this unittest tt_assert(round_long_to_next_multiple_of(LONG_MIN,2) == LONG_MIN). It is the case in all platforms that LONG_MIN is even, right?
Yes, I think we wrote some good code there together.
Small review:
I need to think more about the overflow protection in round_long_to_next_multiple_of(). I'm not sure what happens if there is actually an overflow there and that line is skipped.
Yes, please do! I spent quite some time with paper and pencil here to go through the possible edge cases. And then I wrote unit tests for all of them that came to mind. But it's not at all unrealistic that I missed an important case.
Where can I find the formula you are using in transform_uniform_random_to_laplace()? Also, maybe we could rename that function to sample_laplace_distribution() or something.
From Wikipedia! :) I'll include a comment. I'm also renaming the function as suggested.
When you do digestmap_free(hs_stats->onions_seen_this_period, NULL) do the elements of the digestmap get freed? I think you need to pass the tor_free_() func as the second argument?
I think freeing elements would actually be harmful, because we're just storing void pointers ((void*)(uintptr_t)1) in the map. We cannot dereference those pointers and free whatever we find at that address. Should we ask a C programmer to get this confirmed?
This value, multiplied with EPSILON, is Laplace parameter b. */ I think this is divided not multiplied.
Err, yes. You mentioned that on IRC and totally wanted to change it. Changed now.
An interesting consequence of rounding up negative numbers is that a result of 0 doesn't mean that the underlying value was also 0. In our case it can be anything from -7 to 0. This is not something bad but something to keep in mind.
Agreed that it's potentially confusing, but it's correct, AFAICT.
BTW, wrt to this unittest tt_assert(round_long_to_next_multiple_of(LONG_MIN,2) == LONG_MIN). It is the case in all platforms that LONG_MIN is even, right?
Interesting question. I'm pretty sure. But even if not, the worst thing that can happen is that our unit tests break.
I made two more changes: I documented the new config option in the man page as discussed in #tor-dev, and I changed the default config value to 0 for two reasons: we should avoid that somebody accidentally turns on these statistics when running our branch; and we must avoid that we accidentally merge the wrong default value into master.
Branch task-13192 contains the new changes as additional commits, and branch task-13192-2 is heavily rebased following the plan described above.
Yes, I think we wrote some good code there together.
Small review:
When you do digestmap_free(hs_stats->onions_seen_this_period, NULL) do the elements of the digestmap get freed? I think you need to pass the tor_free_() func as the second argument?
I think freeing elements would actually be harmful, because we're just storing void pointers ((void*)(uintptr_t)1) in the map. We cannot dereference those pointers and free whatever we find at that address. Should we ask a C programmer to get this confirmed?
I'm sorry, I meant the key not the value. They key in our case is memdupped but I think it shouldn't be.
BTW, wrt to this unittest tt_assert(round_long_to_next_multiple_of(LONG_MIN,2) == LONG_MIN). It is the case in all platforms that LONG_MIN is even, right?
Interesting question. I'm pretty sure. But even if not, the worst thing that can happen is that our unit tests break.
I made two more changes: I documented the new config option in the man page as discussed in #tor-dev, and I changed the default config value to 0 for two reasons: we should avoid that somebody accidentally turns on these statistics when running our branch; and we must avoid that we accidentally merge the wrong default value into master.
Branch task-13192 contains the new changes as additional commits, and branch task-13192-2 is heavily rebased following the plan described above.
task-13192-2 looks OK to me. I don't see any new commits to the other branch.
FWIW, we should soon (in a week or so) give this code to people to run it in their relays. At that point, the commented code of extrainfo_dump_to_string() should be included and enabled so that stats actually get to us. What do we need to do to enable that piece of code? Revise the proposal and post it to [tor-dev]?
I'm sorry, I meant the key not the value. They key in our case is memdupped but I think it shouldn't be.
Oh, you're right. Fixed in new fixup commit on branch task-13192-2, I think.
task-13192-2 looks OK to me. I don't see any new commits to the other branch.
Looks like I forgot to push task-13192 earlier. Pushed now.
FWIW, we should soon (in a week or so) give this code to people to run it in their relays. At that point, the commented code of extrainfo_dump_to_string() should be included and enabled so that stats actually get to us. What do we need to do to enable that piece of code? Revise the proposal and post it to [tor-dev]?
Yes, let's revise the proposal and post the code to tor-dev@, possibly even with the commented-out code in it. Then let's give people a week, or at least a couple of days, to review everything. After that, we could publish the code with the working extra-info code and ask people to run it. Of course, there's always the possibility that people will have feedback that we need to incorporate, or that will force us to change our plans.
add_laplace_noise() uses sample_laplace_distribution() with a "long" value which might be very problematic if a double is expected. You basically loose the float part. If this is the intended goal, I would put a comment explaining why since "long" and "double" are quite different.
NAN is a GNU extension which might be problematic without a fallback on BSD or/and Windows. I'm not to familiar with that macro outside Linux.
+round_long_to_next_multiple_of(): probably a typo but "unsigned divisor", is a "long" or "int" is missing here? It's allowed but I really think that types should be always specified, just for the sake of the reviewers :).
round_long_to_next_multiple_of() will "Floating point exception" if "divisor" is 0. Modulo with 0 is not really a good thing :). In this case, I was able to make it coredump with number == LONG_MIN and divisor set to 0.
commit 6c3263b5b56e0b6e953226af240f2f607a2fa415
I see that there are multiple places where "hs_stats" is checked and if NULL, it's allocated. Shouldn't we call "rep_hist_hs_stats_init()" instead? Seems to me that having the start interval time set is important when initializing the digestmap.
Hm, also maybe the epsilon should not be in the hidserv-stats-end line and should be in both of the stat lines. There might be a good reason for having different epsilons between different stats (for example, because we might care more about HSDir stat obfuscation than HS cell obfuscation))
add_laplace_noise() uses sample_laplace_distribution() with a "long" value which might be very problematic if a double is expected. You basically loose the float part. If this is the intended goal, I would put a comment explaining why since "long" and "double" are quite different.
Yes, it's intended. Added a short comment.
NAN is a GNU extension which might be problematic without a fallback on BSD or/and Windows. I'm not to familiar with that macro outside Linux.
As discussed in #tor-dev, let's just do tor_assert() instead of returning NAN.
+round_long_to_next_multiple_of(): probably a typo but "unsigned divisor", is a "long" or "int" is missing here? It's allowed but I really think that types should be always specified, just for the sake of the reviewers :).
In fact, it's probably even better to use the same type for number and divisor. Changed to long.
round_long_to_next_multiple_of() will "Floating point exception" if "divisor" is 0. Modulo with 0 is not really a good thing :). In this case, I was able to make it coredump with number == LONG_MIN and divisor set to 0.
Added another tor_assert().
commit 6c3263b5b56e0b6e953226af240f2f607a2fa415
I see that there are multiple places where "hs_stats" is checked and if NULL, it's allocated. Shouldn't we call "rep_hist_hs_stats_init()" instead? Seems to me that having the start interval time set is important when initializing the digestmap.
Indeed. But we can't initialize stats there, because we don't have a timestamp. The better idea is probably to ignore observations when we're not collecting stats. Changed.
rep_hist_hs_stats_init(): double space:
{{{
if (!hs_stats) {
}}}
Fixed.
The rest seems fine to me for now.
Great! Thanks for the review, and feel free to review the new changes. Pushing new commits to task-13192-2 after replying to asn below.
Hm, also maybe the epsilon should not be in the hidserv-stats-end line and should be in both of the stat lines. There might be a good reason for having different epsilons between different stats (for example, because we might care more about HSDir stat obfuscation than HS cell obfuscation))
As discussed in #tor-dev, I changed this.
There, updated task-13192-2 contains all the changes.
Quick comment also, I was worried about the hs_stats->rp_relay_cells_seen overflow possibility. I did an easy calculation to see how much data needs to go through the RP for that:
This is an unsigned long where on x86 64 bits it's 64 bit thus humongeous. However, on 32 bit it goes up to 2^32 and with each cell at 512 bytes.
((2^32) * 512) = 2199023255552 bytes of total data before overflow.(((2199023255552/1024)/1024)/1024) = 2048 Gigabytes ^K ^M ^G
Thus, one would have to send through ~2TB of data to make that stat goes round and around (on 32 bit system) :).
Code looks good to me! There are still log statement though that I'm not sure we want upstream like below but I'll wait for the big squash :).
+ log_warn(LD_GENERAL, "Saw new RP cell in %u.",+ TO_OR_CIRCUIT(circ)->p_circ_id);
Regarding overflow, I think we'd be in trouble after 1 TiB, because we're casting the unsigned long into signed long for adding noise. That's 1 TiB/day or 12.7 MB/s on average. We could switch to int64_t. It's just 32 more bits per relay. And we'll need a new set of obfuscation functions. Should we do that?
The log statement you refer to is in the "CLASSIFIED" commit that's only there for debugging purposes and that should go away before we give out the branch to relay operators. Unless you found another log statement in the other commits?
Regarding overflow, I think we'd be in trouble after 1 TiB, because we're casting the unsigned long into signed long for adding noise. That's 1 TiB/day or 12.7 MB/s on average. We could switch to int64_t. It's just 32 more bits per relay. And we'll need a new set of obfuscation functions. Should we do that?
I would definitely go for uint64_t/int64_t here, it's pretty cheap memory wise and can prevent issues that we think are improbable now but are the next day :).
The log statement you refer to is in the "CLASSIFIED" commit that's only there for debugging purposes and that should go away before we give out the branch to relay operators. Unless you found another log statement in the other commits?
Sorry for that, I'll know next time that CLASSIFIED means will disapear :)
Even my Oracle NetBeans IDE - not an especially privacy-focused piece of software - asks the user beforehand whether they would like to opt in to statistics.
Changed to int64_t, switched obfuscation methods, and applied them in two separate calls. task-13192-3 contains squash-commits, task-13192-4 is a clean rebase.
For testing purposes, yes please. But the actual measurements won't start in the next 7 to 10 days, and you'll probably have to switch to another branch then.
I enabled the part where we're including hidden-service statistics in extra-info descriptors. task-13192-4 contains additional commits, task-13192-5 is a rebased version of task-13192-4.
I also removed the log messages, because we don't want people to run them without knowing. I'll attach the patch to this ticket in a minute.
I enabled the part where we're including hidden-service statistics in extra-info descriptors. task-13192-4 contains additional commits, task-13192-5 is a rebased version of task-13192-4.
I also removed the log messages, because we don't want people to run them without knowing. I'll attach the patch to this ticket in a minute.
I like the code. I wonder if the additive noise should be applied to the right side of the bin, or to the center of the bin. Or if it matters at all.
So for example if binsize is 8. And the binned value is 16. Should the additive noise be applied to 16, or to 12? I'm asking because 12 (the center) might be a more fair value than 16 (which is the max).
My intuition is that it shouldn't matter whether we add noise to the right side of a bin (which is at the same time the left side of the next bin) or the center of a bin. In either case there's exactly one predefined value in each bin that we add noise to. The adversary knows that, and given any obfuscated value she can say how likely it is that we started at a given bin, regardless of the definition we picked before. The distance between bins stays exactly the same, so this doesn't become harder or easier if we start at bin centers. I'd say let's not change this part in the code and keep using the right side of the bin.
Please see my branch karsten-task-13192-5 in my repository for the respective fixes and additions. Feel free to tweak changes file as you see appropriate.
Indeed, thanks for the review, nickm! And thanks for the almost instant fixes, asn! I merged asn's changes into my task-13192-5 branch with one trivial change: I turned the changes-file-and-man-page commit into a --squash commit rather than a commit of its own to avoid back-and-forth in Git history.
ok, but long long vs int64 is a gcc vs standards issue. INT64_MIN is the right thig to look at, and we need to make sure it's defined. I've merged that, and another patch.