See #6537 (moved) for discussion. There's a branch in the middle of our node selection algorithm, which is bad for time-invariance. Furthermore, a sufficiently clever compiler might decide to reinsert the break statement that 6537 so carefully removed. (I have not yet found a compiler this clever, but better safe than sorry!)
We can probably do better by using a bit-twiddling approach to setting i_chosen. For example, rransom wrote:
I think I'd argue this can wait until 0.2.4. (I might have argued #6537 (moved) could wait until 0.2.4 too.)
Trac: Summary: Use bit-twiddling tricks to make choose-by-bandwith algorithm even more time-invairant to Use bit-twiddling tricks to make choose-by-bandwith algorithm even more time-invariant
I'm not sure I'm comfortable with that. Honestly, you may have to talk me out of doing this in 0.2.2.
That said, there is a much simpler fix, if we can trust the compiler not to get too smart. Instead of setting i_has_been_chosen to 1 (as we do in the current code), set rand_bw t the o INT64_MAX (or whatever its type's max is). Remove i_has_been_chosen. Should work I think.
That said, there is a much simpler fix, if we can trust the compiler not to get too smart. Instead of setting i_has_been_chosen to 1 (as we do in the current code), set rand_bw t the o INT64_MAX (or whatever its type's max is). Remove i_has_been_chosen. Should work I think.
That would work, although you'll still need to switch to using integer math (rather than operating on the floating-point bandwidth values directly).
Okay. Branch "bug6538_023" in my public repository has the minimal fix for branch 0.2.3. Branch "bug6538" has additional code to refactor those functions a little more and extract the common logic into a testable function; it should be considered only for 0.2.4.
Oops. Hadn't actually pushed "bug6538_023" to my public repo. Done. (If you started looking at "bug6538", please don't worry -- it contains all of "bug6538_023" and a little more.)
arma: I am still liable to merge bug6538_023 into maint-0.2.3 if nobody stops me.
log_warn(LD_BUG, "Round-off error in computing bandwidth had an effect on " " which router we chose. Please tell the developers. "
is more likely to trigger now, since we make it all the way through the loop each time -- even though in nearly none of these cases does it actually have an effect on which router we chose? I recognize this is a potential problem with the original fix, not this one (unless I'm wrong).
"incancations"
More generally, this appears to be a complex tweak to a theoretical issue that is theoretically mostly solved in what we already merged? It includes messings to our autoconf and lots of new code. Why is it scheduled for 0.2.3? Or said another way, is the risk/reward ratio here suitable for forcing it on all wheezy users at the next rc? I see the risk as being somewhat low and the reward as being very low.
More generally, this appears to be a complex tweak to a theoretical issue that is theoretically mostly solved in what we already merged? It includes messings to our autoconf and lots of new code. Why is it scheduled for 0.2.3? Or said another way, is the risk/reward ratio here suitable for forcing it on all wheezy users at the next rc? I see the risk as being somewhat low and the reward as being very low.
The risk is that somebody on your LAN (or a local eavesdropper, or your entry node), can figure out about what your path is. Conservative wisdom is that a fairly local attacker can detect 10-100ns timing leaks, and a remote one can detect 1-10ms timing leaks. I'm especially worried about slow mobile devices here.
Instead of calling the risk "low", do you mean "unproven" or something?
Instead of calling the risk "low", do you mean "unproven" or something?
I mean the risk of applying it. That is, what are the chances it will produce strange crashes, strange path selection bugs, or a spew of "Sky is falling tell the developers!" warnings that forcce us to rush out a quick new release (which may or may not get picked up by distros like Wheezy, depending on timing).
The slow mobile devices question is a good point though. Do you seriously think that setting a variable 1000 times vs 1100 times is going to be noticeable to even a local network attacker?
rransom points out that he has some integer scaling code that's better than what I wrote:
13:48 < rransom> Was there a reason to use llround (and add autocrap goo to check for it) instead of my approach to int64_t-izing the bandwidths?13:50 < nickm> Argh. Context-switching. I had thought that my approach was to int64_t-ize: that is, to move the bandwidth array to uint64_t earlier in the computation and get "double" out of the picture entirely by the time that we did the "choose one randomly by weight" computation.13:51 < nickm> The reason I want the computation to be (array of uint64_t, length, entropy) -> index is that I trust myself to reasoning about this stuff much better when all values are integers.13:51 < nickm> *to reason13:52 < rransom> You used llround on (approximately) the values that would have been in the array of doubles before. I rescaled them to a total bandwidth around 2^61 and justcasted to int64_t .13:54 < nickm> A decent idea; maybe we should do that too. Why 2^61?13:55 < rransom> Because it's big, but not close enough to 2^63 that I would have to think carefully about whether the roundoff errors could overflow and really break things.14:01 < nickm> mind if I C&P the above to the bug report?14:01 < rransom> Go ahead.
The slow mobile devices question is a good point though. Do you seriously think that setting a variable 1000 times vs 1100 times is going to be noticeable to even a local network attacker?
So the offending code in Tor today is:
for (i=0; i < (unsigned)smartlist_len(sl); i++) { tmp += bandwidths[i]; if (tmp >= rand_bw && !i_has_been_chosen) { i_chosen = i; i_has_been_chosen = 1; } }
Note that in the part of the loop before we choose the router, the tmp >= rand_bw part will be false and the !i_has_been_chosen part will never be executed. When we choose the router, both tests are executed and we run the conditional block. After we choose the route, tmp >= rand_bw will be true and !i_has_been chosen will be executed and will turn out to be false.
I wouldn't expect any cache misses to be different in these cases. So what we need to worry about is branch mispredictions. In the worst case, let's say that in the first half of the loop, we correctly predict the first test, and in the second part of the loop, we mispredict both tests every time (because our architecture sucks hard). So, the biggest difference in timing will be on the order of N_ROUTERS * branch_misprediction_cost * 2. Something like 3-10 cycles is a reasonable order of magnitude for branch_misprediction_cost on a modern CPU. I don't know where you'll find an embedded device with less than 400 MHz running Tor these days, so let's say our minimum clock speed is 200 MHz. Let's say there are 5000 routers. So in the worst imaginable case, that's 0.25 ms, which will definitely be detectable locally, and might be detectable remotely depending on assumptions, prevailing conditions, and the phase of the moon.
Making some more favorable assumptions, and assuming no huge branch mispredictions, assuming that it's only one extra cycle to check !i_has_been_chosen, assuming that our clock speed is a nice hefty 1 GHz: that's still 5 microseconds, which should be detectable locally.
Instead of calling the risk "low", do you mean "unproven" or something?
I mean the risk of applying it. That is, what are the chances it will produce strange crashes, strange path selection bugs, or a spew of "Sky is falling tell the developers!" warnings that forcce us to rush out a quick new release (which may or may not get picked up by distros like Wheezy, depending on timing).
I have been using the patch that I wrote to fix this issue for over a month, with not even a scary log message.
The slow mobile devices question is a good point though. Do you seriously think that setting a variable 1000 times vs 1100 times is going to be noticeable to even a local network attacker?
The current code (after #6537 (moved)) loops at a different frequency after it selects a relay, which will be quite noticeable to someone picking up a smartphone's RF emissions.
The slow mobile devices question is a good point though. Do you seriously think that setting a variable 1000 times vs 1100 times is going to be noticeable to even a local network attacker?
Ok. I'm fine having this go into 0.2.3 if you come up with something you still want to put into 0.2.3.
Looks like we're not doing an 0.2.3-plus-extra series.
Do you still think this should go in mainline 0.2.3 (i.e. is it a huge security flaw that must be fixed or we're screwing a large number of our users)?