Use bit-twiddling tricks to make choose-by-bandwith algorithm even more time-invariant

changed milestone to %Tor: 0.2.4.x-final

added 023backport component::core tor/tor milestone::Tor: 0.2.4.x-final priority::medium resolution::fixed status::closed tor-client type::enhancement labels

I think I'd argue this can wait until 0.2.4. (I might have argued #6537 (moved) could wait until 0.2.4 too.)

Trac:
Summary: Use bit-twiddling tricks to make choose-by-bandwith algorithm even more time-invairant to Use bit-twiddling tricks to make choose-by-bandwith algorithm even more time-invariant

I'm not sure I'm comfortable with that. Honestly, you may have to talk me out of doing this in 0.2.2.

That said, there is a much simpler fix, if we can trust the compiler not to get too smart. Instead of setting i_has_been_chosen to 1 (as we do in the current code), set rand_bw t the o INT64_MAX (or whatever its type's max is). Remove i_has_been_chosen. Should work I think.

Replying to nickm_mobile:

That said, there is a much simpler fix, if we can trust the compiler not to get too smart. Instead of setting i_has_been_chosen to 1 (as we do in the current code), set rand_bw t the o INT64_MAX (or whatever its type's max is). Remove i_has_been_chosen. Should work I think.

That would work, although you'll still need to switch to using integer math (rather than operating on the floating-point bandwidth values directly).

I'll bite; what's wrong with using DBL_MAX ?

ah; because you can't set rand_bw to DBL_MAX, because rand_bw is a uint64.

Okay. Branch "bug6538_023" in my public repository has the minimal fix for branch 0.2.3. Branch "bug6538" has additional code to refactor those functions a little more and extract the common logic into a testable function; it should be considered only for 0.2.4.

Trac:
Status: new to needs_review

Oops. Hadn't actually pushed "bug6538_023" to my public repo. Done. (If you started looking at "bug6538", please don't worry -- it contains all of "bug6538_023" and a little more.)

arma: I am still liable to merge bug6538_023 into maint-0.2.3 if nobody stops me.

Looking at bug6538_023:

     log_warn(LD_BUG, "Round-off error in computing bandwidth had an effect on "
              " which router we chose. Please tell the developers. "

is more likely to trigger now, since we make it all the way through the loop each time -- even though in nearly none of these cases does it actually have an effect on which router we chose? I recognize this is a potential problem with the original fix, not this one (unless I'm wrong).

"incancations"

More generally, this appears to be a complex tweak to a theoretical issue that is theoretically mostly solved in what we already merged? It includes messings to our autoconf and lots of new code. Why is it scheduled for 0.2.3? Or said another way, is the risk/reward ratio here suitable for forcing it on all wheezy users at the next rc? I see the risk as being somewhat low and the reward as being very low.

Replying to arma:

More generally, this appears to be a complex tweak to a theoretical issue that is theoretically mostly solved in what we already merged? It includes messings to our autoconf and lots of new code. Why is it scheduled for 0.2.3? Or said another way, is the risk/reward ratio here suitable for forcing it on all wheezy users at the next rc? I see the risk as being somewhat low and the reward as being very low.

The risk is that somebody on your LAN (or a local eavesdropper, or your entry node), can figure out about what your path is. Conservative wisdom is that a fairly local attacker can detect 10-100ns timing leaks, and a remote one can detect 1-10ms timing leaks. I'm especially worried about slow mobile devices here.

Instead of calling the risk "low", do you mean "unproven" or something?

Replying to nickm:

Instead of calling the risk "low", do you mean "unproven" or something?

I mean the risk of applying it. That is, what are the chances it will produce strange crashes, strange path selection bugs, or a spew of "Sky is falling tell the developers!" warnings that forcce us to rush out a quick new release (which may or may not get picked up by distros like Wheezy, depending on timing).

The slow mobile devices question is a good point though. Do you seriously think that setting a variable 1000 times vs 1100 times is going to be noticeable to even a local network attacker?

rransom points out that he has some integer scaling code that's better than what I wrote:

13:48 < rransom> Was there a reason to use llround (and add autocrap goo to 
                 check for it) instead of my approach to int64_t-izing the 
                 bandwidths?
13:50 < nickm> Argh.  Context-switching.  I had thought that my approach was to 
               int64_t-ize: that is, to move the bandwidth array to uint64_t 
               earlier in the computation and get "double" out of the picture 
               entirely by the time that we did the "choose one randomly by 
               weight" computation.
13:51 < nickm> The reason I want the computation to be (array of uint64_t, 
               length, entropy) -> index is that I trust myself to reasoning 
               about this stuff much better when all values are integers.
13:51 < nickm> *to reason
13:52 < rransom> You used llround on (approximately) the values that would have 
                 been in the array of doubles before. I rescaled them to a 
                 total bandwidth around 2^61 and justcasted to int64_t .
13:54 < nickm> A decent idea; maybe we should do that too.  Why 2^61?
13:55 < rransom> Because it's big, but not close enough to 2^63 that I would 
                 have to think carefully about whether the roundoff errors 
                 could overflow and really break things.
14:01 < nickm> mind  if I C&P the above to the bug report?
14:01 < rransom> Go ahead.

Replying to arma:

The slow mobile devices question is a good point though. Do you seriously think that setting a variable 1000 times vs 1100 times is going to be noticeable to even a local network attacker?

So the offending code in Tor today is:

  for (i=0; i < (unsigned)smartlist_len(sl); i++) {
    tmp += bandwidths[i];
    if (tmp >= rand_bw && !i_has_been_chosen) {
      i_chosen = i;
      i_has_been_chosen = 1;
    }
  }

Note that in the part of the loop before we choose the router, the tmp >= rand_bw part will be false and the !i_has_been_chosen part will never be executed. When we choose the router, both tests are executed and we run the conditional block. After we choose the route, tmp >= rand_bw will be true and !i_has_been chosen will be executed and will turn out to be false.

I wouldn't expect any cache misses to be different in these cases. So what we need to worry about is branch mispredictions. In the worst case, let's say that in the first half of the loop, we correctly predict the first test, and in the second part of the loop, we mispredict both tests every time (because our architecture sucks hard). So, the biggest difference in timing will be on the order of N_ROUTERS * branch_misprediction_cost * 2. Something like 3-10 cycles is a reasonable order of magnitude for branch_misprediction_cost on a modern CPU. I don't know where you'll find an embedded device with less than 400 MHz running Tor these days, so let's say our minimum clock speed is 200 MHz. Let's say there are 5000 routers. So in the worst imaginable case, that's 0.25 ms, which will definitely be detectable locally, and might be detectable remotely depending on assumptions, prevailing conditions, and the phase of the moon.

Making some more favorable assumptions, and assuming no huge branch mispredictions, assuming that it's only one extra cycle to check !i_has_been_chosen, assuming that our clock speed is a nice hefty 1 GHz: that's still 5 microseconds, which should be detectable locally.

Replying to arma:

Replying to nickm:

Instead of calling the risk "low", do you mean "unproven" or something?

I mean the risk of applying it. That is, what are the chances it will produce strange crashes, strange path selection bugs, or a spew of "Sky is falling tell the developers!" warnings that forcce us to rush out a quick new release (which may or may not get picked up by distros like Wheezy, depending on timing).

I have been using the patch that I wrote to fix this issue for over a month, with not even a scary log message.

The slow mobile devices question is a good point though. Do you seriously think that setting a variable 1000 times vs 1100 times is going to be noticeable to even a local network attacker?

The current code (after #6537 (moved)) loops at a different frequency after it selects a relay, which will be quite noticeable to someone picking up a smartphone's RF emissions.

wrt the scale issue, I'm seeing typical values that suggest that the integer roundoff doesn't matter much here. Is their counterveiling evidence?

Replying to nickm:

wrt the scale issue, I'm seeing typical values that suggest that the integer roundoff doesn't matter much here. Is their counterveiling evidence?

Whether integer roundoff matters or not depends on the current value of weight_scale.

Also, are you sure that the C comparison operator you want to leave in (tmp > rand_bw) will be implemented in constant time on 32-bit systems?

Replying to arma:

The slow mobile devices question is a good point though. Do you seriously think that setting a variable 1000 times vs 1100 times is going to be noticeable to even a local network attacker?

Ok. I'm fine having this go into 0.2.3 if you come up with something you still want to put into 0.2.3.

Branch revised. Now targeting 0.2.4 with some possibility for backport.

Still looks good to me. Still passes unit tests. Going to merge soon.

The one to look at is "bug6538" in my public repository. I think it answers all of rransom's notes.

Merged this to master. I'm going to stick it in 023backport too; if I do an 0.2.3-plus-extras series, then assuming a lack of bugs, it goes in.

Trac:
Keywords: N/A deleted, 023backport added

Trac:
Keywords: 023backport deleted, 023backport tor-client added

Trac:
Component: Tor Client to Tor

Looks like we're not doing an 0.2.3-plus-extra series.

Do you still think this should go in mainline 0.2.3 (i.e. is it a huge security flaw that must be fixed or we're screwing a large number of our users)?

Marking a batch of tickets that had been under consideration for 0.2.3 backport as fixed-in-0.2.4. There is no plan for further 0.2.3 releases.

Trac:
Milestone: Tor: 0.2.3.x-final to Tor: 0.2.4.x-final
Status: needs_review to closed
Resolution: N/A to fixed

closed

mentioned in issue #6805 (moved)

moved to tpo/core/tor#6538 (closed)

mentioned in issue tpo/core/tor#6805 (closed)

Use bit-twiddling tricks to make choose-by-bandwith algorithm even more time-invariant

Child items ...

Activity