Don't surprise users with new circuits in the middle of browsing

changed milestone to %Tor: 0.2.7.x-final in legacy/trac

added PostFreeze027 in Legacy / Trac TorCoreTeam201509 in Legacy / Trac component::core tor/tor in Legacy / Trac milestone::Tor: 0.2.7.x-final in Legacy / Trac owner::yawning in Legacy / Trac priority::medium in Legacy / Trac resolution::fixed in Legacy / Trac status::closed in Legacy / Trac tbb-usability in Legacy / Trac tbb-wants in Legacy / Trac tor-core in Legacy / Trac type::enhancement in Legacy / Trac version::tor unspecified in Legacy / Trac labels

Trac:
0001-Bug-15482-Don-t-abandon-circuits-that-are-still-bein.patch

Tor patch to always reset timestamp_dirty

I question whether this is something to do unconditionally, or whether it's only something to do when we have an external application managing circuit isolation for us.

I also wonder whether there shouldn't be some kind of randomized upper limit to how long this can keep a circuit alive.

Trac:
Keywords: MikePerry201503 deleted, MikePerry201503 tbb-wants added
Milestone: N/A to Tor: 0.2.7.x-final
Type: defect to enhancement

I'm attaching a patch with a more cautious (but more complex, and untested) approach. What do you think?

Another option would be to add an additional isolation flag, ISO_TIME, that could be turned off on a given SocksPort.

Trac:
0001-More-cautious-approach-to-increasing-circuit-lifetim.patch

Trac:
bug15482.patch

Simpler version of Nick's patch for TBB 4.5a5 (no randomness or max).

GeKo and I both agree that the SOCKS isolation check is important here. However, I felt that the randomness was definitely not the best plan in its current form, because sometimes circuits could still end up very short lived, still allowing TBB to surprise the user with a new circuit at random intervals.

I also wasn't 100% clear on the reasoning for having a max at all, so I removed that too. I attached the version we're building 4.5a5 with here: https://trac.torproject.org/projects/tor/attachment/ticket/15482/bug15482.patch.

We can revisit the idea of a max and/or randomization sometime between now and 4.5-stable (which should hopefully be in ~2-3 weeks).

Trac:
Cc: N/A to gk

The way I thought I implemented the randomness didn't allow for short-lived circuits. It wasn't choosing randomly between [0,X], but instead between [X, X*9/8].

For me, the idea of no maximum at all is a bit scary; it means that keepalive-type stuff will keep a circuit open forever even if the user doesn't expect that it would.

Finally -- did somebody test out the isolation thing, to make sure that only authentication-isolated circuits get the new behavior? I only wrote it; it does need some testing.

Trac:
Cc: gk to gk, mcs

Trac:
Cc: gk, mcs to gk, mcs, anonym

Trac:
Keywords: MikePerry201503 tbb-wants, TorBrowserTeam201503 deleted, MikePerry201503, tbb-wants, TorBrowserTeam201504 added

https://research.torproject.org/ideas.html discusses Tor circuit reuse as rotating every 10 minutes. I saw that legacy/trac#13766 (moved) was invalidated and closed, but if the 10 minutes rotation approach is retired, let's update that documentation as well.

Replying to nickm:

The way I thought I implemented the randomness didn't allow for short-lived circuits. It wasn't choosing randomly between [0,X], but instead between [X, X*9/8].

You're right. In my haste I misread this.

For me, the idea of no maximum at all is a bit scary; it means that keepalive-type stuff will keep a circuit open forever even if the user doesn't expect that it would.

Are there specific attacks or types of attacks that make you nervous here?

I believe there is a potential concern that we're making e2e correlation easier with longer circuit lifespans, but I think I am more worried about the case where I leave my browser open overnight and some website keeps pinging itself every so often to cause a new circuit to get built in order to discover my guard node. I am also worried about non-malicious but dumb websites that do this anyway, and end up exposing me to the entire Tor exit node population over the course of a few hours (during which time malicious exits get more chances to try to own my browser, sniff my cookies, perform e2e correlation, mount their own guard discovery attacks, or otherwise mess with me).

Guard discovery and exit node churn were the main security reasons I wanted a huge circuit dirtiness timeout in legacy/trac#13766 (moved), until Roger pointed out that a fixed-length custom circuit dirtiness would provide a 100% accurate distinguisher for Tor Browser users at the Guard node.

In my ideal world, we would find some way to make this distinguisher statistical (instead of absolute, instantaneous certainty) while still ensuring really long circuit lifetimes for websites that would otherwise cause circuit churn. Plenty of long-term semi-accurate statistical classifiers likely exist for Tor Browser traffic at the guard node, so this tradeoff seems like the right one to me.

Finally -- did somebody test out the isolation thing, to make sure that only authentication-isolated circuits get the new behavior? I only wrote it; it does need some testing.

That's what alphas are for :). In fact, I'm revisiting this ticket right now because I'm still personally experiencing cases where my exit IP has changed in the middle of actively interacting with a website, and this change has disrupted my browsing activity. I'd like to brainstorm ways of making the circuit dirtiness larger without exposing us to attacks. However, I'm also heavily biased in the usability and experimentation direction here, due to the uncertain (and conflicting) nature of the security issues at hand...

My current (admittedly completely haphazard and half-baked) inclination is to do something like randomizing the initial per-circuit dirtiness between 10 minutes and a much larger value, reset the dirtiness to rand(0,10min) upon stream close (in circuit_detach_stream()), and still omit any maximum value on dirtiness. I'm updating this ticket to give you a chance to talk me down from this cliff, ideally within the next few days before I jump and commit a patch to do something insane like this for 4.5-stable ;).

Trac:
0001-Bug-15482-Don-t-abandon-circuits-that-are-still-in-u.patch

This version also extends the lifetime upon stream detach.

Ok, I attached a new version that also does the stream detach hack (which I added primarily because we're boosting our HTTP keep-alive back up to 2 minutes thanks to solving legacy/trac#4100 (moved)). This version also adds some debug loglines, and I tested it with TBB and torsocks 1.2 and 2.0. The SOCKS u+p check works, but I noticed that the old torsocks 1.x actually sends the UNIX username as the SOCKS username by default, but the new torsocks 2.0 has options for SOCKS u+p that are off by default.

I also noticed that there is still a subtle distinguisher here. If a non-hidserv circuit has been alive for more than 10 minutes after first use, the only way this could happen without this patch is if a stream was still open on this circuit. In that case, a normal Tor client would close that circuit immediately after receiving the RELAY_END cell from upstream. However, clients running any version of this patch will keep non-hidserv circuits open past the 10 minute mark, and then close them without necessarily receiving a cell from upstream.

I am not sure what to do about this. I think it would require a way to reliably differentiate hidserv from non-hidserv circuits to use effectively, but it might be pretty accurate after that. Does this distinguisher trump the usability win here?

I now have a proper git branch with something resembling what I'd like to have for TBB 4.5 and beyond: https://gitweb.torproject.org/mikeperry/tor.git/commit/?h=bug15482

I decided to create a SocksAuthCircuitRenewPeriod torrc option that governs how long we extend the lifetime of socksauth-isolated circuits each time a new stream arrives on them. I set the default value to 1 hour.

I also added code in circuit_is_acceptable() to allow us to keep using a circuit with SOCKS u+p auth even if it was otherwise too dirty. I did this because when we enable HTTP/2 (legacy/trac#14952 (moved)), we'll have super-long-lived connections that may actually exceed even the 1 hour circuit lifetime extension. To preserve our circuit UI and isolation model, we'll want to keep using the same circuit for new connections with the same auth in this case.

Because the circuit_detach_stream() hack makes it easier to differentiate TBB users (since it removes any possibility of a circuit closing immediately after RELAY_END), I placed that in its own commit. I don't think I'll actually use this commit in 4.5, though I do think it will improve behavior once HTTP/2 is enabled.

Along the way, I noticed that circuit_is_better() has a serious bug where the circuit purpose value was actually being obtained incorrectly, causing the majority of that function body to be skipped, so I fixed that. When this is fixed, we also need to ensure that we actually keep using SOCKS auth circuits if a stream arrives with that same SOCKS auth, otherwise we'll actually increase circuit churn.

I'm going to let that branch run on my TBB through the weekend and keep an eye on the loglines, and if it still seems good to me by Monday, I'll probably apply everything but the circuit_detach_stream() commit to the 4.5-stable release.

Trac:
Status: new to needs_review

Replying to mikeperry:

I'm going to let that branch run on my TBB through the weekend and keep an eye on the loglines, and if it still seems good to me by Monday, I'll probably apply everything but the circuit_detach_stream() commit to the 4.5-stable release.

I'd like to have a review for this one. I think this is the least we can do here given that this a) is intended for a stable release b) is coming without any testing in an alpha release while we dropped other less risky things from including them into the upcoming stable just because of that c) is potentially affecting all Tor users d) deserves a much more thorough investigation/discussion about its possible impact wrt known attacks e) ...

Replying to gk:

Replying to mikeperry:

I'm going to let that branch run on my TBB through the weekend and keep an eye on the loglines, and if it still seems good to me by Monday, I'll probably apply everything but the circuit_detach_stream() commit to the 4.5-stable release.

I'd like to have a review for this one.

+1. This is scary stuff.

After running this through the weekend, I am in agreement. This branch hasn't seemed to cause any additional problems yet, but the circuit selection code is still making churn-heavy choices in places I can't track down, and it seems I only improved a very small number of cases of churn over the 1500 stream requests that happened in my browsers over the weekend.

Most frustratingly, the fixed circuit_is_better() in this branch was involved in roughly 15 stream requests out of 1500, and it turns out it never made a decision that the buggy circuit_is_better wouldn't have also made, and it somehow was not called during at least one case where two open circuits were available for a site with the same SOCKS auth. Tor was somehow deciding between them elsewhere (and still annoying the shit out of me by causing me to suddenly get banned while in the middle of something).

The changes in circuit_is_acceptable() reduced circuit churn 4 times out of 1500, and circuit_detach_stream() made circuits live longer 6 times out of 1500, perhaps due to me also running legacy/trac#4100 (moved). The changes in circuit_detach_stream() make the distinguisher worse if everyone doesn't run them, though, so I don't think they are worth it.

So I am OK with the patch in 4.5a5 for 4.5-stable. I also produced a slightly longer but tunable version of it, such as https://gitweb.torproject.org/mikeperry/tor.git/commit/?h=bug15482-tbb-longer, but I could be talked out of that, too.

Trac:
Status: needs_review to needs_information

We're going to go with the 4.5a5 patch (https://trac.torproject.org/projects/tor/attachment/ticket/15482/bug15482.patch) for 4.5-stable, as it should match existing hidden service circuit usage behavior for repeat site usage and minimize the possibilities for differentiation of TBB users otherwise. I will work on improving my https://gitweb.torproject.org/mikeperry/tor.git/commit/?h=bug15482 branch and figuring out the remaining cases of circuit churn for potential merge into master in the meantime.

Trac:
Keywords: tbb-4.5-alpha deleted, N/A added

Replying to mikeperry:

I decided to create a SocksAuthCircuitRenewPeriod torrc option that governs how long we extend the lifetime of socksauth-isolated circuits each time a new stream arrives on them.

Would it be an option to allow users to set this with a per-SOCKSPort flag, instead of a global torrc option? E.g. Tails has a dedicated SOCKSPort for Tor Browser, and I would feel slightly more comfortable enabling this longer circuit lifetime for that port only, and leaving the default untouched for other applications: my gut feeling is that other applications using SOCKS u+p may have different needs, and might be negatively impacted by longer circuit lifetime (be it in terms of functionality, UX, or anonymity).

Replying to intrigeri:

Would it be an option to allow users to set this with a per-SOCKSPort flag, instead of a global torrc option?

Yeah, suddenly changing the IsolateSOCKSAuth semantics is awkward for Tails, Whonix, TorVM etc.

How about the following patch, which is even simpler and more versatile?

Trac:
IsolateKeepAlive.patch

Add IsolateKeepAlive flag

Trac:
Keywords: TorBrowserTeam201504 deleted, TorBrowserTeam201505 added

Trac:
Keywords: TorBrowserTeam201505 deleted, TorBrowserTeam201505R added

Is this something the tails people need us to merge for 4.5.1, or is it OK if they pick up this patch independently?

I like the approach of adding a new Socksport parameter IsolateKeepAlive, but it also isn't strictly needed for TBB right now. I will want to pick it up when we prepare this for a mainline tor merge (though I also want to get to the bottom of where the hell circuit selection is happening that seems to avoid circuit_is_better/circuit_get_best).

I posted IsolateKeepAlive.patch here after a long discussion on the Qubes mailing list about the state of TorVM vis-à-vis TB 4.5. Quick summary: They would need to switch from installing the official torproject.org tor RPM to building and patching tor themselves...

The idea was that it might be possible to merge the IsolateKeepAlive approach to stable tor sooner than the previous approach, because it remains backward compatible with longstanding IsolateSOCKSAuth semantics.

Trac:
Username: rustybird

Trac:
Username: tyseom
Cc: gk, mcs, anonym to gk, mcs, anonym, tyseom

Trac:
Keywords: TorBrowserTeam201505R deleted, TorBrowserTeam201506R added

Trac:
Username: rustybird

IsolateKeepAliveSOCKSAuth.patch

Trac:
Username: rustybird

One problem I didn't think through with IsolateKeepAlive is that if uncredentialed circuits (not associated with any URL bar domain) must not be kept alive, then this flag wouldn't really be suitable for TB.

What do you think about IsolateKeepAliveSOCKSAuth.patch? It's based on bug15482.patch, with all the logic kept in place (IsolateSOCKSAuth set + non-empty SOCKS credentials sent), but also an additional required condition (IsolateKeepAliveSOCKSAuth set).

(Other than that I've only unified the if and else if branches, because they were identical in the original bug15482.patch: circ->base_.timestamp_dirty = approx_time();)

Trac:
Username: rustybird

Transfer review tickets to next month.

Trac:
Keywords: TorBrowserTeam201506R deleted, TorBrowserTeam201507R added

From an IRC discussion with arma today...

This is probably contributing to things like: https://tor.stackexchange.com/questions/7164/why-do-relays-not-end-idle-tcp-sessions

(TLDR: Someone is running a relay on a gigabit residential line behind an ISP supplied awful router, and it ends up being underutilized because the NAT table fills up.)

Since channels are torn down after the last circuit is destroyed, extending circuit lifespan will increase the number of open channels at a given time, this new behavior probably is increasing the damage done by relays on residential lines to network health.

IMO it's not a reason to kill this feature since this problem will crop up as the network and our userbase grows (As the number of users tends towards infinity, the number of channels required at any given time gets closer and closer to the total number of relays in the network for a middle relay).

Metrics on the number of open circuits/channels on a given relay may be useful to get an idea about what we should ask of relay operators.

As it is deployed today, I don't think that this patch should be changing the total number of circuits significantly. If you are continuously using a website, you will have some circuit open for it. Really the only thing this patch does is make you tend to use the same circuit for that activity rather than using a new one.

In aggregate, I guess this might cause connections to live longer between relays, but I doubt it.

What is more likely the culprit is that in Tor Browser 4.5, we also reset our connection keepalive to match to Firefox default of 2 minutes (up from 20 seconds), since we solved legacy/trac#4100 (moved). That would cause more connections to stay open, since the HTTP connections are now being kept open for longer. I think this in combination with more users is the real problem.

SPDY and HTTP/2 will make this even worse, since its keep-alive duration is potentially infinite.

Replying to mikeperry:

As it is deployed today, I don't think that this patch should be changing the total number of circuits significantly. If you are continuously using a website, you will have some circuit open for it. Really the only thing this patch does is make you tend to use the same circuit for that activity rather than using a new one.

In aggregate, I guess this might cause connections to live longer between relays, but I doubt it.

Metrics needed here, a bit too late to really characterize the old behavior, but I believe that number of channels/circuits per relay metrics will be useful information to have.

What is more likely the culprit is that in Tor Browser 4.5, we also reset our connection keepalive to match to Firefox default of 2 minutes (up from 20 seconds), since we solved legacy/trac#4100 (moved). That would cause more connections to stay open, since the HTTP connections are now being kept open for longer. I think this in combination with more users is the real problem.

SPDY and HTTP/2 will make this even worse, since its keep-alive duration is potentially infinite.

I will reiterate that this isn't a fundamental problem with this patch, since relays SHOULD (MUST) be able to support talking to any other relay (and lots of clients or servers if they happen to be Guards/Exits). You brought up testing inter-relay connectivity and delisting (or reducing the consensus weight) of relays that fail to build circuits. I think this will be a good idea, moving forward but such tests are likely to be quite time consuming. I seem to recall someone running this sort of experiment in the past, but I don't recall the trac number off the top of my head. I'd be willing to bet that due to internet quirks all relays can't talk to every other relay (even if they aren't stuck behind some crappy NAT box).

Trac:
Keywords: MikePerry201503, TorBrowserTeam201507R deleted, N/A added

Per the tor-core IRC meeting today:

This functionality must be merged before the 0.2.7.x soft-freeze tentatively scheduled for the end of this month.
Something like nickm's original patch with a max lifespan will be merged due to concerns over 'bad' extending a circuit indefinitely.
A new Isolation value will be added to enable/disable the circuit lifespan extension behavior.

I'll look at all the various patches floating around and come up with something that is merge-ready, reassigning the ticket to myself.

Trac:
Owner: mikeperry to yawning
Keywords: N/A deleted, tor-core, TorCoreTeam201508 added
Version: N/A to Tor: unspecified
Status: needs_information to assigned

FWIW, I like the idea behind rustybird's second patch (https://trac.torproject.org/projects/tor/attachment/ticket/15482/IsolateKeepAliveSOCKSAuth.patch) minus the needless whitespace changes.

I think any form of max lifespan opens up the user to both guard discovery attacks as well as increased exit node and correlation exposure (because a max lifespan allows an application to be induced to continually reconnect until a compromised middle or exit node is chosen on a new circuit).

Beyond the security concerns (which should be sufficient by themselves), it also terrible for usability. The lifespan of HTTP connections is a relic of the shittiness of HTTP/1.x. Both HTTP/2 and QUIC fix this, and keep connections opened forever, because that is how sessions actually work on the web. To drive home the usability impact of enforcing this max lifespan: would we ever force people to reconnect to their SSH servers every X minutes/hours/days through Tor? If we're not willing to do that, we shouldn't to the equivalent to the web.

Help!

Read this is included in TBB and want to apply it to local build, but I can't figure out which variation to use and can't find a TBB-tor branch in GIT.

I'll try one of the more experimental ones if it will be of help, but otherwise would prefer the one in TBB.

Trac:
Cc: gk, mcs, anonym, tyseom to gk, mcs, anonym, tyseom, starlight@binnacle.cx

Trac:
Keywords: N/A deleted, PostFreeze027 added

Trac:
Keywords: TorCoreTeam201508 deleted, TorCoreTeam201509 added

https://github.com/Yawning/tor/compare/feature15482

This should be mergable.

It's basically mike's patch with it made configurable, though not in the way rustybird did it because that's broken (isolation_flags is a uint8_t). I could have extended the width of the flag field instead of carving out a separate thing for this, but it really feels like something that should be separate, and I didn't want to dig through all the code that assumes it's an 8 bit value.

Trac:
Status: assigned to needs_review

To consider: Looks pretty plausible, but I'm still worried by having circuits that are potentially immortal, where dirtiness simply never matters. We'd be losing the property that, after enough time has passed, you can be sure that old stuff isn't going on the same circuits you're still using.

Minor tweaks needed: the documentation for timestamp_dirty and MaxCircuitDirtiness need to get fixed to match the new behavior.

Trac:
Status: needs_review to needs_revision

Oh, and I'm not sure it's right to ditch the randomness here either; I think it helps. (Rationale: in unpatched Tor, any adversary on the circuit can only learn, from the circuit close time, that the circuit is finally used for no streams, and that the time at which the first stream was attached to the circuit. But this is trivial for them to infer from traffic patterns anyway. With this patch, such an adversary can also learn a latest-bound for the time at which the last stream opened, which wasn't visible to them before unless the streams are pretty isolated in time. I don't know if we should be worrying about this or not, but it deserve's a moment's contemplation IMO)

Replying to nickm:

To consider: Looks pretty plausible, but I'm still worried by having circuits that are potentially immortal, where dirtiness simply never matters. We'd be losing the property that, after enough time has passed, you can be sure that old stuff isn't going on the same circuits you're still using.

This property does not make sense for Tor Browser, because it's not how web sessions work (see comment:31). You don't just get to "wait a while" and suddenly your browser sessions are unlinkable. They are only unlinkable insofar as we actively enforce it by identifier management in the browser (which is identical to our socks auth usage). Any more surprise partial unlinkability you try to randomly sprinkle on the user is just usability failure.

If you insist on having a max if/when this merges, please ensure that Tor Browser can turn that completely off via another flag/parameter, or we're going to have to keep a silly patch around to disable it ourselves :/.

As for the randomness, I'm indifferent to it. It could prove useful, but I should point out that whatever you do there, you should also do to the timestamp_dirty updates for rend circs in connection_ap_handshake_attach_circuit(), otherwise you may create another distinguisher there. They have long since behaved exactly like this patch makes normal circuits behave, so we might as well keep them identical in whatever we decide.

So. I have no idea what the behavior is supposed to be, since I implemented what the Tor Browser people wanted, that's off by default and possible to disable, in an attempt to fix the fact that they're distributing a forked tor with radically different circuit dirtyness behavior than every single other tor (unless otherwise patched).

FWIW I think this being possible to disable/off by default is also kind of dumb, since it fragments behavior (even if Tor Browser isn't the only client usage of tor, it is by far the largest). I'm going to ignore this patch entirely till some sort of consensus is reached between mike and nickm.

Trac:
Keywords: PostFreeze027 deleted, N/A added
Milestone: Tor: 0.2.7.x-final to Tor: 0.2.8.x-final

whoops; batch-modify is dangerous

Trac:
Keywords: TorCoreTeam201509 deleted, TorCoreTeam201509 PostFreeze027 added
Milestone: Tor: 0.2.8.x-final to Tor: 0.2.7.x-final

Fixup pushed with the documentation changes. Same branch. I still think this should be on by default (and quite possibly mandatory) at least until Tor Browser ceases to be the primary Tor client implementation out there.

Trac:
Status: needs_revision to needs_review