So instead of raising the timeout, let's make normal circuits behave more like hidden service circuits: keep resetting their timestamp_dirty every time a new stream is attached. This has the effect that a user will never suddenly get a new circuit in the middle of actively using a website, which will be a huge usability improvement. Still not ideal, but good enough to leave the actual circuit dirtiness timeout alone.
I am going to do this with a TBB-specific Tor patch for now so we can test this in TBB 4.5a5. We can then decide if we want to make this a torrc option after that.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
I question whether this is something to do unconditionally, or whether it's only something to do when we have an external application managing circuit isolation for us.
I also wonder whether there shouldn't be some kind of randomized upper limit to how long this can keep a circuit alive.
GeKo and I both agree that the SOCKS isolation check is important here. However, I felt that the randomness was definitely not the best plan in its current form, because sometimes circuits could still end up very short lived, still allowing TBB to surprise the user with a new circuit at random intervals.
The way I thought I implemented the randomness didn't allow for short-lived circuits. It wasn't choosing randomly between [0,X], but instead between [X, X*9/8].
For me, the idea of no maximum at all is a bit scary; it means that keepalive-type stuff will keep a circuit open forever even if the user doesn't expect that it would.
Finally -- did somebody test out the isolation thing, to make sure that only authentication-isolated circuits get the new behavior? I only wrote it; it does need some testing.
The way I thought I implemented the randomness didn't allow for short-lived circuits. It wasn't choosing randomly between [0,X], but instead between [X, X*9/8].
You're right. In my haste I misread this.
For me, the idea of no maximum at all is a bit scary; it means that keepalive-type stuff will keep a circuit open forever even if the user doesn't expect that it would.
Are there specific attacks or types of attacks that make you nervous here?
I believe there is a potential concern that we're making e2e correlation easier with longer circuit lifespans, but I think I am more worried about the case where I leave my browser open overnight and some website keeps pinging itself every so often to cause a new circuit to get built in order to discover my guard node. I am also worried about non-malicious but dumb websites that do this anyway, and end up exposing me to the entire Tor exit node population over the course of a few hours (during which time malicious exits get more chances to try to own my browser, sniff my cookies, perform e2e correlation, mount their own guard discovery attacks, or otherwise mess with me).
Guard discovery and exit node churn were the main security reasons I wanted a huge circuit dirtiness timeout in legacy/trac#13766 (moved), until Roger pointed out that a fixed-length custom circuit dirtiness would provide a 100% accurate distinguisher for Tor Browser users at the Guard node.
In my ideal world, we would find some way to make this distinguisher statistical (instead of absolute, instantaneous certainty) while still ensuring really long circuit lifetimes for websites that would otherwise cause circuit churn. Plenty of long-term semi-accurate statistical classifiers likely exist for Tor Browser traffic at the guard node, so this tradeoff seems like the right one to me.
Finally -- did somebody test out the isolation thing, to make sure that only authentication-isolated circuits get the new behavior? I only wrote it; it does need some testing.
That's what alphas are for :). In fact, I'm revisiting this ticket right now because I'm still personally experiencing cases where my exit IP has changed in the middle of actively interacting with a website, and this change has disrupted my browsing activity. I'd like to brainstorm ways of making the circuit dirtiness larger without exposing us to attacks. However, I'm also heavily biased in the usability and experimentation direction here, due to the uncertain (and conflicting) nature of the security issues at hand...
My current (admittedly completely haphazard and half-baked) inclination is to do something like randomizing the initial per-circuit dirtiness between 10 minutes and a much larger value, reset the dirtiness to rand(0,10min) upon stream close (in circuit_detach_stream()), and still omit any maximum value on dirtiness. I'm updating this ticket to give you a chance to talk me down from this cliff, ideally within the next few days before I jump and commit a patch to do something insane like this for 4.5-stable ;).
Ok, I attached a new version that also does the stream detach hack (which I added primarily because we're boosting our HTTP keep-alive back up to 2 minutes thanks to solving legacy/trac#4100 (moved)). This version also adds some debug loglines, and I tested it with TBB and torsocks 1.2 and 2.0. The SOCKS u+p check works, but I noticed that the old torsocks 1.x actually sends the UNIX username as the SOCKS username by default, but the new torsocks 2.0 has options for SOCKS u+p that are off by default.
I also noticed that there is still a subtle distinguisher here. If a non-hidserv circuit has been alive for more than 10 minutes after first use, the only way this could happen without this patch is if a stream was still open on this circuit. In that case, a normal Tor client would close that circuit immediately after receiving the RELAY_END cell from upstream. However, clients running any version of this patch will keep non-hidserv circuits open past the 10 minute mark, and then close them without necessarily receiving a cell from upstream.
I am not sure what to do about this. I think it would require a way to reliably differentiate hidserv from non-hidserv circuits to use effectively, but it might be pretty accurate after that. Does this distinguisher trump the usability win here?
I decided to create a SocksAuthCircuitRenewPeriod torrc option that governs how long we extend the lifetime of socksauth-isolated circuits each time a new stream arrives on them. I set the default value to 1 hour.
I also added code in circuit_is_acceptable() to allow us to keep using a circuit with SOCKS u+p auth even if it was otherwise too dirty. I did this because when we enable HTTP/2 (legacy/trac#14952 (moved)), we'll have super-long-lived connections that may actually exceed even the 1 hour circuit lifetime extension. To preserve our circuit UI and isolation model, we'll want to keep using the same circuit for new connections with the same auth in this case.
Because the circuit_detach_stream() hack makes it easier to differentiate TBB users (since it removes any possibility of a circuit closing immediately after RELAY_END), I placed that in its own commit. I don't think I'll actually use this commit in 4.5, though I do think it will improve behavior once HTTP/2 is enabled.
Along the way, I noticed that circuit_is_better() has a serious bug where the circuit purpose value was actually being obtained incorrectly, causing the majority of that function body to be skipped, so I fixed that. When this is fixed, we also need to ensure that we actually keep using SOCKS auth circuits if a stream arrives with that same SOCKS auth, otherwise we'll actually increase circuit churn.
I'm going to let that branch run on my TBB through the weekend and keep an eye on the loglines, and if it still seems good to me by Monday, I'll probably apply everything but the circuit_detach_stream() commit to the 4.5-stable release.
I'm going to let that branch run on my TBB through the weekend and keep an eye on the loglines, and if it still seems good to me by Monday, I'll probably apply everything but the circuit_detach_stream() commit to the 4.5-stable release.
I'd like to have a review for this one. I think this is the least we can do here given that this
a) is intended for a stable release
b) is coming without any testing in an alpha release while we dropped other less risky things from including them into the upcoming stable just because of that
c) is potentially affecting all Tor users
d) deserves a much more thorough investigation/discussion about its possible impact wrt known attacks
e) ...
I'm going to let that branch run on my TBB through the weekend and keep an eye on the loglines, and if it still seems good to me by Monday, I'll probably apply everything but the circuit_detach_stream() commit to the 4.5-stable release.
After running this through the weekend, I am in agreement. This branch hasn't seemed to cause any additional problems yet, but the circuit selection code is still making churn-heavy choices in places I can't track down, and it seems I only improved a very small number of cases of churn over the 1500 stream requests that happened in my browsers over the weekend.
Most frustratingly, the fixed circuit_is_better() in this branch was involved in roughly 15 stream requests out of 1500, and it turns out it never made a decision that the buggy circuit_is_better wouldn't have also made, and it somehow was not called during at least one case where two open circuits were available for a site with the same SOCKS auth. Tor was somehow deciding between them elsewhere (and still annoying the shit out of me by causing me to suddenly get banned while in the middle of something).
The changes in circuit_is_acceptable() reduced circuit churn 4 times out of 1500, and circuit_detach_stream() made circuits live longer 6 times out of 1500, perhaps due to me also running legacy/trac#4100 (moved). The changes in circuit_detach_stream() make the distinguisher worse if everyone doesn't run them, though, so I don't think they are worth it.
I decided to create a SocksAuthCircuitRenewPeriod torrc option that governs how long we extend the lifetime of socksauth-isolated circuits each time a new stream arrives on them.
Would it be an option to allow users to set this with a per-SOCKSPort flag, instead of a global torrc option? E.g. Tails has a dedicated SOCKSPort for Tor Browser, and I would feel slightly more comfortable enabling this longer circuit lifetime for that port only, and leaving the default untouched for other applications: my gut feeling is that other applications using SOCKS u+p may have different needs, and might be negatively impacted by longer circuit lifetime (be it in terms of functionality, UX, or anonymity).
Is this something the tails people need us to merge for 4.5.1, or is it OK if they pick up this patch independently?
I like the approach of adding a new Socksport parameter IsolateKeepAlive, but it also isn't strictly needed for TBB right now. I will want to pick it up when we prepare this for a mainline tor merge (though I also want to get to the bottom of where the hell circuit selection is happening that seems to avoid circuit_is_better/circuit_get_best).
The idea was that it might be possible to merge the IsolateKeepAlive approach to stable tor sooner than the previous approach, because it remains backward compatible with longstanding IsolateSOCKSAuth semantics.
One problem I didn't think through with IsolateKeepAlive is that if uncredentialed circuits (not associated with any URL bar domain) must not be kept alive, then this flag wouldn't really be suitable for TB.
What do you think about IsolateKeepAliveSOCKSAuth.patch? It's based on bug15482.patch, with all the logic kept in place (IsolateSOCKSAuth set + non-empty SOCKS credentials sent), but also an additional required condition (IsolateKeepAliveSOCKSAuth set).
(Other than that I've only unified the if and else if branches, because they were identical in the original bug15482.patch: circ->base_.timestamp_dirty = approx_time();)
(TLDR: Someone is running a relay on a gigabit residential line behind an ISP supplied awful router, and it ends up being underutilized because the NAT table fills up.)
Since channels are torn down after the last circuit is destroyed, extending circuit lifespan will increase the number of open channels at a given time, this new behavior probably is increasing the damage done by relays on residential lines to network health.
IMO it's not a reason to kill this feature since this problem will crop up as the network and our userbase grows (As the number of users tends towards infinity, the number of channels required at any given time gets closer and closer to the total number of relays in the network for a middle relay).
Metrics on the number of open circuits/channels on a given relay may be useful to get an idea about what we should ask of relay operators.
As it is deployed today, I don't think that this patch should be changing the total number of circuits significantly. If you are continuously using a website, you will have some circuit open for it. Really the only thing this patch does is make you tend to use the same circuit for that activity rather than using a new one.
In aggregate, I guess this might cause connections to live longer between relays, but I doubt it.
What is more likely the culprit is that in Tor Browser 4.5, we also reset our connection keepalive to match to Firefox default of 2 minutes (up from 20 seconds), since we solved legacy/trac#4100 (moved). That would cause more connections to stay open, since the HTTP connections are now being kept open for longer. I think this in combination with more users is the real problem.
SPDY and HTTP/2 will make this even worse, since its keep-alive duration is potentially infinite.
As it is deployed today, I don't think that this patch should be changing the total number of circuits significantly. If you are continuously using a website, you will have some circuit open for it. Really the only thing this patch does is make you tend to use the same circuit for that activity rather than using a new one.
In aggregate, I guess this might cause connections to live longer between relays, but I doubt it.
Metrics needed here, a bit too late to really characterize the old behavior, but I believe that number of channels/circuits per relay metrics will be useful information to have.
What is more likely the culprit is that in Tor Browser 4.5, we also reset our connection keepalive to match to Firefox default of 2 minutes (up from 20 seconds), since we solved legacy/trac#4100 (moved). That would cause more connections to stay open, since the HTTP connections are now being kept open for longer. I think this in combination with more users is the real problem.
SPDY and HTTP/2 will make this even worse, since its keep-alive duration is potentially infinite.
I will reiterate that this isn't a fundamental problem with this patch, since relays SHOULD (MUST) be able to support talking to any other relay (and lots of clients or servers if they happen to be Guards/Exits). You brought up testing inter-relay connectivity and delisting (or reducing the consensus weight) of relays that fail to build circuits. I think this will be a good idea, moving forward but such tests are likely to be quite time consuming. I seem to recall someone running this sort of experiment in the past, but I don't recall the trac number off the top of my head. I'd be willing to bet that due to internet quirks all relays can't talk to every other relay (even if they aren't stuck behind some crappy NAT box).
I think any form of max lifespan opens up the user to both guard discovery attacks as well as increased exit node and correlation exposure (because a max lifespan allows an application to be induced to continually reconnect until a compromised middle or exit node is chosen on a new circuit).
Beyond the security concerns (which should be sufficient by themselves), it also terrible for usability. The lifespan of HTTP connections is a relic of the shittiness of HTTP/1.x. Both HTTP/2 and QUIC fix this, and keep connections opened forever, because that is how sessions actually work on the web. To drive home the usability impact of enforcing this max lifespan: would we ever force people to reconnect to their SSH servers every X minutes/hours/days through Tor? If we're not willing to do that, we shouldn't to the equivalent to the web.
Read this is included in TBB and
want to apply it to local build,
but I can't figure out which
variation to use and can't find
a TBB-tor branch in GIT.
I'll try one of the more experimental
ones if it will be of help, but otherwise
would prefer the one in TBB.
It's basically mike's patch with it made configurable, though not in the way rustybird did it because that's broken (isolation_flags is a uint8_t). I could have extended the width of the flag field instead of carving out a separate thing for this, but it really feels like something that should be separate, and I didn't want to dig through all the code that assumes it's an 8 bit value.
To consider: Looks pretty plausible, but I'm still worried by having circuits that are potentially immortal, where dirtiness simply never matters. We'd be losing the property that, after enough time has passed, you can be sure that old stuff isn't going on the same circuits you're still using.
Minor tweaks needed: the documentation for timestamp_dirty and MaxCircuitDirtiness need to get fixed to match the new behavior.
Oh, and I'm not sure it's right to ditch the randomness here either; I think it helps. (Rationale: in unpatched Tor, any adversary on the circuit can only learn, from the circuit close time, that the circuit is finally used for no streams, and that the time at which the first stream was attached to the circuit. But this is trivial for them to infer from traffic patterns anyway. With this patch, such an adversary can also learn a latest-bound for the time at which the last stream opened, which wasn't visible to them before unless the streams are pretty isolated in time. I don't know if we should be worrying about this or not, but it deserve's a moment's contemplation IMO)
To consider: Looks pretty plausible, but I'm still worried by having circuits that are potentially immortal, where dirtiness simply never matters. We'd be losing the property that, after enough time has passed, you can be sure that old stuff isn't going on the same circuits you're still using.
This property does not make sense for Tor Browser, because it's not how web sessions work (see comment:31). You don't just get to "wait a while" and suddenly your browser sessions are unlinkable. They are only unlinkable insofar as we actively enforce it by identifier management in the browser (which is identical to our socks auth usage). Any more surprise partial unlinkability you try to randomly sprinkle on the user is just usability failure.
If you insist on having a max if/when this merges, please ensure that Tor Browser can turn that completely off via another flag/parameter, or we're going to have to keep a silly patch around to disable it ourselves :/.
As for the randomness, I'm indifferent to it. It could prove useful, but I should point out that whatever you do there, you should also do to the timestamp_dirty updates for rend circs in connection_ap_handshake_attach_circuit(), otherwise you may create another distinguisher there. They have long since behaved exactly like this patch makes normal circuits behave, so we might as well keep them identical in whatever we decide.
So. I have no idea what the behavior is supposed to be, since I implemented what the Tor Browser people wanted, that's off by default and possible to disable, in an attempt to fix the fact that they're distributing a forked tor with radically different circuit dirtyness behavior than every single other tor (unless otherwise patched).
FWIW I think this being possible to disable/off by default is also kind of dumb, since it fragments behavior (even if Tor Browser isn't the only client usage of tor, it is by far the largest). I'm going to ignore this patch entirely till some sort of consensus is reached between mike and nickm.
Fixup pushed with the documentation changes. Same branch. I still think this should be on by default (and quite possibly mandatory) at least until Tor Browser ceases to be the primary Tor client implementation out there.