Following #5752 (moved), all web content for a page is requested on a circuit devoted to the page's URL bar domain. OCSP requests, however, are being incorrectly sent on a separate circuit. Favicons and DNS queries, etc, should also be checked for violations.
Probably we need to fix ThirdPartyUtil::GetFirstPartyUri to return the parent page's domain for OCSP requests.
Trac: Description: Following #5752 (moved), all web content for a page is requested on a circuit devoted to the page's URL bar domain. OCSP requests, however, are being sent on a separate circuit. Favicons and DNS queries, etc, should also be checked for violations.
Probably we need to fix ThirdPartyUtil::GetFirstPartyUri to return the parent page's domain for OCSP requests.
Following #5752 (moved), all web content for a page is requested on a circuit devoted to the page's URL bar domain. OCSP requests, however, are being incorrectly sent on a separate circuit. Favicons and DNS queries, etc, should also be checked for violations.
Probably we need to fix ThirdPartyUtil::GetFirstPartyUri to return the parent page's domain for OCSP requests.
Yes, right now I imagine update checks and updates will run on the default circuit (no first party URI). But maybe we should put these on a special circuit?
Just looking at the commit messages: what happened to the update and blocklist checks?
These patches don't address updates and blocklist checks. My feeling is these are probably OK on the default circuit, but I'm open to other suggestions.
Let's see how it goes if I add the missing aIsolationKey.
Sorry -- I seem to have failed to re-push that fix. It should be correct now.
That one fixes it but the build is still failing on Linux systems (have not tested the other ones yet):
pkix_pl_ocspresponse.c: In function 'pkix_pl_OcspResponse_Create':pkix_pl_ocspresponse.c:493:30: warning: passing argument 5 of 'hcv1->createFcn' makes pointer from integer without a cast rv = (*hcv1->createFcn)(serverSession, "http", ^pkix_pl_ocspresponse.c:493:30: note: expected 'const char *' but argument is of type 'PRIntervalTime'pkix_pl_ocspresponse.c:493:30: warning: passing argument 6 of 'hcv1->createFcn' makes integer from pointer without a castpkix_pl_ocspresponse.c:493:30: note: expected 'PRIntervalTime' but argument is of type 'void **'pkix_pl_ocspresponse.c:493:30: error: too few arguments to function 'hcv1->createFcn'make[9]: *** [/home/ubuntu/build/tor-browser/obj-i686-pc-linux-gnu/security/nss/lib/libpkix/pkix_pl_nss/pki/pkix_pl_ocspresponse.o] Error 1
Let's see how it goes if I add the missing aIsolationKey.
Sorry -- I seem to have failed to re-push that fix. It should be correct now.
That one fixes it but the build is still failing on Linux systems (have not tested the other ones yet):
{{{
pkix_pl_ocspresponse.c: In function 'pkix_pl_OcspResponse_Create':
pkix_pl_ocspresponse.c:493:30: warning: passing argument 5 of 'hcv1->createFcn' makes pointer from integer without a cast
rv = (*hcv1->createFcn)(serverSession, "http",
^
pkix_pl_ocspresponse.c:493:30: note: expected 'const char *' but argument is of type 'PRIntervalTime'
pkix_pl_ocspresponse.c:493:30: warning: passing argument 6 of 'hcv1->createFcn' makes integer from pointer without a cast
pkix_pl_ocspresponse.c:493:30: note: expected 'PRIntervalTime' but argument is of type 'void **'
pkix_pl_ocspresponse.c:493:30: error: too few arguments to function 'hcv1->createFcn'
make[9]: *** [/home/ubuntu/build/tor-browser/obj-i686-pc-linux-gnu/security/nss/lib/libpkix/pkix_pl_nss/pki/pkix_pl_ocspresponse.o] Error 1
}}}
I'll work on fixing this. I developed the code on a Mac and had not tested linux.
For some reason, the linux version of tor-browser.git wanted to build libpkix, whereas OS X did not. libpkix is off by default (security.use_mozillapkix_verification = true) in FF31 and removed completely by FF34. A new library, mozilla::pkix, is used by default instead. So I implemented first-party isolation in mozilla::pkix only. In libpkix a few isolationKey arguments are inserted for compatibility with headers, but all isolationKeys are set to NULL.
This one is crashing for me on a 64bit Linux right after the browser window gets up. Probably when the first OCSP request happens. I had the locale patch as well in that build but I doubt it is the culprit. To be super sure I am just rebuilding without it.
The stacktrace gives just a crash somewhere deep down in libxul. We'll see if I can find more out by using our broken debug symbols.
Marking this as needs_revision in order to not lose time for looking for the bug (I'll change that again if the JS locale patch caused this).
There is supposed to be something wrong with our debug symbols (#13917 (moved)) but I think the following might be helpful, though:
{{{
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffdaafe700 (LWP 3114)]
0x00007ffff3287941 in mozilla::psm::CertIDHash(unsigned char (&) [48], CERTCertificateStr const*, CERTCertificateStr const*, char const*) ()
at /home/ubuntu/build/tor-browser/security/certverifier/OCSPCache.cpp:79
[snip]
This was indeed helpful. I think the issue is that strlen(aIsolationKey) at security/certverifier/OCSPCache.cpp:79 is segfaulting when aIsolationKey is null.
Unfortunately my linux build of tor-browser.git is taking absolutely forever inside VirtualBox, so I haven't had a chance to check this directly myself. But I did confirm with a small test C program that strlen((char *) NULL) segfaults.
The favicon portion of this patch checks and sets an nsINode attribute that specifies the first party. I believe this can be abused by content to set its own attributes to circumvent our domain isolation.
I also feel that the OCSP cache isolation is too invasive - it touches too many pieces of the code. This patch seems very unlikely to be taken by Mozilla. We need to find a less invasive way of isolating the OCSP cache and requests.
So that #13766 (moved) can still move forward, I pushed a Torbutton commit that keeps the circuit dirty timeout at 10 minutes for requests that we can't find a first party for.
We can perhaps revisit this during/after the ff38-esr rebase, but it is too large, untested, and risky for 4.5.
The favicon portion of this patch checks and sets an nsINode attribute that specifies the first party. I believe this can be abused by content to set its own attributes to circumvent our domain isolation.
This is a good observation that I had missed. One alternative might be to add a SetIsolationKey method to nsINodes, that would only be accessible to C++ and chrome javascript.
The favicon portion of this patch checks and sets an nsINode attribute that specifies the first party. I believe this can be abused by content to set its own attributes to circumvent our domain isolation.
I also feel that the OCSP cache isolation is too invasive - it touches too many pieces of the code. This patch seems very unlikely to be taken by Mozilla. We need to find a less invasive way of isolating the OCSP cache and requests.
So that #13766 (moved) can still move forward, I pushed a Torbutton commit that keeps the circuit dirty timeout at 10 minutes for requests that we can't find a first party for.
That's very unfortunate. While I agree we should not take the patches in this bug having different timeouts for different circuits depending on the load might reveal quite a bit of information which makes me quite uncomfortable. Not sure how much yet though and how serious this is as I currently can't think clearly about it due to last minute stress :-/
The favicon portion of this patch checks and sets an nsINode attribute that specifies the first party. I believe this can be abused by content to set its own attributes to circumvent our domain isolation.
It's probably useful to note here that there are two pathways for requesting favicons. The nsINode fix deals with the part of the code that injects an ordinary content image into the chrome tab object.
I also feel that the OCSP cache isolation is too invasive - it touches too many pieces of the code. This patch seems very unlikely to be taken by Mozilla. We need to find a less invasive way of isolating the OCSP cache and requests.
I think the OCSP patch may be less invasive than it looks. While a lot of files are involved, most of the changes are minimal, and involve simply adding an "isolationKey" parameter to an internal API call. The only potentially breaking change to an outward-facing API is in nsISocketProvider.idl. But this API is native-only, so it is unlikely to interfere with extensions.
I've been trying to find ways to make this patch smaller, and I'm open to suggestions, but given the existing code path from new HTTPS connection to OCSP request, I haven't succeeded in finding a good way to shorten it. (Mozilla did accept another patch from us which changes substantially more lines of code: https://hg.mozilla.org/mozilla-central/rev/04c452764339)
In case it's useful to understanding this patch, the code path of the isolation key string is as follows:
The browser is making an original HTTPS request using an nsHttpChannel. nsHttpChannel::BeginConnect() calls GetFirstPartyIsolationURI and passes it to the nsHttpConnectionInfo constructor (1).
Soon, nsHalfOpenSocket::SetupStreams, called by HttpConnectionManager, gets the isolation key using nsHttpConnectionInfo->GetIsolationKey() and passes it to a new nsSocketTransport instance (2).
nsSocketTransport::BuildSocket passes it to nsISocketProvider::NewSocket and nsISocketProvider::AddToSocket(3).
The three implementations of these latter two methods in turn pass the isolationKey to nsSSLIOLayerNewSocket and nsSSLIOLayerAddToSocket(4), which in turn assign the isolationKey to a new TransportSecurityInfoinstance (5).
The socket now needs to be verified: SSLServerCertVerificationJob AuthCertificate is called, gets the isolationKey from the TransportSecurityInfo instance (6). The isolationKey is then passed to CertVerifier::VerifySSLServerCert(7), to CertVerifier::VerifyCert(8), and then to CertVerifier::MozillaPKIXVerifyCert(9), the last of which instantiates an NSSCertDBTrustDomain containing the isolation key (10).
When NSSCertDBTrustDomain::CheckRevocation is called, it passes the isolationKey to OCSPCache::Get, OCSPCache::Put and OCSPRequestor::DoOCSPRequest()(11).
Up to this point, all code has simply been passing the isolationKey down from the original http channel to the OCSP code.
Now OCSPCache.Get() and OCSPCache.Put() call OCSPCache::CertIDHash, which uses the isolationKey in its generation of the hash. (12).
The DoOCSPRequest call stores the isolation key in new nsNSSHttpRequestSession instance (in a createFcn call). (13)
The nsNSSHttpRequestSession instance creates a new nsHttpDownloadEvent, which uses the nsNSSHttpRequestSession's isolation key to set the document URI for a new nsHttpChannel for the OCSP request (14). The channel's document URI set to ensure that when GetFirstPartyIsolationURI is later on the channel by torbutton's domain isolator, the same first party URI will be returned as that for the original HTTPS channel (16).
I tested the latest patches and the crashes are gone + the isolation seems to work as expected. Mike could you review the changes? I'd like to include the patches into 5.0a1. Oh, and in addition to Mozilla taking patches that are larger they declined to take a patch we ship due to complexity issues. Thus, I think we should ship the ones in this ticket while looking in parallel for a better solution after Mozilla officially declined to take our code.
Trac: Keywords: N/Adeleted, TorBrowserTeam201505R, MikePerry201505R added Status: needs_revision to needs_review
It seems that in nsHttpChannel::BeginConnect(), the isolationKey is the full URI spec. Shouldn't it be using ThirdPartyUtil::GetFirstPartyHostForIsolation() instead of the full spec?
nsHttpConnectionInfo::SetOriginServer() is also using the isolationKey here. This is going to cause much stronger HTTP Keep-Alive isolation than we originally had, esp if the isolationKey differs from how we use SOCKS u+p.
I am made very nervous by the use of bare char pointers rather than nsCStrings for passing (and in NSSCertDBTrustDomain, holding) the isolationKey string. Are we absolutely sure the lifetime of the objects using the bare pointer is always shorter than wherever this isolationKey's pointer memory is held? This seems especially tricky because there are a couple of different places that hold a copy of the isolationKey, and it is hard to untangle which bare pointer is really held by which object. Is the additional string allocation+copy really a serious perf concern here? Otherwise, I think we risk some really annoying-to-debug UAFs, especially if the OCSP request handling is refactored, made more async, or otherwise further decoupled from the HTTP layer in future Firefox releases.
In fact, I suspect that it may be the case that our longer HTTP keep-alive is just causing us not to see more cases where issue 3 would cause crashes. If the original HTTPS connection gets torn down long before the OCSP request completes, I suspect we'll again see issues...
I think my concerns wrt to issue 3 at the moment are limited to NSSCertDBTrustDomain holding the bare pointer. If we make that be an nsCString, I think we may be OK, since the rest is just arg passing. But if any of these function calls suddenly become async in FF38 or later, we'll be sad again.
In the interest of getting us closer to 5.0a1, I will fix up my concerns in a fixup commit.. But I'd still like this to have more eyes (mcs+brade, ideally), and I'd like us to think about how we can protect against future issues.
In the interest of getting us closer to 5.0a1, I will fix up my concerns in a fixup commit..
Testing the 5.0a1 the OCSP isolation does not work anymore (it worked while testing the arthur's latest patches in comment:34). The catch-all-curcuit is used for me as without the patch. So I guess either your fixups are causing this or some interference with #15933 (moved). Either way this needs revision.
Kathy and I did our best to review these patches. We are not 100% sure, but it seems like the change Mike made to use a hostname for isolationKey will break the code inside nsHTTPDownloadEvent::Run() that passes that string to NS_NewURI(). It would be good to add a check for failure there in any case.
With respect to memory ownership and lifetime issues, it is difficult to be certain but Kathy and I think the code is OK. From an auditability point of view, using string classes would make things easier.
A few more comments:
The GUID for nsISocketTransport.idl should be updated.
The second parameter to ReportFailedToProcess() is always an empty string. Why add the parameter?
Inside TransportSecurityInfo, the getter that has this signature is not called:
nsresult GetIsolationKey(char **aIsolationKey);
Can we remove it? Also, it might be good to follow the example of GetHostNameRaw() and rename the other GetIsolationKey() method to GetIsolationKeyRaw().
Here's a new fixup patch that addresses issues mentioned in comment:38 and comment:40. I fixed the code in nsHTTPDownloadEvent::Run by prepending an "https://" scheme to the isolation domain.
I tested this patch and confirmed that most OCSP requests are isolated to the first party domain. However, some OCSP requests go on the No-First-Party circuit, apparently because they are prompted by favicon requests or Tor Browser update requests.
Here's a new fixup patch that addresses issues mentioned in comment:38 and comment:40. I fixed the code in nsHTTPDownloadEvent::Run by prepending an "https://" scheme to the isolation domain.
I tested this patch and confirmed that most OCSP requests are isolated to the first party domain. However, some OCSP requests go on the No-First-Party circuit, apparently because they are prompted by favicon requests or Tor Browser update requests.
This last case should be solved bu #16448 (moved), right? Or is there an additional issue here?
Here's a new fixup patch that addresses issues mentioned in comment:38 and comment:40. I fixed the code in nsHTTPDownloadEvent::Run by prepending an "https://" scheme to the isolation domain.
I tested this patch and confirmed that most OCSP requests are isolated to the first party domain. However, some OCSP requests go on the No-First-Party circuit, apparently because they are prompted by favicon requests or Tor Browser update requests.
This last case should be solved bu #16448 (moved), right?
Yes, in my manual tests, it is solved by #16448 (moved).