Confidential: TROVE-2021-009: Improved DNS cache oracle

Hi,

This is a new confidential ticket regarding the improved DNS cache oracle, to be presented (hopefully, conditionally accepted at time of writing) at USENIX Security 2023 in the paper "Timeless Timing Attacks and Preload Defenses in Tor’s DNS Cache".

Below is the original report, lightly reformatted.

A timeless side-channel in tor's DNS cache

Exit-relays currently maintain a DNS cache for all domains directly (with RESOLVE) or indirectly (with CONNECT) resolved by the exit. Domains are cached for five or 60 minutes, based on their clipped TTL. All circuits at an exit share the same DNS cache.

One known way of determining if a domain is cached or not is to use time: a domain is considered cached or not depending on if the exit responds "fast" or "slow" to a request to resolve it. An attacker that learns if a domain is cached or not learns about destinations of earlier exit-traffic at the exit. Such information can be a privacy-harm, e.g., see the work on "Website Fingerprinting with Website Oracles" where the DNS cache is used to confirm traffic classification by constructing a so-called website oracle [1]. In general, it is more harmful if an attacker can confirm a website classification to an unpopular website on the Tor network (because, per definition, learning that an unpopular website was visited within some time frame is more information than learning that a popular website, like say YouTube, was visited within that same time frame).

While prior attacks have used timing, we here report a timeless side channel attack in tor's DNS cache. This makes the attack much more robust than if based on time. We have made in the order of 10^7 queries (only for our own domain) in the network with zero false positives. Not having any false positives is consistent with our understanding of the attack, see description below.

The attack

The attack is relatively straightforward:

Create a one-hop circuit to an exit.

Send a RESOLVE cell for an arbitrary domain name.

Send two RESOLVE cells in the same TLS record. The first RESOLVE cell is for a domain name that we want to check if it is (un)cached. The second RESOLVE cell is for a domain name that we know is cached. I.e., use domain name from step 2.

Observe the order in which the two RESOLVED cells are returned. The first RESOLVE query was uncached if the relative cell order changed, otherwise cached.

The trick here follows from tor's handling of incoming RESOLVE cells. First of all, the two cells will be processed immediately after one another without any noise because they arrived as one unit. Secondly, responses to cached RESOLVE queries can be and are queued for sending immediately. Third, an uncached RESOLVE query has no response and so is scheduled with an event callback. This is where the change in order happens if the first RESOLVE cell is uncached. In contrast, the change in order does not happen if both RESOLVE cells are cached.

Based on suggestions from Roger, we focused above on using the TLS record to force cells to arrive at an exit at the same time in a given order. Likely it's just as feasible to pack two small TLS records (one cell each) in one TCP packet.

This setup can be run in parallel to create a timeless website oracle ("tlwo") for the entire Tor network without false positives. It works about as well with a multi-hop circuit: our two cells often get propagated in the same TLS records.

See the attached tor-0.4.6.6--tlwo.patch for additional details on how to reproduce the attack. You will need to build a modified tor proxy and optionally run carml.

The improved attack

At a first glance the attacker can only learn whether a domain name was uncached or cached within the past five or 60 minutes (depending on clipped TTL). It is possible to infer the exact time that a domain was cached as follows:

Determine the expected clip value. For example, many domains are always clipped to five minutes because of how authoritative DNS servers set TTLs.

Repeatedly perform step 3 from the above attack, until the first RESOLVE request is no longer cached. Note down the time spent doing this as t seconds.

If the expected clip was five minutes, the entry was cached 300-t seconds ago. This means that the leaked information is finer-grained than one might expect and not particularly time-sensitive: an attacker has (at least) five minutes to query exits in the network.

Defenses

Our take on a defense is based on the assumption that the underlying issue leading to the timeless side-channel is unfixable / undesirable to sufficiently address: packing several cells into the same TLS record / TCP packet is a must for performance, adding latency to every DNS query is prohibitively expensive, and an attacker can always pretend being one or more relays providing cells packed in such a way that the attacker can infer if a domain was cached or not (if anything because of difference in time).

A short-term mitigation

One relatively easy mitigation is to randomize the TTL in clip_dns_ttl(). This would randomize the expiration time of the entry in the cache (see set_expiry()), which would make the cached/uncached state of a domain be within a time frame determined by the distribution that the randomness is sampled from. The goal is to at least get away from exact-second precision of the attack as-is today. Unfortunately, a website visit typically consists of multiple domains resolved at the same time the website is visited (with TTLs clipped long or short), so the randomness will cancel out to a degree. Regardless, this is something relatively easy to implemented with few downsides.

When communicating to the wider community about the short-term mitigation, one could also emphasize the value of periodically injecting visits to particularly sensitive website in the network: "regularly check that the websites you care deeply about are reachable over Tor to add noise into the network's caches".

Long-term defense

After some thinking for a long-term defense, we landed on focusing on the purpose of the cache at exits in the first place. We consider only web-traffic here to explain the defense. Extending the defense to consider essential non-web domains is straightforward (e.g., base on the Cisco Umbrella list instead).

The main purpose of the DNS cache is performance, with the added bonus of leaking less to DNS resolvers on cache hits. The DNS cache is only useful if there's a cache hit. In general, there are two cases when we can expect cache hits: (1) subsequent or concurrent streams connecting to the same domain in the same circuit, and (2) someone else visited a website with the same (or subset of) domains at the same exit within the time that domains are cached.

For (1), we believe that the DNS cache could be per circuit: add circuit ID to each entry in the DNS cache, and only use cached entries for streams that were connected via the same circuit that injected the entry into the cache. These per circuit entries should be removed as soon as the circuit is closed. This leaves inter-circuit resolution performance identical to today. It also saves memory by not keeping entries around for longer than necessary, and more importantly, removes the privacy harm caused by caching rarely visited domains.

For (2), we believe that the key is to determine which domains are likely to be useful to cache. Our initial analysis shows that there are surprisingly few domains that we can expect to be useful to cache (somewhere a bit more than top 100 most popular websites should be enough). This is because of the long-tail nature of website popularity (e.g., Alexa or Tranco lists), the size and usage of the Tor-network, TTLs used by popular websites today (Task% have TTLs below 300 seconds), and that most domains are unique to individual websites (~85% of domains on Tranco top 1k). Basically, for a domain, the question to cache or not boils down to how likely it is that another circuit will request the same domain (=website the domain is on) within its TTL (=five minutes).

We are in luck here, because domains that are popular are those that are commonly visited in the network, so the privacy harm of inferring that they're cached is small (because we expected them to be cached somewhere already). On the flip side, rarely visited domains are those that may cause the most harm, but they're useless to cache, so we shouldn't.

In gist, we think tor's DNS cache should only cache entries per-circuit and maintain an allowlist of domains that may be globally cached. At the cost of more complexity, this should result in negligible resolver performance impact on the network, remove a source of privacy harms, and reduce memory usage at exits.

For the allowlist, we have submitted in parallel a research safety board request on network-wide measurements we could do to gain confidence in such a list. With some luck, we think it's possible to derive it from Alexa/Tranco/Cisco Umbrella lists ~monthly and still get most of the performance gain of the DNS cache.

Extra: also for confirmation attacks?

Recall that our attack also works with multi-hop circuits. If the way in which cells are packed into TLS records propagate somewhat intact through the Tor network, it would be trivial for an exit to introduce patterns that a guard could use to confirm where a circuit exited. For example, imagine an exit relay that always send one cell per TLS record. A guard relay could look for that.

We tried a few variations of the above in Chutney - both sending fewer and more cells per TLS record when compared to a normal Tor relay. As long as an unmodified middle relay processes these cells in between, the introduced cell patterns are destroyed. For example, seven cells that arrived in separate TLS records get merged into a single TLS record by the middle relay. This is the reason why our attack still works across multiple hops: two cells come in, and they fit in a single TLS record and so are sent as such. Other channel cells may fill up a TLS record though, in which case our two cells could potentially be packed in two intermediate TLS records (thereby risking a false positive).

We were able to preserve the introduced patterns if packet timings are also changed. This is a different unrelated attack vector, however [2].

References

Pulls and Dahlberg, Website Fingerprinting with Website Oracles, PETS 2020

Luo et al., Exposing invisible timing-based traffic watermarks with BACKLIT, CSAC 2011

Best regards, Rasmus Dahlberg and Tobias Pulls

Edited Sep 08, 2022 by micah