Redesign the HS client descriptor cache

changed milestone to %Tor: 0.2.8.x-final in legacy/trac

added component::core tor/tor in Legacy / Trac milestone::Tor: 0.2.8.x-final in Legacy / Trac parent::16381 in Legacy / Trac priority::high in Legacy / Trac resolution::fixed in Legacy / Trac sponsor::R in Legacy / Trac status::closed in Legacy / Trac tor-hs in Legacy / Trac type::defect in Legacy / Trac labels

Trac:
Description: Bug legacy/trac#16381 (moved) vs legacy/trac#14219 (moved) demonstarted an important issue with the current design of the HS client descriptor cache.

In a nutshell, we can't rely on the timestamp to order descriptors because the timestamp is rounded down to the nearest hour thus any change in that time period will never be seen by the client (legacy/trac#16381 (moved)). Furthermore, we can't rely on comparing the descriptor content because we have two replicas with the same content but have different descriptor ID (legacy/trac#14219 (moved)).

One solution proposed by arma (and +1 by special) is to have an intro point failure cache. Every time we encounter this "we keep state about the previous HS desc by keeping a copy of it, and then seeing if the new one is different".

to

Bug legacy/trac#16381 (moved) vs legacy/trac#14219 (moved) demonstarted an important issue with the current design of the HS client descriptor cache.

In a nutshell, we can't rely on the timestamp to order descriptors because the timestamp is rounded down to the nearest hour thus any change in that time period will never be seen by the client (legacy/trac#16381 (moved)). Furthermore, we can't rely on comparing the descriptor content because we have two replicas with the same content but have different descriptor ID (legacy/trac#14219 (moved)).

One solution proposed by arma (and +1 by special) is to redesign the logic to be cleaner, and keep an intro point failure cache. It seems like every time we encounter this "we keep state about the previous hs desc by keeping a copy of it, and then seeing if the new one is different, get it?" logic, somebody finds it confusing. Maybe now is a good time for it to die? :)

Ok I took a stab at it so we have a starting point!

https://people.torproject.org/~dgoulet/volatile/client-hs-desc-cache.txt

I'm unsure how best to proceed thus I put it on people.tpo for now, maybe a pad at some point if the above is just off track :). It just felt a bit too big/much for a ticket and discussion.

Once we start the implementation, we MUST add the cache behavior as a fat comment, I would actually be incline to have seperate files with proper documentation on top for this to stop adding to the massive C files we currently have :).

Trac:
Cc: N/A to asn

Here is an attempt at this new cache. Feedback is very welcome on both the code and cache behavior.

See branch: bug16389_027_01

Trac:
Status: new to needs_review

Suggestions:

I would be more cormfortable if cache_failure_intro_lookup set *entry to NULL on lookup failure. Leaving it unset seems error-prone to me.
use uint8_t* for pointers to bytes, and char* for pointers to text. So, if you're going to pass it as a key for a strmap, have it be char*; if you're going to pass it as a key for a digestmap, have it be uint8_t. And document what format each argument is in in the documentation, how long it should be, etc.
Make a rend_cache_failure_entry_new() and rend_cache_failure_entry_free(), even if they're just wrappers around tor_malloc and tor_free. This will make the structure easier to maintain if you add anything else to it.
As above but for rend_cache_failure_intro_t
In cache_failure_intro_add, maybe only add fail_entry to rend_cache_failure if it's new?
should "unsigned int failure" be an enum?
Does the new "goto_err" in rend_cache_store_v2_desc_as_client() correctly make us close all the pending connections?

Bigger issues:

Should "signal NEWNYM" affect this cache? Guessing yes.
Should the entries in this cache ever expire? I think probably yes.
Removing the timestamp check in rend_cache_store_v2_desc_as_client() makes me worry about replacing a newer descriptor with an older one. Should I be worried about that?

Trac:
Status: needs_review to needs_revision

Replying to nickm:

Suggestions:

All suggestions are great! I will work on those!

should "unsigned int failure" be an enum?

It should! but currently the failures are #define hence the current unsigned int. I had a plan (volatile) to make a patch to change those to an enum eventually.

Does the new "goto_err" in rend_cache_store_v2_desc_as_client() correctly make us close all the pending connections?

There shouldn't be any pending connections to those intro points because we are parsing the fetched descriptor and making sure the IP entries in there are in the failure cache or not. After validation, if the intro node list is empty it means we can't use any thus we goto err. At that point, no new connections have been established to those intro points nor are already established.

Bigger issues:

Should "signal NEWNYM" affect this cache? Guessing yes.

I thought about clearing this cache on a NEWNYM but I'm unsure... Failures are quite linked to the consensus and current "state" of the network that is if IPs aren't usuable, it goes beyond the scope of the cache and circuit state. A NEWNYM will clean the descriptor cache but I think we need to keep the IP failure so if we download again that same old descriptor, we won't initiate anything. I can be convinced that clearing this cache would be desirable like for example if we decide to link the IP expiry time to the descriptor expiry time. (see next question)

Should the entries in this cache ever expire? I think probably yes.

That was also a concern of mine. So this patch has a "natural rotation" for the cache entries meaning that when we download a new descriptor, it removes every IP that are not listed in the descriptor (for a service ID). But if an IP is re-used we can end up not using it even if it's suppose to work with that new descriptor.

So, what would be a wise expiry time? Here are some I can think of:

Remove the service ID entry (meaning all the IP associated with an onion address) when the descriptor expires (rend_cache_clean). Currently a valid period of 24 hours.
When we get a new consensus, we could consider that it indicates a "new era" of connectivity :)
Expire them at each RendPostPeriod default value (what most HS will use?). (1 hour)

We could also expire entries at a different rate based on their failure type. A NACK is much more different than a UNREACHABLE for instance but that's maybe a future enhancement.

In terms of simplicity for now, I'm incline to use 1. especially since the branch in legacy/trac#4862 (moved) makes it that the HS only uploads its descriptor when all IPs have been established. We avoid the race of fetching the descriptor before an IP is ready and thus not using it for 24 hours.

Removing the timestamp check in rend_cache_store_v2_desc_as_client() makes me worry about replacing a newer descriptor with an older one. Should I be worried about that?

Hrm, I thought it was harmless to remove it because an HSDir doesn't accept an older descriptor but then that does fix the fact that an evil HSDir could simply serve older descriptor non stop and break clients for a long time making them add lots of IPs to their failure cache. Good catch, I'll put it back!

New branch coming soon.

Replying to dgoulet:

Replying to nickm:

Suggestions:

All suggestions are great! I will work on those!

should "unsigned int failure" be an enum?

It should! but currently the failures are #define hence the current unsigned int. I had a plan (volatile) to make a patch to change those to an enum eventually.

Okay. Let's try to get that in with this?

Does the new "goto_err" in rend_cache_store_v2_desc_as_client() correctly make us close all the pending connections?

There shouldn't be any pending connections to those intro points because we are parsing the fetched descriptor and making sure the IP entries in there are in the failure cache or not. After validation, if the intro node list is empty it means we can't use any thus we goto err. At that point, no new connections have been established to those intro points nor are already established.

Hm. I think I meant connection_edge connections from the client?

Bigger issues:

Should "signal NEWNYM" affect this cache? Guessing yes.

I thought about clearing this cache on a NEWNYM but I'm unsure... Failures are quite linked to the consensus and current "state" of the network that is if IPs aren't usuable, it goes beyond the scope of the cache and circuit state. A NEWNYM will clean the descriptor cache but I think we need to keep the IP failure so if we download again that same old descriptor, we won't initiate anything. I can be convinced that clearing this cache would be desirable like for example if we decide to link the IP expiry time to the descriptor expiry time. (see next question)

I think that clearing the cache is probably a good idea; otherwise there's a way to correlate access across a NEWNYM, right?

Should the entries in this cache ever expire? I think probably yes.

That was also a concern of mine. So this patch has a "natural rotation" for the cache entries meaning that when we download a new descriptor, it removes every IP that are not listed in the descriptor (for a service ID). But if an IP is re-used we can end up not using it even if it's suppose to work with that new descriptor.

So, what would be a wise expiry time? Here are some I can think of:

Remove the service ID entry (meaning all the IP associated with an onion address) when the descriptor expires (rend_cache_clean). Currently a valid period of 24 hours.

I think that seems right. When we remove the descriptor entirely, we shouldn't remember the IP statuses in it.

When we get a new consensus, we could consider that it indicates a "new era" of connectivity :)

Expire them at each RendPostPeriod default value (what most HS will use?). (1 hour)

We could also expire entries at a different rate based on their failure type. A NACK is much more different than a UNREACHABLE for instance but that's maybe a future enhancement.

In terms of simplicity for now, I'm incline to use 1. especially since the branch in legacy/trac#4862 (moved) makes it that the HS only uploads its descriptor when all IPs have been established. We avoid the race of fetching the descriptor before an IP is ready and thus not using it for 24 hours.

Removing the timestamp check in rend_cache_store_v2_desc_as_client() makes me worry about replacing a newer descriptor with an older one. Should I be worried about that?

Hrm, I thought it was harmless to remove it because an HSDir doesn't accept an older descriptor but then that does fix the fact that an evil HSDir could simply serve older descriptor non stop and break clients for a long time making them add lots of IPs to their failure cache. Good catch, I'll put it back!

New branch coming soon.

Yaaaay.

Replying to nickm:

Replying to dgoulet:

Replying to nickm:

Suggestions: [snip]

Does the new "goto_err" in rend_cache_store_v2_desc_as_client() correctly make us close all the pending connections?

There shouldn't be any pending connections to those intro points because we are parsing the fetched descriptor and making sure the IP entries in there are in the failure cache or not. After validation, if the intro node list is empty it means we can't use any thus we goto err. At that point, no new connections have been established to those intro points nor are already established.

Hm. I think I meant connection_edge connections from the client?

On error, if the store function fails, the edge connection is closed so that extra new goto err is fine. RCS_BADDESC is sent back.

Ok I fixed everything else in branch: bug16389_027_02.

Trac:
Status: needs_revision to needs_review

Roger's status: I think we need to ignore past intro point failures if they're from more than 5 minutes ago. Only very recent intro point failures should be remembered.

This is especially important if we merge legacy/trac#4862 (moved) too, since we're going to have onion services that go offline for a little while (during which Alice fails to reach it), and then it comes back and re-establishes its intro points, and then Alice tries to visit it again and preemptively gives up because her intro point failure cache says they will all fail

Trac:
Keywords: N/A deleted, TorCoreTeam201507 added

dgoulet: do you agree with Roger's comments? Does that make this needs_revision?

Replying to nickm:

dgoulet: do you agree with Roger's comments? Does that make this needs_revision?

Yup I do agree, moving this back to needs_revision and will start working on a fix right now.

Trac:
Keywords: TorCoreTeam201507 deleted, TorCoreTeam201508 added
Status: needs_review to needs_revision

New branch that expires cache failure entry after 5 minutes. See top two commits for this feature.

Branch: bug16389_027_03

Trac:
Status: needs_revision to needs_review

Seems okay. Merging!

Trac:
Status: needs_review to closed
Resolution: N/A to implemented

Merged, but needs a changes file. I can try to write one later, but it would be cool if dgoulet could give it a try.

Trac:
Resolution: implemented to N/A
Status: closed to reopened

Did everybody just assume I picked the perfect number "5 minutes", or has anybody thought through the time period required here?

Basically, any time interval is suboptimal, in that waiting a long time will result in more instances of unhappy users who can't reach their HS, and waiting a short time will result in duplicate attempts.

It seems like we want a nonce in the hidden service descriptor, and in the cache we tag which descriptor we had when we tried it, and then expire -- no, that wouldn't work either, since the hidden service doesn't need to publish a new version of its descriptor if it's just reusing all its old intro points. So there isn't any explicit signal by the HS that it was gone for a while and has now returned so maybe you want to retry.

In sum: with this patch are we still going to have cases where Alice tries to visit a hidden service, finds it unavailable, then the HS goes back online and tells her to try again, and for five minutes she fails?

Maybe this argues for hidden services putting the nonce into their HS descs anyway, and publishing an updated HS desc every time they lose contact with their intro points, to give clients a way to recognize when the HS has acknowledged that things changed? Yuck.

(added a changes file from dgoulet. Still pondering arma's question on and off)

Trac:
Milestone: Tor: 0.2.7.x-final to Tor: 0.2.8.x-final
Keywords: TorCoreTeam201508 deleted, N/A added
Description: Bug legacy/trac#16381 (moved) vs legacy/trac#14219 (moved) demonstarted an important issue with the current design of the HS client descriptor cache.

In a nutshell, we can't rely on the timestamp to order descriptors because the timestamp is rounded down to the nearest hour thus any change in that time period will never be seen by the client (legacy/trac#16381 (moved)). Furthermore, we can't rely on comparing the descriptor content because we have two replicas with the same content but have different descriptor ID (legacy/trac#14219 (moved)).

One solution proposed by arma (and +1 by special) is to redesign the logic to be cleaner, and keep an intro point failure cache. It seems like every time we encounter this "we keep state about the previous hs desc by keeping a copy of it, and then seeing if the new one is different, get it?" logic, somebody finds it confusing. Maybe now is a good time for it to die? :)

to

We merged a patch here, but see arma's comment below.

Bug legacy/trac#16381 (moved) vs legacy/trac#14219 (moved) demonstarted an important issue with the current design of the HS client descriptor cache.

In a nutshell, we can't rely on the timestamp to order descriptors because the timestamp is rounded down to the nearest hour thus any change in that time period will never be seen by the client (legacy/trac#16381 (moved)). Furthermore, we can't rely on comparing the descriptor content because we have two replicas with the same content but have different descriptor ID (legacy/trac#14219 (moved)).

One solution proposed by arma (and +1 by special) is to redesign the logic to be cleaner, and keep an intro point failure cache. It seems like every time we encounter this "we keep state about the previous hs desc by keeping a copy of it, and then seeing if the new one is different, get it?" logic, somebody finds it confusing. Maybe now is a good time for it to die? :)

Redesign the HS client descriptor cache

Child items ...

Activity