prop224: Onion service rev counters are useless and actually harmful for scalability

changed milestone to %Tor: 0.3.5.x-final in legacy/trac

Trac:
Description: Armadev discovered that hsv3 revision counters are harmful to scalability since if an onion service is hosted by multiple servers (like the fb one), every server should have visibility of the revision counter if they want to publish a descriptor.

We should figure out whether there is an easy way around that, or whether this is actually a big problem for scalable v3s.

Rev counters are there so that HSDirs (and other actors) cannot replay old HS descriptors. However, they are not really needed since now HS descriptors are only replayable for a day (before the blinded key gets refreshed), and also HSDirs could keep a replay cache of the descriptor assigned to a blinded key.

If we decide to rip them off, the way to do it is in two painful steps: a) Remove rev counter checking from HSDirs, and do a replay cache or something. b) In the far future, when all HSDirs have upgraded to (a), rip out the rev counter code from onion services.

to

Armadev discovered that hsv3 revision counters are harmful to scalability since if an onion service is hosted by multiple servers (like the fb one), every server should have visibility of the revision counter if they want to publish a descriptor.

We should figure out whether there is an easy way around that, or whether this is actually a big problem for scalable v3s. We should also consider how this works out with onionbalance-based designs.

Rev counters are there so that HSDirs (and other actors) cannot replay old HS descriptors. However, they are not really needed since now HS descriptors are only replayable for a day (before the blinded key gets refreshed), and also HSDirs could keep a replay cache of the descriptor assigned to a blinded key.

If we decide to rip them off, the way to do it is in two painful steps: a) Remove rev counter checking from HSDirs, and do a replay cache or something. b) In the far future, when all HSDirs have upgraded to (a), rip out the rev counter code from onion services.

Trac:
Username: franklin
Cc: dgoulet to dgoulet, franklin

WRT the replay cache idea of step (a) above, we probably do need a replay cache on the HSDirs, because there is a 24 hour window (before we change blinded pubkey) where HSDirs can replace the descriptors on other HSDirs with older versions of the descriptor. We probably want to avoid this and we should use a replay cache for this.

The right way to use a replay cache here is to store the hash of the HSV3 descriptor on the replay cache. We should investigate whether we need to hash the whole descriptor, or the whole descriptor minus the signature (in case the signature is malleable and an attacker can tweak it to bypass replay cache). If we need to do the latter approach, we should add the right data in hs_cache_dir_descriptor_t as part of cache_dir_desc_new().

Implementation plan for step (a) above:

Introduce a global replay_cache_t *hs_cache_replay_cache in hs_cache.c next to hs_cache_v3_dir. We should index entries to this replay cache by blinded key, or maybe add an insertion timestamp so that we know when to clean it up. 2 In cache_store_v3_as_dir, remove the revision counter check, and instead query the replay cache for whether we already have seen this descriptor before. If we have seen this descriptor before we should treat it the same way we treat descriptors with a smaller or equal revision counter right now, that is, reject them and log_info.
We should clean up the replay cache when we are sure that the blinded key for a descriptor is now useless and will never be used by clients again. We should look in rend-spec-v3.txt to make sure when that is; probably at 24 or 48 hours or so.

Replying to asn:

b) In the far future, when all HSDirs have upgraded to (a), rip out the rev counter code from onion services.

As I understand it, HSDirs learned how to handle v3 descriptors in 0.3.0.1-alpha, as part of legacy/trac#17238 (moved).

According to https://trac.torproject.org/projects/tor/wiki/org/teams/NetworkTeam/CoreTorReleases#Listofreleases Tor 0.3.0 is already dead. And importantly, the feature isn't in our long-term-stable 0.2.9.

So as long as we get the new functionality into HSDirs before the next long-term-stable, the "far future" will just be a matter of waiting some months for intermediate stable versions to die out.

This issue was originally reported in legacy/trac#25124 (closed) as a bug in Stem.

Trac:
Keywords: tor-hs prop224 deleted, tor-hs prop224 034-roadmap-proposed added

Replying to arma:

Replying to asn:

b) In the far future, when all HSDirs have upgraded to (a), rip out the rev counter code from onion services.

As I understand it, HSDirs learned how to handle v3 descriptors in 0.3.0.1-alpha, as part of legacy/trac#17238 (moved).

According to https://trac.torproject.org/projects/tor/wiki/org/teams/NetworkTeam/CoreTorReleases#Listofreleases Tor 0.3.0 is already dead. And importantly, the feature isn't in our long-term-stable 0.2.9.

So as long as we get the new functionality into HSDirs before the next long-term-stable, the "far future" will just be a matter of waiting some months for intermediate stable versions to die out.

But hang on, do clients require descriptors to have revision counters? If so, we can't rip out revision counters on services for a long time.

But we can make HSDirs and clients ignore them,

Latest plan: According to RFC 8032:

   Some systems assume signatures are not malleable: that is, given a
   valid signature for some message under some key, the attacker can't
   produce another valid signature for the same message and key.

   Ed25519 and Ed448 signatures are not malleable due to the
   verification check that decoded S is smaller than l.  Without this
   check, one can add a multiple of l into a scalar part and still pass
   signature verification, resulting in malleable signatures.

We should check if our ed25519 implementations do the above check. If they do, it should be possible to just replay cache the ED25519_SIG_LEN bytes of our ed25519_signature_t. I plan to look at our implementation this week to see if the aboce check is done, then send an email to Ian Goldberg, and if he agrees that it's legit, proceed with this plan.

Replying to asn:

Latest plan: According to RFC 8032: {{{ Some systems assume signatures are not malleable: that is, given a valid signature for some message under some key, the attacker can't produce another valid signature for the same message and key.

Ed25519 and Ed448 signatures are not malleable due to the verification check that decoded S is smaller than l. Without this check, one can add a multiple of l into a scalar part and still pass signature verification, resulting in malleable signatures. }}}

We should check if our ed25519 implementations do the above check. If they do, it should be possible to just replay cache the ED25519_SIG_LEN bytes of our ed25519_signature_t. I plan to look at our implementation this week to see if the aboce check is done, then send an email to Ian Goldberg, and if he agrees that it's legit, proceed with this plan.

Progressed with plan: https://lists.torproject.org/pipermail/tor-dev/2018-April/013074.html

Trac:
Cc: dgoulet, franklin to dgoulet, franklin, isis

This is already on the roadmap so marking it for 034.

Trac:
Owner: N/A to asn
Keywords: tor-hs prop224 034-roadmap-proposed deleted, tor-hs prop224 034-roadmap-master added
Status: new to assigned
Milestone: Tor: unspecified to Tor: 0.3.4.x-final

Trac:
Parent: N/A to legacy/trac#25955 (moved)

Hey @cypherpunks, we really appreciate you trying to help organise and triage our tickets, but in this case the HS team has a flow that works for them, so please don't set parent tickets on their stuff. Thanks!

Trac:
Parent: legacy/trac#25955 (moved) to N/A

Taking over this.

To summarize, after a discussion with teor and asn, we'll go with a replaycache that stores a hash of the descriptor without the signature. And this would be done after desc signature validation.

Trac:
Owner: asn to dgoulet
Cc: dgoulet, franklin, isis to franklin, isis

Here is a fun fact. We use the revision counter in the computation of the descriptor encryption keys. See spec section HS-DESC-ENCRYPTION-KEYS.

So bottom line, this means that we have to remove it from secret_input computation but only if we can't find the counter in the plaintext data of the descriptor ("revision-counter" SP Integer NL).

Code wise, this isn't very complicated but I thought it would be wise to just throw it out there since it affects our crypto construction.

Replying to teor:

So as long as we get the new functionality into HSDirs before the next long-term-stable, the "far future" will just be a matter of waiting some months for intermediate stable versions to die out.

But hang on, do clients require descriptors to have revision counters?

So here is the fun part of this. Client do look at the revision counter when caching but only to decide if they have a newer one in their cache or not. Thus, a revision counter always to 0 for instance wouldn't affect the client cache.

As for HSDir, they won't accept a descriptor with a revision counter that is lower or equal with the one they have in their cache. Meaning that 034+ services will still need to put the revision counter in their descriptor for a while until <= 032 is phased out. Not putting the revision counter in the descriptor for specific HSDir is not trivial amount of engineering work.

Now, this is where it gets messy. When decoding the plaintext part of a descriptor (where the rev. counter is), we hard require the counter to be in it (notice the T1()):

  T1(str_rev_counter, R3_REVISION_COUNTER, EQ(1), NO_OBJ),

Thus, as long as we have 032 and 033 HSDirs and clients on the network, we can't remove the counter from the descriptor else they won't be able to store/access 034+ services as every descriptor will fail to be decoded.

Thus to summarize, the only thing we can do for now is make HSDir use a replaycache instead of revision counter, make client ignore revision counter and make the revision counter optional in the descriptor when decoding it.

But we MUST make the service put the rev counter regardless, with the current mechanism, in the descriptor for a while else it will break client and HSDir <= 033.

Or a more nuclear option, we bump the descriptor version to 4 which won't have the revision counter which will effectively make 034+ the birth of the onion service v4 :P...... not ideal ;).

Possible branch: ticket25552_034_01.

To be honest, I currently don't see a way for service to stop putting the revision counter without breaking many clients because of comment:15.

Seems like once all HSDir <= 032 are phased out, we can then make the service stop putting it in the descriptor (which will break <= 032 clients...). This means that right now (it is in the branch) we have to either make the client ignore the revision counter in the secret in put construction or always use 0 if no rev counter in the descriptor (which I did for simplicity).

Anyway, see the attempt above.

Trac:
Status: assigned to needs_review
Reviewer: N/A to asn

I'm going to suggest a different strategy:

Make v3 onion services use the descriptor generation timestamp for the revision counter
Backport this change to all tor versions with v3 onion services (0.3.2 and later)

This fix will make v3 onion services scaleable, by allowing multiple services to submit descriptors with a very small probability of revision number collisions. It also retains the property that newer descriptors replace older ones.

We can make a separate decision about replay caches on HSDirs. We can make a separate decision about removing the revision counter entirely. If we decide to keep it, we should check that it's a 64-bit field, so it lasts past 2038.

Trac:
Status: needs_review to needs_revision

Oh, wait, that doesn't work. It leaks the service's clock / clock skew to HSDirs and clients.

We could add noise and round, but suddenly the patch isn't simple any more.

Trac:
Status: needs_revision to needs_review

HSDirs that allow descriptors with no revision counter should declare a new HSDir protocol version. We should make services leave out the revision counter if a consensus parameter is set.

Then our upgrade path is:

Wait until most HSDirs support the new protocol
Stop giving the HSDir flag to relays that don't support the new protocol
Make services uses the new protocol by setting the consensus parameter

Trac:
Status: needs_review to needs_revision

Some discussion happened on IRC with teor, asn and I. Here is an updated branch:

https://github.com/torproject/tor/pull/80 Branch: ticket25552_034_02

Trac:
Status: needs_revision to needs_review

Looks better!

I did another review and pushed some changes to ticket25552_034_02. Changes are in different commits, so if you don't like some of them feel free to not cherry-pick them! I also added a comment on github.

Do you think we can run this on your relay for a few days (perhaps with increased logging) to ensure that we get no replay situations? I'm mainly afraid of the case where the HS cache clears its descriptor cache, and the HS tries to send us the same descriptor, and we reject it even tho we dont have one. I looked at the code and I believe that this cant happen since each HS descriptor is different (because of encryption salt, etc.), but still want to make sure :)

Ok I've squashed your commit so we can have a clean git history! Your github comment doesn't show so I think you forgot to click Submit review ;).

All good stuff. Everything is untouched except the hs_descriptor_is_replay() that I changed so we don't use the hs_descriptor.c namespace and also I identify the function with as_dir because the hs_cache.c file contains many caches. I'll run this on my relay but it will take at the very least 5+ days since takes 96h to get HSDir flag ;).

But I think in the meantime we can merge upstream else we'll be way to short from the closing of the merge window here...

Branch: ticket25552_034_03

https://github.com/torproject/tor/pull/81 Branch: ticket25552_034_03

Trac:
Status: needs_review to merge_ready

Travis says the unit tests are failing here. I confirm:

hs_cache/directory: 
  FAIL src/test/test_hs_cache.c:112: expected log to contain "Possible descriptor replay detected"  Captured logs:

  [directory FAILED]
1/1086 TESTS FAILED. (31 skipped)

I'll also start reviewing the branch.

Trac:
Status: merge_ready to needs_revision

I've reviewed the PR. The biggest issue is related to the use of "\n" signature_str, which I believe should be "\n" signature_str " " instead.

Other issues not on the PR:

How hard is it to DoS this into an OOM condition? Do we need to tie it into the OOM system? And by doing so, do we subject ourselves to replay attacks once again?
On point 1: perhaps the replay cache should be thought of, not as a complete replacement for revision counters, but as an alternative to them when they cannot be used? That is, we could enforce the rule that revision counters are non-decreasing, but allow revision counters to remain equal, and use the replay cache to handle only the "equal counter" case.

That way if we need to rip out the cache because of OOM issues in the future, or if we solve the problem of coordinating revision counters between distributed HS providers, we aren't stuck with the cache forever.

As above, the unit tests need fixing.

Replying to nickm:

I've reviewed the PR. The biggest issue is related to the use of "\n" signature_str, which I believe should be "\n" signature_str " " instead.

Yes. Fixed on the PR with fixup 23a4c09edb2c9d46.

How hard is it to DoS this into an OOM condition? Do we need to tie it into the OOM system? And by doing so, do we subject ourselves to replay attacks once again?

Right so hooking it up to the OOM could affect the replay attack protection. Thus, this cache can potentially grow to infinite. However, entries are in bytes (32 byte hash + some extra byte). For example, one would need to upload this amount of descriptors in a 36 hour period to reach the size:

100k - ~3MB
1M - ~32MB
5M - ~153MB

What if we add a warning when a certain threshold is reached so we can maybe learn if it is happening in the wild at some point?

Usually, when the OOM goes off on HSDir descriptor, we learn about it pretty quickly.

On point 1: perhaps the replay cache should be thought of, not as a complete replacement for revision counters, but as an alternative to them when they cannot be used? That is, we could enforce the rule that revision counters are non-decreasing, but allow revision counters to remain equal, and use the replay cache to handle only the "equal counter" case.

This won't work with the use case we are trying to fix with this patch.

Basically, multiple services competing (for instance what FB does), the last one uploading needs to be the true one winner. If we still support revision counter, the service that restarted the last will always be the winner.

So we need to dump the rev. counter usage both at the HSDir and Client side for caches all together.

That way if we need to rip out the cache because of OOM issues in the future, or if we solve the problem of coordinating revision counters between distributed HS providers, we aren't stuck with the cache forever.

Other problem with revision counter. We can't simply dump them because it is also used in the encryption key computation by service/client...

So we need this transition period were (1) we make all client/hsdir make it optional in the descriptor and set it to 0 which means if not present, client and service uses 0. Then (2) at some point far in time, we can make service stop putting it because we believe all client/hsdir have updated......

Really not ideal situation but the best we came up with.

As above, the unit tests need fixing.

Done in fixup fc45a82b7bceb4ca

Trac:
Status: needs_revision to needs_review

The changes here look good. Let's keep talking about revision counters, though: the DoS OOM opportunity makes me nervous.

There's an alternative approach we've been talking about. For preliminaries, see my torspec branch ope_spec, with some corresponding backend implementation code in ope_hax

Also see bug25552_blinding for an alternative design suggested by Nick, which blinds the time(NULL) timestamp with a hash derived from the ephemeral blinded key as such:

    uint32_t BLINDING_FACTOR = SHA3(ephemeral_blinding_key)[4]
    uint64_t REV_COUNTER = now + BLINDING_FACTOR

IIUC, this offers the same properties as the OPE approach: monotonically increasing rev counter, with no state file needed, and with obfucated local time, but it requires time sync between load balancing nodes. It also seems easier to understand/review than the OPE approach.

Nick what you think?

OK approach from comment:30 is busted after all! We are going with the OPE design from comment:29, however this needs some more work. Pushing to early 035.

Trac:
Milestone: Tor: 0.3.4.x-final to Tor: 0.3.5.x-final
Status: needs_review to new

Oooff! I have a branch that implements the OPE design for HSv3! It seems to work on my onion service for the past week just fine, and tests pass. This was much harder than I thought, because the hs_service unittests are a mess that depend on too many time sources. I fixed them all and tested them with various kinds of local time using a crazy bash script I hacked up.

You can find a fixup for the spec branch at ope_spec in my github torspec repo. It explains why we cannot use "seconds since TP", and why we have to use "seconds since SRV".

You can find the code at my branch bug25552_ope_draft. It seems to work fine and the code is decent, but I labeled it as draft because there might be some trivial code-style improvements that can be done (in particular im thinking the is_current func argument). I also need to make sure that HSes will not burp if they have an obsolete rev-counter left in their state file. Please check it out and if you like it I will make it real proper: https://github.com/torproject/tor/pull/150

Trac:
Status: new to needs_review
Points: 4 to 6+++

Trac:
Reviewer: asn to dgoulet

Left comments on the PR. No show stoppper but a couple comments I would like to see answered or/and addressed. Thanks!

Trac:
Status: needs_review to needs_revision

Replying to dgoulet:

Left comments on the PR. No show stoppper but a couple comments I would like to see answered or/and addressed. Thanks!

Thanks for the review David. I pushed two fixups that should address your comments.

Let me know if you like them and I'm gonna rebase everything into a proper branch.

Trac:
Status: needs_revision to needs_review

Hi. I just wanted to chime in that the bug25552_ope_draft appears to be a working solution to the issues with UPLOAD_REJECTED during repeated descriptor publishing that were first reported in ticket legacy/trac#25124 (closed). Here is the code I'm using to test the fix.

Trac:
Username: maqp

ok so this branch seems good to go although now Github is complaining of merge conflict (because of the huge ongoing refactoring!).

I'm putting this one in merge_ready but @asn if you have a chance to provide a new one rebased on latest master, might help nickm if the conflicts are too complicated.

Trac:
Status: needs_review to merge_ready

Please see branch bug25552_ope in my repo with the rebased branch!

Here is the relevant PR: https://github.com/torproject/tor/pull/185

Does this ticket solve legacy/trac#26549 (moved)?

Replying to teor:

Does this ticket solve legacy/trac#26549 (moved)?

Yes it does. Replied on that ticket.

If it's possible, could you please backport this to 0.3.2 and later?

Trac:
Username: maqp

Replying to maqp:

If it's possible, could you please backport this to 0.3.2 and later?

This is a very large patch, so we might not backport it. What would stop you from upgrading to 0.3.5?

Trac:
Keywords: tor-hs prop224 034-roadmap-master deleted, tor-hs prop224 034-roadmap-master 032-backport-maybe 033-backport-maybe 034-backport-maybe added

Replying to teor:

This is a very large patch, so we might not backport it. What would stop you from upgrading to 0.3.5?

Nothing really! I just noticed the 0.3.5 milestone was set to December and since you mentioned something about backporting the fix in comment:18, I thought I'd ask if that was possible. If it's a lot of work then I'll of course wait until Tails etc. upgrade to 0.3.5.

Trac:
Username: maqp

Trac:
Keywords: tor-hs prop224 034-roadmap-master 032-backport-maybe 033-backport-maybe 034-backport-maybe deleted, tor-hs prop224 035-roadmap-master added

Marking this as 035-must. Adding the regression as suggested by the triage guidelines, meaning that this is a regression behavior from v2 to v3.

Trac:
Keywords: tor-hs prop224 035-roadmap-master deleted, tor-hs prop224 035-roadmap-master 035-must regression added

I left a few small comments on the branch; looking very good. Only one of the comments (saving the OPE key) should take any engineering effort.

Trac:
Status: merge_ready to needs_revision

Did the fixes requested, but the branch fails in chutney. Need some more hammer probably.

prop224: Onion service rev counters are useless and actually harmful for scalability

Child items ...

Activity