prop289: Implement authenticated SENDME

changed milestone to %Tor: 0.4.1.x-final in legacy/trac

Trac:
Owner: N/A to dgoulet
Status: new to assigned

Trac:
Keywords: prop289 035-roadmap-master deleted, 035-roadmap-master, 035-triaged-in-20180711, prop289 added

Trac:
Cc: N/A to dmr

I think all prop#289 tickets are Sponsor V.

Trac:
Sponsor: N/A to SponsorV
Keywords: N/A deleted, prop289-assigned-sponsor-v added

Deferring various feature-y things to 0.3.6. If one of these is actually happening in 0.3.5, please let me know!

Trac:
Milestone: Tor: 0.3.5.x-final to Tor: 0.3.6.x-final

Tor 0.3.6.x has been renamed to 0.4.0.x.

Trac:
Milestone: Tor: 0.3.6.x-final to Tor: 0.4.0.x-final

prop289 won't make it in 040. Feature freeze is in 7 days.

Trac:
Milestone: Tor: 0.4.0.x-final to Tor: 0.4.1.x-final

Let's review these tickets at the next meeting using our 041-proposed process. They're on the roadmap, so the review should focus on ticket size and team capacity (and sponsor expectations).

Trac:
Keywords: N/A deleted, 041-proposed-on-roadmap added

On going work of the entire prop289 implementation is in: ticket26288_041_01. It contains all the child ticket code since they all depend on each other.

Trac:
Points: N/A to 21
Keywords: N/A deleted, network-team-roadmap-2019-Q1Q2 added

I'm waiting on legacy/trac#29536 (moved) to be merged before submitting a PR. I've rebased the latest branch on legacy/trac#29536 (moved) so it can have a full integration: See ticket26288_041_02.

PR will arrive once everything is in master.

There will be one last thing to discuss: how to do this with protover. See the spec change that I propose to the proposal about this: ticket26288_01 (torspec).

Ok! We got all the pieces!

PR: https://github.com/torproject/tor/pull/725 Branch: ticket26288_041_02 Spec: ticket26288_01

So a piece missing is the integration with protover. I'm not entirely sure how to proceed code wise because what I've tried with SendMe=1 and it was not working. Basically, what I need is a confirmation that what is proposed makes sense and is doable that way. If so, I'll push the commit that implements this and will ask nickm to hunt down why it is failing.

Trac:
Status: assigned to needs_review
Reviewer: N/A to ahf

Replying to dgoulet:

...

So a piece missing is the integration with protover. I'm not entirely sure how to proceed code wise because what I've tried with SendMe=1 and it was not working. Basically, what I need is a confirmation that what is proposed makes sense and is doable that way. If so, I'll push the commit that implements this and will ask nickm to hunt down why it is failing.

SENDMEs are part of circuits and streams, so we could increment the Relay protocol version:

9.3. "Relay"

   The "relay" protocols are those used to handle CREATE/CREATE2
   cells, and those that handle the various RELAY cell types received
   after a CREATE/CREATE2 cell.  (Except, relay cells used to manage
   introduction and rendezvous points are managed with the "HSIntro"
   and "HSRend" protocols respectively.)

https://gitweb.torproject.org/torspec.git/tree/tor-spec.txt#n2012

Replying to teor:

Replying to dgoulet:

...

So a piece missing is the integration with protover. I'm not entirely sure how to proceed code wise because what I've tried with SendMe=1 and it was not working. Basically, what I need is a confirmation that what is proposed makes sense and is doable that way. If so, I'll push the commit that implements this and will ask nickm to hunt down why it is failing.

SENDMEs are part of circuits and streams, so we could increment the Relay protocol version:

Hmmmm the only reason I created a SendMe here is because it would have made Relay a bit messier... but I guess overall that is what we've designed Protover to support anyway.

Edit: After some discussions with Nick on IRC, problem with Relay is that we would need two new versions, that is "Auth. SENDME + tap" and "Auth. SENDME + ntor"... and that means using Relay implies a large matrix of versions every time we change a different cell type.

So the suggestion would be something like FlowCtrl=, have an implicit "1" that is current situation and add the value for 2 that would be for prop289.

We already have a SENDME version (0) that all tor supports. And now we want to support v1. In order for protover to "stop" the use of v0, we then need to introduce two new versions to Relay which right now would be 3 and 4.

~~Then to remove the usage of v0, we would advertise Relay=1-2,4 which should effectively exit() every client that does NOT support v1 that is Relay=4.~~

Doable!

Currently there is a compilation failure on AppVeyor:

                 from ../src/core/or/or.h:32,
                 from ../src/core/or/sendme.c:12:
../src/core/or/sendme.c: In function 'sendme_connection_edge_consider_sending':
../src/core/or/sendme.c:326:27: error: format '%lu' expects argument of type 'long unsigned int', but argument 5 has type 'size_t {aka unsigned int}' [-Werror=format=]
     log_debug(log_domain, "Outbuf %lu, queuing stream SENDME.",
                           ^
../src/lib/log/log.h:243:48: note: in definition of macro 'log_debug'
       log_fn_(LOG_DEBUG, domain, __FUNCTION__, args, ##__VA_ARGS__); \
                                                ^~~~

Looks like the test failures on Travis isn't related to this code.

Trac:
Status: needs_review to needs_revision

Trac:
Status: needs_revision to needs_review

The code and spec changes looks good to me. Moving this over to Nick for some additional review.

Trac:
Reviewer: ahf to nickm

Replying to dgoulet:

Replying to teor:

Replying to dgoulet:

...

So a piece missing is the integration with protover. I'm not entirely sure how to proceed code wise because what I've tried with SendMe=1 and it was not working. Basically, what I need is a confirmation that what is proposed makes sense and is doable that way. If so, I'll push the commit that implements this and will ask nickm to hunt down why it is failing.

SENDMEs are part of circuits and streams, so we could increment the Relay protocol version:

Hmmmm the only reason I created a SendMe here is because it would have made Relay a bit messier... but I guess overall that is what we've designed Protover to support anyway.

Edit: After some discussions with Nick on IRC, problem with Relay is that we would need two new versions, that is "Auth. SENDME + tap" and "Auth. SENDME + ntor"... and that means using Relay implies a large matrix of versions every time we change a different cell type.

So the suggestion would be something like FlowCtrl=, have an implicit "1" that is current situation and add the value for 2 that would be for prop289.

You can do it this way: just like HSIntro etc.

We already have a SENDME version (0) that all tor supports. And now we want to support v1. In order for protover to "stop" the use of v0, we then need to introduce two new versions to Relay which right now would be 3 and 4.

~~Then to remove the usage of v0, we would advertise Relay=1-2,4 which should effectively exit() every client that does NOT support v1 that is Relay=4.~~

I think there's some confusion here.

The current Relay protocols are:

TAP and all the features in Tor 0.2.3 (including whatever flow control was in 0.2.3)
ntor and all the features in Tor 0.2.4.19, including TAP and all the features in 0.2.3 (including whatever flow control was in 0.2.4.19) https://gitweb.torproject.org/torspec.git/tree/tor-spec.txt#n2012

But I think you're right overall: we don't know if we want to turn off TAP first, or the old flow control first. So a new protocol is a good idea.

I reviewed the protocol parts of this patch:

Phase 3 of the transition plan requires old clients and relays to download a consensus so they learn that they should stop trying to connect to the network. But since 0.2.8, clients (and censored relays that can't access any DirPorts) will try to use the ORPort to download a consensus. But ORPort circuits from legacy clients will fail during phase 3.

Here's what I think we need to do:

always allow legacy sendmes for BEGINDIR for the consensus, and everything that is required to validate a consensus:

authority certificates,
relay descriptors (for bridge clients),
anything else?

Revise the transition plan, so it includes the protover changes and the consensus parameter changes
Don't remove the section about extensive testing using chutney:

-   We'll want to do a bunch of testing in chutney before flipping the
-   switches in the real network: I've long suspected we still have bugs
-   in our sendme timing, and this proposal might expose some of them.

Do the chutney tests now, and do them again when we want to implement each phase on the public network

The spec and the code are also out of sync: the spec talks about FlowCtrl, but the code doesn't have FlowCtrl.

Here are the changes I think we need to make:

Add FlowCtrl=1 to the protocols advertised by relays in C
Add FlowCtrl=1 to the protocols advertised by relays in Rust
Clarify "FlowCtrl" in the spec:

   Tor clients and relays that don't support this protover version from the
   consensus "required-client-protocols" or "required-relay-protocols" lines
   will exit and thus not try to join the network. Here is the proposed value:

      "FlowCtrl"

      Describes the flow control protocol at the circuit and stream level.
      If there is no FlowCtrl protocol version, tor supports the unauthenticated
      flow control features from its supported Relay protocols.

Trac:
Status: needs_review to needs_revision

Replying to teor:

I reviewed the protocol parts of this patch:

Phase 3 of the transition plan requires old clients and relays to download a consensus so they learn that they should stop trying to connect to the network. But since 0.2.8, clients (and censored relays that can't access any DirPorts) will try to use the ORPort to download a consensus. But ORPort circuits from legacy clients will fail during phase 3.

This is what the new protover version is all about. We would flip FlowCtrl=1 before the consensus param sendme_accept_min_version=1 is set. This would effectively exit() every clients/relays that can't support v1.

Then we can enforce only accepting v1 through the consensus (if that param is needed at all).

Here's what I think we need to do:

always allow legacy sendmes for BEGINDIR for the consensus, and everything that is required to validate a consensus:

authority certificates,

relay descriptors (for bridge clients),

anything else?

I'm quite reluctant to keep legacy SENDMEs even when the protover and/or the consensus param is flipped. The whole point of that protover is that we don't have to deal with clients not supporting v1. And we can NOT flip the "min accept" param before that protover.

Don't remove the section about extensive testing using chutney: {{{

We'll want to do a bunch of testing in chutney before flipping the

switches in the real network: I've long suspected we still have bugs

in our sendme timing, and this proposal might expose some of them. }}}

Do the chutney tests now, and do them again when we want to implement each phase on the public network

I don't see the point of that in a proposal to be honest. Chutney tests have been done extensively to develop that branch so I'm not sure what this (4) is about here?...

Any switch we flip in the consensus, especially something like this, needs to be tested a lot. Not doing so is a bit reckless on our part and I doubt having it in the proposal saying "we need to test this" is useful at all...

What I recommend we do is actually describe the "ordering" of all the pieces in the transition plan. And then document what we should NOT do so in 10 years, we have a reminder of what to do (because I don't see Phase 3 being done until many years!).

The spec and the code are also out of sync: the spec talks about FlowCtrl, but the code doesn't have FlowCtrl.

Yes, that is what the original comment mean on my part is that I didn't do this yet because I wasn't sure it was the right approach with the SendMe.

Replying to dgoulet:

Replying to teor:

I reviewed the protocol parts of this patch:

Phase 3 of the transition plan requires old clients and relays to download a consensus so they learn that they should stop trying to connect to the network. But since 0.2.8, clients (and censored relays that can't access any DirPorts) will try to use the ORPort to download a consensus. But ORPort circuits from legacy clients will fail during phase 3.

This is what the new protover version is all about. We would flip FlowCtrl=1 before the consensus param sendme_accept_min_version=1 is set. This would effectively exit() every clients/relays that can't support v1.

You're right, old clients and relays that download a consensus while FlowCtrl=1 is required, and while v1 sendmes are accepted, will cache that consensus. Then they will refuse to connect to the network as long as they have that cached consensus.

But when we start rejecting v1 sendmes, old clients will fail to download a consensus. (Most old relays will still be able to fetch consensuses, because they use DirPorts.) These old clients may be new installs of an old version, clients or relays which clear their data directories, or clients that are only launched occasionally.

How will we handle them? We'll shut down most old clients that are already installed. Is that enough? How much time do we need between setting the protocol version requirement, and rejecting v1 sendmes?

What will an old client do if we reject its sendmes? How many old clients will it take to bring down the network? Do we need to keep testing old clients without a cached consensus, but with v1 sendmes rejected?

We need to talk about these issues in the proposal.

Then we can enforce only accepting v1 through the consensus (if that param is needed at all).

Here's what I think we need to do:

always allow legacy sendmes for BEGINDIR for the consensus, and everything that is required to validate a consensus:

authority certificates,

relay descriptors (for bridge clients),

anything else?

I'm quite reluctant to keep legacy SENDMEs even when the protover and/or the consensus param is flipped. The whole point of that protover is that we don't have to deal with clients not supporting v1. And we can NOT flip the "min accept" param before that protover.

I agree: having to keep legacy code is a really bad outcome.

Don't remove the section about extensive testing using chutney: {{{

We'll want to do a bunch of testing in chutney before flipping the

switches in the real network: I've long suspected we still have bugs

in our sendme timing, and this proposal might expose some of them. }}}

Do the chutney tests now, and do them again when we want to implement each phase on the public network

I don't see the point of that in a proposal to be honest. Chutney tests have been done extensively to develop that branch so I'm not sure what this (4) is about here?...

Future Tor changes can break our transition plans.

Any switch we flip in the consensus, especially something like this, needs to be tested a lot. Not doing so is a bit reckless on our part and I doubt having it in the proposal saying "we need to test this" is useful at all...

What I recommend we do is actually describe the "ordering" of all the pieces in the transition plan. And then document what we should NOT do so in 10 years, we have a reminder of what to do (because I don't see Phase 3 being done until many years!).

You're right, writing it down doesn't help that much.

Here's one way to regularly test future transition plans:

We write a script that runs chutney tests that test each part of the transition plan, for old and new clients, and clients with and without cached consensuses
We set up a repository on Travis with a cron job that runs once a week, to make sure that future changes don't break the transition plan

Are there other ways to make sure we don't break our future transition plans?

Replying to teor:

We need to talk about these issues in the proposal.

Indeed. I'll make the changes as soon as I can to the proposal.

Are there other ways to make sure we don't break our future transition plans?

This current issue reminds me of the tap vs ntor problem. Once we stop using tap, what will old client do? I recall analysis being done by nickm there and I'm expecting a very similar conclusion(s) about clients who just can't use the network.

Trac:
Keywords: N/A deleted, tor-spec added

(Only spec change, no code for now)

I've force pushed quite a big fix on the last commit. I recommend a complete re-read of section 4 (Deployment).

I've expanded it considering teor's comment and a discussion I had with Nick this morning on IRC. Basically, keeping v0 support on directory one-hop circuit could be desirable as a buffer once v0 is not accepted anymore.

I've also added a Timeline section to try to give us a better idea of how it looks.

Trac:
Status: needs_revision to needs_review

wow, this is a big one. I've reviewed the first N commits here. I'll do the next N commits soon.

Woo, I've finished reviewing. Looks nice! I've made some suggestions on the branch.

In addition to the review comments, here is something I'd like to see: I'd like to see a unit test that makes sure that the calculated SENDME values match SENDME values that are calculated by a different process, like a python script or something. This will help us make sure that we're calculating the right function over the correct portions of the correct cells. (I didn't see a test like that, but I could have missed it.) What do you think?

Also, this branch needs a changes file.

Trac:
Status: needs_review to needs_revision

Two more requests:

travis seems to be failing on the streamwrap unit test. That's probably related to changes here.
there's no coverage information; I'd like to know how the coverage is on the tests.

See legacy/trac#29668 (moved) for a branch to make the tests pass here.

Replying to nickm:

Woo, I've finished reviewing. Looks nice! I've made some suggestions on the branch.

All pushed now.

In addition to the review comments, here is something I'd like to see: I'd like to see a unit test that makes sure that the calculated SENDME values match SENDME values that are calculated by a different process, like a python script or something. This will help us make sure that we're calculating the right function over the correct portions of the correct cells. (I didn't see a test like that, but I could have missed it.) What do you think?

So essentially, you would like a test that makes sure the digest we compute and match is valid outside of Tor code? That is basically the cell rolling digest validation?

Also, this branch needs a changes file.

I'll add it once the protover commit is done. We seems to have reached a consensus with FlowCtrl so that is what I'll go for.

travis seems to be failing on the streamwrap unit test. That's probably related to changes here.

As discussed on IRC, legacy/trac#29668 (moved) should be merged upstream instead of within this branch.

there's no coverage information; I'd like to know how the coverage is on the tests.

I'll get you that once I have the protover commit.

Trac:
Status: needs_revision to needs_review

Looks good! There are a few more changes I think we should do, mostly around encapsulation and efficiency.

Trac:
Status: needs_review to needs_revision

(I've commented on the PR)

Replying to nickm:

(I've commented on the PR)

Responded. Some more commits were pushed there changing many things. My integration tests still pass successfully after this.

Trac:
Status: needs_revision to needs_review

As a reminder:

PR: https://github.com/torproject/tor/pull/725 Branch: ticket26288_041_02 Spec: ticket26288_01

Okay! I have some tiny requests about function names, and I still think that code outside relay_crypto shouldn't access its fields.

Trac:
Status: needs_review to needs_revision

Replying to nickm:

Okay! I have some tiny requests about function names, and I still think that code outside relay_crypto shouldn't access its fields.

All fixed!

Trac:
Status: needs_revision to needs_review

lgtm now; let's squash & rebase, then re-run all the tests we can think of :)

Replying to nickm:

lgtm now; let's squash & rebase, then re-run all the tests we can think of :)

Here we go! New PR based on latest master (206d28ff152f2df5). Branch has been squashed also thus free of fixups:

PR: https://github.com/torproject/tor/pull/986 Branch: ticket26288_041_03

does this mean, we can no longer cheat bandwidth tokens by spamming

SendMe

's ?

cypherpunks: for more info proposal 289.

Dgoulet: this looks good; before we merge, can you re-verify your chutney tests?

Integration tests went well! Now, this lacks two things:

FlowCtrl protover value.
The one-hop circuit to a directory cache (download consensus) and its consensus param.

nickm and I discussed it on IRC and we'll do that once this is merged so we don't pile up more things on this already big/non-trivial branch. However, we will not release both features apart, they need to all go in the same release. I'll open these tickets and made them child of this parent ticket.

Which means that once this is merged, keep this parent ticket open until we merge the children.

Final PR: https://github.com/torproject/tor/pull/986 Branch: ticket26288_041_03 Spec: ticket26288_01 <-- be careful, has a fixup commit.

Once everything is good, we'll need a tor-spec.txt (or maybe a new sendme-spec.txt?) patch to integrate the proposal officially into the spec.

Trac:
Keywords: tor-spec deleted, tor-spec asn-merge added
Status: needs_review to merge_ready