Let's review these tickets at the next meeting using our 041-proposed process.
They're on the roadmap, so the review should focus on ticket size and team capacity (and sponsor expectations).
On going work of the entire prop289 implementation is in: ticket26288_041_01. It contains all the child ticket code since they all depend on each other.
There will be one last thing to discuss: how to do this with protover. See the spec change that I propose to the proposal about this: ticket26288_01 (torspec).
So a piece missing is the integration with protover. I'm not entirely sure how to proceed code wise because what I've tried with SendMe=1 and it was not working. Basically, what I need is a confirmation that what is proposed makes sense and is doable that way. If so, I'll push the commit that implements this and will ask nickm to hunt down why it is failing.
Trac: Status: assigned to needs_review Reviewer: N/Ato ahf
So a piece missing is the integration with protover. I'm not entirely sure how to proceed code wise because what I've tried with SendMe=1 and it was not working. Basically, what I need is a confirmation that what is proposed makes sense and is doable that way. If so, I'll push the commit that implements this and will ask nickm to hunt down why it is failing.
SENDMEs are part of circuits and streams, so we could increment the Relay protocol version:
9.3. "Relay" The "relay" protocols are those used to handle CREATE/CREATE2 cells, and those that handle the various RELAY cell types received after a CREATE/CREATE2 cell. (Except, relay cells used to manage introduction and rendezvous points are managed with the "HSIntro" and "HSRend" protocols respectively.)
So a piece missing is the integration with protover. I'm not entirely sure how to proceed code wise because what I've tried with SendMe=1 and it was not working. Basically, what I need is a confirmation that what is proposed makes sense and is doable that way. If so, I'll push the commit that implements this and will ask nickm to hunt down why it is failing.
SENDMEs are part of circuits and streams, so we could increment the Relay protocol version:
Hmmmm the only reason I created a SendMe here is because it would have made Relay a bit messier... but I guess overall that is what we've designed Protover to support anyway.
Edit: After some discussions with Nick on IRC, problem with Relay is that we would need two new versions, that is "Auth. SENDME + tap" and "Auth. SENDME + ntor"... and that means using Relay implies a large matrix of versions every time we change a different cell type.
So the suggestion would be something like FlowCtrl=, have an implicit "1" that is current situation and add the value for 2 that would be for prop289.
We already have a SENDME version (0) that all tor supports. And now we want to support v1. In order for protover to "stop" the use of v0, we then need to introduce two new versions to Relay which right now would be 3 and 4.
Then to remove the usage of v0, we would advertise Relay=1-2,4 which should effectively exit() every client that does NOT support v1 that is Relay=4.
Currently there is a compilation failure on AppVeyor:
from ../src/core/or/or.h:32, from ../src/core/or/sendme.c:12:../src/core/or/sendme.c: In function 'sendme_connection_edge_consider_sending':../src/core/or/sendme.c:326:27: error: format '%lu' expects argument of type 'long unsigned int', but argument 5 has type 'size_t {aka unsigned int}' [-Werror=format=] log_debug(log_domain, "Outbuf %lu, queuing stream SENDME.", ^../src/lib/log/log.h:243:48: note: in definition of macro 'log_debug' log_fn_(LOG_DEBUG, domain, __FUNCTION__, args, ##__VA_ARGS__); \ ^~~~
Looks like the test failures on Travis isn't related to this code.
So a piece missing is the integration with protover. I'm not entirely sure how to proceed code wise because what I've tried with SendMe=1 and it was not working. Basically, what I need is a confirmation that what is proposed makes sense and is doable that way. If so, I'll push the commit that implements this and will ask nickm to hunt down why it is failing.
SENDMEs are part of circuits and streams, so we could increment the Relay protocol version:
Hmmmm the only reason I created a SendMe here is because it would have made Relay a bit messier... but I guess overall that is what we've designed Protover to support anyway.
Edit: After some discussions with Nick on IRC, problem with Relay is that we would need two new versions, that is "Auth. SENDME + tap" and "Auth. SENDME + ntor"... and that means using Relay implies a large matrix of versions every time we change a different cell type.
So the suggestion would be something like FlowCtrl=, have an implicit "1" that is current situation and add the value for 2 that would be for prop289.
You can do it this way: just like HSIntro etc.
We already have a SENDME version (0) that all tor supports. And now we want to support v1. In order for protover to "stop" the use of v0, we then need to introduce two new versions to Relay which right now would be 3 and 4.
Then to remove the usage of v0, we would advertise Relay=1-2,4 which should effectively exit() every client that does NOT support v1 that is Relay=4.
I think there's some confusion here.
The current Relay protocols are:
TAP and all the features in Tor 0.2.3 (including whatever flow control was in 0.2.3)
Phase 3 of the transition plan requires old clients and relays to download a consensus so they learn that they should stop trying to connect to the network. But since 0.2.8, clients (and censored relays that can't access any DirPorts) will try to use the ORPort to download a consensus. But ORPort circuits from legacy clients will fail during phase 3.
Here's what I think we need to do:
always allow legacy sendmes for BEGINDIR for the consensus, and everything that is required to validate a consensus:
authority certificates,
relay descriptors (for bridge clients),
anything else?
Revise the transition plan, so it includes the protover changes and the consensus parameter changes
Don't remove the section about extensive testing using chutney:
- We'll want to do a bunch of testing in chutney before flipping the- switches in the real network: I've long suspected we still have bugs- in our sendme timing, and this proposal might expose some of them.
Do the chutney tests now, and do them again when we want to implement each phase on the public network
The spec and the code are also out of sync: the spec talks about FlowCtrl, but the code doesn't have FlowCtrl.
Here are the changes I think we need to make:
Add FlowCtrl=1 to the protocols advertised by relays in C
Add FlowCtrl=1 to the protocols advertised by relays in Rust
Clarify "FlowCtrl" in the spec:
Tor clients and relays that don't support this protover version from the consensus "required-client-protocols" or "required-relay-protocols" lines will exit and thus not try to join the network. Here is the proposed value: "FlowCtrl" Describes the flow control protocol at the circuit and stream level. If there is no FlowCtrl protocol version, tor supports the unauthenticated flow control features from its supported Relay protocols.
Phase 3 of the transition plan requires old clients and relays to download a consensus so they learn that they should stop trying to connect to the network. But since 0.2.8, clients (and censored relays that can't access any DirPorts) will try to use the ORPort to download a consensus. But ORPort circuits from legacy clients will fail during phase 3.
This is what the new protover version is all about. We would flip FlowCtrl=1before the consensus param sendme_accept_min_version=1 is set. This would effectively exit() every clients/relays that can't support v1.
Then we can enforce only accepting v1 through the consensus (if that param is needed at all).
Here's what I think we need to do:
always allow legacy sendmes for BEGINDIR for the consensus, and everything that is required to validate a consensus:
authority certificates,
relay descriptors (for bridge clients),
anything else?
I'm quite reluctant to keep legacy SENDMEs even when the protover and/or the consensus param is flipped. The whole point of that protover is that we don't have to deal with clients not supporting v1. And we can NOT flip the "min accept" param before that protover.
Don't remove the section about extensive testing using chutney:
{{{
We'll want to do a bunch of testing in chutney before flipping the
switches in the real network: I've long suspected we still have bugs
in our sendme timing, and this proposal might expose some of them.
}}}
Do the chutney tests now, and do them again when we want to implement each phase on the public network
I don't see the point of that in a proposal to be honest. Chutney tests have been done extensively to develop that branch so I'm not sure what this (4) is about here?...
Any switch we flip in the consensus, especially something like this, needs to be tested a lot. Not doing so is a bit reckless on our part and I doubt having it in the proposal saying "we need to test this" is useful at all...
What I recommend we do is actually describe the "ordering" of all the pieces in the transition plan. And then document what we should NOT do so in 10 years, we have a reminder of what to do (because I don't see Phase 3 being done until many years!).
The spec and the code are also out of sync: the spec talks about FlowCtrl, but the code doesn't have FlowCtrl.
Yes, that is what the original comment mean on my part is that I didn't do this yet because I wasn't sure it was the right approach with the SendMe.
Phase 3 of the transition plan requires old clients and relays to download a consensus so they learn that they should stop trying to connect to the network. But since 0.2.8, clients (and censored relays that can't access any DirPorts) will try to use the ORPort to download a consensus. But ORPort circuits from legacy clients will fail during phase 3.
This is what the new protover version is all about. We would flip FlowCtrl=1before the consensus param sendme_accept_min_version=1 is set. This would effectively exit() every clients/relays that can't support v1.
You're right, old clients and relays that download a consensus while FlowCtrl=1 is required, and while v1 sendmes are accepted, will cache that consensus. Then they will refuse to connect to the network as long as they have that cached consensus.
But when we start rejecting v1 sendmes, old clients will fail to download a consensus. (Most old relays will still be able to fetch consensuses, because they use DirPorts.) These old clients may be new installs of an old version, clients or relays which clear their data directories, or clients that are only launched occasionally.
How will we handle them?
We'll shut down most old clients that are already installed. Is that enough?
How much time do we need between setting the protocol version requirement, and rejecting v1 sendmes?
What will an old client do if we reject its sendmes?
How many old clients will it take to bring down the network?
Do we need to keep testing old clients without a cached consensus, but with v1 sendmes rejected?
We need to talk about these issues in the proposal.
Then we can enforce only accepting v1 through the consensus (if that param is needed at all).
Here's what I think we need to do:
always allow legacy sendmes for BEGINDIR for the consensus, and everything that is required to validate a consensus:
authority certificates,
relay descriptors (for bridge clients),
anything else?
I'm quite reluctant to keep legacy SENDMEs even when the protover and/or the consensus param is flipped. The whole point of that protover is that we don't have to deal with clients not supporting v1. And we can NOT flip the "min accept" param before that protover.
I agree: having to keep legacy code is a really bad outcome.
Don't remove the section about extensive testing using chutney:
{{{
We'll want to do a bunch of testing in chutney before flipping the
switches in the real network: I've long suspected we still have bugs
in our sendme timing, and this proposal might expose some of them.
}}}
Do the chutney tests now, and do them again when we want to implement each phase on the public network
I don't see the point of that in a proposal to be honest. Chutney tests have been done extensively to develop that branch so I'm not sure what this (4) is about here?...
Future Tor changes can break our transition plans.
Any switch we flip in the consensus, especially something like this, needs to be tested a lot. Not doing so is a bit reckless on our part and I doubt having it in the proposal saying "we need to test this" is useful at all...
What I recommend we do is actually describe the "ordering" of all the pieces in the transition plan. And then document what we should NOT do so in 10 years, we have a reminder of what to do (because I don't see Phase 3 being done until many years!).
You're right, writing it down doesn't help that much.
Here's one way to regularly test future transition plans:
We write a script that runs chutney tests that test each part of the transition plan, for old and new clients, and clients with and without cached consensuses
We set up a repository on Travis with a cron job that runs once a week, to make sure that future changes don't break the transition plan
Are there other ways to make sure we don't break our future transition plans?
We need to talk about these issues in the proposal.
Indeed. I'll make the changes as soon as I can to the proposal.
Are there other ways to make sure we don't break our future transition plans?
This current issue reminds me of the tap vs ntor problem. Once we stop using tap, what will old client do? I recall analysis being done by nickm there and I'm expecting a very similar conclusion(s) about clients who just can't use the network.
I've force pushed quite a big fix on the last commit. I recommend a complete re-read of section 4 (Deployment).
I've expanded it considering teor's comment and a discussion I had with Nick this morning on IRC. Basically, keeping v0 support on directory one-hop circuit could be desirable as a buffer once v0 is not accepted anymore.
I've also added a Timeline section to try to give us a better idea of how it looks.
Woo, I've finished reviewing. Looks nice! I've made some suggestions on the branch.
In addition to the review comments, here is something I'd like to see: I'd like to see a unit test that makes sure that the calculated SENDME values match SENDME values that are calculated by a different process, like a python script or something. This will help us make sure that we're calculating the right function over the correct portions of the correct cells. (I didn't see a test like that, but I could have missed it.) What do you think?
Woo, I've finished reviewing. Looks nice! I've made some suggestions on the branch.
All pushed now.
In addition to the review comments, here is something I'd like to see: I'd like to see a unit test that makes sure that the calculated SENDME values match SENDME values that are calculated by a different process, like a python script or something. This will help us make sure that we're calculating the right function over the correct portions of the correct cells. (I didn't see a test like that, but I could have missed it.) What do you think?
So essentially, you would like a test that makes sure the digest we compute and match is valid outside of Tor code? That is basically the cell rolling digest validation?
Also, this branch needs a changes file.
I'll add it once the protover commit is done. We seems to have reached a consensus with FlowCtrl so that is what I'll go for.
travis seems to be failing on the streamwrap unit test. That's probably related to changes here.
As discussed on IRC, legacy/trac#29668 (moved) should be merged upstream instead of within this branch.
there's no coverage information; I'd like to know how the coverage is on the tests.
I'll get you that once I have the protover commit.
Integration tests went well! Now, this lacks two things:
FlowCtrl protover value.
The one-hop circuit to a directory cache (download consensus) and its consensus param.
nickm and I discussed it on IRC and we'll do that once this is merged so we don't pile up more things on this already big/non-trivial branch. However, we will not release both features apart, they need to all go in the same release. I'll open these tickets and made them child of this parent ticket.
Which means that once this is merged, keep this parent ticket open until we merge the children.