Use unreliable and unordered WebRTC data channels

added Sponsor 96 label

added S96-maybe-O1 label

added Backlog label

This ticket was discussed at the 2023-03-16 team meeting.

http://meetbot.debian.net/tor-meeting/2023/tor-meeting.2023-03-16-15.57.log.html#l-13

assigned to @shelikhoo

removed Backlog label

added Backlog Doing labels

removed Backlog label

mentioned in issue #40243 (closed)

mentioned in merge request !109 (closed)

added Q2 label

Here is my proposal about addressing this issue with letting KCP deal with packet loss instead of let webrtc do that:

Protocol Negotiation

Since there will be two versions of protocol(the existing protocol, and the proposed protocol), the proxy and client will need to find a way to find a common protocol to communicate with each other. And this will be done with a ALPN style negotiation over broker's signaling channel.

The client and proxy both send their supported protocols to the broker, and the broker will forward the supported protocol list to each other without processing or modification. The client and proxy will communicate over the first protocol supported by proxy in client preference list. If the client or server didn't send its preferred protocol, it will be tcp-like.

Protocol Design

The client and proxy will turn on unordered and non-retransmission mode if udp like mode have been selected.

In order to felicitate the transfer of turbotunnel ID and other meta data, the following protocol will be used for the communication between client and proxy:

The first byte of message will be used to indicate message type:

0xff: Message with clientID header
0xfe: Message without header

The client will send type 1 message until it receive a type 2 message, and then it will send type 1 message. The proxy will always send type 2 message(will only send the first message after receive an message from client).

The proxy forward the message to server without modification or processing, and indicate protocol kind in the websocket url.(It would rely on websocket to preserve message boundary.)

For the client and server, this protocol replace clientID header and encapsulation layer.

Client configuration

a new bridge line configuration mod will be added to client configuration, the valid values are a string representing protocol preference, each charactor represents one protocol and will be expended after processing. u represents udp-like, t represents tcp-like. It is ut by default.

Evaluation

A simulation environment build using shadow will be used to evaluate the relative performance improvement using this protocol.

Thanks for the thoughtful proposal, @shelikhoo.

The first byte of message will be used to indicate message type:

0xff: Message with clientID header

0xfe: Message without header

The client will send type 1 message until it receive a type 2 message, and then it will send type 1 message. The proxy will always send type 2 message(will only send the first message after receive an message from client).

Let me suggest some alternatives that may be simpler and easier. What they have in common is transmitting the ClientID reliably, out-of-band, before the data channel is established.

Send the ClientID through the broker. I.e., add a ClientID field to ClientPollRequest and ProxyPollResponse. The proxy will get the ClientID from getClientOffer.
Send the ClientID in the data channel label property. I.e., call pc.CreateDataChannel in the client with the label parameter set to the ClientID. The label is sent reliably in the DATA_CHANNEL_OPEN which is sent behind the scenes when a data channel is created. The proxy will get the label in the ondatachannel callback.

Since there will be two versions of protocol (the existing protocol, and the proposed protocol), the proxy and client will need to find a way to find a common protocol to communicate with each other. And this will be done with a ALPN style negotiation over broker's signaling channel.

I think it can be even simpler. Data channels have a protocol field that is intended for exactly this purpose. Like label, protocol is sent reliably and out-of-band behind the scenes in DATA_CHANNEL_OPEN as part of the process of creating a data channel.

If we assume that a client that supports the new protocol will continue to support the old protocol, then the client doesn't even need to include its protocol support in its registration message, and the broker does not need to consider protocol support when matching clients with proxies. Here is how it could work:

The proxy sends its protocol support in the ProxyPollRequest message:

"protocols": ["", "u"],

(I recommend using an empty string to represent the old protocol, not "t", because our existing data channels are in fact already using protocol = "", which is the default.) The broker forwards the proxy's protocol list to the client in the ClientPollResponse message. If the "protocols" field is present and contains "u", then the client uses the new protocol. If the "protocols" field contains "" or is absent, then the client uses the old protocol. The client sets the Protocol field of RTCDataChannelInit to either "" or "u" when calling pc.CreateDataChannel. The proxy, in its ondatachannel callback, checks the data channel's protocol and sets up message callbacks appropriately (as in this example):

dbg(`Data Channel established, protocol ${JSON.stringify(channel.protocol)}...`);
switch (channel.protocol) {
  case "": // empty string is old default
    // reliable, ordered mode
    this.prepareDataChannel(channel);
    break;
  case "u":
    // unreliable, unordered mode
    // do the equivalent of prepareDataChannel, but for unreliable mode
    break;
  default:
    // error: disconnect
    break;
}

If the client is going to drop support for the old protocol, then the client will have to send its supported protocols to the broker and the broker will have to use that information in matching, but that will make things more complicated.

The proxy forward the message to server without modification or processing, and indicate protocol kind in the websocket url.

This is a reasonable idea, to forward each data channel message unmodified through the WebSocket, relying on WebSocket message boundaries to keep KCP packets separate. The first message will contain the ClientID, which was learned using one of the mechanisms above.

Using a different WebSocket URL path is defensible, but be aware that WebSocket also has built-in support for a "subprotocol" string that could hold this information. In the proxy, you set the subprotocol using the optional second argument to the WebSocket constructor. On the bridge, you set the Subprotocols field of the Upgrader struct and then find which protocol was selected using Conn.Subprotocol.

An alternative, possibly more incremental, is to continue using the existing encapsulation protocol on the proxy–bridge WebSocket link. The proxy will first send the ClientID to the bridge. Then, when you forward a message from the client to the bridge, prepend a 2-byte length prefix of the form 11xxxxxx 0xxxxxxx, which supports data lengths up to 14 bits. The benefit of this approach is that no synchronized change is required at bridges; the disadvantage is more processing at the proxy.

One thing that is missing here is padding/shaping of the client–proxy link when using unreliable data channels. We have support for arbitrary padding/shaping of the existing reliable data channels, even if we do not use it. But with all the recent talk of TLS-in-TLS tunnel detection by the GFW (ex. 1, ex. 2), we should not design a protocol that does not support adding padding to messages or sending messages that consist entirely of padding. One question is whether the padding should be end-to-end between the client and bridge (as it is now with reliable data channels), or if the proxy should remove the padding and only forward plain, unpadded packets to the bridge.

a new bridge line configuration mod will be added to client configuration, the valid values are a string representing protocol preference, each charactor represents one protocol and will be expended after processing. u represents udp-like, t represents tcp-like. It is ut by default.

I recommend against this part. There are no advantages to making this configurable at the client. Just make clients always prefer the "u" mode when supported by the proxy.

Thanks for your comments and advises!!!

About Protocol Negotiation

I agree that sending the negotiated protocol with channel opening message is simpler and achieve the same purpose and functionality. I unnecessarily treated WebRTC as a plain UDP connection.

About Protocol Design

I agree sending Channel ID with channel opening message is a lot simpler than having a seperate protocol to transfer that. However, I purposefully engineered a protocol with a new message type header to make it possible to add padding/message fragmentation feature in the future, and this includes let proxy send traffic to server without any processing. This ensures we can roll out packet-length related countermeasure without asking proxy operators to deploy a new version of proxy (which is time-consuming).

BTW, the implementation specific reason I wish not to transfer channel ID in the channel opening message is that break the abstraction and encapsulation of Go implementation of snowflake: the Channel ID is not accessible to the part of code creating webrtc connection, it would be necessary to refactor the code to move the channel ID generation to func (t *Transport) Dial() function and modified NewPeers to accept additional channel ID argument(which need special care to avoid breaking API stability) and propagate that ID with cascading API changes on exported functions. I was unable to determine if any library user would need to update their code to use more udp-like transport.

I personally don't think sending Channel ID via signaling channel is a good idea, since this would allow attacker to get a lot of channel id easily and draining their server -> client message in large scale.

Client configuration

The intended usage case for this configuration is for debug and benchmarking, and when packet loss is high, turn off "tcp" like transport support at client side to treat server don't support udp like transport as failure to try again until one with udp like support is matched.

Thanks for your considerate reply.

the Channel ID is not accessible to the part of code creating webrtc connection, it would be necessary to refactor the code to move the channel ID generation to func (t *Transport) Dial() function and modified NewPeers to accept additional channel ID argument(which need special care to avoid breaking API stability)

The refactoring you propose does not seem that bad. Functions like NewPeers arguably should not even be exposed as part of the public API. IPtProxy does not use it. Current library users never need to interact with the ClientID, and I don't see any reason why that should change.

Of course I will let you consider and come to a decision. What I am worried about is the ongoing costs of complexity, which may seem small now but will continue to accrue over time. The 0xff and 0xfe message type tags already create a state machine with two states: the client has to do act differently depending on whether it has received a 0xfe type from the client on its current temporary data channel, and the bridge has to act differently depending on whether it has received a 0xff type from the bridge on its current WebSocket. Now, you have to make sure that the client's behavior is appropriate, in both of those state, for every message type the client may receive, including invalid or unknown ones. The more possibilities there are, the more opportunities for errors in specification or implementation. Every little bit of complexity you can squeeze out during the design phase will repay itself many times during implementation and, especially, maintenance. If you do intend to implement a state machine, you should draw a diagram or write it out, and state exhaustively what each peer should do for each message type in each state. For example, suppose the bridge receives a 0xff with a Client ID, then later receives another 0xff with a different Client ID—what should happen? What if the bridge receives a 0xfe before receiving a 0xff? What if a 0xff message is too short to contain a complete Client ID? Does an unknown message type get ignored or does it cause the connection to terminate?

I purposefully engineered a protocol with a new message type header to make it possible to add padding/message fragmentation feature in the future, and this includes let proxy send traffic to server without any processing. This ensures we can roll out packet-length related countermeasure without asking proxy operators to deploy a new version of proxy (which is time-consuming).

Okay, this is a good point about enabling the proxies to always pass messages through unmodified. I agree that is a pretty compelling advantage, and argues for characteristics like padding being end-to-end rather than being interpreted and stripped by the proxy.

I still think, though, that using message type tags to send the Client ID in-band is likely to make things harder than necessary. Suppose you want to introduce a new message type tag to represent a packet with additional padding. You effectively have to add two new message type tags, one corresponding to the 0xff state and one to the 0xfe state.

I personally don't think sending Channel ID via signaling channel is a good idea, since this would allow attacker to get a lot of channel id easily and draining their server -> client message in large scale.

That is a fair consideration.

The intended usage case for this configuration is for debug and benchmarking, and when packet loss is high, turn off "tcp" like transport support at client side to treat server don't support udp like transport as failure to try again until one with udp like support is matched.

I don't know. To me, it looks like a large gain in complexity for a very small gain in utility. Additional configuration options contribute to complexity multiplicatively, not additively. We do not intend to keep running the old and new protocols in parallel forever, correct? We expect the proxy population to upgrade and start supporting the new protocol, so that the fallback to the reliable-channel mode happens rarely and can eventually be removed? If there remains a large fraction of proxies that fail to support the new protocol, that's a problem we need to fix directly, not by adding more logic to clients to work around it, IMO.

Thanks for your good work on this. Not everything needs to come to a decision now; I'm sure some choices will become more clear as you work through prototyping.

mentioned in issue #40248

Okay, here is the design I am going to implement:

Protocol Negotiation

The proxy send its supported protocol to the broker. The client receive proxy's supported protocol and decide which protocol to use. The chosen protocol name will be sent to the server via WebRTC's SCTP's channel opening message.

Protocol Negotiation: Client Preference

An option will be defined to force a client to choose a mode during development. We will see if it is necessary to publish in the final version.

Client - Server Protocol

A new message format will be defined to allow client and proxy communicate their Client ID. There are 2 kinds of messages: message with Client ID and message without Client ID.

0xff: Message with clientID header
0xfe: Message without header

Client have 2 states:

C-a: No Client ID acknowledgement. C-b: Client ID acknowledgement received.

Client initially in C-a state, and transfer to C-b state when received an message from server. Whenever there is conncetion intrruption, the connection will be recreated with a the same Client ID. It drops any packet received if its type header is unexpected

Server have only one state. Once received a message with Client ID, reply to client with Message without header message. Every connection with a inbound proxy forwards starts a new proxy-server protocol connection. It drops any packet received if its type header is unexpected

Proxy - Server Protocol

A query parameter will be added to the websocket connection url to signal udp-like protocol.

Websocket can preserve message boundary on its own. Proxy does not process the message, all message are forwarded without processing.

Evaluation

A shadow test environment will be created to test and evaluate performance improvement.

Client have 2 states

Proxy have only one state

I think this specification of the state machine is incomplete. It's missing a transition back from C-b to C-a, which happens whenever the client reconnects with a new proxy. Every time the bridge receives a new WebSocket connection, it needs to know the Client ID associated with the new incoming connection, so it can start feeding the packets into KCP.

(P.S. please say "Client ID", not "Channel ID", so your terminology matches the source code.)

Strictly speaking, the state is not a single global variable on the client: every one of the client's current WebRTC connections can be in either state C-a or C-b, independent of all the others. Even though we currently only use one connection at a time, that state variable needs to be stored with the connection, not global to the client.

Similarly, every server connection has its own state variable, depending on whether it has received a correctly formatted 0xff message yet or not.

I think you will be happier if you write down your plan more formally. I mean like a table where you say what happens with every event in every state.

actor	state	event	action
client conn	N/A	connection open	transition to state C-a
client conn	C-a	outgoing packet queued	prefix with `\xff` + Client ID + data and send
client conn	C-a	receive message with `\xfe` prefix	transition to state C-b, strip prefix and queue incoming packet
client conn	C-a	receive any other message (empty or non-`\xfe` prefix)	???
client conn	C-a	local close	???
client conn	C-a	remote close	???
client conn	C-b	outgoing packet queued	prefix with `\xfe` + data and send
client conn	C-b	receive message with `\xfe` prefix	strip prefix and queue incoming packet
client conn	C-b	receive any other message (empty or non-`\xfe` prefix)	???
client conn	C-b	local close	???
client conn	C-b	remote close	???
server conn	N/A	connection received	transition to state S-a
server conn	S-a	outgoing packet queued	cannot happen, packets can only be queued for a specific Client ID
server conn	S-a	receive message with `\xff` prefix that is at least 9 bytes long	associate the connection with Client ID data[1:9] and queue incoming packet data[9:], transition to state S-b
server conn	S-a	receive any other message (empty, prefix other than `\xff`, `\xff` prefix but shorter than 9 bytes)	???
server conn	S-a	local close	???
server conn	S-a	remote close	???
server conn	S-b	outgoing packet queued	prefix with `\xfe` and send
server conn	S-b	receive message with `\xff` prefix that is at least 9 bytes and data[1:9] matches this connection's Client ID	remove first 9 bytes, queue incoming packet
server conn	S-b	receive message with `\xff` prefix that is at least 9 bytes and data[1:9] not match this connection's Client ID	???
server conn	S-b	receive message with `\xfe` prefix	strip prefix and queue incoming packet
server conn	S-b	receive any other message (empty, prefix other than `\xff` or `\xfe`, `\xff` prefix but shorter than 9 bytes)	???
server conn	S-b	local close	???
server conn	S-b	remote close	???

I think this specification of the state machine is incomplete. It's missing a transition back from > C-b to C-a, which happens whenever the client reconnects with a new proxy

Yes! I only fully understand this issue when I begin to implement this protocol. I have amended the spec to make sure it notes that whenever there is a connection interruption with proxy, the connection will be created again with the same client id, effective reset the entire protocol state.

(P.S. please say "Client ID", not "Channel ID", so your terminology matches the source code.)

Fixed.

I think you will be happier if you write down your plan more formally. I mean like a table where you say what happens with every event in every state.

I have amended the spec with "drop any unexpected message". Please allow me to work on implementation first as I usually can fix issue surfaced during implementation(and often don't realize mistake in detail until actually writing the code).

added Stale label

mentioned in issue #40289

removed Q2 label

There was some unexpected issue discovered during implementation, and would require a more difficult choice. I have thought for a while but there is no easy winner, but luckily we can have a discussion of this:

The issue is that it is required to create a data channel before sending the offer, and client don't know what protocol the proxy supports when creating the data channel, as it have yet to be matched with proxy.

Here are some routes we could try:

change proxy logic to expect more than one data channel
rewrite the client offer manually to allow creating data channel later
reject older proxies without udp-like transport

I recommend that you not try to solve this problem now. Instead, I recommend that you first get a branch working where you pay no attention to backward compatibility. Completely replace the current data channel code with your new implementation—don't try to make the new and old work side by side. Assume that every proxy has support for your the protocol by default, and only the new protocol.

You should think about compatibility with the existing deployment only after that non–backward compatible branch is working satisfactorily.

The reason I recommend this is that the changes and design decisions needed to support unreliable data channels are non-trivial in themselves. They deserve to have their own focused review. If I am reviewing, I want to see exactly what changes when reliable data channels are converted to unreliable data channels, and only that. I don't want to see it as a parallel code path to the existing reliable data channels, and I really don't want to see it mixed with backward compatibility glue code. The situation I want to avoid is that all the changes—data channels, rendezvous, backward compatibility—get wrapped up in one merge request that is too big to effectively review, and/or you feel that you have sunk too much work into it to want to make changes during review. The diff that implements unreliable data channels needs to be small and reviewable.

So what I suggest now is: continue making the data channel before creating the offer, but set Ordered=false and MaxRetransmits=0 in DataChannelInit. Make whatever other changes you need in the client, and make the minimal changes in the proxy and server needed to make the system work in a local deployment. Ideally, no changes to rendezvous are needed at this point. Once the local deployment is working, let's do a review before making further changes.

Yes! Thanks for the suggestion. I will try to get it working localy (assuming all old proxies are rejected by broker) and request a review of it. I agree it will become something too big to fully review once backward compatibility come into play.

cloned from #40251

assigned to @shelikhoo

changed title from Analysis of speed deficiency of Snowflake in China, 2023 Q1 to Use unreliable and unordered WebRTC data channels

changed the description

removed Stale label

mentioned in merge request !219 (closed)

mentioned in issue #40251

This issue has been waiting for information two weeks or more. It needs attention. Please take care of this before the end of 2024-04-19. ~"Needs Information" tickets will be moved to the Icebox after that point.

(Any ticket left in Needs Review, Needs Information, Next, or Doing without activity for 14 days gets such notifications. Make a comment describing the current state of this ticket and remove the Stale label to fix this.)

To make the bot ignore this ticket, add the bot-ignore label.

added Stale label

removed Sponsor 96 label

removed S96-maybe-O1 label

mentioned in merge request !315

@shelikhoo Are you working on this ticket or can we close it?

Hey, just FYI, for future work, the proxy <-> server connection can also be made unreliable, thanks to WebTransport and HTTP/3.WebTransportDatagramDuplexStream is for unreliable data, WebTransportBidirectionalStream is for reliable, in case we need to pass some initial data reliably.

It should be easier to implement backwards-compatibility-wise since there are only 2 servers, plus with this MR it is already expected that server <-> proxy connection is unreliable.

Yeah. Lantern's Browsers Unbounded already uses WebTransport for that link in the connection chain.

Hmm I'm looking at their code and it appears that they still use WebSocket (see this snippet), which is also evident by looking at the "network" tab when running their extension. They only have an MR to add WebTransport.

IMU the issue is that their proxy is written in Go, and there are no WebTransport binding for JS, as said here https://github.com/getlantern/unbounded/pull/160#issuecomment-1708826816.

added Needs Review label and removed Doing label

removed Stale label

This issue has been waiting for information two weeks or more. It needs attention. Please take care of this before the end of 2024-11-27. ~"Needs Information" tickets will be moved to the Icebox after that point.

(Any ticket left in Needs Review, Needs Information, Next, or Doing without activity for 14 days gets such notifications. Make a comment describing the current state of this ticket and remove the Stale label to fix this.)

To make the bot ignore this ticket, add the bot-ignore label.

added Stale label

This is being implemented in !315 waiting for review.

removed Needs Review label

removed Stale label

added Doing label

mentioned in issue snowflake-webext#48 (closed)

This issue has been waiting for information two weeks or more. It needs attention. Please take care of this before the end of 2025-03-09. Needs Information tickets will be moved to the Icebox after that point.

(Any ticket left in Needs Review, Needs Information, Next, or Doing without activity for 14 days gets such notifications. Make a comment describing the current state of this ticket and remove the Stale label to fix this.)

To make the bot ignore this ticket, add the bot-ignore label.

added Stale label

removed Stale label

Use unreliable and unordered WebRTC data channels

Designs

Child items 0

Activity

Protocol Negotiation

Protocol Design

Client configuration

Evaluation

About Protocol Negotiation

About Protocol Design

Client configuration

Protocol Negotiation

Protocol Negotiation: Client Preference

Client - Server Protocol

Proxy - Server Protocol

Evaluation