Actually here are some observation from me related to #40243 (closed):
Snowflake is currently using network resource in a so suboptimal way I think it would make sense to also consider make protocol level change on how kcp is interacting with webrtc before considering to add forward error correction. This would be in the form of enabling unreliable mode of webrtc and make necessary change to get it to work.
Right now, kcp packets are sent in webrtc data channel in a reliable way that deliver all packets in order and retransmit any lost message repeatedly. However, kcp also retransmit its packet itself, which as a result, queue all those retransmitted packets somewhere, like in webrtc's buffer.
This means lost packets are required to be retransmitted more than once in different protocol, and retransmit & timeout get compounded. More retransmit result in more lost packets and more retransmission, which eventually lead to connection melt down <- please read.
back pressure like the one introduced in !144 (merged) only move the problem, and block kcp's send in unexpected way, as kcp don't expect send to block as it is usually over udp.
Here is my proposal about addressing this issue with letting KCP deal with packet loss instead of let webrtc do that:
Protocol Negotiation
Since there will be two versions of protocol(the existing protocol, and the proposed protocol), the proxy and client will need to find a way to find a common protocol to communicate with each other. And this will be done with a ALPN style negotiation over broker's signaling channel.
The client and proxy both send their supported protocols to the broker, and the broker will forward the supported protocol list to each other without processing or modification. The client and proxy will communicate over the first protocol supported by proxy in client preference list. If the client or server didn't send its preferred protocol, it will be tcp-like.
Protocol Design
The client and proxy will turn on unordered and non-retransmission mode if udp like mode have been selected.
In order to felicitate the transfer of turbotunnel ID and other meta data, the following protocol will be used for the communication between client and proxy:
The first byte of message will be used to indicate message type:
0xff: Message with clientID header
0xfe: Message without header
The client will send type 1 message until it receive a type 2 message, and then it will send type 1 message. The proxy will always send type 2 message(will only send the first message after receive an message from client).
The proxy forward the message to server without modification or processing, and indicate protocol kind in the websocket url.(It would rely on websocket to preserve message boundary.)
For the client and server, this protocol replace clientID header and encapsulation layer.
Client configuration
a new bridge line configuration mod will be added to client configuration, the valid values are a string representing protocol preference, each charactor represents one protocol and will be expended after processing. u represents udp-like, t represents tcp-like. It is ut by default.
Evaluation
A simulation environment build using shadow will be used to evaluate the relative performance improvement using this protocol.
The first byte of message will be used to indicate message type:
0xff: Message with clientID header
0xfe: Message without header
The client will send type 1 message until it receive a type 2 message, and then it will send type 1 message. The proxy will always send type 2 message(will only send the first message after receive an message from client).
Let me suggest some alternatives that may be simpler and easier.
What they have in common is transmitting the ClientID reliably,
out-of-band, before the data channel is established.
Send the ClientID in the data channel label property. I.e., call pc.CreateDataChannel in the client with the label parameter set to the ClientID. The label is sent reliably in the DATA_CHANNEL_OPEN which is sent behind the scenes when a data channel is created. The proxy will get the label in the ondatachannel callback.
Since there will be two versions of protocol (the existing protocol, and the
proposed protocol), the proxy and client will need to find a way to find a
common protocol to communicate with each other. And this will be done with a
ALPN style negotiation over broker's signaling channel.
I think it can be even simpler.
Data channels have a protocol field
that is intended for exactly this purpose.
Like label, protocol is sent reliably and out-of-band
behind the scenes in DATA_CHANNEL_OPEN
as part of the process of creating a data channel.
If we assume that a client that supports the new protocol
will continue to support the old protocol,
then the client doesn't even need to include its protocol support in its registration message,
and the broker does not need to consider protocol support when matching clients with proxies.
Here is how it could work:
The proxy sends its protocol support in the
ProxyPollRequest message:
"protocols":["","u"],
(I recommend using an empty string to represent the old protocol,
not "t", because our existing data channels are in fact already using
protocol = "", which is the default.)
The broker forwards the proxy's protocol list to the client in the
ClientPollResponse
message.
If the "protocols" field is present and contains "u",
then the client uses the new protocol.
If the "protocols" field contains "" or is absent,
then the client uses the old protocol.
The client sets the Protocol field of
RTCDataChannelInit
to either "" or "u"when calling pc.CreateDataChannel.
The proxy, in its
ondatachannel callback,
checks the data channel's protocol and sets up message callbacks appropriately
(as in this example):
dbg(`Data Channel established, protocol ${JSON.stringify(channel.protocol)}...`);switch (channel.protocol){case"":// empty string is old default// reliable, ordered modethis.prepareDataChannel(channel);break;case"u":// unreliable, unordered mode// do the equivalent of prepareDataChannel, but for unreliable modebreak;default:// error: disconnectbreak;}
If the client is going to drop support for the old protocol,
then the client will have to send its supported protocols to the broker
and the broker will have to use that information in matching,
but that will make things more complicated.
The proxy forward the message to server without modification or processing, and
indicate protocol kind in the websocket url.
This is a reasonable idea,
to forward each data channel message unmodified through the WebSocket,
relying on WebSocket message boundaries
to keep KCP packets separate.
The first message will contain the ClientID,
which was learned using one of the mechanisms above.
Using a different WebSocket URL path is defensible,
but be aware that WebSocket also has built-in support for a
"subprotocol" string
that could hold this information.
In the proxy, you set the subprotocol using the
optional second argument to the WebSocket
constructor.
On the bridge, you set the Subprotocols field of the
Upgrader struct
and then find which protocol was selected using
Conn.Subprotocol.
An alternative, possibly more incremental,
is to continue using the existing encapsulation protocol
on the proxy–bridge WebSocket link.
The proxy will first send the ClientID to the bridge.
Then, when you forward a message from the client to the bridge,
prepend a 2-byte length prefix of the form 11xxxxxx 0xxxxxxx,
which supports data lengths up to 14 bits.
The benefit of this approach is that no
synchronized change is required at bridges;
the disadvantage is more processing at the proxy.
One thing that is missing here is padding/shaping
of the client–proxy link when using unreliable data channels.
We have support for arbitrary padding/shaping
of the existing reliable data channels,
even if we do not use it.
But with all the recent talk of
TLS-in-TLS tunnel detection by the GFW
(ex. 1,
ex. 2),
we should not design a protocol that does not support
adding padding to messages or sending messages that consist entirely of padding.
One question is whether the padding should be end-to-end
between the client and bridge (as it is now with reliable data channels),
or if the proxy should remove the padding and only forward
plain, unpadded packets to the bridge.
a new bridge line configuration mod will be added to client configuration, the valid values are a string representing protocol preference, each charactor represents one protocol and will be expended after processing. u represents udp-like, t represents tcp-like. It is ut by default.
I recommend against this part.
There are no advantages to making this configurable at the client.
Just make clients always prefer the "u" mode when supported by the proxy.
I agree that sending the negotiated protocol with channel opening message is simpler and achieve the same purpose and functionality. I unnecessarily treated WebRTC as a plain UDP connection.
About Protocol Design
I agree sending Channel ID with channel opening message is a lot simpler than having a seperate protocol to transfer that. However, I purposefully engineered a protocol with a new message type header to make it possible to add padding/message fragmentation feature in the future, and this includes let proxy send traffic to server without any processing. This ensures we can roll out packet-length related countermeasure without asking proxy operators to deploy a new version of proxy (which is time-consuming).
BTW, the implementation specific reason I wish not to transfer channel ID in the channel opening message is that break the abstraction and encapsulation of Go implementation of snowflake: the Channel ID is not accessible to the part of code creating webrtc connection, it would be necessary to refactor the code to move the channel ID generation to func (t *Transport) Dial()function and modified NewPeers to accept additional channel ID argument(which need special care to avoid breaking API stability) and propagate that ID with cascading API changes on exported functions. I was unable to determine if any library user would need to update their code to use more udp-like transport.
I personally don't think sending Channel ID via signaling channel is a good idea, since this would allow attacker to get a lot of channel id easily and draining their server -> client message in large scale.
Client configuration
The intended usage case for this configuration is for debug and benchmarking, and when packet loss is high, turn off "tcp" like transport support at client side to treat server don't support udp like transport as failure to try again until one with udp like support is matched.
the Channel ID is not accessible to the part of code creating webrtc connection, it would be necessary to refactor the code to move the channel ID generation to func (t *Transport) Dial()function and modified NewPeers to accept additional channel ID argument(which need special care to avoid breaking API stability)
The refactoring you propose does not seem that bad.
Functions like NewPeers arguably should not even be
exposed as part of the public API.
IPtProxy does not use it.
Current library users never need to interact with the ClientID,
and I don't see any reason why that should change.
Of course I will let you consider and come to a decision.
What I am worried about is the ongoing costs of complexity,
which may seem small now but will continue to accrue over time.
The 0xff and 0xfe message type tags already create a state machine with two states:
the client has to do act differently depending on whether it
has received a 0xfe type from the client on its current temporary data channel,
and the bridge has to act differently depending on whether it
has received a 0xff type from the bridge on its current WebSocket.
Now, you have to make sure that the client's behavior
is appropriate, in both of those state,
for every message type the client may receive,
including invalid or unknown ones.
The more possibilities there are,
the more opportunities for errors
in specification or implementation.
Every little bit of complexity you can squeeze out
during the design phase will repay itself many times
during implementation and, especially, maintenance.
If you do intend to implement a state machine,
you should draw a diagram or write it out,
and state exhaustively what each peer should do
for each message type in each state.
For example, suppose the bridge receives a 0xff with a Client ID,
then later receives another 0xff with a different Client ID—what should happen?
What if the bridge receives a 0xfe before receiving a 0xff?
What if a 0xff message is too short to contain a complete Client ID?
Does an unknown message type get ignored or does it cause
the connection to terminate?
I purposefully engineered a protocol with a new message type header to make it possible to add padding/message fragmentation feature in the future, and this includes let proxy send traffic to server without any processing. This ensures we can roll out packet-length related countermeasure without asking proxy operators to deploy a new version of proxy (which is time-consuming).
Okay, this is a good point about enabling the proxies
to always pass messages through unmodified.
I agree that is a pretty compelling advantage,
and argues for characteristics like padding being end-to-end
rather than being interpreted and stripped by the proxy.
I still think, though,
that using message type tags to send the Client ID
in-band is likely to make things harder than necessary.
Suppose you want to introduce a new message type tag
to represent a packet with additional padding.
You effectively have to add two new message type tags,
one corresponding to the 0xff state and one to the 0xfe state.
I personally don't think sending Channel ID via signaling channel is a good idea, since this would allow attacker to get a lot of channel id easily and draining their server -> client message in large scale.
That is a fair consideration.
The intended usage case for this configuration is for debug and benchmarking, and when packet loss is high, turn off "tcp" like transport support at client side to treat server don't support udp like transport as failure to try again until one with udp like support is matched.
I don't know.
To me, it looks like a large gain in complexity
for a very small gain in utility.
Additional configuration options contribute to complexity
multiplicatively, not additively.
We do not intend to keep running the old and new protocols
in parallel forever, correct?
We expect the proxy population to upgrade and start supporting
the new protocol, so that the fallback
to the reliable-channel mode happens rarely
and can eventually be removed?
If there remains a large fraction of proxies
that fail to support the new protocol,
that's a problem we need to fix directly,
not by adding more logic to clients to work around it, IMO.
Thanks for your good work on this.
Not everything needs to come to a decision now;
I'm sure some choices will become more clear
as you work through prototyping.
The proxy send its supported protocol to the broker. The client receive proxy's supported protocol and decide which protocol to use. The chosen protocol name will be sent to the server via WebRTC's SCTP's channel opening message.
Protocol Negotiation: Client Preference
An option will be defined to force a client to choose a mode during development. We will see if it is necessary to publish in the final version.
Client - Server Protocol
A new message format will be defined to allow client and proxy communicate their Client ID.
There are 2 kinds of messages: message with Client ID and message without Client ID.
0xff: Message with clientID header
0xfe: Message without header
Client have 2 states:
C-a: No Client ID acknowledgement.
C-b: Client ID acknowledgement received.
Client initially in C-a state, and transfer to C-b state when received an message from server. Whenever there is conncetion intrruption, the connection will be recreated with a the same Client ID. It drops any packet received if its type header is unexpected
Server have only one state. Once received a message with Client ID, reply to client with Message without header message. Every connection with a inbound proxy forwards starts a new proxy-server protocol connection. It drops any packet received if its type header is unexpected
Proxy - Server Protocol
A query parameter will be added to the websocket connection url to signal udp-like protocol.
Websocket can preserve message boundary on its own. Proxy does not process the message, all message are forwarded without processing.
Evaluation
A shadow test environment will be created to test and evaluate performance improvement.
I think this specification of the state machine is incomplete.
It's missing a transition back from C-b to C-a,
which happens whenever the client reconnects with a new proxy.
Every time the bridge receives a new WebSocket connection,
it needs to know the Client ID associated with the new incoming connection,
so it can start feeding the packets into KCP.
(P.S. please say "Client ID", not "Channel ID", so your terminology matches the source code.)
Strictly speaking, the state is not a single global variable
on the client:
every one of the client's current WebRTC connections
can be in either state C-a or C-b,
independent of all the others.
Even though we currently only use one connection at a time,
that state variable needs to be stored with the connection,
not global to the client.
Similarly, every server connection has its own state variable,
depending on whether it has received a correctly formatted
0xff message yet or not.
I think you will be happier if you write down your plan more formally.
I mean like a table where you say what happens with every event in every state.
actor
state
event
action
client conn
N/A
connection open
transition to state C-a
client conn
C-a
outgoing packet queued
prefix with \xff + Client ID + data and send
client conn
C-a
receive message with \xfe prefix
transition to state C-b, strip prefix and queue incoming packet
client conn
C-a
receive any other message (empty or non-\xfe prefix)
???
client conn
C-a
local close
???
client conn
C-a
remote close
???
client conn
C-b
outgoing packet queued
prefix with \xfe + data and send
client conn
C-b
receive message with \xfe prefix
strip prefix and queue incoming packet
client conn
C-b
receive any other message (empty or non-\xfe prefix)
???
client conn
C-b
local close
???
client conn
C-b
remote close
???
server conn
N/A
connection received
transition to state S-a
server conn
S-a
outgoing packet queued
cannot happen, packets can only be queued for a specific Client ID
server conn
S-a
receive message with \xff prefix that is at least 9 bytes long
associate the connection with Client ID data[1:9] and queue incoming packet data[9:], transition to state S-b
server conn
S-a
receive any other message (empty, prefix other than \xff, \xff prefix but shorter than 9 bytes)
???
server conn
S-a
local close
???
server conn
S-a
remote close
???
server conn
S-b
outgoing packet queued
prefix with \xfe and send
server conn
S-b
receive message with \xff prefix that is at least 9 bytes and data[1:9] matches this connection's Client ID
remove first 9 bytes, queue incoming packet
server conn
S-b
receive message with \xff prefix that is at least 9 bytes and data[1:9] not match this connection's Client ID
???
server conn
S-b
receive message with \xfe prefix
strip prefix and queue incoming packet
server conn
S-b
receive any other message (empty, prefix other than \xff or \xfe, \xff prefix but shorter than 9 bytes)
I think this specification of the state machine is incomplete. It's missing a transition back from > C-b to C-a, which happens whenever the client reconnects with a new proxy
Yes! I only fully understand this issue when I begin to implement this protocol. I have amended the spec to make sure it notes that whenever there is a connection interruption with proxy, the connection will be created again with the same client id, effective reset the entire protocol state.
(P.S. please say "Client ID", not "Channel ID", so your terminology matches the source code.)
Fixed.
I think you will be happier if you write down your plan more formally. I mean like a table where
you say what happens with every event in every state.
I have amended the spec with "drop any unexpected message". Please allow me to work on implementation first as I usually can fix issue surfaced during implementation(and often don't realize mistake in detail until actually writing the code).
There was some unexpected issue discovered during implementation, and would require a more difficult choice. I have thought for a while but there is no easy winner, but luckily we can have a discussion of this:
The issue is that it is required to create a data channel before sending the offer, and client don't know what protocol the proxy supports when creating the data channel, as it have yet to be matched with proxy.
Here are some routes we could try:
change proxy logic to expect more than one data channel
rewrite the client offer manually to allow creating data channel later
I recommend that you not try to solve this problem now.
Instead, I recommend that you first get a branch working
where you pay no attention to backward compatibility.
Completely replace the current data channel code with your new implementation—don't try
to make the new and old work side by side.
Assume that every proxy has support for your the protocol by default, and only the new protocol.
You should think about compatibility with the existing deployment
only after that non–backward compatible branch is working satisfactorily.
The reason I recommend this is that the changes and design decisions
needed to support unreliable data channels are non-trivial in themselves.
They deserve to have their own focused review.
If I am reviewing, I want to see exactly what changes when reliable data channels
are converted to unreliable data channels, and only that.
I don't want to see it as a parallel code path to the existing reliable data channels,
and I really don't want to see it mixed with backward compatibility glue code.
The situation I want to avoid is that all the changes—data channels, rendezvous, backward compatibility—get wrapped
up in one merge request that is too big to effectively review,
and/or you feel that you have sunk too much work into it to want to make changes during review.
The diff that implements unreliable data channels needs to be small and reviewable.
So what I suggest now is: continue making the data channel before creating the offer,
but set Ordered=false and MaxRetransmits=0 in DataChannelInit.
Make whatever other changes you need in the client,
and make the minimal changes in the proxy and server needed to make the system work
in a local deployment.
Ideally, no changes to rendezvous are needed at this point.
Once the local deployment is working, let's do a review before making further changes.
Yes! Thanks for the suggestion. I will try to get it working localy (assuming all old proxies are rejected by broker) and request a review of it. I agree it will become something too big to fully review once backward compatibility come into play.
This issue has been waiting for information two
weeks or more. It needs attention. Please take care of
this before the end of
2024-04-19. ~"Needs
Information" tickets will be moved to the Icebox after
that point.
(Any ticket left in Needs Review, Needs Information, Next, or Doing
without activity for 14 days gets such
notifications. Make a comment describing the current state
of this ticket and remove the Stale label to fix this.)
To make the bot ignore this ticket, add the bot-ignore label.
It should be easier to implement backwards-compatibility-wise since there are only 2 servers, plus with this MR it is already expected that server <-> proxy connection is unreliable.
Hmm I'm looking at their code and it appears that they still use WebSocket (see this snippet), which is also evident by looking at the "network" tab when running their extension. They only have an MR to add WebTransport.
This issue has been waiting for information two
weeks or more. It needs attention. Please take care of
this before the end of
2024-11-27. ~"Needs
Information" tickets will be moved to the Icebox after
that point.
(Any ticket left in Needs Review, Needs Information, Next, or Doing
without activity for 14 days gets such
notifications. Make a comment describing the current state
of this ticket and remove the Stale label to fix this.)
To make the bot ignore this ticket, add the bot-ignore label.
This issue has been waiting for information two
weeks or more. It needs attention. Please take care of
this before the end of
2025-03-09. Needs Information
tickets will be moved to the Icebox after that point.
(Any ticket left in Needs Review, Needs Information, Next, or Doing
without activity for 14 days gets such
notifications. Make a comment describing the current state
of this ticket and remove the Stale label to fix this.)
To make the bot ignore this ticket, add the bot-ignore label.