Rate-limiting streams for flow control, and stream buffer bloat

In arti, data from an application is read from a socket and written to a DataWriter. This DataWriter buffers up to a cells worth of bytes, and when flushed packages these bytes into a AnyRelayMsg and pushes them to an 128-cell mpsc queue (the stream queue). Later the circuit reactor will take a cell from this stream queue and push it to a channel queue.

A problem with this 128-cell stream queue is that it (1) contributes to buffer bloat, and (2) makes rate-limiting flow control more difficult. If we implement rate-limiting in the circuit reactor, then the user can completely fill this 128-cell queue when the stream has a low rate-limit (or is 0 due to XOFF). If we implement rate-limiting in the DataWriter, then a change in the rate limit will immediately affect the user (since the DataWriter will block with the rate limit) which is good, but the possibly 128 cells that are already queued in the stream queue will not be rate limited (they've already passed the rate limiter). So if we were to receive an XOFF (or an XON with a low rate limit), then the up to 64 KiB of data that is already queued in this stream queue would not be rate limited and would be sent unimpeded.

In c-tor things work a bit differently, since bytes read from an edge socket are processed immediately and pushed into the circuit queue. I think the circuit queue size is limited by the cwnd (circuit_consider_stop_edge_reading, although conflux complicates this)? There is no stream queue like we have in arti.

My questions are:

Is this 128-cell queue too large under DoS conditions? For example a client could start many download streams through a single exit, but stop reading on those streams so that the exit would buffer ~64 KiB of data for each stream. Maybe this is fine and we expect an arti OOM manager to deal with this? I think the same applies to onion services?
If we implement rate-limiting in the DataWriter, is it really that bad if we continue sending up to an additional 64 KiB of data after receiving an XOFF (not including bytes already in the channel queue, kernel, or in-flight on the network)? We could just consider these bytes "in-flight" once they pass through the DataWriter.
Do we want to reduce the size of this 128-cell queue anyways just to reduce buffer bloat?
1. Would we reduce it to some smaller fixed size?
2. Would we want it to change dynamically according to the circuit's cwnd? And if so, how would we even do this?

My thinking is that we should perform rate-limiting stream flow control in the DataWriter, and reduce the cell queue to 64 cells rather than 128 (but this change is a bit arbitrary).

/cc @dgoulet @gabi-250 @mikeperry (Sorry for all the pings, I'm not sure who might be interested in this.)