tls: `buf_read_from_tls()` is exiting its loop too early
This is a bit of a deep dive so you've been warned.
Context
The buf_read_from_tls()
function is called when a connection is ready to read data from a TLS socket into its inbuf (buf_t
).
That function loops over the total length of data we are allowed to consume (that value is affected by bandwidth limitation). The general case is 16k bytes or 32 cells being the hard coded maximum allowed per mainloop round of a connection read. The parameter is at_most
and it is the maximum amount of data we are asking to read.
Wow OpenSSL
Within the read loop, we call read_to_chunk_tls()
which essentially boils down to calling SSL_read(at_most)
(with tor using OpenSSL). It is given the at_most
bytes value and thus the expectation is that we'll get that amount of bytes back from the TLS socket.
Now it turns out that SSL_read()
only returns a TLS record and will buffer the rest up to at_most
if that limit is higher than a record size (in our case a record size is 4096 bytes).
See the man page for a better explanation. I might of not fully understood it but experimental data shows that we get 4k bytes every time: https://www.openssl.org/docs/man1.0.2/man3/SSL_read.html
And thus the takeaway here is that read_to_chunk_tls()
called with the upper limit of 16k
bytes will always only return 4k
bytes.
The Bug
Putting everything together, lets look at the read_to_chunk_tls()
reading loop, see if you can spot the problem:
while (at_most > total_read) {
[...]
r = read_to_chunk_tls(buf, chunk, tls, readlen);
if (r < 0)
return r; /* Error */
tor_assert(total_read+r <= BUF_MAX_LEN);
total_read += r;
if ((size_t)r < readlen) /* eof, block, or no more to read. */
break;
}
Notice the end of the loop if
condition, it believes that if the returned length is smaller than the one asked, we've reached EOF (exactly like read()
behaves). But in the case of SSL_read()
, this is not true and so we bail every time after reading only a fraction of at_most
that is one single TLS record.
I was able to confirm by adding custom log_debug
:
Jun 19 13:53:13.551 [debug] buf_read_from_tls(): tls: read len: 14858, at_most: 14858, total: 0
Jun 19 13:53:13.551 [debug] tor_tls_read(): tls read: at_most 14858, read: 4048
You can see that the read is only 4048
(TLS overhead is removed and thus less than 4096).
Solution
Obvious solution here would be to remove that if
condition and let tor consume up to the at_most
value of bytes.
I did run an experiment with this and was able to see my onion service consuming 32 cells per round instead of 8.
Performance
Now, the naive thinking here would be "Ok wow, if we make tor read 4 times more cells, will tor cell relaying will improve by 4 folds?". In practice, probably not.
But this will obviously improve performance in some ways. Reason is that take KIST scheduler that runs every 10 msec. Now, it will be able to make a decision based on 4 times the amount of cells it used to see. If the TCP window allows it, more cells will go on the wire. But also, I strongly believe EWMA priority will get better here because circuit will see more cells and thus the back offs will get sharper per circuit instead of constantly hovering at most around 8-16 cells per round.
It remains unclear if this will trigger more congestion on the network for instance or buffer bloating or anything unexpected. Because of that, we might want to not backport this right away and see a progression from newer relays instead.
And of course, we need the shadow people to help us measure this quantitatively :).
Note: I was able to find this issue because I was measuring, with the tracing branch #32910 (closed), the onion service timings to handle lots of INTRO2 cells under DDoS and noticed that we would never processed more than 8 cells per mainloop round.