Trace cell queue times in Tor to measure Tor application "congestion"

changed milestone to %Tor: unspecified

added component::core tor/tor kist milestone::Tor: unspecified owner::pastly parent::12541 points::4 priority::medium resolution::wontfix severity::normal sponsor::Q-can status::closed type::enhancement labels

Here are my notes about the instrumentation we need to properly test KIST:

What to record

read from kernel input buffer into a cell struct

connection_or_process_cells_from_inbuf()

Either a cell_t or a var_cell_t get created in this func. This is where we can set the unique id and add a call site.

change cell into a packed_cell

connection_or_write_cell_to_buf() connection_or_write_var_cell_to_buf()

Here the cell_t or var_cell_t gets "packed" into a packed_cell_t that can be written onto the network. The callsite is after the cell is packed but before it is written with connection_write_to_buf().

write network bytes from tor output buffer to kernel output buffer

In 2 above, the cell gets written to Tor output buffer in connection_write_to_buf(), and now we want to add a callsite when that cell gets written from Tor to the kernel.

Since the tor output buffer is just a buffer of raw bytes, we won't actually know when that cell gets written since we have no struct information. So we use a heuristic of counting bytes to guess instead.

After writing e.g. cell 123 in connection_write_to_buf(), check the length of the output buffer for that connection. Lets say the length is 2048 (4 cells). Now we know that cell 123 is in the 4th position, i.e. at buf[1536:2048]. A soon as that connection flushes the 2048th byte from the buffer, we know that cell 123 will have been completely transferred to the kernel.

We need to keep state in the meantime, keeping track of:

cell 123
conn XXX
bytes_remaining 2048

Then every time that connection XXX flushes some data, subtract the bytes_remaining counter. When the bytes_remaining counter is <=0, then call the final call site to log the "kernel write" event, and delete the state for cell 123.

Connection bytes written get reported in connection_buckets_decrement(). I think the bytes reported there might include the TLS overhead, so you need to make sure to subtract off that overhead to make sure the counter lines up correctly with bytes stored in the tor connection buffer.

Output format

Let's start by writing with Tor's log interface. If we could configure a log level at build time that would be helpful, otherwise log_notice.

Format is something like " " and of course the event name will be needed also.

Trac:
Component: - Select a component to Tor
Milestone: N/A to Tor: 0.2.8.x-final

So I want to fully understand here. The goal is:

Record when a cell is created with the amount of bytes

--> Keep in state the cell ID and size for a specific connection X.

Record the amount of bytes once it's packed (??)

--> Update the state with the packed size for the cell ID and connection X. --> Record the position of the packed cell in the output buf just after we wrote in it.

Record when that specific cell is written to the kernel

--> For connection X, record when the cell ID has been written to the kernel.

Is that correct?

Replying to dgoulet:

Is that correct?

Pretty much, though I don't think you need the cell size. You just need the position in the Tor outbuf in order to compute when the cell gets written to the kernel.

The flow will be like this:

Read bytes from kernel input socket to Tor incoming connection X buffer
Turn bytes from Tor incoming connection X buffer into a cell C
LOG: Record 'socket_read' event for Cell C ID and timestamp
Cell C gets processed
Cell C gets packed for next hop
Cell C gets written to Tor output connection Y buffer
STATE: Store Cell C ID and its position in the Tor output connection Y buffer in memory
STATE: Update buffer position of Cell C whenever output connection Y writes more bytes to kernel output socket
LOG: when position indicates Cell C is now in the kernel, record 'socket_write' event for Cell ID and timestamp
STATE: Clear state for Cell C

Does that help?

These seem like features, or like other stuff unlikely to be possible this month. Bumping them to 0.2.9

Trac:
Milestone: Tor: 0.2.8.x-final to Tor: 0.2.9.x-final

Trac:
Points: N/A to medium/large
Reviewer: N/A to N/A

This is linked to #13802 (moved) that is the need for an instrumentation framework so this cell measurement can use it. I already have some code prepared for this in branch ticket13802_028_01. Not working yet but at least a good start.

Trac:
Points: medium/large to 4
Keywords: N/A deleted, kist added
Milestone: Tor: 0.2.9.x-final to Tor: 0.2.???
Cc: dgoulet to N/A

Updated #13802 (moved) with this already.

Public branch with working shadow traces as we need them today is here.

https://github.com/pastly/public-tor/tree/ticket13802_028_01

Milestone renamed

Trac:
Milestone: Tor: 0.2.??? to Tor: 0.3.???

Finally admitting that 0.3.??? was a euphemism for Tor: unspecified all along.

Trac:
Milestone: Tor: 0.3.??? to Tor: unspecified
Keywords: N/A deleted, tor-03-unspecified-201612 added

New public branch. Ready for review.

https://github.com/pastly/public-tor/tree/ticket17598

Trac:
Owner: N/A to pastly
Status: new to assigned

Trac:
Status: assigned to needs_review

David's tracing branch is great, and I think it will be extremely useful for a variety of research and testing purposes. We have already implemented and are using a feature that uses the tracing branch to measure congestion times of cells in Tor.

If there are concerns about merging such individual "tracing modules" like this, I'd like to offer some reasons why I think it is a good idea to merge. Here are some advantages to merging tracepoint modules, in this case specifically our cell tracing module:

make sure the tracepoints are in the correct spot - future researchers/devs that want to measure congestion don't have to make the mistakes we first made about where to place the tracepoints
very few lines of code added to Tor for cell tracing (just a few tracepoints and a cell id) - most logic is in the cell tracing module
the cell tracing feature doesn't get compiled in unless manually setting a configure option
the cell tracing module is self contained and private and therefore not vulnerable to code rot, the only code that needs to be maintained are the tracepoint locations, which are indeed valuable to maintain by those who know the codebase and not by researchers/devs who want cell congestion data but are not Tor code experts
currently, our cell congestion tracing implementation minimizes the lines of code changed in Tor, but we could instead opt to provide more fine grained tracepoints - we could trace cell queue times in all of the inbuf and outbuf queues separately - more data could mean more useful to more researchers

Assigning this to 0.3.1 because it has an implementation that's ready for review, and 0.3.1 is the next release. (0.3.0 was frozen a few weeks ago.)

robgjansen, pastly: a hint about how we use trac: tickets that aren't in a milestone (or that are in an outdated or far future milestone) don't get looked at.

Trac:
Milestone: Tor: unspecified to Tor: 0.3.1.x-final

Needs revision.

Log lines like the following are wrong and likely the source of problems I had many many months ago.

"%lu.%lu id=%" PRIu32, cell->ts.tv_sec, cell->ts.tv_nsec, cell->id

This would print a timestamp of 1.9 seconds when we're actually 1.000000009 seconds.

There's also an incrementing dummy variable that needs to go away.

Trac:
Status: needs_review to needs_revision

This will not make it into 031. In addition to the above bug, it does not trace cells that originate at an exit (and I suspect cells that originate at the OP).

It only traces cells that come in from connection_or_process_cells_from_inbuf. It probably should trace cells that are created in relay_send_command_from_edge_ too.

This is a big deal for our needs.

Maybe tracing cells that don't go through the inbuf and start already halfway through Tor is different enough that they should be logged differently. For us right now, I don't think the difference is important and we really want the data from the additional cells.

Trac:
Milestone: Tor: 0.3.1.x-final to Tor: 0.3.2.x-final

Remove an old triaging keyword.

Trac:
Keywords: tor-03-unspecified-201612 deleted, N/A added

Trac:
Sponsor: N/A to SponsorQ-can

Trace cell queue times in Tor to measure Tor application "congestion"

Child items 0

Activity