Multithreaded crypto on Tor
The master bug for this task is #1749 (moved).
So, we have 3 main kinds of crypto that servers do a lot.
-
onionskin processing. The server side of this is already parallelized in cpuworker.c
-
SSL handshakes and crypto.
-
AES-crypto on cells.
Parallelizing cell crypto seems like the easiest to do as a start. Here's my current thoughts on that.
Parallelizing cell crypto
Right now, we do relay cell crypto in relay_crypt_one_payload, which gets called in a few places, all of which are fortunately in relay.c. In circuit_package_relay_cell, it gets called 1..N times for a cell that we're packaging from an edge connection and about to stick on a cell queue. In relay_crypt, we just got a relay cell on a circuit, and we are about to decrypt it, see whether it should get handled by us, and maybe pass it on or maybe process it. If we handle it, we are going to pass it to connection_edge_process_relay_cell. If we pass it on, we are going to append it to a cell_queue_t.
For pure relays, the relay_crypt() case dominates. For exit nodes, the circuit_package_relay_cell() case matters too.
I think that the right solution here involves adding another pair of cell queue structures (not necessarily cell_queue_t) to each circuit, containing cells going in each direction that have not yet been encrypted. We can then have a few worker threads whose job it is to do the relay crypto for such circuits.
Cell crypto complications
We will want some kind of a work-queue implementation for passing info to the worker threads. To implement this, we'll probably need a condition-variable implementation. On pthreads this is trivial: just use pthread_cond_t. For pre-Vista Windows, it's less so. Fortunately, I needed to write a windows condition variable implementation for Libevent, so we can just use that one.
We need to keep multiple workers from messing with the same circuit at once.
Our current circuit priority logic (which picks which circuits get their cells delivered first) favors circuits with small queues. This will need to start looking not only at the number of crypted cells waiting to be put onto connection buffers, but also at the number of the not-yet crypted cells on the circuit.
We don't necessarily want to crypt all circuits' cells with the same priority. Rather, we will probably want to use similar calculations to those used by the circuit priority logic, maybe.
We should design our data structures carefully and avoid like the plague anything that will force us to iterate over all circuits, or over all connections. That's really slow.
Alerting the main thread is a little nontrivial with pre-2.0 Libevent, but we can always fake it with a socketpair.
This won't do a lick of good on OpenBSD, which still doesn't have kernel multithreading. I don't care, though.
== Parallelizing SSL: Design 1 ==
This gets lots easier, I think, if we're using Libevent 2.0 and bufferevents, since Libevent 2.0 already has support for locking its buffers, and keeps its SSL code nice and isolated much better than Tor does with its existing buffer implementation.
See bufferevent_openssl.c in the current libevent code for the ugly details.
One of the main tasks for Libevent 2.1 is better multithreading support. (Right now, Libevent 2.0 lets you use it in a multithreaded way, but doesn't actually help you out by giving you a worker thread pool implementation, or automatic support for running stuff in an alternate thread.)
Assuming the following changes in Libevent, we get multithreaded SSL "for free":
- Built-in worker thread pool support, with the ability to say that a given callback can be invoked in a worker thread.
- Change be_openssl_enable so that rather than calling consider_reading and consider_writing on its own, it schedules them to get run by a worker thread.
Once we have these, we can mark the callbacks used to implement bufferevent_openssl as parallelizable, and we should be basically set.
Of course, there is at least a little bit of design-and-implementation work to get through here.
Parallelizing SSL: Design 2.
This version only requires Libevent 2.0, and a bit of chutzpah and hacking.
Instead of hoping to get some as-yet-undesigned Libevent thread-pool stuff to work, we just run a few Libevent event loops in different thread, and divide up our SSL bufferevents over them "evenly". Libevent can handle this just fine.
Of course, Tor would need to have some more code in it. We'd need to make sure that when a read or write callback happened on an SSL bufferevent in its loop thread, our connection callback instead got activated in the main thread. (No biggie.)
We'd need to balance and maybe rebalance connections fairly. (How hard could that be, really?)
We'd need to avoid race conditions when inspecting things about bufferevents, adding to them, closing them, etc.
Coding plan
Phase 1: Cell crypto
Cell crypto: The design
We'll need a set of worker threads. We'll pass data to worker threads on a basic unbounded work queue implementation. For starters, all threads can share a single queue, though we should keep the implementation abstract so that we can support other modalities in the future. For communicating back to the main thread, we'll also use an unbounded queue. This means that we will need two work queue implementations .
NOTE that work queues are not the same as cell queues. Work queues are how one thread tells another, "Here's something to do." Cell queues are how we keep all the cells waiting on a given circuit in order.
Each circuit gets two new cell queues for not-yet-crypted cells in each direction. When we decide to send a cell on a circuit, instead of crypting it immediately, we put it in the appropriate outgoing not-yet-crypted cell queue. When we receive a RELAY cell on a circuit, instead of crypting it immediately and deciding whether we recognize it or not, we also put it in the appropriate not-yet-crypted queue. This means that a cell can go on the not-yet-crypted queue on a circuit for a direction that the circuit doesn't have a connection in: if we're the client, or the last hop, we'll stick all our incoming cells in the appropriate not-yet-crypted queue as if we were going to send them on to the nonexistent next hop, though really we'll expect to recognize all of them after crypting.
The basic unit of work for a circuit will be, conceptually, "Process some cells on this circuit going in direction X". That means you don't add cells to the work queues: you add "circuit,direction" tuples instead. To avoid starvation, let's do only N cells per circuit at a time, then send it back to the end of the work queue.
To do the crypto on cell a circuit, all you need is:
- The cell
- The appropriate key or keys
- The appropriate digest object or objects
- A direction
- Whether it is an origin circuit
- The target layer for the cell (if sending from origin) State changes in:
- The key(s)
- The digest object(s) The output is:
- The resulting crypted cell
- A "recognized" flag.
- If we're the origin, a "layer hint" telling us which hop sent us this cell.
We want to do all of our crypto in-place, so this should basically all be done with packed_cell_t or equivalent. We should add to each packed_cell_t enough bits to hold the recognized flag, and the layer info. (Q: What do we do about the layer hint? If we give a pointer to the thing, we need to worry about it going away. If we give a numeric layer number, we need to worry about truncate/reextend cases, right?)
Locks need to protect all of the above fields. That includes the cell queues, the keys, the digest objects, and the cpath.
When a worker is done crypting a cell, it puts it on a "crypted, not 'handled'" queue. 'Handling' a cell is the main thread's job. It will consist of putting the cell on the appropriate outgoing queue (if it's outgoing), handling it locally (if it's recognized), or dropping it.
Q: How many locks do we want here? It would be bad if doing crypto on a circuit (which is slow) blocked adding/removing cells on a circuit's uncrypted and unhandled queues (which is and needs to be fast). Maybe we can get away with one lock for cell queues, and one for each direction's worth of crypto.
Whenever you add the first cell to a circuit for crypto, or for handling, you need to stick the circuit on the appropriate work queue.
Closing a circuit gets a little complicated in this case. If the circuit is marked, you never add it to a work queue. If a worker sees a marked circuit, it should simply remove it from the work queue and do no processing on it. (Right?) The only trick is that you can't actually close a marked circuit while it is still on a work queue or getting handled by a worker thread.
Cell crypto: The plan
- Crypto abstractions a. Add needed fields to packed_cell_t a. Refactor the relay crypto code so that it plays nicely with these cell types. i. Add a function to crypt a packed_cell_t on a circuit. i. Add a function do handle a crypted packed_cell_t on a circuit. i. Rewrite the functions that currently call relay crypt (directly or indirectly) so that they use these new functions instead. a. Add locks a. Write a basic "crypting cell queue" type to holds uncrypted cells. a. Write a basic "unhandled cell queue" type to hold unhandled cells. a. Refactor the relay crypto code so that instead of immediately crypting/handling a cell, it puts the cell on a crypting queue. i. '''For now''' [***] the code should immediately dispatch to the "process the crypting queue" function, and the "process the needs-handling queue" function. a. Testing for all of the above. If we designed it right, we can now unit-test individual functions except (maybe) for the handle function. Also, we should make sure that our refactorings didn't break Tor as a client or server.
- Threading abstractions (can happen in parallel with crypto abstractions) a. Conditions. We have some initial tor_cond_t code for pthreads in compat.c; we should add a windows implementation (maybe using the one Nick wrote for Libevent). a. Work Queues. The implementation here is pretty simple: keep a queue of objects (using a double-linked-list, a ring buffer, a pair of smartlists, or whatever). Lock it when you add or remove something, or inspect the queue. When you add something to a previously empty queue, you need to "alert". There needs to be a blocking and a nonblocking "extract" operation. There should be a "clear" operation. i. Condition-based alerting i. Event/socketpair-based alerting a. Thread pools. Code to launch an appropriate number of threads that do work from a work queue. This should be abstracted a little; from the main thread's POV, it just wants a "launch", "add work unit", "sanity-check", and "shutdown." a. Unit tests for all of the above.
- Integration a. Write the worker thread code. The algorithm is more or less "Remove a circuit/direction from the work queue. Grab N uncrypted cells. crypt them. stick them on the to-handle queue. If the to-handle queue was empty, stick the circuit on the needs-handling work queue. If there are still cells waiting to get crypted on this circuit, stick this circuit at the end of the work queue." a. Write the handler-loop code. The algorithm is more or less "Remove a circuit from the work queue. Process all of its unhandled cells." a. Clean up the code to handle closing circuits to make sure that we can't cause races. A refnt would do, but we're talking a refcnt that doesn't get over 2 or 3. It would need locking though. Hm. a. Write the code to stick circuits on the work queue. This replaces the [***] logic above with an algorithm more or less like "if this circuit was previously empty, stick it on the work queue. Except for non-mulithreaded environments, where we just call the "crypt" and "handle" functions ourself.
- Gilding the lily a. We should optimize the thread pool/work queue implementation a bit. Maybe we're better off with separate work queues for each thread or something. a. Probe automatically for the number of CPUs when NumCPUs is 0. a. Make queue-EWMA code (maybe) consider uncrypted/unhandled cell counts.
Athena's Notes
On cell_t vs. packed_cell_t:
- Why this talk of adding things to packed_cell_t in https://trac.torproject.org/projects/tor/wiki/org/projects/Tor/MultithreadedCrypto ?
- Cells get packed in cell_queue_append_packed_copy(), called from append_cell_to_circuit_queue()
- The append_cell_to_circuit_queue() function is called from circuit_receive_relay_cell() which does a relay_crypt() earlier on, from circuit_package_relay_cell() (in the endpoint case), which uses relay_crypt_one_payload() with cell->payload of a cell_t. There are also two calls to append_cell_to_circuit_queue() in circuitbuild.c which are not involved in relaying and won't be involved in parallel crypto.
- So, we need to either fold the layer_hint and recognized output parameters of relay_crypt() into fields of cell_t, or the job queue has to associated those fields with the cells somehow. Since jobs will be at a larger than one cell granularity, the former is probably simpler.
- I don't understand the "We want to do all of our crypto in-place, so this should basically all be done with packed_cell_t or equivalent."; relay_crypt() and relay_crypt_one_payload() all operate on cell_t, and ultimately come down to crypto_cipher_crypt_inplace() on cell->payload. Since we always work in terms of cell_t and only turn it into packed_cell_t on the way out after the crypto is done, using cell_t seems more natural.
- I do not think we should use cell_queue_t, since appending a cell_t to it involves introducing an extra copy. It would be acceptable if we were working in terms of packed_cell_t, but to do that we would have to do the copy-and-pack before the crypto, then unpack again in the worker thread to figure out what the payload we're supposed to be crypting actually is, then re-packed. We should only ever be doing crypto ops in place on cell_t and passing around pointers to cell_t like we do now.
- So, we need a queue of cell_ts. Do we want to add a next field to cell_t as well as the layer_hint and recognized ones, or do we want some other means of queueing them?
On threading structures:
- We should have a relaycrypt_dispatcher_t or something like that keeping track of
what circuits need work. It should have a set of (circuit_t, direction) tuples,
and for each:
- A queue of crypted cells to send out on this circuit
- A queue of cells to work on
- These should be here rather than on the circuit because for efficiency we will want the jobs to be at a larger granularity than one cell at a time, so attaching them to the circuit means either the worker thread holds a lock on the circuit for a relatively long period of time, or we need a complicated means of notifying the workers in the middle of a job if the circuit is going away. Keeping the queues associated with the job object rather than the circuit means we can handle closing down a circuit by just having the main thread mark the associated job object as dead, and the dispatcher won't dispatch it any more and will free it once whatever thread is working on it finishes, and we avoid the complexity of interrupting jobs at the expense of sometimes leaving worker threads to finish work we'll just throw away.
- It follows that the job size should not be so large as to take more than perhaps a few hundred milliseconds for the worker thread, since without an interrupt mechanism returning an indication when the worker threads should exit from the function they call to get another job is the natural way to shut them down, so the length a job takes the finish defines the latency we have to kill them when Tor exits.
- An indication of which worker thread this is assigned to, if any
- A flag to indicate if this is a dead circuit and should be removed from the dispatcher as soon as it is safe to do so.
- The relaycrypt_dispatcher_t should also have a list of worker threads, and which (circuit_t, direction) is currently assigned to a thread, if any, and a flag signalling if the thread should exit (if we're reducing the number of worker threads or shutting down Tor).
- We should support the following operations on relaycrypt_dispatcher_t
- By the main thread: add a (circuit_t, direction) tuple to the list if it is not already present and queue a cell on it.
- By a worker thread: get a (circuit_t, direction) tuple to work on
- How should we choose? Round-robin? By queue length (I think this is a risk for starvation)? By which has the oldest cell (seems fairest and most starvation-free, but is heaviest on book-keeping, since it needs a timestamp for every cell_t in the in-queue of when it was queued)?
- This is also where we should block if no work is currently available, or return a signal that the thread should exit.
- By a worker thread: stop working on a job and return it to the dispatch queue if necessary
- Locking:
- The relaycrypt_dispatcher_t needs to track jobs and worker threads in
ways that both the workers and main thread can read and modify, so
these shared structures will need a lock. These manipulations
here will be quite fast typically, things like picking a new job to
work on or adding one, and in all cases bounded-time, with nothing
ever blocking while holding the lock. Is it acceptable for the
main thread to block on a mutex here, or do we need to design a
mechanism that's lock-free for the main thread?
- If so, maybe use a queue that only the main thread ever modifies, with a bit per job indicating if it's been taken up by the workers, and then the workers poll it and copy those jobs into the main, lock -protected structure when they check for work, and set that bit. The main thread would clean up entries with the taken-up bit set whenever it next adds something to the queue.
- Otherwise, the dispatch structures are only accessed by worker threads and normal locking that might block them briefly is acceptable.
- The input cell queue is enqueued to by the main thread and dequeued from by worker threads. Should we use a similar mechanism?
- Reverse case for the output queue
- The relaycrypt_dispatcher_t needs to track jobs and worker threads in
ways that both the workers and main thread can read and modify, so
these shared structures will need a lock. These manipulations
here will be quite fast typically, things like picking a new job to
work on or adding one, and in all cases bounded-time, with nothing
ever blocking while holding the lock. Is it acceptable for the
main thread to block on a mutex here, or do we need to design a
mechanism that's lock-free for the main thread?
- Notifications:
- There needs to be a way for the main thread to unblock a blocked worker thread when new work becomes available. Easiest option is a condition variable/mutex pair with pthreads, but how does this work on Win32?
- There should be a way for a worker thread to schedule an invocation on the main thread of a handler for the crypted output.
Changes to existing files needed:
-
Changes to relay.c:
- We will have to refactor circuit_receive_relay_cell() and possibly circuit_package_relay_cell(); they'll be split into two functions. The first will be called where the existing function is called now, and do everything before the crypto op, then put the cell to crypt in the appropriate queue and set up a job object on the relay worker dispatcher. The second will be invoked (how?) when the crypted cells are available and perform everything the old version did after the crypto ops.
-
Changes to config.c, main.c
- We'll have create the relay job dispatcher on startup and close it and kill all the worker threads on shutdown
- We'll need a config option for the number of worker threads, and to create more or shut only some down on the fly if it gets changed.
- There should be an option to disable all threading and make circuit_receive_relay_cell() and circuit_package_relay_cell() have the old behavior, if someone has some weird old machine perhaps.