Improve CGO performance one way or another
The current C tor CGO performance isn't great. For reference, here is how our current cell encryption performs on my CPU:
 Inbound cells: 58.36 ns per cell. (0.11 ns per byte of payload)
Outbound cells: 60.66 ns per cell. (0.12 ns per byte of payload)
relay encrypt inbound: 301.01 ns per cell.And here is how my branch in !879 (merged) performs:
CGO outbound at relay: 634.88 per cell.
CGO inbound at relay: 626.55 per cell.After some research and hacking, and reading a lot of source code, I managed to get it down to:
CGO outbound at relay: 178.68 per cell.
CGO inbound at relay: 170.08 per cell.
CGO originate at relay: 657.40 per cell.So, I sped it up by a factor of 3.5...
... but the current performance is still not great. In the case where a cell is not recognized (recognized field=0, and our current code doesn't need to touch sha1), we're something like 3x slower. In the case where we do need to originate or receive a cell, we're something like 2.2 times slower.
(I'll need to clean this up!)
Next possibilities include:
- Accepting this performance cost for now.
- Looking into optimizations to speed up key reinitialization.
- Looking into even more optimized polyval implementations
- Looking into pipelining multiple cells, like the CGO authors do in their optimized code.
- Looking into whether we there is any reasonable way to avoid the AES re-keying expense. (Another block cipher? Nonce evolution only? Another layer of TBC with an evolving tweak?)
- Looking into a different construction?
- We should at least probably benchmark the more straightforward wideblock cipher constructions to confirm that we're doing better than that. I don't think we have AEZ beat, but we should compare AES+Polyval HHFHFH, as well as Kravatte-WBC and maybe even Xoofff-WBC.
- Also we should sanity-check it versus AES-GCM, just to see how far off we are from current practice. (The speed of raw CTR mode for non-terminal hops is not something we'll be able to replicate for any replacement.)
 
(We'll need an analogous ticket for Arti once we've figured out how to go fast.)
Edited  by Nick Mathewson