Latest profiling at commit 99e915dbfea9b2b985a10368721d9c79b358abbe
+ 15.86% tor [.] fmonty+ 3.79% tor [.] circuitmux_find_map_entry+ 2.81% tor [.] connection_bucket_refill+ 2.57% tor [.] connection_handle_read_impl+ 1.99% tor [.] buf_datalen+ 1.79% tor [.] assert_connection_ok+ 1.52% tor [.] connection_or_process_cells_from_inbuf
I can see a couple of way of reducing circuitmux_find_map_entry() cpu time by calling it less. For instance, update_circuit_on_cmux_ calls it 3 times in the same function querying the same cmux,circ pair.
Clearly biggest part of the CPU is crypto. Here is an other view without a procname "tor" filter. First value of 37.04% is kernel work.
+ 37.04% tor [ctr] [k] 0xffffffff8104f45a+ 3.85% tor tor [.] fmonty+ 2.85% tor libcrypto.so.1.0.0 [.] gcm_ghash_clmul+ 2.82% tor libcrypto.so.1.0.0 [.] aesni_ctr32_encrypt_blocks+ 2.14% tor libcrypto.so.1.0.0 [.] aesni_cbc_sha1_enc_avx+ 2.01% tor libcrypto.so.1.0.0 [.] bn_sqr4x_mont+ 1.62% tor libc-2.19.so [.] _int_malloc+ 1.60% tor libcrypto.so.1.0.0 [.] sha1_block_data_order_avx+ 1.04% tor libc-2.19.so [.] __memcpy_sse2_unaligned+ 0.95% tor libevent-2.0.so.5.1.9 [.] event_base_loop+ 0.95% tor libz.so.1.2.8 [.] 0x0000000000002a87+ 0.92% tor tor [.] circuitmux_find_map_entry
Time to dust off #8897 (moved). It's also worth noting (maybe this should belong in the other ticket) that ed25519-donna (also by Floodberry) has code for the faster scalar base mult, so we should give serious consideration into ditching the ref0 code and folding it in if it's up to par.
The reference to __memcpy_sse2_unaligned() above reminds me that data should always be aligned for more efficient read/write.
There are tools (Valgrind?) that can report this. For x86(_64), buffers should always be aligned to at least mod 16.
Note: Discussing Intel and compatible processors here.
To be pedantic:
Assembly/Compiler Coding Rule 46. //(H impact, H generality) Align data on natural operand size address boundaries. If the data will be accessed with vector instruction loads and stores, align the data on 16-byte boundaries.//
The performance hit for non-vector access comes when an access straddles a cache line boundary (64 bytes), unless the processor is an iPotato86 that should have been retired a long time ago (P1, PMMX, AMD <= K8). Vector access needs to be 16 byte aligned, unless you are using AVX where it doesn't matter (So looking towards the bright future, this will matter less and less).
The memcpy that I imagine dominates memcpy's runtime would be the one in buffers.c:write_to_buf(), invoked from connection_write_to_buf() from the RELAY_COMMAND_DATA handler logic (a quick skim shows that everything else is infrequent, should be reasonably aligned, or the copy is too small to be interesting).
Here the destination will only always be nicely aligned when a new chunk is allocated for the buffer (or if the data in the buffer happened to end on a 16 byte boundary), and the source will never be aligned correctly (cell->payload + RELAY_HEADER_SIZE).
I'm currently of the opinion that before messing with this, faster crypto will gain more mileage for our development time.
I tend to agree there, though if we do someday decide to optimize memcpy, I'd like to see call graph info to see which memcpy really matters; I bet we might be surprised.
wrt crypto speed improvements, that's an 0.2.7 issue at earliest.
The other functions that appear above in the profile don't matter a lot individually, and they don't look like bottlenecks per se, but I bet there's some opportunity to make them faster. Not for 0.2.6, though.
Two questions:
Let's think about the elephant in ring 0: that 37% of our time that we're spending in the kernel. I don't think this is a regression, though, but we should check. And some time, we should investigate where the time is going.
Can anybody profile... an exit? a busy hidden service? Some other relevant configuration?
Interesting but not FWICT a regression. I say we close this. Could I ask you to make a couple of new tickets in 0.2.7 or later for doing less smartlist_remove(), and for moving client-side ntor into workers, and close this when you're done?