When Tor 0.2.6.x is closer to done, profile relays running Tor 0.2.6.x and optimize accordingly

changed milestone to %Tor: 0.2.6.x-final

added 026-triaged-1 027-triaged-1-out component::core tor/tor milestone::Tor: 0.2.6.x-final owner::dgoulet performance priority::medium resolution::implemented status::closed tor-relay type::defect labels

Trac:
Keywords: tor-relay performance deleted, tor-relay performance 026-triaged-1 added

Trac:

Latest results I have with perf on my 50Mb/s relay (non exit) running 0.2.6.1-alpha-dev.

Taken over 15 minutes of sampling generating 206MB of events.

Is that 0.2.6.1-alpha-dev before or after we merged #9262 (moved)?

Yes, this is at commit b3b840443ddbf126c5a9c25533986e9b9ea33585.

Will wait a bit before updating to latest alpha because I need my relay to report HS stats until jan 12th at least.

Trac:
Status: new to assigned
Owner: N/A to dgoulet

Latest profiling at commit 99e915dbfea9b2b985a10368721d9c79b358abbe

+  15.86%  tor  [.] fmonty
+   3.79%  tor  [.] circuitmux_find_map_entry
+   2.81%  tor  [.] connection_bucket_refill
+   2.57%  tor  [.] connection_handle_read_impl
+   1.99%  tor  [.] buf_datalen
+   1.79%  tor  [.] assert_connection_ok
+   1.52%  tor  [.] connection_or_process_cells_from_inbuf

I can see a couple of way of reducing circuitmux_find_map_entry() cpu time by calling it less. For instance, update_circuit_on_cmux_ calls it 3 times in the same function querying the same cmux,circ pair.

Clearly biggest part of the CPU is crypto. Here is an other view without a procname "tor" filter. First value of 37.04% is kernel work.

+  37.04%  tor  [ctr]                  [k] 0xffffffff8104f45a
+   3.85%  tor  tor                    [.] fmonty
+   2.85%  tor  libcrypto.so.1.0.0     [.] gcm_ghash_clmul
+   2.82%  tor  libcrypto.so.1.0.0     [.] aesni_ctr32_encrypt_blocks
+   2.14%  tor  libcrypto.so.1.0.0     [.] aesni_cbc_sha1_enc_avx
+   2.01%  tor  libcrypto.so.1.0.0     [.] bn_sqr4x_mont
+   1.62%  tor  libc-2.19.so           [.] _int_malloc
+   1.60%  tor  libcrypto.so.1.0.0     [.] sha1_block_data_order_avx
+   1.04%  tor  libc-2.19.so           [.] __memcpy_sse2_unaligned
+   0.95%  tor  libevent-2.0.so.5.1.9  [.] event_base_loop
+   0.95%  tor  libz.so.1.2.8          [.] 0x0000000000002a87
+   0.92%  tor  tor                    [.] circuitmux_find_map_entry

Replying to dgoulet:

{{{

15.86% tor [.] fmonty }}}

{{{

3.85% tor tor [.] fmonty }}}

Time to dust off #8897 (moved). It's also worth noting (maybe this should belong in the other ticket) that ed25519-donna (also by Floodberry) has code for the faster scalar base mult, so we should give serious consideration into ditching the ref0 code and folding it in if it's up to par.

The reference to __memcpy_sse2_unaligned() above reminds me that data should always be aligned for more efficient read/write.

There are tools (Valgrind?) that can report this. For x86(_64), buffers should always be aligned to at least mod 16.

Trac:
Username: tmpname0901

Replying to tmpname0901:

The reference to __memcpy_sse2_unaligned() above reminds me that data should always be aligned for more efficient read/write.

There are tools (Valgrind?) that can report this. For x86(_64), buffers should always be aligned to at least mod 16.

Note: Discussing Intel and compatible processors here.

To be pedantic:

Assembly/Compiler Coding Rule 46. //(H impact, H generality) Align data on natural operand size address boundaries. If the data will be accessed with vector instruction loads and stores, align the data on 16-byte boundaries.//

The performance hit for non-vector access comes when an access straddles a cache line boundary (64 bytes), unless the processor is an iPotato86 that should have been retired a long time ago (P1, PMMX, AMD <= K8). Vector access needs to be 16 byte aligned, unless you are using AVX where it doesn't matter (So looking towards the bright future, this will matter less and less).

The memcpy that I imagine dominates memcpy's runtime would be the one in buffers.c:write_to_buf(), invoked from connection_write_to_buf() from the RELAY_COMMAND_DATA handler logic (a quick skim shows that everything else is infrequent, should be reasonably aligned, or the copy is too small to be interesting).

Here the destination will only always be nicely aligned when a new chunk is allocated for the buffer (or if the data in the buffer happened to end on a 16 byte boundary), and the source will never be aligned correctly (cell->payload + RELAY_HEADER_SIZE).

I'm currently of the opinion that before messing with this, faster crypto will gain more mileage for our development time.

I tend to agree there, though if we do someday decide to optimize memcpy, I'd like to see call graph info to see which memcpy really matters; I bet we might be surprised.

wrt crypto speed improvements, that's an 0.2.7 issue at earliest.

The other functions that appear above in the profile don't matter a lot individually, and they don't look like bottlenecks per se, but I bet there's some opportunity to make them faster. Not for 0.2.6, though.

Two questions:

Let's think about the elephant in ring 0: that 37% of our time that we're spending in the kernel. I don't think this is a regression, though, but we should check. And some time, we should investigate where the time is going.
Can anybody profile... an exit? a busy hidden service? Some other relevant configuration?

Here are some perf results from a busy hidden service. Everything above 2% is shown here:

+  34.25%  tor  [.] fmonty
-   6.43%  tor  [.] smartlist_remove
   - smartlist_remove
      - 99.84% smartlist_subtract
           router_choose_random_node
           choose_good_middle_server
           onion_extend_cpath
           onion_populate_cpath
           circuit_establish_circuit
         - circuit_launch_by_extend_info
            + 78.39% rend_service_introduce
            + 19.19% circuit_launch
            + 2.42% rend_service_relaunch_rendezvous
-   4.85%  tor  [.] node_get_prim_orport
   - node_get_prim_orport
      - 97.86% node_get_addr
         - nodelist_add_node_and_family
            + 71.40% choose_good_middle_server
            + 28.60% populate_live_entry_guards
      + 2.14% nodelist_add_node_and_family
+   4.84%  tor  [.] tor_addr_compare_masked
+   2.95%  tor  [.] siphash24
+   2.78%  tor  [.] swap_conditional
+   2.51%  tor  [.] nodelist_add_node_and_family         
+   2.41%  tor  [.] tor_memeq
+   2.36%  tor  [.] compute_weighted_bandwidths
+   2.34%  tor  [.] crecip

Interesting but not FWICT a regression. I say we close this. Could I ask you to make a couple of new tickets in 0.2.7 or later for doing less smartlist_remove(), and for moving client-side ntor into workers, and close this when you're done?

Trac:
Milestone: Tor: 0.2.6.x-final to Tor: 0.2.7.x-final

Marking triaged-out items from first round of 0.2.7 triage.

Trac:
Keywords: tor-relay performance 026-triaged-1 deleted, 026-triaged-1, tor-relay, 027-triaged-1-out, performance added

Make all non-needs_review, non-needs_revision, 027-triaged-1-out items belong to 0.2.???

Trac:
Milestone: Tor: 0.2.7.x-final to Tor: 0.2.???

Trac:
Resolution: N/A to implemented
Sponsor: N/A to N/A
Status: assigned to closed
Milestone: Tor: 0.2.??? to Tor: 0.2.6.x-final

closed

moved to tpo/core/tor#12464 (closed)

When Tor 0.2.6.x is closer to done, profile relays running Tor 0.2.6.x and optimize accordingly

Child items ...

Activity