Trac issueshttps://gitlab.torproject.org/legacy/trac/-/issues2020-06-13T14:40:14Zhttps://gitlab.torproject.org/legacy/trac/-/issues/13738Make worker handle introduction point crypto2020-06-13T14:40:14ZDavid Gouletdgoulet@torproject.orgMake worker handle introduction point cryptoLinked to #13737 to send crypto related task to worker.Linked to #13737 to send crypto related task to worker.Tor: unspecifiedhttps://gitlab.torproject.org/legacy/trac/-/issues/22210circuit_is_acceptable is slow due to IP and fingerprint parsing2020-06-13T15:08:42Zteorcircuit_is_acceptable is slow due to IP and fingerprint parsing(I received this bug report from special. Please feel free to contact me (teor) with questions.)
Using Tor version 0.3.1.0-alpha-dev (git-04f1ddaa2a33194b), also applies to at least all 0.2.8.x and above.
Testcase is a ricochet client ...(I received this bug report from special. Please feel free to contact me (teor) with questions.)
Using Tor version 0.3.1.0-alpha-dev (git-04f1ddaa2a33194b), also applies to at least all 0.2.8.x and above.
Testcase is a ricochet client with a very large number of contacts. Effectively, it’s a client that attempts to build circuits to >100 <500 hidden services almost simultaneously, and retries over time with increasing delays.
This tor client will constantly use 100% of a CPU core for >15 minutes after start. There’s a variety of buggy behavior going on, and a lot of circuits, so it’s not entirely surprising to see some high load. More interesting is where that load actually comes from.
The vast majority of CPU time — 2.1 minutes out of a 3 minute trace — is spent under `circuit_is_acceptable`, as called from `connection_ap_attach_pending` (via `circuit_get_open_circ_or_launch` and `circuit_get_best`). The breakdown from there is:
2.10 min 100.0% 6.26 s circuit_is_acceptable
58.93 s 46.7% 1.05 s tor_addr_parse
56.34 s 44.7% 1.84 s tor_inet_pton
52.53 s 41.7% 1.68 s tor_inet_aton
50.71 s 40.2% 914.00 ms tor_sscanf
49.74 s 39.4% 13.78 s tor_vsscanf
29.30 s 23.2% 20.11 s scan_unsigned
5.29 s 4.2% 5.29 s TOR_ISDIGIT
1.13 s 0.8% 1.13 s digit_to_num
43.76 s 34.7% 1.30 s connection_ap_can_use_exit
39.51 s 31.3% 574.00 ms node_get_by_nickname
38.93 s 30.9% 764.00 ms node_get_by_hex_id
23.82 s 18.9% 869.00 ms hex_digest_nickname_decode
21.61 s 17.1% 10.95 s base16_decode
10.07 s 7.9% 10.07 s hex_decode_digit_
14.32 s 11.3% 156.00 ms node_get_by_id
16.39 s 13.0% 1.47 s build_state_get_exit_node
14.93 s 11.8% 119.00 ms node_get_by_id
That is super hard to read in this format, but essentially a majority of time is spent on:
1. Parsing IP addresses from strings out of node descriptors
2. Looking up nodes by ID
3. Decoding hex digests to get node IDs
This implies to me that these could be cached in a much more useful way in the node struct, the algorithmic complexity of some of the circuit selection functions is out of control, and I’m also guessing that the circuit attach functions are being called __way__ too often.
Without circuit_is_acceptable being considered, the hottest path is under choose_good_middle_server for node selection, which seems much more reasonable.
The profile quoted here had —enable-expensive-hardening (including ASAN) and slightly older tor, but I’ve rerun with 04f1ddaa and without hardening with the same conclusions, including 2 out of 3 minutes of CPU time being spent under `circuit_is_acceptable`.Tor: 0.4.1.x-finalNeel Chauhanneel@neelc.orgNeel Chauhanneel@neelc.orghttps://gitlab.torproject.org/legacy/trac/-/issues/30291Optimize our path selection code2020-06-13T15:40:59ZDavid Gouletdgoulet@torproject.orgOptimize our path selection code(This is a more complete ticket than #13739 and thus supersedes it.)
An onion service can be ask to open many rendezvous circuit by a client. With a vanilla Tor, the resource consumption there is roughly 1:2 (service:client) because the...(This is a more complete ticket than #13739 and thus supersedes it.)
An onion service can be ask to open many rendezvous circuit by a client. With a vanilla Tor, the resource consumption there is roughly 1:2 (service:client) because the client needs to open two circuits, one to the RP and one to the IP and then sends one cell on that circuit to introduce. The service will do a bit less than that because its introduction circuit is already established so it simply needs to open a rendezvous circuit.
However, an attacker (DoS) can make a client skip the RP circuit creation, pin relays in a path to the introduction point and simply open an introduction circuit each time through that path (which in theory should be made very fast and also controlled by the attacker).
This means that the work ratio goes from 1:2 to 1:1 because the client only opens one circuit, send a single cell, close it and repeat.
But, in reality, it gets worst for the service because of the performance of our path selection code. It is heavy. For starter, we need to randomly select 3 nodes from the consensus for each rendezvous circuit. But to select those nodes, we go over all of them for each node to exclude some, include others, go through a series of tests such has `tor_addr_family()` and `node_has_preferred_descriptor()` that are quite heavy.
The CPU profile of an overloaded service shows that the majority of CPU is spent in selecting these nodes. Here is a perf report output for the top 5:
```
+ 11.10% tor tor [.] smartlist_remove
+ 10.96% tor tor [.] fmonty
+ 8.77% tor tor [.] tor_memeq
+ 6.56% tor tor [.] tor_addr_family
+ 5.40% tor tor [.] tor_addr_is_null
```
The top `smartlist_remove()` comes from:
```
- smartlist_remove
- 11.10% smartlist_subtract
router_choose_random_node
choose_good_middle_server
onion_extend_cpath
onion_populate_cpath
circuit_establish_circuit
- circuit_launch_by_extend_info
- 10.93% launch_rendezvous_point_circuit
```
Amazingly enough, we spend more time selecting path than doing crypto on a service that gets DDoS by introduction requests (#29607).
There aren't many straight forward approach here. Optimizing the code path bits by bits could be an approach. Most likely, our first step is to have a benchmark for our current path selection code and try to improve on it incrementally.
There are also more hackish approach like pre-building a list of RP nodes at each consensus (or when torrc options are changed like `ExcludeNodes`) and then randomly picking nodes from there would go faster because they wouldn't need to go through long iterations of tests, it would be already done.
One example is `router_choose_random_node()` that goes over the entire nodes list just to pick nodes to exclude, then again to test each nodes `router_add_running_nodes_to_smartlist()` and then potentially does a series of `smartlist_subtract()` that iterate over the entire list again. This is in theory `O(n) + O(n) + O(n)` actually theoretically being `O(n)` but reality is far from the theory here in CPU consumption ;).Tor: unspecifiedhttps://gitlab.torproject.org/legacy/trac/-/issues/13739Optimize the functions called in circuit_launch_by_extend_info()2020-06-13T15:40:59ZDavid Gouletdgoulet@torproject.orgOptimize the functions called in circuit_launch_by_extend_info()Ref to https://trac.torproject.org/projects/tor/ticket/8902#comment:10
You can see that on a loaded HS, some functions are called quite a lot and they take a lot of CPU.
https://people.torproject.org/~dgoulet/tor-hs-perf-100-circ.png
...Ref to https://trac.torproject.org/projects/tor/ticket/8902#comment:10
You can see that on a loaded HS, some functions are called quite a lot and they take a lot of CPU.
https://people.torproject.org/~dgoulet/tor-hs-perf-100-circ.png
They should be optimize to bring that CPU usage down.Tor: unspecified