Tor issueshttps://gitlab.torproject.org/tpo/core/tor/-/issues2023-09-21T13:03:40Zhttps://gitlab.torproject.org/tpo/core/tor/-/issues/40858Failing to initialize sendme_inc when building the descriptor structure cause...2023-09-21T13:03:40Zhyunsoo.kim676Failing to initialize sendme_inc when building the descriptor structure causes hidden service performance and availability issuesI believe I have found a bug with quite a serious impact in the TOR relay code, and my PI suggested that I report it.
The bug affects TOR 0.4.7.13 and 0.4.8.5 (at least).
Line numbers that are referenced are based on release-0.4.7
...I believe I have found a bug with quite a serious impact in the TOR relay code, and my PI suggested that I report it.
The bug affects TOR 0.4.7.13 and 0.4.8.5 (at least).
Line numbers that are referenced are based on release-0.4.7
The details are below.
Bug:
Failing to initialize sendme_inc when building the descriptor structure causes hidden service performance and availability issues.
Details:
In hs_service.c, in the function build_service_desc_encrypted, the hidden service initializes the encrypted structure of the descriptor structure (of type hs_service_descriptor_t). In this function, the variables sendme_inc and flow_control_pv are not initialized, although they should be.
At first glance, there’s no impact, as these fields of the descriptor structure are not used when encoding the descriptor before sending it to the client. Instead, when encoding the descriptor, the correct sendme_inc and protocol version values from the consensus are used (see hs_descriptor.c line 778), so that the client receives correct values, hence there’s no impact on the sendme mechanism.
However, there’s another place where one of these fields, senme_inc, IS used by the hidden service. In the function hs_service_new_consensus_params(), in line hs_service.c line 3716, the code checks to see where sendme values have changed in the new consensus. The value of current_sendme_inc is a fixed 31. But the value of desc->desc->encrypted_data.sendme_inc is erroneously 0 (as it has never been initialized). This causes the condition to always return true. The impact of this is that all introduction points from both descriptors are immediately expired any time a new consensus is received.
Impact analysis:
According to our analysis, expiry of ALL introduction points, from both descriptors, all at once, causes an availability issue. Every time a consensus is received (roughly every hour on average), the service is not available for up to two minutes, as the descriptor cache of the clients is valid for two minutes. Until the clients load a new descriptor, the service is not available.
In addition, the expiry of introduction points may lead to unnecessary hidden service descriptor uploads, which may affect the performance of both TOR clients and relays.
Fix:
We suggest the following fix.
In hs_service.c, in line 1768, add the line:
encrypted->sendme_inc=congestion_control_sendme_inc();
We have downloaded the source code and built it with the fix, and it seems to solve the problem.Tor: 0.4.7.x-post-stableMike PerryMike Perryhttps://gitlab.torproject.org/tpo/core/tor/-/issues/40812socks: Add a new extended error to indicate when client is unable to solve th...2024-01-30T15:30:21ZDavid Gouletdgoulet@torproject.orgsocks: Add a new extended error to indicate when client is unable to solve the PoWWe can use `X'F8'` for this one that would indicate that solving the PoW for a specific onion connection failed.We can use `X'F8'` for this one that would indicate that solving the PoW for a specific onion connection failed.Tor: 0.4.8.x-post-stablehttps://gitlab.torproject.org/tpo/core/tor/-/issues/40767Investigate high circuit build error rates in simulation2023-04-12T14:46:43Zgabi-250Investigate high circuit build error rates in simulationWe ran some shadow simulations to debug/repro the issue from #40570, and @jnewsome noticed the onion service clients have consistently high [circuit build failure rates](https://gitlab.torproject.org/tpo/core/tor/-/issues/40570#note_2883...We ran some shadow simulations to debug/repro the issue from #40570, and @jnewsome noticed the onion service clients have consistently high [circuit build failure rates](https://gitlab.torproject.org/tpo/core/tor/-/issues/40570#note_2883257).
We should figure out what causes these circuit build failures.https://gitlab.torproject.org/tpo/core/tor/-/issues/40766Introduce additional HS client timeouts2023-04-12T14:46:37Zgabi-250Introduce additional HS client timeoutsToday tor terminates any circuits that take too long to build (`circuit_build_times_handle_completed_hop`, `circuit_expire_building`). In addition to this circuit built timeout, we might want to introduce timeouts for circuits that were ...Today tor terminates any circuits that take too long to build (`circuit_build_times_handle_completed_hop`, `circuit_expire_building`). In addition to this circuit built timeout, we might want to introduce timeouts for circuits that were built successfully but are stuck waiting for:
* `INTRODUCE_ACK` (for intro circuits)
* `RENDEZVOUS_ESTABLISHED` (for rend circuits)
cc @dgoulet who suggested this potential improvement for c-tor/artihttps://gitlab.torproject.org/tpo/core/tor/-/issues/40758Follow-up from "metrics: Add service side metrics for REND/INTRO circuit fail...2023-03-08T15:51:34Zgabi-250Follow-up from "metrics: Add service side metrics for REND/INTRO circuit failures."The following discussion from !695 should be addressed:
- [x] @gabi-250 started a [discussion](https://gitlab.torproject.org/tpo/core/tor/-/merge_requests/695#note_2878450):
> Hmm, maybe it's worth adding a `reason` label so we ca...The following discussion from !695 should be addressed:
- [x] @gabi-250 started a [discussion](https://gitlab.torproject.org/tpo/core/tor/-/merge_requests/695#note_2878450):
> Hmm, maybe it's worth adding a `reason` label so we can break these down by cause (when visualizing in e.g. grafana).gabi-250gabi-250https://gitlab.torproject.org/tpo/core/tor/-/issues/40757Add timing metrics for onion service handshakes2023-03-14T11:46:00Zgabi-250Add timing metrics for onion service handshakesOur [metrics library](https://gitlab.torproject.org/tpo/core/tor/-/blob/455471835da35d8ee64e6a2c0a70acb89a003bf4/src/lib/metrics/metrics_common.h#L25-31) only supports [counters](https://prometheus.io/docs/concepts/metric_types/#counter)...Our [metrics library](https://gitlab.torproject.org/tpo/core/tor/-/blob/455471835da35d8ee64e6a2c0a70acb89a003bf4/src/lib/metrics/metrics_common.h#L25-31) only supports [counters](https://prometheus.io/docs/concepts/metric_types/#counter) and [gauges](https://prometheus.io/docs/concepts/metric_types/#gauge). For time measurements, we're going to need e.g. [histograms](https://prometheus.io/docs/concepts/metric_types/#histogram), so we'll have to update the metrics library accordingly.gabi-250gabi-250https://gitlab.torproject.org/tpo/core/tor/-/issues/40756Add client side failure metrics for the INTRO/REND stages2023-03-13T15:56:40Zgabi-250Add client side failure metrics for the INTRO/REND stagesgabi-250gabi-250https://gitlab.torproject.org/tpo/core/tor/-/issues/40755Add service side failure metrics for the intro request/rendezvous stages2023-03-07T14:50:33Zgabi-250Add service side failure metrics for the intro request/rendezvous stagesgabi-250gabi-250https://gitlab.torproject.org/tpo/core/tor/-/issues/40751`hs_metrics_close_established_intro` should decrement the `hs_intro_establish...2023-02-13T15:18:20Zgabi-250`hs_metrics_close_established_intro` should decrement the `hs_intro_established_count metric`, not increment it`hs_metrics_close_established_intro` should decrease `hs_intro_established_count` by 1, not [increase it](https://gitlab.torproject.org/tpo/core/tor/-/blob/c98d78c95c198dd513c9cc446ed09d430b49566c/src/feature/hs/hs_metrics.h#L65-68).`hs_metrics_close_established_intro` should decrease `hs_intro_established_count` by 1, not [increase it](https://gitlab.torproject.org/tpo/core/tor/-/blob/c98d78c95c198dd513c9cc446ed09d430b49566c/src/feature/hs/hs_metrics.h#L65-68).gabi-250gabi-250https://gitlab.torproject.org/tpo/core/tor/-/issues/40717Additional metricsport stats for various stages of onionservice handshake2023-12-07T14:41:35ZMike PerryAdditional metricsport stats for various stages of onionservice handshakeIf we export additional onion service metrics such as time measurements on the HSDIR, INTRO, and REND stages of circuit setup for both client and service side, and the number of timeouts/failures there, it would help to uncover the root ...If we export additional onion service metrics such as time measurements on the HSDIR, INTRO, and REND stages of circuit setup for both client and service side, and the number of timeouts/failures there, it would help to uncover the root cause of issues like https://gitlab.torproject.org/tpo/core/tor/-/issues/40570 and related reliability and connectivity issues with onion services.
We can also export congestion control info from https://gitlab.torproject.org/tpo/core/tor/-/issues/40708 to the onionservice metrics set, too, which can help us with tuning congestion control for onion services.
We can then hook up the onionperf onion service instances to our grafana dashboard, and gather more detailed stats that way, as a supplement to the metrics that get graphed on the metrics website.https://gitlab.torproject.org/tpo/core/tor/-/issues/40716Impelement conflux for onion services2022-11-28T14:01:05ZMike PerryImpelement conflux for onion servicesConflux is traffic splitting, and will result in increased throughput and reduced latency for onion services after a connection has been established, by routing traffic over multiple paths, or via the lowest latency path to a service.
T...Conflux is traffic splitting, and will result in increased throughput and reduced latency for onion services after a connection has been established, by routing traffic over multiple paths, or via the lowest latency path to a service.
This ticket is for the onion service pieces of conflux (https://gitlab.torproject.org/tpo/core/tor/-/issues/40593).
We will not be implementing the onion services pieces of conflux as part of that ticket. It can be done later, if any onion service sponsors care about latency or throughput.
The pieces for onion services are:
- **Negotiation**
- [ ] Protover Advertisement for Onions (24h)
- [ ] Rend circuit linking (40h)
This is specified in https://gitlab.torproject.org/tpo/core/torspec/-/blob/main/proposals/329-traffic-splitting.txt, but we probably want to allow onion services to configure their scheduler by manually choosing either BLEST, or LowRTT, since different kinds of onion services may want to optimize for either throughput or latency.
There may be some additional work wrt making sure linked edge conns work properly, if they are handled differently for the onion service case.
Also, some shadow validation and performance testing will be needed. Maybe 40h or so of dev time (though much longer wall-clock time).https://gitlab.torproject.org/tpo/core/tor/-/issues/40702Single Onion Service Rends become 7 hop after retry2023-02-09T16:22:00ZMike PerrySingle Onion Service Rends become 7 hop after retryIn `retry_service_rendezvous_point()`, if a rend connect fails for a non-anonymous rend, we promote it to a 7-hop slow rend for some reason.
This will impact non-anonymous onions who want performance, especially during the DoS.
David n...In `retry_service_rendezvous_point()`, if a rend connect fails for a non-anonymous rend, we promote it to a 7-hop slow rend for some reason.
This will impact non-anonymous onions who want performance, especially during the DoS.
David notes that this decision to fall back to full anonymous mode in the event of timeout or failure was explicitly written just in case a non-anonymous onion service was also behind a restrictive firewall, and that firewall was the thing that happened to cause a timeout. There is also a comment that explains this, believe it or not. Back then, decision making in C-Tor was a bit more...special.
I bet if we get funders who actually care about single onion performance, they would prefer that their single onions not randomly double in latency on a timeout or failure, just to support the case where some single onion out there might be behind a firewall that they don't know about. Such a funder might suggest that we provide some other option for people behind firewalls to use, instead of this madness.
But I look forward to more research.https://gitlab.torproject.org/tpo/core/tor/-/issues/40696hs: Service rendezvous circuit can be repurposed2022-11-10T15:09:26ZDavid Gouletdgoulet@torproject.orghs: Service rendezvous circuit can be repurposedThe `circuit_build_times_handle_completed_hop()` function is called at every hop built and can decide to repurpose a circuit without the HS subsystem knowing about it.
For clients, it is not such a big deal because they notice if an int...The `circuit_build_times_handle_completed_hop()` function is called at every hop built and can decide to repurpose a circuit without the HS subsystem knowing about it.
For clients, it is not such a big deal because they notice if an intro or rendezvous circuit goes away and they will simply relaunch one.
For services, this is A-bad. If the rendezvous circuit gets repurposed that way, the service will not retry leading to the client timing out the rendezvous circuit and redoing the whole HS introduction dance. So in theory, 20% of all service RP request are currently failing and making the client to retry after a CBT for RP.
Essentially, we need the service to retry the rendezvous circuit if it gets repurposed. Proper easy approach here is to simply launch a retry in `hs_circ_cleanup_on_repurpose()` if the purpose is `S_CONNECT_REND`.Tor: 0.4.8.x-freezeDavid Gouletdgoulet@torproject.orgDavid Gouletdgoulet@torproject.orghttps://gitlab.torproject.org/tpo/core/tor/-/issues/40695hs: Don't retry to open a service RP circuit while iterating on the circuit list2022-11-01T13:31:05ZDavid Gouletdgoulet@torproject.orghs: Don't retry to open a service RP circuit while iterating on the circuit listTurns out that `hs_circ_retry_service_rendezvous_point()` is being called in the circuit loop of `circuit_expire_building()` and this is no good at all because it means we are adding an item to the circuit global list while iterating on ...Turns out that `hs_circ_retry_service_rendezvous_point()` is being called in the circuit loop of `circuit_expire_building()` and this is no good at all because it means we are adding an item to the circuit global list while iterating on it.
Instead, we should call it when `hs_service_circuit_cleanup_on_close()` is called, if the closing circuit was in `CIRCUIT_PURPOSE_S_CONNECT_REND` state, retry.Tor: 0.4.8.x-freezeDavid Gouletdgoulet@torproject.orgDavid Gouletdgoulet@torproject.orghttps://gitlab.torproject.org/tpo/core/tor/-/issues/40694hs: Client rendez-vous circuit expiry is a mess2022-11-10T15:09:25ZDavid Gouletdgoulet@torproject.orghs: Client rendez-vous circuit expiry is a messIt turns out that once a rendezvous circuit is ready waiting for the `RENDEZVOUS2` to arrive ( `CIRCUIT_PURPOSE_C_REND_READY_INTRO_ACKED`), we expire that circuit with `general_cutoff` which is a very very low value. We then flip `circ->...It turns out that once a rendezvous circuit is ready waiting for the `RENDEZVOUS2` to arrive ( `CIRCUIT_PURPOSE_C_REND_READY_INTRO_ACKED`), we expire that circuit with `general_cutoff` which is a very very low value. We then flip `circ->hs_circ_has_timed_out` indicating that it timed out "but we'll keep it around in case it works". However, in `circuit_is_acceptable()`, we return false if that flag is set meaning once flagged, it is good as dead until it finalizes which can be "never".
So, in the meantime, other rendezvous circuits get opened until one finally finalizes (all to the same RP). This can lead to a bomb of circuits opening by the client because it ain't cap and if the service never replies (because under DoS ;), then the client will just keep on trying.
There are a many issues here:
1. The cutoff of such circuit should be much higher because we end up in that circuit purpose when the `INTRODUCE_ACK` is received which could be *before* the service receives the `INTRODUCE2` cell. Thus, the worst case is a full 3 hop latency, a 4-hop circuit creation latency (service -> RP), and then a 7-hop latency for the `RENDEZVOUS2` cell to arrive. After talking to Mike, we'll apply a large value as in taking the basic CBT and multiplying it by 3.
2. We should get rid of `hs_circ_has_timed_out` because it is extremely fragile in its current logic and we should simply apply a much longer CBT for the `CIRCUIT_PURPOSE_C_REND_READY_INTRO_ACKED` purpose.
3. It turns out that we can't have RP circuits in parallel to the same RP. The RP relay will discard any old circuits if a new one shows up with the same cookie. And so this whole dance is pointless.
4. Unrelated, the service rendezvous establish timeout cutoff is too low, it should be the "four hop" circuit cutoff and not the general cutoff so this also needs to be bumped.
5. I'm sure I'll find more problems in this logic.
Again, we need to backport this for the sake of HS client UX.Tor: 0.4.8.x-freezeDavid Gouletdgoulet@torproject.orgDavid Gouletdgoulet@torproject.orghttps://gitlab.torproject.org/tpo/core/tor/-/issues/40692hs: Client intro failure cache being poluted by circuit closing without a NAC...2022-11-10T15:09:26ZDavid Gouletdgoulet@torproject.orghs: Client intro failure cache being poluted by circuit closing without a NACK from the introUnder current network conditions, a lot of introduction circuit fails for clients because the intro points are overloaded and can't process the `ntor` handshake leading to the circuit being closed with a `RESOURCELIMIT`.
When it trickle...Under current network conditions, a lot of introduction circuit fails for clients because the intro points are overloaded and can't process the `ntor` handshake leading to the circuit being closed with a `RESOURCELIMIT`.
When it trickles down to the HS client, `hs_client_circuit_cleanup_on_free()` is called which leads to flagging the intro point with a generic failure error which in turn makes the client stop using that intro point for another 2 minutes (time the intro will stay in the failure cache).
I have observed a situation where all intro points get flagged this way, the client re-download the descriptor, fails again and then has to wait 120 seconds before being able to do anything. This is conveniently align to the `SocksTimeout` of 120 seconds as well so inevitably, the socks connection hangs up.
This has really bad UX reachability consequences because in theory, we could simply retry the intro point a couple times and get it to work instead of failing and fallbacking to other intro points which in turns overload them and has a snowball effect.
Solution here is to mark the failure as "unreachable" and so the HS client will retry up to 5 times before giving up (`MAX_INTRO_POINT_REACHABILITY_FAILURES`).
We should definitely backport this.Tor: 0.4.8.x-freezeDavid Gouletdgoulet@torproject.orgDavid Gouletdgoulet@torproject.orghttps://gitlab.torproject.org/tpo/core/tor/-/issues/40634prop327: Implement PoW over Introduction Circuits2024-03-28T18:02:50ZDavid Gouletdgoulet@torproject.orgprop327: Implement PoW over Introduction CircuitsThis is the implementation ticket for prop327This is the implementation ticket for prop327Tor: 0.4.8.x-freezeMicah Elizabeth ScottMicah Elizabeth Scotthttps://gitlab.torproject.org/tpo/core/tor/-/issues/40621'Hidden' Authenticatd Onion Services2022-06-16T15:21:04Zrichard'Hidden' Authenticatd Onion Services# The Problem
Ricochet-Refresh has an issue where anybody who knows a user's onion service id (eg the Ricochet-Refresh user name/handle) then they can cyber-stalk said user by attempting to connect to the onion service (see https://gith...# The Problem
Ricochet-Refresh has an issue where anybody who knows a user's onion service id (eg the Ricochet-Refresh user name/handle) then they can cyber-stalk said user by attempting to connect to the onion service (see https://github.com/blueprint-freespeech/ricochet-refresh/issues/73 for a more in-depth description).
Gosling is meant to partially rectify this problem by using multiple onion services (see https://github.com/blueprint-freespeech/gosling/blob/main/docs/protocol.md for architecture info) but still has this issue in the case where a previously allowed contact is blocked and their onion auth key is revoked. The blocked contact can *still* cyberstalk the user so long as the endpoint server they were given is still in use. This *can* be mitigated by giving every single contact their own endpoint onion service, but past discussions have indicated this would be bad for the tor relays to maintain so many onion services.
# What I'd like
If an onion service is running but requires an auth key to decrypt the service descriptor, the tor daemon helpfully notifies a client that (via a custom SOCKS5 error). This is great UX for connecting to authenticated onion services in Tor Browser (as with OnionShare for example), but less than ideal for the above-described scenario.
It would be nice if it were possible to have a 'hidden' authenticated onion service (lol I know) which a user would only be able to confirm the existence of if they are allowed access via onion auth. So if a user does not have a valid auth key, they should not be able to determine if the service is running.
I have no idea if this is possible given how v3 works, but something to think about for the future maybe?https://gitlab.torproject.org/tpo/core/tor/-/issues/40586hs: Set congestion control onto crypt_path2022-03-25T19:26:32ZDavid Gouletdgoulet@torproject.orghs: Set congestion control onto crypt_pathDon't setup the congestion control on the circuit but rather on the cpath. Onion service circuit have a special last hop for the handshake and the CC object should be set there for the subsystem to properly function.Don't setup the congestion control on the circuit but rather on the cpath. Onion service circuit have a special last hop for the handshake and the CC object should be set there for the subsystem to properly function.Tor: 0.4.7.x-stableDavid Gouletdgoulet@torproject.orgDavid Gouletdgoulet@torproject.orghttps://gitlab.torproject.org/tpo/core/tor/-/issues/40581prometheus issues label name "port" is not unique: invalid sample when scraping2022-03-25T19:26:31ZAlex Xuprometheus issues label name "port" is not unique: invalid sample when scrapingexample output:
```
tor_hs_app_write_bytes_total{onion="duwq3shurnywtxq5z76dbcy7gbqyjgel4vzauxupuc4v773tiyxif5qd",port="80",port="443"} 1000
```
while I am new to Prometheus, my understanding is that each sample of a metric can have a...example output:
```
tor_hs_app_write_bytes_total{onion="duwq3shurnywtxq5z76dbcy7gbqyjgel4vzauxupuc4v773tiyxif5qd",port="80",port="443"} 1000
```
while I am new to Prometheus, my understanding is that each sample of a metric can have at most one value per label. therefore, `port="80",port="443"` is invalid. i think this makes sense in this context: either tor is able to count the bytes per port, in which case there should be two lines, one with `port="80"`, and one with `port="443"`, or tor can only count the bytes per service, in which case there should be one line and no port labels.
assuming the former is true, I think https://gitlab.torproject.org/tpo/core/tor/-/blob/455471835da35d8ee64e6a2c0a70acb89a003bf4/src/feature/hs/hs_metrics.c#L46-59 should look like:
```
if (base_metrics[i].port_as_label && service->config.ports) {
SMARTLIST_FOREACH_BEGIN(service->config.ports,
const hs_port_config_t *, p) {
metrics_store_entry_t *entry =
metrics_store_add(store, base_metrics[i].type, base_metrics[i].name,
base_metrics[i].help);
/* Add labels to the entry. */
metrics_store_entry_add_label(entry,
metrics_format_label("onion", service->onion_address));
metrics_store_entry_add_label(entry,
metrics_format_label("port", port_to_str(p->virtual_port)));
} SMARTLIST_FOREACH_END(p);
} else {
metrics_store_entry_t *entry =
metrics_store_add(store, base_metrics[i].type, base_metrics[i].name,
base_metrics[i].help);
/* Add labels to the entry. */
metrics_store_entry_add_label(entry,
metrics_format_label("onion", service->onion_address));
}
```
possibly with refactoring, and maybe adjustment of the condition, although I'm not sure it is allowed to have an onion service without ports? I have not submitted this as an MR because I have not tested it at all.
strangely, however, the case of multiple ports per service seems to already be handled at https://gitlab.torproject.org/tpo/core/tor/-/blob/455471835da35d8ee64e6a2c0a70acb89a003bf4/src/feature/hs/hs_metrics.c#L80-91, it is just not initialized properly. possibly this change was intended during development and not completed?Tor: 0.4.7.x-stableDavid Gouletdgoulet@torproject.orgDavid Gouletdgoulet@torproject.org