Additional metricsport stats for various stages of onionservice handshake

mentioned in issue tpo/network-health/metrics/website#40071

added Onion Services Performance labels

added Backlog label

added Next label and removed Backlog label

assigned to @gabi-250

added Doing label

removed Next label

Some observations:

we currently don't have any metrics for the (hidden service) client side
our metrics library only supports counters and gauges. For time measurements, we're going to need e.g. histograms, so we'll have to update the metrics library accordingly

Given the above, initially I'm only going to focus on adding timeout/failure metrics for the service side (for the HSDIR/INTRO/REND stages), and tackle timings and client side metrics afterwards.

marked this issue as related to #40755 (closed)

marked this issue as related to #40756 (closed)

marked this issue as related to #40757 (closed)

mentioned in merge request !695 (merged)

mentioned in merge request !700 (closed)

mentioned in commit d1264d11

added Q2 label

For core/tor!40570, I managed to gather enough information from the stem client logs, so I didn't actually add too many new metrics to tor. The ones I added are:

hs_intro_rejected_intro_req_count - the number of introduction requests rejected by the hidden service. This metric has a reason label
hs_rdv_error_count - the number of rendezvous errors as seen by the hidden service (this number includes the number of circuit establishment failures, failed retries, end-to-end circuit setup failures). This metric has a reason label
tor_hs_rend_circ_build_time - the rendezvous circuit build time in milliseconds
tor_hs_intro_circ_build_time - the introduction circuit build time in milliseconds

I'm not planning on adding any other metrics to C Tor in the near future, so I'm going to unassign this and put it back in the backlog. @mikeperry feel free to close if you think we can declare this done.

unassigned @gabi-250

removed Doing label

added Backlog label

Hrmm, a stat for the hsdir time and failure count would also be useful.

Additionally, stats for the same thing on the service side will be useful too (hsdir post time, fail rate; intro fail/collapse/rotation rate, rend success rate and build time).

These metrics should not be used to enter into a wild goose chase with finding the source of all possible issues in C-Tor, but long term, we will want the same metrics in some way from Arti, because we will want to compare Arti to C-Tor, to make sure that Arti Onions are more reliable than C-Tor at these things.

mentioned in issue #40784

This came up with the Onion Service Resource Coalition (a second time), because the folks at Quiet are experiencing problems with 'time to availability' as a fairly large issue for them. Specifically around how long it takes for an onion address to become available to others initially, either when its first created, or after the host loses its connection to the Tor network and reconnects (Quiet has a mobile app, and network transitions can be frequent on mobile).

It unclear where in the stack the issue is, and uncovering it without a reproducible case + logs, or metrics for those layers make it difficult. I've asked them for a reproducible case so we can try and find out, but I agree with @mikeperry here that these stats would be very useful. If we don't want to commit to adding these to C-tor, because of the relative complexity, we definitely need to make sure they are going to be added to Arti.

Do we have a good way of tracking things like this so that they are going to be picked up in Arti?

It unclear where in the stack the issue is, and uncovering it without a reproducible case + logs, or metrics for those layers make it difficult. I've asked them for a reproducible case so we can try and find out, but I agree with @mikeperry here that these stats would be very useful. If we don't want to commit to adding these to C-tor, because of the relative complexity, we definitely need to make sure they are going to be added to Arti.

I haven't looked at C-tor in quite a while, so I can't comment on the complexity of the task. Waiting until they share some logs/repro steps sounds sensible to me.

In Arti we currently don't have metrics of any kind, so things like this will likely be part of a bigger project where we tackle observability more generally.

Note that while this is not currently on our roadmap, we're certainly not opposed to having service-side metrics (but the topic of adding metrics is somewhat broad, so we will need to discuss it as a team, define the scope of the project, etc.).

Do we have a good way of tracking things like this so that they are going to be picked up in Arti?

I can open a ticket and link it to this one. Tickets are the only way (that I know of) to track feature requests in Arti.

cc @ahf @Diziet @nickm

I opened arti#1003

mentioned in issue arti#1003

marked this issue as related to arti#1003

Additional metricsport stats for various stages of onionservice handshake

Child items 0

Activity