If we export additional onion service metrics such as time measurements on the HSDIR, INTRO, and REND stages of circuit setup for both client and service side, and the number of timeouts/failures there, it would help to uncover the root cause of issues like #40570 (closed) and related reliability and connectivity issues with onion services.
We can also export congestion control info from #40708 (closed) to the onionservice metrics set, too, which can help us with tuning congestion control for onion services.
We can then hook up the onionperf onion service instances to our grafana dashboard, and gather more detailed stats that way, as a supplement to the metrics that get graphed on the metrics website.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
we currently don't have any metrics for the (hidden service) client side
our metrics library only supports
counters and
gauges. For time measurements,
we're going to need e.g. histograms, so
we'll have to update the metrics library accordingly
Given the above, initially I'm only going to focus on adding timeout/failure metrics for the service side (for the HSDIR/INTRO/REND stages), and tackle timings and client side metrics afterwards.
For core/tor!40570, I managed to gather enough information from the stem client logs, so I didn't actually add too many new metrics to tor. The ones I added are:
hs_intro_rejected_intro_req_count - the number of introduction requests rejected by the hidden service. This metric has a reason label
hs_rdv_error_count - the number of rendezvous errors as seen by the hidden service (this number includes the number of circuit establishment failures, failed retries, end-to-end circuit setup failures). This metric has a reason label
tor_hs_rend_circ_build_time - the rendezvous circuit build time in milliseconds
tor_hs_intro_circ_build_time - the introduction circuit build time in milliseconds
I'm not planning on adding any other metrics to C Tor in the near future, so I'm going to unassign this and put it back in the backlog. @mikeperry feel free to close if you think we can declare this done.
Hrmm, a stat for the hsdir time and failure count would also be useful.
Additionally, stats for the same thing on the service side will be useful too (hsdir post time, fail rate; intro fail/collapse/rotation rate, rend success rate and build time).
These metrics should not be used to enter into a wild goose chase with finding the source of all possible issues in C-Tor, but long term, we will want the same metrics in some way from Arti, because we will want to compare Arti to C-Tor, to make sure that Arti Onions are more reliable than C-Tor at these things.
This came up with the Onion Service Resource Coalition (a second time), because the folks at Quiet are experiencing problems with 'time to availability' as a fairly large issue for them. Specifically around how long it takes for an onion address to become available to others initially, either when its first created, or after the host loses its connection to the Tor network and reconnects (Quiet has a mobile app, and network transitions can be frequent on mobile).
It unclear where in the stack the issue is, and uncovering it without a reproducible case + logs, or metrics for those layers make it difficult. I've asked them for a reproducible case so we can try and find out, but I agree with @mikeperry here that these stats would be very useful. If we don't want to commit to adding these to C-tor, because of the relative complexity, we definitely need to make sure they are going to be added to Arti.
Do we have a good way of tracking things like this so that they are going to be picked up in Arti?
It unclear where in the stack the issue is, and uncovering it without a reproducible case + logs, or metrics for those layers make it difficult. I've asked them for a reproducible case so we can try and find out, but I agree with @mikeperry here that these stats would be very useful. If we don't want to commit to adding these to C-tor, because of the relative complexity, we definitely need to make sure they are going to be added to Arti.
I haven't looked at C-tor in quite a while, so I can't comment on the complexity of the task. Waiting until they share some logs/repro steps sounds sensible to me.
In Arti we currently don't have metrics of any kind, so things like this will likely be part of a bigger project where we tackle observability more generally.
Note that while this is not currently on our roadmap, we're certainly not opposed to having service-side metrics (but the topic of adding metrics is somewhat broad, so we will need to discuss it as a team, define the scope of the project, etc.).
Do we have a good way of tracking things like this so that they are going to be picked up in Arti?
I can open a ticket and link it to this one. Tickets are the only way (that I know of) to track feature requests in Arti.