I think that the "DEBUG" log might not be the root cause here. It's what would happen if the task to build a channel closes without having actually built a channel or given us an error. That's something we should investigate but I don't know if it's the root cause.
Thinking over possible causes here:
Maybe Arti itself stops doing well after 12 hours or so, and this has nothing to do with onion services.
Maybe when channels go down we don't re-open them well?
Maybe if a guard becomes unreliable we don't reconnect so well?
Maybe the problem is specific to onion services:
maybe we are forgetting to publish over time
maybe we are not keeping our intro circuits up over time
maybe we aren't handling time period/SRV transitions correctly
...?
Let's figure out how we can best reproduce and diagnose this issue, and how we can figure out what the root cause is.
@jnewsome Is it easy in shadow to have a test that simulates a day or two worth of time passing, but runs fairly quickly on its own?
I hacked up the shadow CI test a bit to repro this. I changed the simulation to run for 5h instead of 30m, and added arti clients to the arti hidden server that try to use it at t=2h and t=3h (in addition to the original that uses it at t=25m t=20m).
Neither of the later clients are able to successfully contact the server. It could be that the bug surfaces earlier in the simulation due to TestingTorNetwork settings (e.g. hsdir_interval = 120)
All arti logs below are with -l debug,tor_proto=trace.
Here is the arti server's log. The last entry is at 1h51m, which seems surprising.arti-extra.1001.stdout
Here is the arti log for the successful client at t=25m t=20m:arti.1000.stdout
And here's the arti log for the unsuccessful client at t=2h:arti.1000.stdout
I'm still poking around; LMK if you'd like me to rerun with any additional tweaks or to push a branch and instructions for repro'ing.
It does look the HS only uploads its descriptors twice. Once @ t=15m when it starts, and then again at t=28m:
$ grep descriptor fileserver-onion-arti/arti-extra.1001.stdout 2000-01-01T00:15:00Z DEBUG tor_hsservice::svc::publish::reactor: starting descriptor publisher reactor nickname=tgen_hs2000-01-01T00:15:00Z INFO tor_dirmgr::bootstrap: 1: Downloading microdescriptors (we are missing 7). attempt=12000-01-01T00:15:00Z DEBUG tor_dirmgr::state: Consensus now usable, with 0 microdescriptors missing. The current consensus is fresh until 2000-01-01 0:20:00.0 +00:00:00, and valid until 2000-01-01 0:30:00.0 +00:00:00. I've picked 2000-01-01 0:27:59.460564431 +00:00:00 as the earliest time to replace it.2000-01-01T00:15:00Z DEBUG arti_client::status: 100%: connecting successfully; directory is fetching microdescriptors (7/7)2000-01-01T00:15:01Z DEBUG arti_client::status: 100%: connecting successfully; directory is fetching microdescriptors (7/7)2000-01-01T00:15:01Z DEBUG arti_client::status: 100%: connecting successfully; directory is fetching microdescriptors (7/7)2000-01-01T00:15:04Z DEBUG tor_hsservice::svc::publish::reactor: successfully uploaded descriptor to HSDir nickname=tgen_hs hsdir_id=$0a9b1b207fd13a6f117f95cafa358eee2234f19a hsdir_rsa_id=$0a9b1b207fd13a6f117f95cafa358eee2234f19a2000-01-01T00:15:04Z DEBUG tor_hsservice::svc::publish::reactor: successfully uploaded descriptor to HSDir nickname=tgen_hs hsdir_id=$3baeac8e24c87b4e536484837b67966487a93214 hsdir_rsa_id=$3baeac8e24c87b4e536484837b67966487a932142000-01-01T00:15:04Z DEBUG tor_hsservice::svc::publish::backoff: failed to upload a hidden service descriptor: circuit failed attempt=1 can_retry=true2000-01-01T00:15:04Z DEBUG tor_hsservice::svc::publish::backoff: failed to upload a hidden service descriptor: circuit failed attempt=1 can_retry=true2000-01-01T00:15:04Z DEBUG tor_hsservice::svc::publish::reactor: successfully uploaded descriptor to HSDir nickname=tgen_hs hsdir_id=$a52ca5b56c64d864f6ae43e56f29acbd5706dda1 hsdir_rsa_id=$a52ca5b56c64d864f6ae43e56f29acbd5706dda12000-01-01T00:15:04Z DEBUG tor_hsservice::svc::publish::reactor: successfully uploaded descriptor to HSDir nickname=tgen_hs hsdir_id=$4ebb385c80a2ca5d671e16f1c722fbfb5f176891 hsdir_rsa_id=$4ebb385c80a2ca5d671e16f1c722fbfb5f1768912000-01-01T00:15:06Z DEBUG tor_hsservice::svc::publish::reactor: successfully uploaded descriptor to HSDir nickname=tgen_hs hsdir_id=$3fb0bd7827c760fe7f9dd810fcb10322d63ab4cf hsdir_rsa_id=$3fb0bd7827c760fe7f9dd810fcb10322d63ab4cf2000-01-01T00:15:06Z DEBUG tor_hsservice::svc::publish::reactor: successfully uploaded descriptor to HSDir nickname=tgen_hs hsdir_id=$ff197204099fa0e507fa46d41fed97d3337b4baa hsdir_rsa_id=$ff197204099fa0e507fa46d41fed97d3337b4baa2000-01-01T00:15:06Z DEBUG tor_hsservice::svc::publish::reactor: descriptor uploaded successfully to 6/6 HSDirs nickname=tgen_hs time_period=TimePeriod { interval_num: 131483, length: IntegerMinutes { value: 120 }, epoch_offset_in_sec: 3600 }2000-01-01T00:27:59Z DEBUG tor_dirmgr::state: Consensus now usable, with 0 microdescriptors missing. The current consensus is fresh until 2000-01-01 0:30:00.0 +00:00:00, and valid until 2000-01-01 0:40:00.0 +00:00:00. I've picked 2000-01-01 0:38:40.13421484 +00:00:00 as the earliest time to replace it.2000-01-01T00:27:59Z DEBUG arti_client::status: 100%: connecting successfully; directory is fetching microdescriptors (7/7); next directory is fetching authority certificates (1/1)2000-01-01T00:28:00Z DEBUG tor_hsservice::svc::publish::reactor: successfully uploaded descriptor to HSDir nickname=tgen_hs hsdir_id=$3baeac8e24c87b4e536484837b67966487a93214 hsdir_rsa_id=$3baeac8e24c87b4e536484837b67966487a932142000-01-01T00:28:00Z DEBUG tor_hsservice::svc::publish::reactor: successfully uploaded descriptor to HSDir nickname=tgen_hs hsdir_id=$a52ca5b56c64d864f6ae43e56f29acbd5706dda1 hsdir_rsa_id=$a52ca5b56c64d864f6ae43e56f29acbd5706dda12000-01-01T00:28:00Z DEBUG tor_hsservice::svc::publish::reactor: successfully uploaded descriptor to HSDir nickname=tgen_hs hsdir_id=$7073525d6a7b97e4708ce8b712baf21049a72168 hsdir_rsa_id=$7073525d6a7b97e4708ce8b712baf21049a721682000-01-01T00:28:00Z DEBUG tor_hsservice::svc::publish::reactor: successfully uploaded descriptor to HSDir nickname=tgen_hs hsdir_id=$4ebb385c80a2ca5d671e16f1c722fbfb5f176891 hsdir_rsa_id=$4ebb385c80a2ca5d671e16f1c722fbfb5f1768912000-01-01T00:28:01Z DEBUG tor_hsservice::svc::publish::reactor: successfully uploaded descriptor to HSDir nickname=tgen_hs hsdir_id=$ff197204099fa0e507fa46d41fed97d3337b4baa hsdir_rsa_id=$ff197204099fa0e507fa46d41fed97d3337b4baa2000-01-01T00:28:01Z DEBUG tor_hsservice::svc::publish::reactor: descriptor uploaded successfully to 5/5 HSDirs nickname=tgen_hs time_period=TimePeriod { interval_num: 131483, length: IntegerMinutes { value: 120 }, epoch_offset_in_sec: 3600 }2000-01-01T00:38:40Z DEBUG tor_dirmgr::state: Consensus now usable, with 0 microdescriptors missing. The current consensus is fresh until 2000-01-01 0:40:00.0 +00:00:00, and valid until 2000-01-01 0:50:00.0 +00:00:00. I've picked 2000-01-01 0:45:24.380732721 +00:00:00 as the earliest time to replace it.2000-01-01T00:38:40Z DEBUG arti_client::status: 92%: connecting successfully; directory is fetching microdescriptors (7/7); next directory is fetching authority certificates (1/1)2000-01-01T00:45:24Z DEBUG tor_dirmgr::state: Consensus now usable, with 0 microdescriptors missing. The current consensus is fresh until 2000-01-01 0:45:00.0 +00:00:00, and valid until 2000-01-01 0:55:00.0 +00:00:00. I've picked 2000-01-01 0:52:07.41431165 +00:00:00 as the earliest time to replace it.2000-01-01T00:45:24Z DEBUG arti_client::status: 92%: connecting successfully; directory is fetching microdescriptors (7/7); next directory is fetching authority certificates (1/1)2000-01-01T00:52:07Z DEBUG tor_dirmgr::state: Consensus now usable, with 0 microdescriptors missing. The current consensus is fresh until 2000-01-01 0:55:00.0 +00:00:00, and valid until 2000-01-01 1:05:00.0 +00:00:00. I've picked 2000-01-01 1:04:12.824566134 +00:00:00 as the earliest time to replace it.2000-01-01T00:52:07Z DEBUG arti_client::status: 92%: connecting successfully; directory is fetching microdescriptors (7/7); next directory is fetching authority certificates (1/1)
I don't recall the expected interval here wrt hsdir_interval = 120 (e.g. is that minutes or seconds?), but given that the 2nd upload happens 13m after the 1st upload, it seems odd that there are no more after that.
Hmm, by design the publisher only republishes the descriptor if there's a consensus change, or if the IPT manager says the intro points have changed. (FWIW, the last time the IPT mgr asked the publisher to publish a new descriptor is at 15m: 2000-01-01T00:15:03Z DEBUG tor_hsservice::ipt_mgr: HS service tgen_hs: 4 good IPTs, >= target 3, publishing). Thinking about it some more, this seems wrong...
hsdir_interval = 120 means ~2h(ish?, depending at what time the first time period began at) marks the start of a new time period (which haven't published a descriptor for, presumably). Note: the publisher only publishes the descriptor for the next time period if the publishing task happens to run between the time a new SRV appears and the time period change (because SRV#2's "range" includes TP#2: #1166 (comment 2984189))
(just yesterday we chatted about this in #1201 (closed))
Here, it looks like:
this wasn't the case (so the descriptor was only published for the current/previous TPs)
the publisher never republished the descriptor (because there was no consensus change)
This would explain why the clients are failing to connect after 2h (though I don't see the descriptor lookup corresponding failure in the client logs?..)
I suspect the behaviour from #1241 (closed) (which we haven't implemented yet) is meant to prevent this sort of thing from happening.
I need to validate this theory before I start trying to fix the publisher. Could you share the branch you're using? I'd like to try it out for myself.
Made a draft PR including some notes for running it in the description. (I think I need to update the README re the arti build(s) that the test expects)
It does look the HS only uploads its descriptors twice. Once @ t=15m when it starts, and then again at t=28m:
This is very strange. I get similar results if I run the simulation in CI, but not if I run it locally (I've run the simulation numerous times while debugging this, using the instructions from !1890).
@jnewsome I wonder if it's because I'm running an older version of shadow? (Shadow v3.0.0-0-ge502d20e 2023-05-18--18:20:51 and TGen v1.1.1 running GLib v2.74.5 and IGraph v0.9.10)
Here's what I have in the logs from the (local) simulated run (I added some extra logs to the publisher to see how many consensus changes it finds out about):
So from the logs above, it looks like the publisher works as expected. However, starting from 1h, the service reports a bunch of errors:
gabi-tor@goose ~/shadow-test-logs4 > rg "Introduction handshake was invalid" arti-extra.1001.stdout400:2000-01-01T01:05:03Z WARN tor_hsservice::helpers: Problem while accepting rendezvous request: error: Could not process INTRODUCE request: Introduction handshake was invalid: Circuit-extension handshake authentication failed402:2000-01-01T01:06:34Z WARN tor_hsservice::helpers: Problem while accepting rendezvous request: error: Could not process INTRODUCE request: Introduction handshake was invalid: Circuit-extension handshake authentication failed404:2000-01-01T01:08:05Z WARN tor_hsservice::helpers: Problem while accepting rendezvous request: error: Could not process INTRODUCE request: Introduction handshake was invalid: Circuit-extension handshake authentication failed407:2000-01-01T01:09:37Z WARN tor_hsservice::helpers: Problem while accepting rendezvous request: error: Could not process INTRODUCE request: Introduction handshake was invalid: Circuit-extension handshake authentication failed433:2000-01-01T01:11:08Z WARN tor_hsservice::helpers: Problem while accepting rendezvous request: error: Could not process INTRODUCE request: Introduction handshake was invalid: Circuit-extension handshake authentication failed445:2000-01-01T02:05:01Z WARN tor_hsservice::helpers: Problem while accepting rendezvous request: error: Could not process INTRODUCE request: Introduction handshake was invalid: Circuit-extension handshake authentication failed480:2000-01-01T02:06:33Z WARN tor_hsservice::helpers: Problem while accepting rendezvous request: error: Could not process INTRODUCE request: Introduction handshake was invalid: Circuit-extension handshake authentication failed482:2000-01-01T02:08:04Z WARN tor_hsservice::helpers: Problem while accepting rendezvous request: error: Could not process INTRODUCE request: Introduction handshake was invalid: Circuit-extension handshake authentication failed509:2000-01-01T02:09:36Z WARN tor_hsservice::helpers: Problem while accepting rendezvous request: error: Could not process INTRODUCE request: Introduction handshake was invalid: Circuit-extension handshake authentication failed513:2000-01-01T02:11:07Z WARN tor_hsservice::helpers: Problem while accepting rendezvous request: error: Could not process INTRODUCE request: Introduction handshake was invalid: Circuit-extension handshake authentication failed
FWIW, I see these errors in trace.log.2024-01-12, as well as in the logs from my own (non-shadow) experimental run. I think this is the root cause of the bug reported by @stefani.
I instrumented the code in handshake/hs_ntor.rs to see what's going on, and it turns out this error happens because the service can't find the decryption key to decrypt the ENCRYPTED_DATA from the INTRODUCE2 cell. It happens because none of the service's subcredentials from the RequestContext of the IptEstablisher match the subcredential used by the client. This is not because the correct subcredential doesn't exist in the key store, but rather because the snapshot of subcredentials cached in RequestContext becomes outdated after a while. We even have a TODO about it:
// TODO HSS: perhaps the subcredentials should be retrieved in // server_receive_intro_no_keygen instead? See also the TODO in HsNtorServiceInput
Addressing the TODO fixes the Problem while accepting rendezvous request issues for me (i.e. all the simulated clients successfully connect to the onion service). I have a WIP patch here: gabi-250/arti@695acaa3
I think there are some more problems and unanswered questions here:
for some reason, in CI, this shadow simulation ends at 1:51 (despite there
being simulated clients that try to connect at 2:05).
in CI, there are no consensus changes(?) in that simulation
we might need to republish periodically, even if there aren't any
HsDir/consensus changes (see #1241 (comment 2985928)). But I don't think this is what's causing the bug reported here..
Also, it's amazing how easy this was to debug using shadow .
(while I did manage to repro with a long-running service, eventually the bug stopped happening.. without shadow it would've taken me so much longer to figure out what's going on)
I wonder if it's because I'm running an older version of shadow? (Shadow v3.0.0-0-ge502d20e 2023-05-18--18:20:51 and TGen v1.1.1 running GLib v2.74.5 and IGraph v0.9.10)
Could be. That's the tagged 3.0.0 release; the CI is using a substantially newer version (193924aae0dab30ffda0abe29467f552949849fa Aug 25) and I'm using a slightly newer version than that. Glancing through the git log between 3.0.0 and the CI version I don't see any obvious suspects so far; I'll see if I can repro the 3.0.0 behavior you're seeing and if so probably do a binary search...
Strange that I can't repro @gabi-250 's result. My glib (and likely glibc) versions are different, which can be enough that our results won't be completely deterministically identical; it's possible that some trivial difference in e.g. syscall order is causing the simulation to get a "lucky" execution that doesn't exercise the bug.
This looks a little bit like #1170, wherein the tokio reactor gets "stuck" and stops processing events. In #1170 using async-std instead of tokio, or enabling tracing in tokio, works around the bug though; in this case using async-std gives me the same behavior, which hints it may be a different bug. Maybe that also means I can do a run with tokio tracing though to help dig in a bit further in this case...
The last descriptor upload happens at 38m this time (arti-extra-shadow-3.1.0.stdout was copied from shadow.data/hosts/fileserver-onion-arti/arti-extra.1001.stdout):