tor-chanmgr: `ChanMgr::get_or_launch` is not cancellation-safe, but is used with a timeout

TL;DR: ChanMgr::get_or_launch is not cancellation-safe, which means combining it with a timeout (like is done in tor-circmgr) can cause the pending channel to be left in the channel map forever. This prevents new channels from being built.

The ChanMgr::get_or_launch method may build a new channel. The process is generally:

Add a "pending" entry to the channel map.
Build the channel.
Swap the pending channel with the new open channel in the channel map.

There was some concern (see #1654) about bugs caused by the above returning early between steps (1) and (3), which could the pending entry to be left in the channel map forever, preventing new channels from being built. This was mitigated by adding a PendingChannelHandle, which was intended to make sure the pending entry was cleaned up properly. Originally it was written to clean up the entry automatically when the handle was dropped (see !2538 (merged)), but there were concerns about the "mutex lock in drop handler" making it easier to accidentally cause deadlocks in the future. Instead this handle behaviour was changed to only warn (and panic in debug builds) if the handle was dropped (see !2566 (merged)). (Edit: The debug-panic behaviour was removed in !2625 (merged)).

This has helped identify the likely cause of #1290 thanks to the backtrace from @wesleyac:

2024-10-28T16:35:56Z ERROR arti::logging: Panic at crates/tor-chanmgr/src/mgr/state.rs:616:13: Dropped the 'PendingChannelHandle' without removing the channel
[...]
  12: core::ptr::drop_in_place<tor_chanmgr::ChanMgr<tor_rtcompat::tokio::TokioNativeTlsRuntime>::get_or_launch<tor_linkspec::owned::OwnedCircTarget>::{{closure}}>
             at ./crates/tor-chanmgr/src/lib.rs:295:80
  13: core::ptr::drop_in_place<tor_circmgr::build::create_common<tor_rtcompat::tokio::TokioNativeTlsRuntime,tor_linkspec::owned::OwnedCircTarget>::{{closure}}>
             at ./crates/tor-circmgr/src/build.rs:90:59
  14: core::ptr::drop_in_place<<tor_proto::circuit::ClientCirc as tor_circmgr::build::Buildable>::create<tor_rtcompat::tokio::TokioNativeTlsRuntime>::{{closure}}>
             at ./crates/tor-circmgr/src/build.rs:154:76
  15: core::ptr::drop_in_place<alloc::boxed::Box<dyn core::future::future::Future+Output = core::result::Result<alloc::sync::Arc<tor_proto::circuit::ClientCirc>,tor_circmgr::err::Error>+core::marker::Send>>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ptr/mod.rs:542:1
  16: core::ptr::drop_in_place<core::pin::Pin<alloc::boxed::Box<dyn core::future::future::Future+Output = core::result::Result<alloc::sync::Arc<tor_proto::circuit::ClientCirc>,tor_circmgr::err::Error>+core::marker::Send>>>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ptr/mod.rs:542:1
  17: core::ptr::drop_in_place<tor_circmgr::build::Builder<tor_rtcompat::tokio::TokioNativeTlsRuntime,tor_proto::circuit::ClientCirc>::build_notimeout::{{closure}}>
             at ./crates/tor-circmgr/src/build.rs:294:22
  18: core::ptr::drop_in_place<tor_rtcompat::timer::Timeout<tor_circmgr::build::Builder<tor_rtcompat::tokio::TokioNativeTlsRuntime,tor_proto::circuit::ClientCirc>::build_notimeout::{{closure}},tokio::time::sleep::Sleep>>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ptr/mod.rs:542:1
  19: tor_circmgr::build::double_timeout::{{closure}}::{{closure}}
             at ./crates/tor-circmgr/src/build.rs:563:51

Here you can see that tor_circmgr::build::double_timeout is running, which runs drops the future returned by tor_circmgr::build::Builder::build_notimeout, which contains the future returned by ChanMgr::get_or_launch, which contains the PendingChannelHandle.

So it turns out that it's not enough for ChanMgr::get_or_launch to only protect against early returns. get_or_launch must also expect that its future may be dropped at any time. Since get_or_launch isn't designed to handle this, it's not cancellation-safe and cannot be used with an external timeout.

This is probably the cause of #1290, but I'm making this a separate issue to keep this one more focused. And then we can close #1290 if a fix for this issue does indeed fix that issue too.

I'm tagging this as "Project 141", but it affects clients too, not just relays.

Edited Nov 12, 2024 by opara