Be more verbose when warning about not finding a circuit ID

changed milestone to %Tor: 0.2.4.x-final

added 023-backport 024-backport 025-triaged andrea-review-0254 component::core tor/tor milestone::Tor: 0.2.4.x-final nickm-backport-02422 priority::high resolution::fixed status::closed tor-relay type::enhancement version::tor 0.2.5.3-alpha labels

Trac:
no_circuit_ids_left_verbose.patch

More verbose warning messages.

Patch has not been tested!

Patch has not been tested!

"(Possible DoS from client)."

~~More likely "to" client, if someone tries to DDoS some old HS.~~ eh. wrong. Don't know how it can be triggered for chan->is_client either.

This warnings need to be verbose only if you want to verify code bugs. If you sure no more bugs than evil attacker then better to mute this warns.

Agreed; my best guess for explanation here would be that on some relay-to-relay connection, you're getting close to the the limit of 2^15^ circuit IDs in each direction that Tor 0.2.3 and earlier had.

We should never be calling this function on a connection with a client, since we should never be trying to extend a circuit towards a client.

Also I bet this is a very expensive check if we're really exhausting the circID space here. So, upgrading this to "major" and marking 024-backportable.

Trac:
Milestone: N/A to Tor: 0.2.5.x-final
Keywords: N/A deleted, tor-relay 024-backport added
Priority: trivial to major

Trac:
Keywords: tor-relay 024-backport deleted, tor-relay 024-backport 023-backport added

The other possibility here is that our circuit_id_in_use code that we added to solve #7912 (moved) has a flaw somewhere, and circuitIDs are under some circumstances being marked as unusable but never getting made available again.

My branch bug11553_024 has an improved warning, plus an improved (?) randomized algorithm to handle exhaustion scenarios. This algorithm makes its users distinguishable by guards from older clients, so if it goes in a stable release, it should go in the same release as something like #11438 (moved).

My branch bug11553_025 merges that forward to 0.2.5, and includes and attempt at diagnosing whether the #7912 (moved) fix could be responsible.

Trac:
Status: new to needs_review

Btw, proposal 214 or/and #7351 (moved) ticket never was mentioned in Changelog nor ReleaseNotes for new 0.2.4.x. Silently implemented.

(Whoops. Looks like we forgot a changes file? Opening another ticket for that.)

My branch bug11553_025 What about warn:

log_warn(LD_CIRC,"failed to get unique circID.")

Thanks; added a rate-limiter.

A few comments/questions from my side: a) The new log message should be fine. b) Limiting the number of iterations to find a usable circuit ID in the worst case to 64 instead of 2^15^ or 2^31^ sounds also good. Having non-predictable circuit IDs sounds like a wise move to me anyway. Nevertheless, although the chance of not finding a usuable circuit ID for a specific channel within 64 tries is quite small, it will eventually happen and people will end up having warning messages in their log files; which is perfectly fine, as long as this happens very infrequently. I think it is really bad to have ignorable warning messages, but unfortunately I have no better suggestion at this time on how to handle this but stating in the log message that the warning can be ignored if it happens only very rarely. c) Defining MAX_CIRCID_ATTEMPTS 64 seems like a good choice to me. But there is a comment missing in the code explaining why the constant is 64 and not 23, 42, 128 or 1024. d) Is a warning only printed once for a channel (chan->warned_circ_ids_exhausted)? If so, why?

b) I think that if this does turn out to be ignorable, we can downrate it even further.

c) Hm. I don't have a great justification. For any good N, N/2 and N*2 are probably good choices too. Basically, I wanted a number that wasn't too high, but which still gave a pretty small chance of failure when the space started to fill up. For N=64, there's a one-in-a-million failure chance when the space is 80% full, and a one-in-850 chance when the space is 90% full, and a one-in-26 chance when the space is 95% full. That seemed okay.

d) So as to avoid spamming the log with junk. I guess we could refine it to rate-limit it per channel.

ra, are you able to test this patch over the weekend? I'd like to know what kind of results it reports, and whether it indicates a possible bug in the #7351 (moved) code.

ra, are you able to test this patch over the weekend?

And by "this patch" I mean the bug11553_025 version of it.

Begin code review:

New logging in bb9b4c37f8e7f5cf78918f382e90d8b11ff42551 looks okay, but perhaps we should consider resetting warn_circ_ids_exhausted if the number of circuits drops or if sufficient time elapses since it was set, so we don't block warnings on that channel forever even if the exhaustion condition is relieved later?
0d75344b0e0eafc89db89a974e87b16564cd8f0a:
- This code looks okay, but the possibility of occasional false positives on circuit ID exhaustion amplifies my above-stated concern about never letting the blockage on the log message age out.
- We could be smarter about this with better data structures; suppose, for example, the circuit IDs in use per channel lived in a balanced tree with each node knowing how many nodes its subtree had. Then we could know how many circuit IDs are available to assign (N = max_circ_id - num_circ_ids from the root), pick a random i with 0 <= i < n, and then walk down the tree adjusting i as appropriate for which circuit IDs are already in use. This would assign a uniformly distributed random unused circuit ID in deterministic log time in the number of circuit IDs already assigned, and never fail.
985deaaaf7b7397857e02206e89392e0ee101077 looks okay to me

What chance to fill test_circ_id with zeroes by crypto_rand? Non zero. Non zero chance to get zero circ_id and to violate specification.

Replying to cypherpunks:

What chance to fill test_circ_id with zeroes by crypto_rand? Non zero. Non zero chance to get zero circ_id and to violate specification.

Good near-catch. Not actually going to cause it to try to use a zero circ ID since get_unique_circ_id_by_chan() returns zero to indicate failure, but that amounts to a lucky coincidence. Hitting the 2^-15^ or 2^-31^ probability to zero circ_id would cause it to exit the loop without actually making MAX_CIRCID_ATTEMPTS tries, so it increases the probability to fail to find a circuit ID. For the narrow circuit ID case, this is the dominant source of failures to find a circuit ID when N/max_range is less than about 0.85, if my analysis is correct. This code should check for the zero case in the loop condition, I think.

Be more verbose when warning about not finding a circuit ID

Child items 0

Activity