"[WARN] failed to get unique circID. [99 duplicates hidden]
[WARN] No unused circ IDs. Failing.
24/7 for days now. filling up the logs quite fast. Deosnt seem to have any effect on the relays speed. didnt find any solid explanation for what causes this error.
[WARN] failed to get unique circID.
then this
[WARN] No unused circ IDs. Failing."
It is hard to say why this error occurs without more verbose logging output.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
Agreed; my best guess for explanation here would be that on some relay-to-relay connection, you're getting close to the the limit of 2^15^ circuit IDs in each direction that Tor 0.2.3 and earlier had.
We should never be calling this function on a connection with a client, since we should never be trying to extend a circuit towards a client.
Also I bet this is a very expensive check if we're really exhausting the circID space here. So, upgrading this to "major" and marking 024-backportable.
Trac: Milestone: N/Ato Tor: 0.2.5.x-final Keywords: N/Adeleted, tor-relay 024-backport added Priority: trivial to major
The other possibility here is that our circuit_id_in_use code that we added to solve #7912 (moved) has a flaw somewhere, and circuitIDs are under some circumstances being marked as unusable but never getting made available again.
My branch bug11553_024 has an improved warning, plus an improved (?) randomized algorithm to handle exhaustion scenarios. This algorithm makes its users distinguishable by guards from older clients, so if it goes in a stable release, it should go in the same release as something like #11438 (moved).
My branch bug11553_025 merges that forward to 0.2.5, and includes and attempt at diagnosing whether the #7912 (moved) fix could be responsible.
A few comments/questions from my side:
a) The new log message should be fine.
b) Limiting the number of iterations to find a usable circuit ID in the worst case to 64 instead of 2^15^ or 2^31^ sounds also good. Having non-predictable circuit IDs sounds like a wise move to me anyway. Nevertheless, although the chance of not finding a usuable circuit ID for a specific channel within 64 tries is quite small, it will eventually happen and people will end up having warning messages in their log files; which is perfectly fine, as long as this happens very infrequently. I think it is really bad to have ignorable warning messages, but unfortunately I have no better suggestion at this time on how to handle this but stating in the log message that the warning can be ignored if it happens only very rarely.
c) Defining MAX_CIRCID_ATTEMPTS 64 seems like a good choice to me. But there is a comment missing in the code explaining why the constant is 64 and not 23, 42, 128 or 1024.
d) Is a warning only printed once for a channel (chan->warned_circ_ids_exhausted)? If so, why?
b) I think that if this does turn out to be ignorable, we can downrate it even further.
c) Hm. I don't have a great justification. For any good N, N/2 and N*2 are probably good choices too. Basically, I wanted a number that wasn't too high, but which still gave a pretty small chance of failure when the space started to fill up. For N=64, there's a one-in-a-million failure chance when the space is 80% full, and a one-in-850 chance when the space is 90% full, and a one-in-26 chance when the space is 95% full. That seemed okay.
d) So as to avoid spamming the log with junk. I guess we could refine it to rate-limit it per channel.
ra, are you able to test this patch over the weekend? I'd like to know what kind of results it reports, and whether it indicates a possible bug in the #7351 (moved) code.
New logging in bb9b4c37f8e7f5cf78918f382e90d8b11ff42551 looks okay, but perhaps we should consider resetting warn_circ_ids_exhausted if the number of circuits drops or if sufficient time elapses since it was set, so we don't block warnings on that channel forever even if the exhaustion condition is relieved later?
0d75344b0e0eafc89db89a974e87b16564cd8f0a:
This code looks okay, but the possibility of occasional false positives on circuit ID exhaustion amplifies my above-stated concern about never letting the blockage on the log message age out.
We could be smarter about this with better data structures; suppose, for example, the circuit IDs in use per channel lived in a balanced tree with each node knowing how many nodes its subtree had. Then we could know how many circuit IDs are available to assign (N = max_circ_id - num_circ_ids from the root), pick a random i with 0 <= i < n, and then walk down the tree adjusting i as appropriate for which circuit IDs are already in use. This would assign a uniformly distributed random unused circuit ID in deterministic log time in the number of circuit IDs already assigned, and never fail.
985deaaaf7b7397857e02206e89392e0ee101077 looks okay to me
What chance to fill test_circ_id with zeroes by crypto_rand? Non zero.
Non zero chance to get zero circ_id and to violate specification.
Good near-catch. Not actually going to cause it to try to use a zero circ ID since get_unique_circ_id_by_chan() returns zero to indicate failure, but that amounts to a lucky coincidence. Hitting the 2^-15^ or 2^-31^ probability to zero circ_id would cause it to exit the loop without actually making MAX_CIRCID_ATTEMPTS tries, so it increases the probability to fail to find a circuit ID. For the narrow circuit ID case, this is the dominant source of failures to find a circuit ID when N/max_range is less than about 0.85, if my analysis is correct. This code should check for the zero case in the loop condition, I think.