Prop328
Merge request reports
Activity
mentioned in merge request tor!334 (closed)
mentioned in issue tor#40222 (closed)
requested review from @dgoulet
assigned to @mikeperry
74 SP write-rate-count SP write-burst-count NL 81 SP read-overload-count SP write-overload-count NL 75 82 [At most once.] 76 83 ``` 77 84 78 85 The "rate-limit" and "burst-limit" are the raw values from the BandwidthRate 79 86 and BandwidthBurst found in the torrc configuration file. 80 87 81 The "{read|write}-rate-count" and "{read|write}-burst-count" are the counts of 82 how many times the reported limits were exhausted and thus the maximum between 83 the read and write count occurances. 88 The "{read|write}-overload-count" are the counts of how many times the reported 89 limits of burst/rate were exhausted and thus the maximum between the read and 90 write count occurances. 91 92 The 'version' field is set to '1' for the initial implementation of this Good catch. Fixed in cccf29c7.
70 78 ``` 71 79 "overload-ratelimits" SP YYYY-MM-DD SP HH:MM:SS 72 80 SP rate-limit SP burst-limit 73 SP read-rate-count SP read-burst-count 74 SP write-rate-count SP write-burst-count NL 81 SP read-overload-count SP write-overload-count NL 75 82 [At most once.] 76 83 ``` 77 84 78 85 The "rate-limit" and "burst-limit" are the raw values from the BandwidthRate 79 86 and BandwidthBurst found in the torrc configuration file. 80 87 81 The "{read|write}-rate-count" and "{read|write}-burst-count" are the counts of 82 how many times the reported limits were exhausted and thus the maximum between 83 the read and write count occurances. 88 The "{read|write}-overload-count" are the counts of how many times the reported I'm a bit confused about what "how many times" means here. It seems like this will get incremented whenever any connection wants to read/write and finds that the bandwidth is exhausted. That means that this will trigger a lot of times when we're out of bandwidth -- once for each reading/writing connection. So this count won't tell us how overloaded the relay is: instead it will be a product of how often the relay is overloaded and how many connections the relay had.
Here is a suggestion for a not-too-hard improvement, if you think it makes sense: when you're recording this event, remember the time when it happened, and only increment these counters at most once per second.
I like the at-most-once-per-second idea. In fact, for the timescales we're talking about, even at-most-once-per-minute is plenty, and may be better due to side channel risk. We're planning on using this to reduce instances of long-term overload via sbws, which has a reaction time of hours/days.
There is also the related idea that dgoulet brought up in that for DNS, maybe what we are counting once-per-minute is DNS timeout rate above X%. But even if we use percents in that case, it should still be very low. 1% or less, especially if we also use this time-based counter in combination.
I implemented the "once-per-minute" rate limit counter in a0ec61169b.
@mikeperry Do you want me to also implement the 1% logic for DNS timeouts as well? Right now it will register an
overload-generic
if libevent reports a single timeout (see 13d6610f94)
38 38 39 - Any OOMkiller invocation due to memory pressure 40 - Any onionskins are dropped 41 - CPU utilization of Tor's mainloop CPU core above 90% for 60 sec 39 - Any OOM invocation due to memory pressure 40 - Any ntor onionskins are dropped 42 41 - TCP port exhaustion 42 - DNS timeout reached 43 - CPU utilization of Tor's mainloop CPU core above 90% for 60 sec 44 - Control port overload (too many messages queued) 43 45 44 46 The format of the overloaded line added in the extra-info document is as 45 47 follow: 46 48 47 49 ``` 48 "overload-reached" YYYY-MM-DD HH:MM:SS NL What is your reasoning for separating them?
We're trying to balance a few things here. Revealing the specific type of overload condition increases the ability for the adversary to use this information for various attacks. I'd rather it be combined in all cases, so that the specific cause is not visible during any attacks. In all cases, the reaction of our systems should be the same: direct less load to relays with this line. If we need to dig, that's what MetricsPort is for.
If we find out that we were wrong about the specific method of overload sensitivity, and that makes this line too common, I think we should address that kind of thing in the reporting, rather than breaking it out and dealing with it later. Again, no matter what, if a relay is overloaded over long timescales, it should get less load from sbws. That is the primary goal of this line. For analysis or short-term issues, we should ask operators to dig into MetricsPort on a per-relay basis, because high-res stats are risky to publish.
I agree with the rest of your comments, though.
Vague thoughts:
We might contact the operators letting them know that they need more RAM / a caching DNS resolver.
We might treat different kinds of overload as invalidating relays for a different amount of time, if for example we discover that DNS issues are usually one-off but memory exhaustion is clustered.
We might discount one or another kind of overload if we find that it's too easy for an attacker to exploit it in order to move targeted relays out of the network.
Is the metricsport insufficient for these purposes?
Do we expect to debug this a lot?
Do we expect relays to refuse to answer, should we need to debug this?
Are ones that refuse actually good relays? Does anyone else ask them for this kind of data, or already obtain it via other methods?
I was biasing for conservatism, and to learn more about risk here.. maybe that this not the right bias, and we expect a lot of issues with false positives here. I could see that. It all depends on if relays will answer requests for MetricsPort, and if that is enough, I think.
If we don't trust the system itself, let's just make sure sbws will have consensus parameter to turn overload usage off, in the event of attack or malfunction. It will be many months before sbws makes any use of this field, anyway, btw. We can watch it until then.
I agree that it is wise to think about how this could go wrong, but we should plan for that holistically. Right now, by breaking these things back out, we're just reverting a decision that was made for specific reasons. Instead, let's discover and synthesize root concerns so they are all covered.
Edited by Mike PerryTrying again from another angle, here is my list of the high-level concerns we're trying to balance:
- We want to react to overloaded relays, to give them less load.
- We want relay operators that experience fixable kinds of overload to fix those issues.
- We want to diagnose and fix Tor bugs that are causing overload.
- We don't want to give out high-res info about the specific kinds of load being experienced, as this may create side channels, and aids attacks that trigger more of that specific kind of overload, to achieve various ends.
- We don't want attackers to significantly bias relay usage with attacks.
Okay, I am persuaded that we might be doing the right thing in keeping all this stuff rolled up into one kind of overload event. This shouldn't be a merge blocker.
I am not persuaded that we will in fact decide to keep all this stuff rolled up into one. But that's okay, since the system supports future migration.
IOW I'm okay with the current version of this.
added 1 commit
- cccf29c7 - prop328: Rate-limit counter should increase once per minute.