Skip to content
Snippets Groups Projects
Commit b57743b9 authored by George Kadianakis's avatar George Kadianakis
Browse files

Update overload proposal (prop#328)

- Rename 'overload-reached' to 'overload-general'.
- Simplify 'overload-ratelimits' for engineering reasons
- Add a few more reports (more importantly the DNS ones)
parent 46f0bb63
No related branches found
No related tags found
No related merge requests found
......@@ -37,19 +37,20 @@ The general overload line indicates that a relay has reached an "overloaded
state" which can be one or many of the following load metrics:
- Any OOMkiller invocation due to memory pressure
- Any onionskins are dropped
- CPU utilization of Tor's mainloop CPU core above 90% for 60 sec
- Any ntor onionskins are dropped
- TCP port exhaustion
- CPU utilization of Tor's mainloop CPU core above 90% for 60 sec
- Control port overload (too many messages queued)
The format of the overloaded line added in the extra-info document is as
follow:
```
"overload-reached" YYYY-MM-DD HH:MM:SS NL
"overload-general" YYYY-MM-DD HH:MM:SS NL
[At most once.]
```
The timestamp is when a at least one metrics was detected. It should always be
The timestamp is when at least one metrics was detected. It should always be
at the hour and thus, as an example, "2020-01-10 13:00:00" is an expected
timestamp. Because this is a binary state, if the line is present, we consider
that it was hit at the very least once somewhere between the provided
......@@ -70,17 +71,16 @@ down to the hour.
```
"overload-ratelimits" SP YYYY-MM-DD SP HH:MM:SS
SP rate-limit SP burst-limit
SP read-rate-count SP read-burst-count
SP write-rate-count SP write-burst-count NL
SP read-overload-count SP write-overload-count NL
[At most once.]
```
The "rate-limit" and "burst-limit" are the raw values from the BandwidthRate
and BandwidthBurst found in the torrc configuration file.
The "{read|write}-rate-count" and "{read|write}-burst-count" are the counts of
how many times the reported limits were exhausted and thus the maximum between
the read and write count occurances.
The "{read|write}-overload-count" are the counts of how many times the reported
limits of burst/rate were exhausted and thus the maximum between the read and
write count occurances.
# 1.3. File Descriptor Exhaustion
......@@ -102,6 +102,28 @@ This overload field should remain in place for 72 hours since last triggered.
If the limits are reached again in this period, the timestamp is updated, and
this 72 hour period restarts.
# 1.4. DNS Server Issues
Relays should report DNS-related failures so that we can potentially find
relays using bad DNS servers or that are misconfigured that are causing
performance issues to the network.
If the relay sees more than 'threshold' % of its DNS requests failing with
timeouts it should add the following line in its extra info descriptor:
```
"dns-timeouts-theshold-reached" SP YYYY-MM-DD HH:MM:SS SP threshold NL
[At most once.]
```
If the relay sees more than 'threshold' % of its DNS requests failing because
of server errors it should add the following line in its extra info descriptor:
```
"dns-server-failure-theshold-reached" SP YYYY-MM-DD HH:MM:SS SP threshold NL
[At most once.]
```
# 2. Load Metrics
This section proposes a series of metrics that should be collected and
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment