Update overload proposal (prop#328)

- Rename 'overload-reached' to 'overload-general'. - Simplify 'overload-ratelimits' for engineering reasons - Add a few more reports (more importantly the DNS ones)

Update overload proposal (prop#328)
b57743b9 · George Kadianakis · 46f0bb63 · b57743b9
Commit b57743b9 authored 4 years ago by George Kadianakis
--- a/proposals/328-relay-overload-report.md
+++ b/proposals/328-relay-overload-report.md
@@ -37,19 +37,20 @@ The general overload line indicates that a relay has reached an "overloaded
 state" which can be one or many of the following load metrics:

   - Any OOMkiller invocation due to memory pressure
-   - Any onionskins are dropped
-   - CPU utilization of Tor's mainloop CPU core above 90% for 60 sec
+   - Any ntor onionskins are dropped
   - TCP port exhaustion
+   - CPU utilization of Tor's mainloop CPU core above 90% for 60 sec
+   - Control port overload (too many messages queued)

 The format of the overloaded line added in the extra-info document is as
 follow:

 ```
-"overload-reached" YYYY-MM-DD HH:MM:SS NL
+"overload-general" YYYY-MM-DD HH:MM:SS NL
   [At most once.]
 ```

-The timestamp is when a at least one metrics was detected. It should always be
+The timestamp is when at least one metrics was detected. It should always be
 at the hour and thus, as an example, "2020-01-10 13:00:00" is an expected
 timestamp. Because this is a binary state, if the line is present, we consider
 that it was hit at the very least once somewhere between the provided
@@ -70,17 +71,16 @@ down to the hour.
 ```
 "overload-ratelimits" SP YYYY-MM-DD SP HH:MM:SS
                      SP rate-limit SP burst-limit
-                      SP read-rate-count SP read-burst-count
-                      SP write-rate-count SP write-burst-count NL
+                      SP read-overload-count SP write-overload-count NL
  [At most once.]
 ```

 The "rate-limit" and "burst-limit" are the raw values from the BandwidthRate
 and BandwidthBurst found in the torrc configuration file.

-The "{read|write}-rate-count" and "{read|write}-burst-count" are the counts of
-how many times the reported limits were exhausted and thus the maximum between
-the read and write count occurances.
+The "{read|write}-overload-count" are the counts of how many times the reported
+limits of burst/rate were exhausted and thus the maximum between the read and
+write count occurances.

 # 1.3. File Descriptor Exhaustion

@@ -102,6 +102,28 @@ This overload field should remain in place for 72 hours since last triggered.
 If the limits are reached again in this period, the timestamp is updated, and
 this 72 hour period restarts.

+# 1.4. DNS Server Issues
+
+Relays should report DNS-related failures so that we can potentially find
+relays using bad DNS servers or that are misconfigured that are causing
+performance issues to the network.
+
+If the relay sees more than 'threshold' % of its DNS requests failing with
+timeouts it should add the following line in its extra info descriptor:
+
+```
+"dns-timeouts-theshold-reached" SP YYYY-MM-DD HH:MM:SS SP threshold NL
+  [At most once.]
+```
+
+If the relay sees more than 'threshold' % of its DNS requests failing because
+of server errors it should add the following line in its extra info descriptor:
+
+```
+"dns-server-failure-theshold-reached" SP YYYY-MM-DD HH:MM:SS SP threshold NL
+  [At most once.]
+```
+
 # 2. Load Metrics

 This section proposes a series of metrics that should be collected and