Commit 421bea40 authored by George Kadianakis's avatar George Kadianakis
Browse files

Improve some parts of prop#328.

- Rename 'overload-reached' to 'overload-general'
- Simplify 'overload-ratelimits' for engineering reasons
- Add versioning on the extra-info fields
- Add a few more metrics
parent 46f0bb63
Loading
Loading
Loading
Loading
+26 −13
Original line number Diff line number Diff line
@@ -36,20 +36,22 @@ the future and thus this is not an exhaustive list.
The general overload line indicates that a relay has reached an "overloaded
state" which can be one or many of the following load metrics:

   - Any OOMkiller invocation due to memory pressure
   - Any onionskins are dropped
   - CPU utilization of Tor's mainloop CPU core above 90% for 60 sec
   - Any OOM invocation due to memory pressure
   - Any ntor onionskins are dropped
   - TCP port exhaustion
   - DNS timeout reached
   - CPU utilization of Tor's mainloop CPU core above 90% for 60 sec
   - Control port overload (too many messages queued)

The format of the overloaded line added in the extra-info document is as
follow:

```
"overload-reached" YYYY-MM-DD HH:MM:SS NL
"overload-general" SP version SP YYYY-MM-DD HH:MM:SS NL
   [At most once.]
```

The timestamp is when a at least one metrics was detected. It should always be
The timestamp is when at least one metrics was detected. It should always be
at the hour and thus, as an example, "2020-01-10 13:00:00" is an expected
timestamp. Because this is a binary state, if the line is present, we consider
that it was hit at the very least once somewhere between the provided
@@ -60,27 +62,35 @@ The overload field should remain in place for 72 hours since last triggered.
If the limits are reached again in this period, the timestamp is updated, and
this 72 hour period restarts.

The 'version' field is set to '1' for the initial implementation of this
proposal which includes all the above overload metrics except from the CPU and
control port overload. The first version also uses a primitive logic for
detecting DNS timeouts (only if libevent failed a set of 3 DNS requests/retries
in a row).

# 1.2. Token bucket size

Relays should report the 'BandwidthBurst' and 'BandwidthRate' limits in their
descriptor, as well as the number of times these limits were reached, for read
and write, in the past 24 hours starting at the provided timestamp rounded
down to the hour.
and write, in the past 24 hours starting at the provided timestamp rounded down
to the hour.

```
"overload-ratelimits" SP YYYY-MM-DD SP HH:MM:SS
                      SP rate-limit SP burst-limit
                      SP read-rate-count SP read-burst-count
                      SP write-rate-count SP write-burst-count NL
                      SP read-overload-count SP write-overload-count NL
  [At most once.]
```

The "rate-limit" and "burst-limit" are the raw values from the BandwidthRate
and BandwidthBurst found in the torrc configuration file.

The "{read|write}-rate-count" and "{read|write}-burst-count" are the counts of
how many times the reported limits were exhausted and thus the maximum between
the read and write count occurances.
The "{read|write}-overload-count" are the counts of how many times the reported
limits of burst/rate were exhausted and thus the maximum between the read and
write count occurances.

The 'version' field is set to '1' for the initial implementation of this
proposal.

# 1.3. File Descriptor Exhaustion

@@ -91,7 +101,7 @@ notice which relay has a value too small and we can notify them.
This should be published in this format:

```
"overload-fd-exhausted" YYYY-MM-DD HH:MM:SS NL
"overload-fd-exhausted" SP version YYYY-MM-DD HH:MM:SS NL
  [At most once.]
```

@@ -102,6 +112,9 @@ This overload field should remain in place for 72 hours since last triggered.
If the limits are reached again in this period, the timestamp is updated, and
this 72 hour period restarts.

The 'version' field is set to '1' for the initial implementation of this
proposal which detects fd exhaustion only when a socket open fails.

# 2. Load Metrics

This section proposes a series of metrics that should be collected and