Report CPU overload
In http://meetbot.debian.net/tor-meeting/2021/tor-meeting.2021-03-08-16.59.log.html, it was agreed to drop CPU overload reporting in part on the basis that:
17:12:38 <asn> i have no idea how to measure CPU utilization in a multi-platform way
While it's true that there's no cross-platform way to measure CPU utilization, we don't actually want to know the percentage of CPU utilization anyways. Even if we did, it wouldn't be useful: suppose there are 2 CPUs, each at 50% utilization. Is tor overloaded? There's no way to tell using this information: maybe tor is using 50% of one CPU and something else is using 50% of another, so tor is not overloaded; or, maybe tor is pegging one CPU and is being bounced back and forth, so tor is actually overloaded.
What we really want to know is whether tor is bottlenecked by the CPU instead of by the network interface or by other nodes. What this means in the main loop is that normally, when tor goes to check for events, tor would like to block in poll or equivalent for a little while, not constantly be processing event callbacks. If poll always returns immediately, that means that lots of events are queueing up that tor isn't processing expeditiously; in other words, either the CPU is too busy or tor is swapping excessively.
The main problems with implementing this are:
- how do we actually determine if poll returned "immediately"? a hardcoded duration will incorrectly classify very fast relays (with low duration between FD activation) as overloaded or incorrectly classify very slow relays (with high syscall overhead) as not overloaded or both. In the raw API, I think the best way to do this is to call poll with a timeout of 0, and see if any FDs become active. In libevent, I think the least worst way to do this is to occasionally call event_base_loop with EVLOOP_NONBLOCK, then use a prepare watcher to call event_base_get_num_events with EVENT_BASE_COUNT_ACTIVE. if it is non-zero, record potential overload and continue calling event_base_loop with EVLOOP_NONBLOCK; if it is zero, call event_base_loop without EVLOOP_NONBLOCK. Alternatively, it might be possible to just use the prepare watcher without EVLOOP_NONBLOCK, and check that the number of activated events is "small".
- how many times does "excessive events" need to happen in order to be considered overloaded? 1? 10? 100? should it be time-based, e.g. we called event_base_loop with EVLOOP_NONBLOCK continuously for 60 seconds and it kept invoking callbacks?