towards an Exception Reports framework

Currently, there's a number of pulse-checks of the network and its components conducted. IRL's tickets #24070, #24071, #24073 raise a few more.

However, I think we have to step back and start looking at an organized framework on this.

Exception reports are basically overviews about significant changes in some routine/activity. We determine some baseline, say, the consensus weight of each bandwidth authorities, and note if there's a drastic change, maybe daily or twice-a-day, then notify the relevant parties.

The basics would be this:

we determine the areas to address, such as public relays, exits-only, dirauths, bwauths, bridges, censorship, guards, etc.
we determine metrics we need to see, eg, changes in CW, bandwidth advertised, versions, TTL... and determine a baseline, maybe within a standard deviation or so.
then we figure out who needs to know when something is outside the baseline range.
we could also develop some automated or human-driven 'next-steps', eg, Call X bwauth and tell them to ping their upstream, file a track ticket, email some alias@ of people.
Another more interesting direction, yet vital, would be to incorporate the OONI data, which would be a much better detailed baseline of network health.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information