towards an Exception Reports framework
Currently, there's a number of pulse-checks of the network and its components conducted. IRL's tickets #24070, #24071, #24073 raise a few more.
However, I think we have to step back and start looking at an organized framework on this.
Exception reports are basically overviews about significant changes in some routine/activity. We determine some baseline, say, the consensus weight of each bandwidth authorities, and note if there's a drastic change, maybe daily or twice-a-day, then notify the relevant parties.
The basics would be this:
-
we determine the areas to address, such as public relays, exits-only, dirauths, bwauths, bridges, censorship, guards, etc.
-
we determine metrics we need to see, eg, changes in CW, bandwidth advertised, versions, TTL... and determine a baseline, maybe within a standard deviation or so.
-
then we figure out who needs to know when something is outside the baseline range.
-
we could also develop some automated or human-driven 'next-steps', eg, Call X bwauth and tell them to ping their upstream, file a track ticket, email some alias@ of people.
-
Another more interesting direction, yet vital, would be to incorporate the OONI data, which would be a much better detailed baseline of network health.