towards an Exception Reports framework

Currently, there's a number of pulse-checks of the network and its components conducted. IRL's tickets legacy/trac#24070, legacy/trac#24071, legacy/trac#24073 raise a few more.

However, I think we have to step back and start looking at an organized framework on this.

Exception reports are basically overviews about significant changes in some routine/activity. We determine some baseline, say, the consensus weight of each bandwidth authorities, and note if there's a drastic change, maybe daily or twice-a-day, then notify the relevant parties.

The basics would be this:

  • we determine the areas to address, such as public relays, exits-only, dirauths, bwauths, bridges, censorship, guards, etc.

  • we determine metrics we need to see, eg, changes in CW, bandwidth advertised, versions, TTL... and determine a baseline, maybe within a standard deviation or so.

  • then we figure out who needs to know when something is outside the baseline range.

  • we could also develop some automated or human-driven 'next-steps', eg, Call X bwauth and tell them to ping their upstream, file a track ticket, email some alias@ of people.

  • Another more interesting direction, yet vital, would be to incorporate the OONI data, which would be a much better detailed baseline of network health.