Skip to content

immediately trigger an alert on Django exceptions

anarcat requested to merge stricter-exception-rules into main

We make a special case for this, now that most exceptions have been squashed. Those exceptions are often transient, and we seemed to have never been able to reliably trigger an actual alert notification on them, even if they had been sustained for 5+ minutes.

It seems there's a combination of the increase() with a 5m vector and the for that's playing tricks on us.

So let's try to be a little more sensitive on this.

To keep this from flapping like mad, we switch to looking at the increase over a longer period of time (1h). This means that, to flap, this alert would need to have exceptions for a while, then no exception at all for an hour, then again have exceptions. It's possible, but more likely exceptions will be sustained after a funked deployment.

At least the alerting history of the TypeError stuff (tpo/web/donate-neo#122 (closed)) seems to confirm this.

We do not take the removal of for: lightly here. We do this because this is the django site, which TPA will be exclusively responsible for in the long run.

Contrast this with the kill switch monitoring, where we want to give the CiviCRM contractor (who will stick around) a chance to look at things before we alert TPA (and where we have a generous for: 1h, still).

/cc @lavamind @lelutin

Merge request reports