immediately trigger an alert on Django exceptions
We make a special case for this, now that most exceptions have been squashed. Those exceptions are often transient, and we seemed to have never been able to reliably trigger an actual alert notification on them, even if they had been sustained for 5+ minutes.
It seems there's a combination of the increase() with a 5m vector and
the for
that's playing tricks on us.
So let's try to be a little more sensitive on this.
To keep this from flapping like mad, we switch to looking at the increase over a longer period of time (1h). This means that, to flap, this alert would need to have exceptions for a while, then no exception at all for an hour, then again have exceptions. It's possible, but more likely exceptions will be sustained after a funked deployment.
At least the alerting history of the TypeError stuff (tpo/web/donate-neo#122 (closed)) seems to confirm this.
We do not take the removal of for:
lightly here. We do this because
this is the django site, which TPA will be exclusively responsible for
in the long run.
Contrast this with the kill switch monitoring, where we want to give
the CiviCRM contractor (who will stick around) a chance to look at
things before we alert TPA (and where we have a generous for: 1h
,
still).