check for the cost of those failed transactions
so in #116 (closed), we added metrics for failed transactions, yaay! now I need to add alerting for that (#75 (closed)) but i'm not sure which threshold I should set.
Looking at the past 4 days, we had pretty nasty spikes in there:
https://grafana.torproject.org/d/f36842c2-af41-48c2-ab71-442307ba2f75/donate-neo-donations?orgId=1&from=now-4d&to=now&refresh=5s&viewPanel=28
... including one hour (2024-09-21 21:00UTC) where we had 65 failed transactions, ouch!
@mattlav can you check on your end what those correspond to? were we billed for those?
In tpo/tpa/prometheus-alerts@459356c4, i made the threshold be 10 transactions in 10 minutes (basically one per minute), which essentially means 1.20/hr at 0.02 per failed transaction...
So the alert would have definitely rang there, but now I wonder: why? Why did we get all those failures, and why did none of our mechanisms get this?
The point of alerting is to detect anomalous conditions so we can deal with them, but we're barely launched and already getting a lot of those, so perhaps we're doing something wrong, either in the metrics or in the various protections (rate limiting, captchas, CSRF, etc) we've setup.
@stephen, what do you think?