document new metrics stuff introduced in tpo/web/donate-neo#75 and tpo/web/civicrm#78 authored by anarcat's avatar anarcat
......@@ -722,8 +722,28 @@ developing the Django app after @kez had gone.
## Monitoring and metrics
<!-- describe how this service is monitored, how security issues and -->
<!-- upgrades are tracked, see also "Upgrades" above. -->
The donate site is monitored from [Prometheus](howto/prometheus), both
at the system level (normal metrics like disk, CPU, memory, etc) and
at the application level.
There are a couple of alerts set in the alertmanager, all "warning",
that will pop alerts on IRC if problems come up with the service. All
of them have runbooks that link to the [pager playbook](#pager-playbook) section
here.
We currently don't correctly cover for failed transactions, see
[tpo/web/donate-neo#116](https://gitlab.torproject.org/tpo/web/donate-neo/-/issues/116).
The [donate neo donations](https://grafana.torproject.org/d/f36842c2-af41-48c2-ab71-442307ba2f75/donate-neo-donations) dashboard is the main view of the
service in Grafana. It shows the state of the CiviCRM kill switch,
transaction rates, errors, the rate limiter, and exception counts. It
also has an excerpt of system-level metrics from related servers to
draw correlations if there are issues with the service.
There are also links, on the top-right, to Django-specific dashboards
that can be used to diagnose performance issues.
See [tpo/web/donate-neo#75](https://gitlab.torproject.org/tpo/web/donate-neo/-/issues/75) for followup on missing metrics.
## Tests
......
......