document new metrics stuff introduced in tpo/web/donate-neo#75 and tpo/web/civicrm#78 authored by anarcat's avatar anarcat
......@@ -433,7 +433,58 @@ There's also [Prometheus](howto/prometheus) monitoring with graphs rendered by
[Grafana](howto/grafana). This includes an elaborate [Postfix dashboard](https://grafana.torproject.org/d/Ds5BxBYGk/postfix-mtail?orgId=1&from=now-24h&to=now&var-node=eugeni.torproject.org&var-node=crm-int-01.torproject.org)
watching to two mailservers.
We have a task open to [monitor the CiviCRM health better](https://gitlab.torproject.org/tpo/web/civicrm/-/issues/78).
We started working on [monitoring the CiviCRM health better](https://gitlab.torproject.org/tpo/web/civicrm/-/issues/78). So
far we collect metrics that look like this:
```
# HELP civicrm_jobs_timestamp_seconds Timestamp of the last CiviCRM jobs run
# TYPE civicrm_jobs_timestamp_seconds gauge
civicrm_jobs_timestamp_seconds{jobname="civicrm_update_check"} 1726143300
civicrm_jobs_timestamp_seconds{jobname="send_scheduled_mailings"} 1726203600
civicrm_jobs_timestamp_seconds{jobname="fetch_bounces"} 1726203600
civicrm_jobs_timestamp_seconds{jobname="process_inbound_emails"} 1726203600
civicrm_jobs_timestamp_seconds{jobname="clean_up_temporary_data_and_files"} 1725821100
civicrm_jobs_timestamp_seconds{jobname="rebuild_smart_group_cache"} 1726203600
civicrm_jobs_timestamp_seconds{jobname="process_delayed_civirule_actions"} 1726203600
civicrm_jobs_timestamp_seconds{jobname="civirules_cron"} 1726203600
civicrm_jobs_timestamp_seconds{jobname="delete_unscheduled_mailings"} 1726166700
civicrm_jobs_timestamp_seconds{jobname="call_sumfields_gendata_api"} 1726201800
civicrm_jobs_timestamp_seconds{jobname="update_smart_group_snapshots"} 1726166700
civicrm_jobs_timestamp_seconds{jobname="torcrm_resque_processing"} 1726203600
# HELP civicrm_jobs_status_up CiviCRM Scheduled Job status
# TYPE civicrm_jobs_status_up gauge
civicrm_jobs_status_up{jobname="civicrm_update_check"} 1
civicrm_jobs_status_up{jobname="send_scheduled_mailings"} 1
civicrm_jobs_status_up{jobname="fetch_bounces"} 1
civicrm_jobs_status_up{jobname="process_inbound_emails"} 1
civicrm_jobs_status_up{jobname="clean_up_temporary_data_and_files"} 1
civicrm_jobs_status_up{jobname="rebuild_smart_group_cache"} 1
civicrm_jobs_status_up{jobname="process_delayed_civirule_actions"} 1
civicrm_jobs_status_up{jobname="civirules_cron"} 1
civicrm_jobs_status_up{jobname="delete_unscheduled_mailings"} 1
civicrm_jobs_status_up{jobname="call_sumfields_gendata_api"} 1
civicrm_jobs_status_up{jobname="update_smart_group_snapshots"} 1
civicrm_jobs_status_up{jobname="torcrm_resque_processing"} 1
# HELP civicrm_torcrm_resque_processor_status_up Resque processor status
# TYPE civicrm_torcrm_resque_processor_status_up gauge
civicrm_torcrm_resque_processor_status_up 1
```
Those show the last timestamp of various jobs, the status of those
jobs (`1` means ok), and whether the "kill switch" has been raised
(`1` means ok, that is: not raised).
Authentication to the CiviCRM server was particularly problematic:
there's an open issue to convert the HTTP-layer authentication system
to basic auth ([tpo/web/civicrm#147](https://gitlab.torproject.org/tpo/web/civicrm/-/issues/147)).
We're hoping to get more metrics from CiviCRM, like detailed status of
job failures, mailing run times and other statistics, see
[tpo/web/civicrm#148](https://gitlab.torproject.org/tpo/web/civicrm/-/issues/148). Other options were discussed in [this
comment](https://gitlab.torproject.org/tpo/web/civicrm/-/issues/78#note_3076161) as well.
Only the last metric above is hooked up to alerting for now, see
[tpo/web/donate-neo#75](https://gitlab.torproject.org/tpo/web/donate-neo/-/issues/75) for a deeper discussion.
## Tests
......
......