Changes

This is so we have a runbook to link to in a new alert about this.
anarcat · c4076d60
--- a/service/crm.md
+++ b/service/crm.md
@@ -156,6 +156,54 @@ to the underlying storage from the attacker.
 Then API keys secrets should probably be rotated, follow the [Rotating
 API tokens procedure](#rotating-api-tokens).

+### Jobs not running
+
+If you get an alert about a "CiviCRM job failure", for example:
+
+        The CiviCRM job send_scheduled_mailings on crm-int-01.torproject.org
+        has been marked as failed for more than 4h. This could be that
+        it has not run fast enough, or that it failed.
+
+... it means a CiviCRM job (in this case `send_scheduled_mailings`)
+has either failed or has not run in its configured time frame. (Note
+that we currently can't distinguish those states, but hopefully [will
+have metrics to do so soon](https://gitlab.torproject.org/tpo/web/civicrm/-/issues/148).)
+
+The "scheduled job failures" section will also show more information
+about the error:
+
+![](crm/torcrm-sample-sched-failure.png)
+
+To debug this, first find the "Scheduled Job Logs":
+
+ 1. Go to Administer > System Settings > Scheduled Jobs
+ 2. Find the affected job (above `send_scheduled_mailings`)
+ 3. Click "view log"
+
+Here's a screenshot of such a log:
+
+![](crm/torcrm-job-log-example.png)
+
+This will show the error that triggered the alert:
+
+ - If it's an exception, it should be investigated in the source code.
+
+ - If the job just hasn't ran in a timely manner, the systemd timer
+   should be investigated with `systemctl status civicron@prod.timer`
+
+There's also the global CiviCRM on-disk log. It's not perfect, because
+on this server there are sometimes 2 different logs. It can also
+rather noisy, with deprecation alerts, civirules chatter, etc.
+
+Those are also available in "Administer > Administration Console >
+View Log" in the web interface and stored on disk, in:
+
+    ls -altr /srv/crm.torproject.org/htdocs-prod/sites/default/files/civicrm/ConfigAndLog/CiviCRM.1.*.log
+
+Note that it's also possible to run the jobs by hand, but we don't
+have specific examples on how to do this for all jobs. See the
+Resque process job, below, for a more specific example.
+
 ### Kill switch enabled

 If the [Resque Processor Job](#queues) gets stuck because it failed to
@@ -167,6 +215,10 @@ switch:

 ![](crm/torcrm-sample-kill-switch.png)

+Note that this is a special case of the more general job failure
+above. It's documented explicitly and separately here because it's
+such an important part that it warrants its own documentation.
+
 The "scheduled job failures" section will also show more information
 about the error: