Verified Commit c4076d60 authored by anarcat's avatar anarcat
Browse files

document more generic job failures in CiviCRM

This is so we have a runbook to link to in a new alert about this.
parent 342971af
Loading
Loading
Loading
Loading
+52 −0
Original line number Diff line number Diff line
@@ -156,6 +156,54 @@ to the underlying storage from the attacker.
Then API keys secrets should probably be rotated, follow the [Rotating
API tokens procedure](#rotating-api-tokens).

### Jobs not running

If you get an alert about a "CiviCRM job failure", for example:

        The CiviCRM job send_scheduled_mailings on crm-int-01.torproject.org
        has been marked as failed for more than 4h. This could be that
        it has not run fast enough, or that it failed.

... it means a CiviCRM job (in this case `send_scheduled_mailings`)
has either failed or has not run in its configured time frame. (Note
that we currently can't distinguish those states, but hopefully [will
have metrics to do so soon](https://gitlab.torproject.org/tpo/web/civicrm/-/issues/148).)

The "scheduled job failures" section will also show more information
about the error:

![](crm/torcrm-sample-sched-failure.png)

To debug this, first find the "Scheduled Job Logs":

 1. Go to Administer > System Settings > Scheduled Jobs
 2. Find the affected job (above `send_scheduled_mailings`)
 3. Click "view log"

Here's a screenshot of such a log:

![](crm/torcrm-job-log-example.png)

This will show the error that triggered the alert:

 - If it's an exception, it should be investigated in the source code.

 - If the job just hasn't ran in a timely manner, the systemd timer
   should be investigated with `systemctl status civicron@prod.timer`

There's also the global CiviCRM on-disk log. It's not perfect, because
on this server there are sometimes 2 different logs. It can also
rather noisy, with deprecation alerts, civirules chatter, etc.

Those are also available in "Administer > Administration Console >
View Log" in the web interface and stored on disk, in:

    ls -altr /srv/crm.torproject.org/htdocs-prod/sites/default/files/civicrm/ConfigAndLog/CiviCRM.1.*.log

Note that it's also possible to run the jobs by hand, but we don't
have specific examples on how to do this for all jobs. See the
Resque process job, below, for a more specific example.

### Kill switch enabled

If the [Resque Processor Job](#queues) gets stuck because it failed to
@@ -167,6 +215,10 @@ switch:

![](crm/torcrm-sample-kill-switch.png)

Note that this is a special case of the more general job failure
above. It's documented explicitly and separately here because it's
such an important part that it warrants its own documentation.

The "scheduled job failures" section will also show more information
about the error: