Changes
Page history
document more generic job failures in CiviCRM
authored
Sep 18, 2024
by
anarcat
This is so we have a runbook to link to in a new alert about this.
Hide whitespace changes
Inline
Side-by-side
service/crm.md
View page @
c4076d60
...
...
@@ -156,6 +156,54 @@ to the underlying storage from the attacker.
Then API keys secrets should probably be rotated, follow the
[
Rotating
API tokens procedure
](
#rotating-api-tokens
)
.
### Jobs not running
If you get an alert about a "CiviCRM job failure", for example:
The CiviCRM job send_scheduled_mailings on crm-int-01.torproject.org
has been marked as failed for more than 4h. This could be that
it has not run fast enough, or that it failed.
... it means a CiviCRM job (in this case
`send_scheduled_mailings`
)
has either failed or has not run in its configured time frame. (Note
that we currently can't distinguish those states, but hopefully
[
will
have metrics to do so soon
](
https://gitlab.torproject.org/tpo/web/civicrm/-/issues/148
)
.)
The "scheduled job failures" section will also show more information
about the error:

To debug this, first find the "Scheduled Job Logs":
1.
Go to Administer > System Settings > Scheduled Jobs
2.
Find the affected job (above
`send_scheduled_mailings`
)
3.
Click "view log"
Here's a screenshot of such a log:

This will show the error that triggered the alert:
-
If it's an exception, it should be investigated in the source code.
-
If the job just hasn't ran in a timely manner, the systemd timer
should be investigated with
`systemctl status civicron@prod.timer`
There's also the global CiviCRM on-disk log. It's not perfect, because
on this server there are sometimes 2 different logs. It can also
rather noisy, with deprecation alerts, civirules chatter, etc.
Those are also available in "Administer > Administration Console >
View Log" in the web interface and stored on disk, in:
ls -altr /srv/crm.torproject.org/htdocs-prod/sites/default/files/civicrm/ConfigAndLog/CiviCRM.1.*.log
Note that it's also possible to run the jobs by hand, but we don't
have specific examples on how to do this for all jobs. See the
Resque process job, below, for a more specific example.
### Kill switch enabled
If the
[
Resque Processor Job
](
#queues
)
gets stuck because it failed to
...
...
@@ -167,6 +215,10 @@ switch:

Note that this is a special case of the more general job failure
above. It's documented explicitly and separately here because it's
such an important part that it warrants its own documentation.
The "scheduled job failures" section will also show more information
about the error:
...
...
...
...