CRM stands for "Customer Relationship Management" but we actually use it to manage contacts and donations. It is how we send our massive newsletter once in a while.
Tutorial
Basic access
The main website is at:
It is protected by basic authentication and the site's login as well, so you actually need two sets of password to get in.
To set up basic authentication for a new user, the following command must be executed on the CiviCRM server:
htdigest /etc/apache2/htdigest 'Tor CRM' <username>
Once basic authentication is in place, the Drupal/CiviCRM login page can be accessed at: https://crm.torproject.org/user/login
Howto
Monitoring mailings
The CiviCRM server can generate large mailings, in the order of hundreds of thousands of unique email addresses. Those can create significant load on the server if mishandled, and worse, trigger blocking at various providers if not correctly rate-limited.
For this, we have various knobs and tools:
- Grafana dashboard watching the two main mail servers
-
Place to enable/disable mailing (grep for
Send sched
...) - Where the batches are defined
- The Civimail interface should show the latest mailings (when clicking twice on "STARTED", from there click the Report button to see how many mails have been sent, bounced, etc
The Grafana dashboard is based on metrics from Prometheus, which can be inspected live with the following command:
curl -s localhost:3903/metrics | grep -v -e ^go_ -e '^#' -e '^mtail' -e ^process -e _tls_; postfix-queues-sizes
Using lnav
can also be useful to monitor logs in real time, as it
provides per-queue ID navigation, marks warnings (deferred messages)
in yellow and errors (bounces) in red.
A few commands to inspect the email queue:
-
List the queue, with more recent entries first
postqueue -j | jq -C .recipients[] | tac
-
Find how many emails in the queue, per domain:
postqueue -j | jq -r .recipients[].address | sed 's/.*@//' | sort | uniq -c | sort -n
Note that the
qshape deferred
command gives a similar (and actually better) output.
In case of a major problem, you can stop the mailing in CiviCRM and put all emails on hold with:
postsuper -h ALL
Then the postfix-trickle
script can be used to slowly release
emails:
postfix-trickle 10 5
When an email bounces, it should go to civicrm@crm.torproject.org
,
which is an IMAP mailbox periodically checked by CiviCRM. It will
ingest bounces landing in that mailbox and disable them for the next
mailings. It's also how users can unsubscribe from those mailings, so
it is critical that this service runs correctly.
A lot of those notes come from the issue where we enabled CiviCRM to receive its bounces.
Handling abuse complains
Our postmaster alias can receive emails like this:
Subject: Abuse Message [AbuseID:809C16:27]: AbuseFBL: UOL Abuse Report
Those emails usually contain enough information to figure out which email address filed a complaint. The action to take is to remove them from the mailing. Here's an example email sample:
Received: by crm-int-01.torproject.org (Postfix, from userid 33)
id 579C510392E; Thu, 4 Feb 2021 17:30:12 +0000 (UTC)
[...]
Message-Id: <20210204173012.579C510392E@crm-int-01.torproject.org>
[...]
List-Unsubscribe: <mailto:civicrm+u.2936.7009506.26d7b951968ebe4b@crm.torproject.org>
job_id: 2936
Precedence: bulk
[...]
X-CiviMail-Bounce: civicrm+b.2936.7009506.26d7b951968ebe4b@crm.torproject.org
[...]
Your bounce might have only some of those. Possible courses of action to find the victim's email:
- Grep for the queue ID (
579C510392E
) in the mail logs - Grep for the Message-Id
(
20210204173012.579C510392E@crm-int-01.torproject.org
) in mail logs (withpostfix-trace
)
Once you have the email address:
- Head for the CiviCRM search interface to find that user
- Remove the from the "Tor News" group, in the
Group
tab
Another option is to go in Donor record > Edit communication preferences > check do not email.
Alternatively, you can just send an email to the List-Unsubscribe
address or click the "unsubscribe" links at the bottom of the email.
The handle-abuse.py
script in fabric-tasks.git
automatically
handles the CiviCRM bounces that way. Support for other bounces should
be added there as we can.
Special cases should be reported to the CiviCRM admin by forwarding
the email to the Giving
queue in RT.
Sometimes complaints come in about Mailman lists. Those are harder to handle because they do not have individual bounce addresses...
Granting access to the CiviCRM backend
The main CiviCRM is protected by Apache-based authentication,
accessible only by TPA. To add a user, on the backend server
(currently crm-int-01
):
htdigest /etc/apache2/htdigest 'Tor CRM' $USERNAME
Granting a new admin access
When onboarding a new TPA member, the new member can create their own admin user as soon as they have root access to the server. To achieve this they can use the following commands:
sudo -i -u torcivicrm
cd /srv/crm.torproject.org/htdocs-prod && drush uli toradmin
Once logged in a personal account should be created with administrator privileges to facilitate future logins.
Notes:
- The URL produced by drush needs to be manually modified for it to lead to the
right place.
https
should be used indead ofhttp
, and the hostname needs to be changed fromdefault
tocrm.torproject.org
-
drush uli
without a user will produce URLs that give out an Access Denied error since the user with uid 1 is disabled.
Rotating API tokens
See the donate site docs for this.
Pager playbook
Security breach
If there's a major security breach on the service, the first thing to
do is probably to shutdown the CiviCRM server completely. Halt the
crm-int-01
and donate-01
machines completely, and remove access
to the underlying storage from the attacker.
Then API keys secrets should probably be rotated, follow the Rotating API tokens procedure.
Job failures
If you get an alert about a "CiviCRM job failure", for example:
The CiviCRM job send_scheduled_mailings on crm-int-01.torproject.org
has been marked as failed for more than 4h. This could be that
it has not run fast enough, or that it failed.
... it means a CiviCRM job (in this case send_scheduled_mailings
)
has either failed or has not run in its configured time frame. (Note
that we currently can't distinguish those states, but hopefully will
have metrics to do so soon.)
The "scheduled job failures" section will also show more information about the error:
To debug this, first find the "Scheduled Job Logs":
- Go to Administer > System Settings > Scheduled Jobs
- Find the affected job (above
send_scheduled_mailings
) - Click "view log"
Here's a screenshot of such a log:
This will show the error that triggered the alert:
-
If it's an exception, it should be investigated in the source code.
-
If the job just hasn't ran in a timely manner, the systemd timer should be investigated with
systemctl status civicron@prod.timer
There's also the global CiviCRM on-disk log. It's not perfect, because on this server there are sometimes 2 different logs. It can also rather noisy, with deprecation alerts, civirules chatter, etc.
Those are also available in "Administer > Administration Console > View Log" in the web interface and stored on disk, in:
ls -altr /srv/crm.torproject.org/htdocs-prod/sites/default/files/civicrm/ConfigAndLog/CiviCRM.1.*.log
Note that it's also possible to run the jobs by hand, but we don't have specific examples on how to do this for all jobs. See the Resque process job, below, for a more specific example.
Kill switch enabled
If the Resque Processor Job gets stuck because it failed to process an item, it will stop processing completely (assuming it's a bug, or something is wrong). It raises a "kill switch" that will show up as a red "Resque Off" message in Administer > Administration Console > System Status. Here's a screenshot of an enabled kill switch:
Note that this is a special case of the more general job failure above. It's documented explicitly and separately here because it's such an important part that it warrants its own documentation.
The "scheduled job failures" section will also show more information about the error:
To debug this, first find the "Scheduled Job Logs":
- Go to Administer > System Settings > Scheduled Jobs
- Find "TorCRM Resque Processing"
- Click "view log"
Here's a screenshot of such a log:
This will show the error (typically a PHP exception) that triggered the kill switch. This should be investigated in the source code.
There's also the global CiviCRM on-disk log. It's not perfect, because on this server there are sometimes 2 different logs (it's in my pipeline to debug that). It can also rather noisy, with deprecation alerts, civirules chatter, etc.
Those are also available in "Administer > Administration Console > View Log" in the web interface and stored on disk, in:
ls -altr /srv/crm.torproject.org/htdocs-prod/sites/default/files/civicrm/ConfigAndLog/CiviCRM.1.*.log
The items in the queue can be seen by searching for "TorCRM - Resque"
in the above status page, or with the Redis command: LRANGE "resque:queue:prod_web_donations" 0 -1
, in the redis-cli
shell.
The job can be ran from the command-line manually with:
sudo -i -u torcivicrm
cd /srv/crm.torproject.org/htdocs-prod/
cv api setting.create torcrm_resque_off=0
cv api Job.Torcrm_Resque_Process
You can also get a backtrace with:
cv api Job.Torcrm_Resque_Process -vvv
Once the problem is fixed, the kill switch can be reset by going to "CiviCRM > Administer > Tor CRM Settings" in the web interface. Note that there's somewhat of a double-negative in the kill switch configuration. The form is:
Resque Off Switch [0]
Set to 0 to disable the off/kill switch. This gets set to 1 by the "Resque" Scheduled Job when an error is detected. When that happens, check the CiviCRM "ConfigAndLog" logs, or under Administer > Console > View Log
The "Resque Off Switch" is the kill switch. When it's set to zero ("0", as above), it's disabled, which means normal operation and the queue is processed. It's set to "1" when an error is raised, and should be set back to "0" when the issue is fixed.
See tpo/web/civicrm#144 for an example of such a kill switch debugging session.
Disaster recovery
If Redis dies, we might lose in-process donations. But otherwise, it is disposable and data should be recreated as needed.
If the entire database gets destroyed, it needs to be restored from backups, by TPA.
Reference
Installation
Full documentation on the installation of this system is somewhat out of scope for TPA: sysadmins only installed the servers and setup basic services like a VPN (using IPsec) and an Apache, PHP, MySQL stack.
The Puppet classes used on the CiviCRM server is role::civicrm_int
. That
naming convention reflects the fact that, before donate-neo, there used to
be another role named roles::civicrm_ext
for the frontend, retired in
tpo/tpa/team#41511.
Upgrades
As stated above, a new donation campaign involves changes to both the
donate-neo site (donate.tpo
) and the CiviCRM server.
Changes to the CiviCRM server and donation middleware can be deployed progressively through the test/staging/production sites, which all have their own databases. See the donate-neo docs for deployments of the frontend.
TODO: clarify the deployment workflow. They seem to have one branch per environment, but what does that include? Does it matter for us?
There's a drush
script that edits the dev/stage databases to
replace PII in general, and in particular change the email of everyone
to dummy aliases so that emails sent by accident wouldn't end up in
real people's mail boxes.
Upgrades are typically handled by the CiviCRM consultant.
See also the CiviCRM upgrade guide.
SLA
This service is critical, as it is used to host donations, and should be as highly available as possible. Unfortunately, its design has multiple single point of failures, which, in practice, makes this target difficult to fulfill at this point.
Design and architecture
CiviCRM is a relatively "classic" PHP application: it's made of a
collection of .php
files scattered cleverly around various
directories. There's one catch: it's actually built as a drop-in
module for other CMSes. Traditionally, Joomla, Wordpress and Drupal
are supported, and our deployment uses Drupal.
(There's actually a standalone version in development we are interested in as well, as we do not need the features from the Drupal site.)
Most code lives in a torcrm
module that processes Redis messages
through CiviCRM jobs.
CiviCRM is isolated from the public internet through HTTP authentication. Communication with the donation frontend happens through a Redis queue. See also the donation site architecture for more background.
Services
The CiviCRM service runs on the crm-int-01
server, with the
following layer:
- Apache: TLS decapsulation, HTTP authentication and reverse proxy
- PHP FPM: PHP runtime which Apache connects to over FastCGI
- Drupal: PHP entry point, loads CiviCRM code as a module
- CiviCRM: core of the business logic
- MariaDB (MySQL) database (Drupal and CiviCRM storage backend)
- Redis server: communication between CiviCRM and the donate frontend
- Dovecot: IMAP server to handle bounces
Apache answers to the following virtual hosts:
-
crm.torproject.org
: production CiviCRM site -
staging.crm.torproject.org
: staging site -
test.crm.torproject.org
: testing site
The monthly newsletter is configured on CiviCRM and archived on the https://newsletter.torproject.org static site.
Storage
CiviCRM stores most of its data in a MySQL database. There are separate databases for the dev/staging/prod sites.
TODO: does CiviCRM also write to disk?
Queues
CiviCRM can hold a large queue of emails to send, when a new newsletter is generated. This, in turn, can turn in large Postfix email queues when CiviCRM releases those mails in the email system.
The donate-neo frontend uses Redis to queue up transactions for CiviCRM. See the queue documentation in donate-neo. Queued jobs are de-queued by CiviCRM's Resque Scheduled Job, and crons, logs, monitoring, etc, all use standard CiviCRM tooling.
See also the kill switch enabled playbook.
Interfaces
Most operations with CiviCRM happen over a web interface, in a web browser. There is a CiviCRM API but it's rarely used by Tor's operators.
The torcivicrm
user has a command-line CiviCRM tool called cv
in its $PATH
which talks to that API to perform various functions.
Drupal also has its own shell tool called drush.
Authentication
The crm-int-01
server doesn't talk to the outside internet and can
be accessed only via HTTP Digest authentication. We are considering
changing this to basic auth.
Users that need to access the CRM must be added to the Apache htdigest
file
on crm-int-01.tpo
and have a CiviCRM account created from them.
To extract a list of CiviCRM accounts and their roles, the following drush
command may be executed at the root of the Drupal installation:
drush uinf $(drush sqlq "SELECT GROUP_CONCAT(uid) FROM users")
The SSH server is firewalled (rules defined in Puppet,
profile::civicrm
). To get access to the port, ask TPA.
Implementation
CiviCRM is a PHP application licensed under the AGPLv3, supporting
PHP 8.1 and later at the time of writing. We are currently
running CiviCRM 5.73.4, released in May 30th 2024 (as of 2024-08-28),
the current version can be found in
/srv/crm.torproject.org/htdocs-prod/sites/all/modules/civicrm/release-notes.md
on the production server (crm-int-01
). See also the upstream release
announcements, the GitHub
tags page and the release management policy.
Upstream also has their own GitLab instance.
CiviCRM has a torcrm
extension under
sites/all/civicrm_extensions/torcrm
which includes most of the CiviCRM
customization, including the Resque Processor job. It replaces the
old tor_donate
Drupal module, which is being phased out.
Related services
CiviCRM only holds donor information, actual transactions are processed by the donation site, donate-neo.
Issues
Since there are many components, here's a table outlining the known projects and issue trackers for the different sites.
Site | Project | Issues |
---|---|---|
https://crm.torproject.org | project | issues |
https://donate.torproject.org | project | issues |
https://newsletter.torproject.org | project | issues |
Issues with the server-level issues should be filed or in the TPA team issue tracker.
Upstream CiviCRM has their own StackExchange site and use GitLab issue queues
Maintainer
CiviCRM, the PHP application and the Javascript component on
donate-static
are all maintained by the external CiviCRM
contractors.
Users
Direct users of this service are mostly the fundraising team.
Upstream
Upstream is a healthy community of free software developers producing regular releases. Our consultant is part of the core team.
Monitoring and metrics
As other TPA servers, the CRM servers are monitored by
Prometheus. The Redis server (and the related IPsec tunnel) is
particularly monitored, using a blackbox
check, to make sure both
ends can talk to each other.
There's also graphs rendered by Grafana. This includes an elaborate Postfix dashboard watching to two mail servers.
We started working on monitoring the CiviCRM health better. So far we collect metrics that look like this:
# HELP civicrm_jobs_timestamp_seconds Timestamp of the last CiviCRM jobs run
# TYPE civicrm_jobs_timestamp_seconds gauge
civicrm_jobs_timestamp_seconds{jobname="civicrm_update_check"} 1726143300
civicrm_jobs_timestamp_seconds{jobname="send_scheduled_mailings"} 1726203600
civicrm_jobs_timestamp_seconds{jobname="fetch_bounces"} 1726203600
civicrm_jobs_timestamp_seconds{jobname="process_inbound_emails"} 1726203600
civicrm_jobs_timestamp_seconds{jobname="clean_up_temporary_data_and_files"} 1725821100
civicrm_jobs_timestamp_seconds{jobname="rebuild_smart_group_cache"} 1726203600
civicrm_jobs_timestamp_seconds{jobname="process_delayed_civirule_actions"} 1726203600
civicrm_jobs_timestamp_seconds{jobname="civirules_cron"} 1726203600
civicrm_jobs_timestamp_seconds{jobname="delete_unscheduled_mailings"} 1726166700
civicrm_jobs_timestamp_seconds{jobname="call_sumfields_gendata_api"} 1726201800
civicrm_jobs_timestamp_seconds{jobname="update_smart_group_snapshots"} 1726166700
civicrm_jobs_timestamp_seconds{jobname="torcrm_resque_processing"} 1726203600
# HELP civicrm_jobs_status_up CiviCRM Scheduled Job status
# TYPE civicrm_jobs_status_up gauge
civicrm_jobs_status_up{jobname="civicrm_update_check"} 1
civicrm_jobs_status_up{jobname="send_scheduled_mailings"} 1
civicrm_jobs_status_up{jobname="fetch_bounces"} 1
civicrm_jobs_status_up{jobname="process_inbound_emails"} 1
civicrm_jobs_status_up{jobname="clean_up_temporary_data_and_files"} 1
civicrm_jobs_status_up{jobname="rebuild_smart_group_cache"} 1
civicrm_jobs_status_up{jobname="process_delayed_civirule_actions"} 1
civicrm_jobs_status_up{jobname="civirules_cron"} 1
civicrm_jobs_status_up{jobname="delete_unscheduled_mailings"} 1
civicrm_jobs_status_up{jobname="call_sumfields_gendata_api"} 1
civicrm_jobs_status_up{jobname="update_smart_group_snapshots"} 1
civicrm_jobs_status_up{jobname="torcrm_resque_processing"} 1
# HELP civicrm_torcrm_resque_processor_status_up Resque processor status
# TYPE civicrm_torcrm_resque_processor_status_up gauge
civicrm_torcrm_resque_processor_status_up 1
Those show the last timestamp of various jobs, the status of those
jobs (1
means OK), and whether the "kill switch" has been raised
(1
means OK, that is: not raised).
Authentication to the CiviCRM server was particularly problematic: there's an open issue to convert the HTTP-layer authentication system to basic authentication (tpo/web/civicrm#147).
We're hoping to get more metrics from CiviCRM, like detailed status of job failures, mailing run times and other statistics, see tpo/web/civicrm#148. Other options were discussed in this comment as well.
Only the last metric above is hooked up to alerting for now, see tpo/web/donate-neo#75 (closed) for a deeper discussion.
Note that the donate front-end also exports its own metrics, see the Donate Monitoring and metrics documentation for details.
Tests
TODO: what to test on major CiviCRM upgrades, specifically in CiviCRM?
There's a test procedure in donate.torproject.org
that should
likely be followed when there are significant changes performed on
CiviCRM.
Logs
The CRM side (crm-int-01.torproject.org
) has a similar configuration
and sends production environment errors via email.
The logging configuration is in:
crm-int-01:/srv/crm.torproject.org/htdocs-prod/sites/all/modules/custom/tor_donation/src/Donation/ErrorHandler.php
.
Resque processor logs are in the CiviCRM Scheduled Jobs logs under Administer > System Settings > Scheduled Jobs, then find the "Torcrm Resque Processing" job, then view the logs. There may also be fatal errors logged in the general CiviCRM log, under Administer > Admin Console > View Log.
Backups
Backups are done with the regular backup procedures except for
the MariaDB/MySQL database, which are backed up in
/var/backups/local/mysql/
. See also the MySQL section in the backup
documentation.
Other documentation
Upstream has a documentation portal where our users will find:
Discussion
This section is reserved for future large changes proposed to this infrastructure. It can also be used to perform an audit on the current implementation.
Overview
CiviCRM's deployment has simplified a bit since the launch of the new donate-neo frontend. We inherited a few of the complexities of the original design, in particular the fragility of the coupling between frontend and backend through the Redis / IPsec tunnel.
We also inherited the "two single points of failure" design from the original implementation, and actually made that worse by removing the static frontend.
The upside is that software has been updated to use more upstream, shared code, in the form of Django. We plan on using renovate to keep dependencies up to date. Our deployment workflow has improved significantly as well, by hooking up the project with containers and GitLab CI, although CiviCRM itself has failed to benefit from those changes unfortunately.
Next steps include improvements to monitoring and perhaps having a proper dev/stage/prod environments, with a fully separate virtual server for production.
Original "donate-paleo" review
The CiviCRM deployment is complex and feels a bit brittle. The separation between the CiviCRM backend and the middleware API evolved from an initial strict, two-server setup, into the current three-parts component after the static site frontend was added around 2020. The original two-server separation was performed out of a concern for security. We were worried about exposing CiviCRM to the public, because we felt the attack surface of both Drupal and CiviCRM was too wide to be reasonably defended against a determined attacker.
The downside is, obviously, a lot of complexity, which also makes the
service more fragile. The Redis monitoring, for example, was added
after we discovered the ipsec
tunnel would sometimes fail, which
would completely break donations.
Obviously, if either the donation middleware or CiviCRM fails, donations go down as well, so we have actually two single point of failures in that design.
A security review should probably be performed to make sure React, Drupal, its modules, CiviCRM, and other dependencies, are all up to date. Other components like Apache, Redis, or MariaDB are managed through Debian package, and supported by the Debian security team, so should be fairly up to date, in terms of security issues.
Note that this section refers to the old architecture, based on a custom middleware now called "donate-paleo".