email deliverability monitoring
As part of the DKIM/SPF/etc plan (#40363 (closed)) and the %improve mail services OKR, it would be critical to have metrics that show whether or not mail is actually getting delivered to major providers, which are a key problem we're having right now with email delivery (e.g. #40484, #34134, #40149 (closed), https://gitlab.torproject.org/tpo/tpa/team/-/issues/40170).
There are a few parts to this:
- end-to-end deliverability tests
- feedback loops
- blocklist checks
Deliverability tests
A simple monitoring system we might want to implement, is an end-to-end deliverability test which would send email from point X (says lists.tpo, eugeni, submission server, or CiviCRM) and check mailbox on provider Y and Z (say hotmail, gmail, etc) to see if the email arrives.
To implement this, create accounts on:
- hotmail/live.com/microsoft
- yahoo.com (also covers Verizon now, bizarrely)
- gmail.com
Nowadays, this might require an actual phone number, so we could get at VoIP provider like voip.ms. There, the prices currently are:
- toll-free: 1.25$/mth and 0.027$/min.
- canada: 0.85$/mtn and 0.009$/min
I mention toll-free numbers because those could eventually be useful if we want to provide support over phone. This is something that @irl suggested because they use it to help people with censorship circumvention (or at least reporting): when the internet is down, no one can send email to tell you, but they can send text messages or voicemail sometimes... I also looked at toll-free numbers in europe and africa (germany and egypt, in particular), and both are somewhat expenseive (25-15$/mth) so maybe not worth it for now.
In any case, at this stage the phone service would be strictly to register the accounts.
Once we have an account, we need to setup monitoring. This can take a few forms:
- Nagios/Icinga: check_email_delivery (in Debian nagios-plugins-contrib, check_email_loop
- Prometheus: a/i service-prober
- Manual: https://code.mayfirst.org/mfmt/filter-check
Might be more.
Feedback loops
The point of this is to have a place where we collect failure reports from various providers. Those can take many forms:
- DMARC reports
- Hetzner email feedback loops
- TLS failure reports
This is tricky. I have enabled DMARC on my personal domain and regularly receive DMARC delivery reports. They are not human-readable and, even with a parser like dmarc-cat, it's hard to figure out what is a legitimate misconfiguration on our end and what's active spoofing attacks from the outside. I suspect the reports coming out of torproject.org would be monstruous, so they would necessarily need to be somehow aggregated. Here are some aggregation tools:
- dmarcts-report-parser: can parse an IMAP mailbox, stores results in a database, has a web UI, Perl, in Debian but out of date, no upstream release
- lafayette: deployment unclear, documentation rather minimal
- parsedmarc: python, IMAP inbox input, JSON/CSV output, integrates with ElasticSearch, Splunk, Kafka, Grafana (but not prometheus)
- dmarcs-metrics-exporter: Prometheus exporter, scrapes IMAP inbox, good-looking metrics, some Grafana dashboard included
- reports-collector: supports DMARC, but also (SMTP) TLS-RPT, HTTP (CSP, etc), digests reports by HTTP or SMTP, turns reports into JSON, they use Kibana to process it
We already process the hetzner feedback loops with a handle-abuse.py script which we run by hand. It doesn't cover mailing lists complaints yet but it can deal with CiviCRM messages that are filed as spam instead of bounced.
Block list checks
Finally, we need to make sure we're not listed on major block lists. In nagios, there's check_rbl, part of the monitoring-plugins-contrib Debian package.