now that certificate transparency (CT) is a thing, we should probably start monitoring those. specifically, because major web browsers now require CAs to publish their certs in CT, there's a good chance that a hostile actor who manages to generate a cert for us would show up on that radar.
(it wouldn't keep a rogue CA from generating a fake cert though, as they could, in theory, disobey those policies at some point.)
The goal here is to have an alert in monitoring (typically nagios/icinga right now) when a rogue certificate gets issued so we can act quickly on it. What to do next is out of scope, but should also be considered eventually.
This is actually a surprisingly hard problem, because the CT logs are huge and constantly changing, so we can't just "run a log and watch it". The tooling around this isn't well settled either. Finally, we need to consider that we do issue certificates regularly and those shouldn't trigger an alert.
Known solutions
cert spotter is both a commercial service from sslmate.com and a free software log monitor written in Golang, outputs matching certs to stdout, doesn't seem actively maintained
fetchallcerts.py is a homegrown script from @linus that is a proof of concept that writes a JSON representation of the merkle tree, zips up matching certs and warns about inconsistencies in the log
Let's Encrypt runs ct-woodpecker which involves running a full log and is more useful for actual CAs that want to monitor their own things
DSA uses this nagios plugin to monitor certspotter, and it does check for the existing cert bundle, see this YAML config for how it's called ( remotecheck: "/usr/lib/nagios/plugins/dsa-check-ct-logs --domain debian.org --dir /srv/letsencrypt.debian.org/var/result/ --cert-bundle /etc/ssl/ca-global/ca-certificates.crt --subdomains --ignore-re '.*\\.acc\\.umu\\.se' --ignore-fp 90b1c027ff49c22e1dfbde6dcd4e3ef99d795ffe02e61e5ef3850896a33a430b")
the key trick however, is to not warn when a new cert is renewed. therefore we would need to be somewhat clever and recognize our own certificates in there and filter those out.
I've tried to run the certspotter free software version before, and it was very heavy, its like running a bitcoin node in terms of constant heavy I/O and ever growing disk space requirements. I wasn't able to ever sync to the latest "head" of the append-only log and was stuck in a state where I was constantly downloading certs forever. I suspect it may be possible to properly run this, with a dedicated SSD/nvme, and maybe some kind of trimming of the logs, so you don't need to have an ever expanding disk capacity, but it would require a bit more work than simply firing up the docker container.
The nice thing about certspotter's service is they are basically running the above, and exposing an API endpoint for you to query. They are taking the over 10million new certs a day from 40+ different CT logs and indexing them by domain name, and then you have a simple JSON API available. Their API gives you clear programmatic results you can act on (and it actually responds quickly), eg. https://api.certspotter.com/v1/issuances?domain=riseup.net&expand=dns_names&expand=issuer&expand=cert.
As a service, its a pay thing, if you go over 100 hostname queries / hour, 10 full-domain queries / hour, 75 queries / minute, or 5 queries / second. That can work for some use-cases, but probably not for the 'sauteed onions' project (@rhatto).
it seems they count "SCTs" from ... some logs, a bit like web browsers do right now. i didn't know what SCTs were, so I looked around and figured out those are Signed Certificate Timestamps, basically a promise from a CT log that it will add the cert to its merkle tree within a standard timeframe:
anyway, it seems like an interesting approach. i'm not sure how well it maps to the requirements here, but it at least seems something that is relevant to the discussion, if only because it covers the opposite use case (which is, explicitly monitor certificates we did ask for issuance to make sure they are in CT logs, as opposed to monitoring for certs that we did not ask for).
he wrote something called silent ct that does what we need and that he demo'd to me in lisbon, it's awesome. here are my raw notes.
mac to pull valid certs
single file with all certs concatenated separated by magic, HMAC in heading
each node can be restricted to a single domain
can ignore certain logs, based on public key
wrote ct client stuff
follows the google chrome list, but can follow others
sample prometheus metrics idea
# maybe not?silent_ct_alerting_certs{cn="torproject.org",foo=bar} 1# type countersilent_ct_alerts_total 2# type gaugesilent_ct_certs{state="pending"} 0silent_ct_certs{state="alerting"} 1silent_ct_certs{state="legitimate"} 120# type gaugesilent_ct_log_age 1716385624# type gaugesilent_ct_log_lag 100
I'm a prometheus noobie so please scream if anything looks odd or not great for generating alerts!
I'm in particular looking for input on if you need to know all possible "id" and "stored_at" values in order to generate alerts. If the answer is yes I need to go back and redo.. :)
Nice! I'm not sure I understand why the stored_at label is a JSON file
instead of the .crt, but that's probably something we can live with. Is
that path name human-readable? If not, that would certainly make alerts
harder to handle by humans...
The intent on how to use the metrics is shown with bash here:
I'm a prometheus noobie so please scream if anything looks odd or not great for generating alerts!
Oh dear, that script is quite something! :) I kind of have trouble
parsing your advanced bash/awk-fu there ... from what I can tell, it
fetches metrics from http://localhost:8080/metrics and ... uh... does
things? :)
Maybe it would help if I brain dump a little "prometheus alerting
primer" here?
Normally, in a Prometheus setup, you'd have prometheus scraping the
/metrics endpoint and start collecting that data in its internal TSDB
(time-series database). Then, periodically, Prometheus will evaluate
"alerting rules" which are basically Prometheus expression based on the
query language (PromQL). If they evaluate to true, then the Alertmanager
kicks in: it regroups and deduplicates alerts and pings the relevant
alerting endpoints (email, IRC, pager, whatever).
I think, with the metrics provided, we'd alert on time() - silentct_log_timestamp > 24*60*60 to check for stale logs,
silentct_error_counter > 0 and silentct_need_restart > 0.
I'm not sure how to handle silentct_certificate_alert... I am not sure
having the timestamp here is useful. What I would prefer seeing is the
number of certs monitored according to their state. Earlier, I
suggested:
In the above state, there are 120 certs that are found in the CT log,
and all of them are in the allow list. Also, all certs from the allow
list are found in the CT log, and no unexpected entries are in the CT
log.
An error condition I would send an alert on would be:
which would mean a cert was put in the allow list but not found in the
CT log, which probably means someone forgot to cleanup the allow
list. It's not a critical condition of course, but something that might
be cleaned up in the future.
In the above, of course, we don't know which cert was found in the CT
log (or allow list) that shouldn't be there, which is not necessarily
the best from an alerting perspective, as it requires the operator to do
extra work to figure out what's going on. But we typically solve this by
adding playbooks that tell the operator to look in a log file or
somewhere to find such information. When this is too complicated for the
operator, we automate it with a script.
Alternatively, we could do what you did in the current metrics and
actually say something like:
So I would change the name from silentct_certificate_alert to
silentct_unexpected_certificate_count because "alert" is a bit weird as
a prometheus metric: a metric is a metric, it's not an alert. ;) In
general, we try to keep the "business logic" of alerting out of metrics:
the software records and reports metrics, and the alerting system is
where that business logic resides.
We also use a plain "1" here instead of a timestamp, because it's
slightly more obvious what we're doing and it allows us to sum up the
metrics.
doing a PromQL to count how many such alerts we have is kind of
difficult. It can be done with some casting, but I can't think of a
query from the top of my head which is a bit of a smell...
Compare this with my approach, which would basically be:
(ie. 1 instead of a timestamp), and the query just becomes
sum(silentct_certificate_alert) to report how many certs are failing.
I suspect I might be misunderstanding what the metric means though...
I'm in particular looking for input on if you need to know all possible "id" and "stored_at" values in order to generate alerts. If the answer is yes I need to go back and redo.. :)
We do not need to know all possible values. Those are labels that can
then be used in the alerting templates so they're useful as such as well.
Thanks for the very thoughtful comments and link! I'll make a new revision, current thinking based on the above would be:
Keep the non-certificate metrics as is, looks like they will work for you (but I'll expand the metrics.md file with hints on how to use them rather than linking bash/awk kungfu).
Redo the "certificate alert" metric, new stab something like this:
Do I understand correctly that alert manager can then both sum (how many unexpected certificates are there), and it could also (when generating alerts) include information from the respective labels? Is this what you want, or do you prefer the simple version and then just log onto the machine to find out what's wrong?
Thanks for the very thoughtful comments and link! I'll make a new revision, current thinking based on the above would be:
Keep the non-certificate metrics as is, looks like they will work for you (but I'll expand the metrics.md file with hints on how to use them rather than linking bash/awk kungfu).
Redo the "certificate alert" metric, new stab something like this:
I would also avoid keeping the stored_at label altogether, that can be
part of the playbook.
Do I understand correctly that alert manager can then both sum (how many unexpected certificates are there), and it could also (when generating alerts) include information from the respective labels? Is this what you want, or do you prefer the simple version and then just log onto the machine to find out what's wrong?
I am not sure! It really depends on the variability of the labels in the
output. My hunch is your latter suggestion will be fine. Let me expand
on that one a bit:
with the above. I'd start alerting on individual matches in the
unexpected state, but if that's too noisy, I'd do the aggregation. The
point is that this is business logic that doesn't need to be in the
exporter, it can be done on our side.
Thanks!
...
On 2025-01-08 11:01:41, Rasmus Dahlberg (@rgdd) wrote:
--
Antoine Beaupré
torproject.org system administration
I opted for no state label and just "silentct_unexpected_certificate_count". Motivated by: I'm not sure we need metrics for the other states, because the different states are mostly used internally for the monitor to know (and report) when a certificate is spotted which is worth your attention.
I also added label "crt_sans" for "silentct_unexpected_certificate_count".
I also removed label "stored_at", and replaced it with two labels "log_id" and "log_index" which are helpful for your playbook to find more information about the unexpected certificate. I elaborated slightly more on this in the metrics.md file.
Does the updated version seem reasonable to you now?
(I also tried to redeem myself and refactor the bash script without awk kung-fu. But I think the above metrics.md should draw the picture of how to use the metrics without it so it's not linked. I'd be happy to get better/other examples of how you're using the metrics in the contrib/ directory sometime in the future!)
For context: I'll poke you when there is a beta.1 tag that I recommend experimenting with. There are 1-2 more (unrelated to metrics) fixes I'd like to get merged as well. And I also plan to type up some kind of brief docs/handbook or similar that I think should be sufficient to get started.
I opted for no state label and just "silentct_unexpected_certificate_count". Motivated by: I'm not sure we need metrics for the other states, because the different states are mostly used internally for the monitor to know (and report) when a certificate is spotted which is worth your attention.
i think i would worry about missing a situation where no certs are monitored at all. in such a misconfiguration, we'd have no cert monitored, so no alert, but it would still be a failure mode we should cover.
I also added label "crt_sans" for "silentct_unexpected_certificate_count".
I also removed label "stored_at", and replaced it with two labels "log_id" and "log_index" which are helpful for your playbook to find more information about the unexpected certificate. I elaborated slightly more on this in the metrics.md file.
nice!
(I also tried to redeem myself and refactor the bash script without awk kung-fu. But I think the above metrics.md should draw the picture of how to use the metrics without it so it's not linked. I'd be happy to get better/other examples of how you're using the metrics in the contrib/ directory sometime in the future!)
For context: I'll poke you when there is a beta.1 tag that I recommend experimenting with. There are 1-2 more (unrelated to metrics) fixes I'd like to get merged as well. And I also plan to type up some kind of brief docs/handbook or similar that I think should be sufficient to get started.