monitor certificate transparency logs

added Icebox label

mentioned in issue #33062

marked this issue as related to #33062

changed the description

added Icinga label

data point, cks uses certspotter: https://utcc.utoronto.ca/~cks/space/blog/tech/TLSCertTransLogsAndLoad

I've tried to run the certspotter free software version before, and it was very heavy, its like running a bitcoin node in terms of constant heavy I/O and ever growing disk space requirements. I wasn't able to ever sync to the latest "head" of the append-only log and was stuck in a state where I was constantly downloading certs forever. I suspect it may be possible to properly run this, with a dedicated SSD/nvme, and maybe some kind of trimming of the logs, so you don't need to have an ever expanding disk capacity, but it would require a bit more work than simply firing up the docker container.

The nice thing about certspotter's service is they are basically running the above, and exposing an API endpoint for you to query. They are taking the over 10million new certs a day from 40+ different CT logs and indexing them by domain name, and then you have a simple JSON API available. Their API gives you clear programmatic results you can act on (and it actually responds quickly), eg. https://api.certspotter.com/v1/issuances?domain=riseup.net&expand=dns_names&expand=issuer&expand=cert.

As a service, its a pay thing, if you go over 100 hostname queries / hour, 10 full-domain queries / hour, 75 queries / minute, or 5 queries / second. That can work for some use-cases, but probably not for the 'sauteed onions' project (@rhatto).

but probably not for the 'sauteed onions' project

But this issue is not about 'sauteed onions'. They are somehow doing it by requests to crt.sh (free), so it is not TPA's headache.

No, but the sauteed onions folks are needing to monitor CT logs as well.

mentioned in issue #33602 (closed)

changed the description

marked #33602 (closed) as a duplicate of this issue

marked this issue as related to #33602 (closed)

marked this issue as related to #41154 (closed)

mentioned in issue #41154 (closed)

changed the description

mentioned in issue #32351

marked this issue as related to #32351

marginally related, but here's a scanner that checks for broader TLS compliance, including some relatively minimal CT compliance checks.

https://github.com/zmap/zlint/issues/226

it seems they count "SCTs" from ... some logs, a bit like web browsers do right now. i didn't know what SCTs were, so I looked around and figured out those are Signed Certificate Timestamps, basically a promise from a CT log that it will add the cert to its merkle tree within a standard timeframe:

https://certificate.transparency.dev/howctworks/#ca-issues-a-precertificate https://en.wikipedia.org/wiki/Certificate_Transparency#Technical_overview

anyway, it seems like an interesting approach. i'm not sure how well it maps to the requirements here, but it at least seems something that is relevant to the discussion, if only because it covers the opposite use case (which is, explicitly monitor certificates we did ask for issuance to make sure they are in CT logs, as opposed to monitoring for certs that we did not ask for).

added TLS label

just found out about https://github.com/CaliDog/certstream-python following a discussion on hacker news

changed the description

marked this issue as related to #41385

mentioned in issue #41385

changed the description

good conversation with rasmus!

he wrote something called silent ct that does what we need and that he demo'd to me in lisbon, it's awesome. here are my raw notes.

mac to pull valid certs

single file with all certs concatenated separated by magic, HMAC in heading

each node can be restricted to a single domain

can ignore certain logs, based on public key

wrote ct client stuff

follows the google chrome list, but can follow others

sample prometheus metrics idea

# maybe not?
silent_ct_alerting_certs{cn="torproject.org",foo=bar} 1
# type counter
silent_ct_alerts_total 2
# type gauge
silent_ct_certs{state="pending"} 0
silent_ct_certs{state="alerting"} 1
silent_ct_certs{state="legitimate"} 120
# type gauge
silent_ct_log_age 1716385624
# type gauge
silent_ct_log_lag 100

@anarcat: sorry for the delay, here's a first stab at prometheus metrics:

https://git.glasklar.is/rgdd/silentct/-/blob/main/docs/metrics.md

Does that seem to have the right shape to you? The intent on how to use the metrics is shown with bash here:

https://git.glasklar.is/rgdd/silentct/-/blob/main/scripts/silentct-check

I'm a prometheus noobie so please scream if anything looks odd or not great for generating alerts!

I'm in particular looking for input on if you need to know all possible "id" and "stored_at" values in order to generate alerts. If the answer is yes I need to go back and redo.. :)

@anarcat: sorry for the delay, here's a first stab at prometheus metrics:

https://git.glasklar.is/rgdd/silentct/-/blob/main/docs/metrics.md

Does that seem to have the right shape to you?

Nice! I'm not sure I understand why the stored_at label is a JSON file instead of the .crt, but that's probably something we can live with. Is that path name human-readable? If not, that would certainly make alerts harder to handle by humans...

The intent on how to use the metrics is shown with bash here:

https://git.glasklar.is/rgdd/silentct/-/blob/main/scripts/silentct-check

I'm a prometheus noobie so please scream if anything looks odd or not great for generating alerts!

Oh dear, that script is quite something! :) I kind of have trouble parsing your advanced bash/awk-fu there ... from what I can tell, it fetches metrics from http://localhost:8080/metrics and ... uh... does things? :)

Maybe it would help if I brain dump a little "prometheus alerting primer" here?

Normally, in a Prometheus setup, you'd have prometheus scraping the /metrics endpoint and start collecting that data in its internal TSDB (time-series database). Then, periodically, Prometheus will evaluate "alerting rules" which are basically Prometheus expression based on the query language (PromQL). If they evaluate to true, then the Alertmanager kicks in: it regroups and deduplicates alerts and pings the relevant alerting endpoints (email, IRC, pager, whatever).

We have a tutorial on writing alerts here:

https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/prometheus#writing-an-alert

I think, with the metrics provided, we'd alert on time() - silentct_log_timestamp > 24*60*60 to check for stale logs, silentct_error_counter > 0 and silentct_need_restart > 0.

I'm not sure how to handle silentct_certificate_alert... I am not sure having the timestamp here is useful. What I would prefer seeing is the number of certs monitored according to their state. Earlier, I suggested:

silent_ct_certs{state="pending"} 0
silent_ct_certs{state="alerting"} 1
silent_ct_certs{state="legitimate"} 120

In retrospect, I think I would call those:

silentct_certificate_count{state="missing"} 0
silentct_certificate_count{state="allowed"} 120
silentct_certificate_count{state="unexpected"} 0

In the above state, there are 120 certs that are found in the CT log, and all of them are in the allow list. Also, all certs from the allow list are found in the CT log, and no unexpected entries are in the CT log.

An error condition I would send an alert on would be:

silentct_certificate_count{state="missing"} 0
silentct_certificate_count{state="allowed"} 119
silentct_certificate_count{state="unexpected"} 1

... in a scenario where an unexpected cert would have been found in the CT log, not in the allow list.

Another state would be:

silentct_certificate_count{state="missing"} 1
silentct_certificate_count{state="allowed"} 119
silentct_certificate_count{state="unexpected"} 0

which would mean a cert was put in the allow list but not found in the CT log, which probably means someone forgot to cleanup the allow list. It's not a critical condition of course, but something that might be cleaned up in the future.

In the above, of course, we don't know which cert was found in the CT log (or allow list) that shouldn't be there, which is not necessarily the best from an alerting perspective, as it requires the operator to do extra work to figure out what's going on. But we typically solve this by adding playbooks that tell the operator to look in a log file or somewhere to find such information. When this is too complicated for the operator, we automate it with a script.

Alternatively, we could do what you did in the current metrics and actually say something like:

silentct_unexpected_certificate_count{cert="stored_at="/path/to/state/crt_found/<log-hex-id>-<log-index>.json"} 1

So I would change the name from silentct_certificate_alert to silentct_unexpected_certificate_count because "alert" is a bit weird as a prometheus metric: a metric is a metric, it's not an alert. ;) In general, we try to keep the "business logic" of alerting out of metrics: the software records and reports metrics, and the alerting system is where that business logic resides.

We also use a plain "1" here instead of a timestamp, because it's slightly more obvious what we're doing and it allows us to sum up the metrics.

Let's take your metric for example:

silentct_certificate_alert{stored_at="TnWjJ1yaEMM4W2zU3z9S6x3w4I4bjWnAsfpksWKaOd8-A.json"} 1.735992551e+09
silentct_certificate_alert{stored_at="TnWjJ1yaEMM4W2zU3z9S6x3w4I4bjWnAsfpksWKaOd8-B.json"} 1.735992551f+09

doing a PromQL to count how many such alerts we have is kind of difficult. It can be done with some casting, but I can't think of a query from the top of my head which is a bit of a smell...

Compare this with my approach, which would basically be:

silentct_certificate_alert{stored_at="TnWjJ1yaEMM4W2zU3z9S6x3w4I4bjWnAsfpksWKaOd8-A.json"} 1
silentct_certificate_alert{stored_at="TnWjJ1yaEMM4W2zU3z9S6x3w4I4bjWnAsfpksWKaOd8-B.json"} 1

(ie. 1 instead of a timestamp), and the query just becomes sum(silentct_certificate_alert) to report how many certs are failing.

I suspect I might be misunderstanding what the metric means though...

I'm in particular looking for input on if you need to know all possible "id" and "stored_at" values in order to generate alerts. If the answer is yes I need to go back and redo.. :)

We do not need to know all possible values. Those are labels that can then be used in the alerting templates so they're useful as such as well.

Thanks for the very thoughtful comments and link! I'll make a new revision, current thinking based on the above would be:

Keep the non-certificate metrics as is, looks like they will work for you (but I'll expand the metrics.md file with hints on how to use them rather than linking bash/awk kungfu).
Redo the "certificate alert" metric, new stab something like this:

silentct_certificate_count{state="missing"} 0
silentct_certificate_count{state="allowed"} 119
silentct_certificate_count{state="unexpected"} 1

And I'll couple this with a few sentences on how to debug once alertmanager fires an alert.

Then a natural evolution could be something like this:

silentct_certificate_count{state="unexpected", sans="foo.org,www.foo.org", stored_at="..."} 1
silentct_certificate_count{state="unexpected", sans="bar.org,www.bar.org", stored_at="..."} 1

Do I understand correctly that alert manager can then both sum (how many unexpected certificates are there), and it could also (when generating alerts) include information from the respective labels? Is this what you want, or do you prefer the simple version and then just log onto the machine to find out what's wrong?

Rasmus Dahlberg commented on a discussion: #40677 (comment 3145725)

Thanks for the very thoughtful comments and link! I'll make a new revision, current thinking based on the above would be:

Keep the non-certificate metrics as is, looks like they will work for you (but I'll expand the metrics.md file with hints on how to use them rather than linking bash/awk kungfu).

Redo the "certificate alert" metric, new stab something like this:
silentct_certificate_count{state="missing"} 0
silentct_certificate_count{state="allowed"} 119
silentct_certificate_count{state="unexpected"} 1
And I'll couple this with a few sentences on how to debug once alertmanager fires an alert.

Yep, that makes sense!

Having sample alerts could be good, but I think this is something we should work on and contribute back to yours.

Then a natural evolution could be something like this:

silentct_certificate_count{state="unexpected", sans="foo.org,www.foo.org", stored_at="..."} 1
silentct_certificate_count{state="unexpected", sans="bar.org,www.bar.org", stored_at="..."} 1

I think in this case, you should possibly have two distinct metrics. If you have:

silentct_certificate_count{state="unexpected",sans="foo.org,www.foo.org", stored_at="..."} 1

Then you don't have this metric at all:

silentct_certificate_count{state="unexpected"} 1

Since you can deduce it from the first one.

I would also avoid keeping the stored_at label altogether, that can be part of the playbook.

Do I understand correctly that alert manager can then both sum (how many unexpected certificates are there), and it could also (when generating alerts) include information from the respective labels? Is this what you want, or do you prefer the simple version and then just log onto the machine to find out what's wrong?

I am not sure! It really depends on the variability of the labels in the output. My hunch is your latter suggestion will be fine. Let me expand on that one a bit:

silentct_certificate_count{state="unexpected", sans="foo.org,www.foo.org"} 1
silentct_certificate_count{state="unexpected", sans="bar.org,www.bar.org"} 1
silentct_certificate_count{state="missing", sans="foo.example.org"} 1
silentct_certificate_count{state="allowed", sans="bar.example.org"} 1
silentct_certificate_count{state="allowed", sans="quux.example.org"} 1

In the above, I find the following:

foo.org and bar.org are unexpected certs in the store, we'll have to alert on those
foo.example.org was expected to be in the CT log, but wasn't published, is our CA failing at its job, or is it just a delay? maybe alert? not sure
bar.example.org and quuex.example.org are in the CT log, as expected, no alert on those.

We can aggregate those counts with:

sum(silentct_certificate_count) by (state)

... which should give us metrics like:

silentct_certificate_count{state="unexpected"} 2
silentct_certificate_count{state="missing"} 1
silentct_certificate_count{state="allowed"} 2

with the above. I'd start alerting on individual matches in the unexpected state, but if that's too noisy, I'd do the aggregation. The point is that this is business logic that doesn't need to be in the exporter, it can be done on our side.

Thanks!

...

On 2025-01-08 11:01:41, Rasmus Dahlberg (@rgdd) wrote:

-- Antoine Beaupré torproject.org system administration

Thanks again -- now there's an updated version here:

https://git.glasklar.is/rgdd/silentct/-/blob/main/docs/metrics.md

I opted for no state label and just "silentct_unexpected_certificate_count". Motivated by: I'm not sure we need metrics for the other states, because the different states are mostly used internally for the monitor to know (and report) when a certificate is spotted which is worth your attention.

I also added label "crt_sans" for "silentct_unexpected_certificate_count".

I also removed label "stored_at", and replaced it with two labels "log_id" and "log_index" which are helpful for your playbook to find more information about the unexpected certificate. I elaborated slightly more on this in the metrics.md file.

Does the updated version seem reasonable to you now?

(I also tried to redeem myself and refactor the bash script without awk kung-fu. But I think the above metrics.md should draw the picture of how to use the metrics without it so it's not linked. I'd be happy to get better/other examples of how you're using the metrics in the contrib/ directory sometime in the future!)

For context: I'll poke you when there is a beta.1 tag that I recommend experimenting with. There are 1-2 more (unrelated to metrics) fixes I'd like to get merged as well. And I also plan to type up some kind of brief docs/handbook or similar that I think should be sufficient to get started.

Hi!

I opted for no state label and just "silentct_unexpected_certificate_count". Motivated by: I'm not sure we need metrics for the other states, because the different states are mostly used internally for the monitor to know (and report) when a certificate is spotted which is worth your attention.

i think i would worry about missing a situation where no certs are monitored at all. in such a misconfiguration, we'd have no cert monitored, so no alert, but it would still be a failure mode we should cover.

I also added label "crt_sans" for "silentct_unexpected_certificate_count".

I also removed label "stored_at", and replaced it with two labels "log_id" and "log_index" which are helpful for your playbook to find more information about the unexpected certificate. I elaborated slightly more on this in the metrics.md file.

nice!

(I also tried to redeem myself and refactor the bash script without awk kung-fu. But I think the above metrics.md should draw the picture of how to use the metrics without it so it's not linked. I'd be happy to get better/other examples of how you're using the metrics in the contrib/ directory sometime in the future!)

i will of course provide details of how we do this here! it will likely end up as an alert in the https://gitlab.torproject.org/tpo/tpa/prometheus-alerts repository

For context: I'll poke you when there is a beta.1 tag that I recommend experimenting with. There are 1-2 more (unrelated to metrics) fixes I'd like to get merged as well. And I also plan to type up some kind of brief docs/handbook or similar that I think should be sufficient to get started.

awesome, thanks!

changed the description

mentioned in issue #41175 (closed)

changed milestone to %TPA-RFC-33-C: Prometheus high availability, long term metrics, other exporters

removed Icinga label

added Prometheus label

added Roadmap::Future label and removed Icebox label

assigned to @rgdd

monitor certificate transparency logs

Known solutions

Designs

Child items ...

Activity