while we have some way for people to add exporters to our prometheus/grafana setup, it's not clear how they should actually export those metrics to us.
here are the questions that need answering in this proposal:
how to expose metrics (port number? dedicated vhost? different subpath? we do all of those right now)
(how) to encrypt the communication between the scraper and exporter (TLS? plain http? either? we do both right now)
(how) to *restrict access to the exporters? (IP-based allow lists? HTTP user/pass auth? bearer tokens? TLS client certs? nothing? i believe we do everything but TLS client certs and, hopefully "nothing" here)
The question of the time series privacy is out of scope here and handled in #40755 (closed), where we are currently heading towards tor-internal level privacy, and merging Prometheus servers in a single cluster.
this came up in operations with anti-censorship (e.g. #41265 (closed)) but also the network-health folks, which are all stakeholders.
Currently in anti-censorship we have all exporters public. Maybe @cohosh has some historical background on why is like that. But we only expose metrics that can be public, and prefer if others can have access to them. I think we'll like to continue having the metrics publicly reachable. But if our prometheus server was public we might not care so much on having the metrics public.
I can't remember the historical reasons exactly. I think we want these metrics to be public ultimately, and we've taken steps to make them safe for that, but since the prometheus server and the grafana instances aren't public, then making the exporter public was the best option? But I agree that it would be better for the prometheus server public since that would give the community access to historical data.
We have evolved the rules on exporters three times the ones I know are:
First we were exposing multiple exporters in the same domain name with paths like /rdsys-backend-metrics (see all the metrics in https://bridges.torproject.org/)
Then we decided to have one host name per service and have the metrics exposed /metrics on that hostname ( #40789 (comment 2815880))
I just notice the last option of exposing the port to the internet will not work with some of the ACT services as they are now. We've being using the same port both for metrics and internal APIs that we don't want to expose. If we decide to go this way we'll need to change that and use different ports for different things, but that should not be a blocker.
@dgoulet@hiro i would love to hear your thoughts around here as well... how do you think prometheus should scrape exporters? what kind of access control? can we make metrics public?
some background...
originally, we initiated the separation between prom1 and prom2 on the basis that those metrics should not be publicly available. a rationale of that design is here:
The "external" server, on the other hand, is more restrictive and does not allow public access. This is out of concern that specific metrics might lead to timing attacks against the network and/or leak sensitive information. The external server also explicitly does not scrape TPA servers automatically: it only scrapes certain services that are manually configured by TPA.
Turns out it was the anti-censorship team who requested such isolation, as far as I can tell. It's even @cohosh who got the budget approved for the second box. But who did what at this point is largely irrelevant. :)
Now, I feel we let the cat out of the bag and it's going to be hard to put it back in.
In other words, we don't quite know if we can reopen that server publicly at all. I would actually love to stop that separation and merge the two servers into one. It would make my job much easier (one less server to manage!), remove a lot of confusion ("wait, is this on prom1 or prom2?", "why don't i have the node exporter stats here?"), and allow for interesting extra features (e.g. high availability, long-term storage through remote read resampling, etc).
But that's kind of a huge deal. Last person I remember worrying about this is @dgoulet, but I might be wrong about that...
I am against making the metrics dashboards public. The collected data is public but I do not want the data from victoria metrics or our metrics db exposed because I am afraid people would overload our infrastructure. Both those services are protected for that exact reason.
prom2 and graf2 have our relay metrics in there. They CAN NOT be made public. I don't care on moving them to a priv-prom.tpo or whatever but in this current format, the "2" servers can't be public for serious safety reasons.
After that, I don't have any opinions on policy about these exporters.
Setting aside the private metrics that shouldn't be exposed... I would caution you against making prometheus (or grafana) public: with either of those you can construct arbitrary queries, which can be used to overwhelm the server(s). Some metrics, with very high cardinality, with the wrong query period, could result in a huge query that will OOM the server (The number of times that a small number of friendly admins have constructed dashboards in grafana that they leave open in their tabs and cause the server to collapse are amusing, but would be horrifying if it were open to the internet for any troll to mess with)
I'd make exporter for metrics, if the data is not private, public, over TLS (so that you have authenticity of the data). For private data, I wouldn't try to 'redact' them in some way, because that will be difficult to get right and too dangerous to get wrong.
We should curate public dashboards from grafana and make those available (without http basic auth restrictions).
As far as how often to scrape metrics, I think keeping no more than 1-2 weeks of high resolution data (eg. scraping every ~10 seconds), and at least a year for lower resolution metrics.
We should curate public dashboards from grafana and make those available (without http basic auth restrictions).
could you expand on that? how do we expose grafana securely? i was under the impression someone can just (ab)use grafana to inspect all metrics in the first place...
could you expand on that? how do we expose grafana securely? i was
under the impression someone can just (ab)use grafana to inspect all
metrics in the first place...
that would require changing authentication to remove the apache-level authentication system (#30023). I'm also not super clear on the security implications, that page says:
Arbitrary queries cannot be run against your data sources through public dashboards. Public dashboards can only execute the queries stored on the original dashboard.
I'm not sure we're worried about "arbitrary queries", we're worried about actual queries that give de-aggregated results. I'm not sure I understand enough about how Prom and Grafana (and mostly Grafana) to tell how the data travels around there. I think the way this works is the browser requests the query set in the panel, grafana passes the query to the datasource, takes the result and sends it back to the browser which displays it in a graph. But isn't that exactly the kind of stuff we're worried about, samples leaking out of Prometheus?
I guess we'd make public only dashboards we're confident are doing proper aggregation?
I absolutely am worried about "arbitrary queries", as I mentioned in my original message:
I would caution you against making prometheus (or grafana) public: with either of those you can construct arbitrary queries, which can be used to overwhelm the server(s). Some metrics, with very high cardinality, with the wrong query period, could result in a huge query that will OOM the server (The number of times that a small number of friendly admins have constructed dashboards in grafana that they leave open in their tabs and cause the server to collapse are amusing, but would be horrifying if it were open to the internet for any troll to mess with)
I think if we construct a graph, or a dashboard that is doing a query on data that we are fine with making public, then there is no concern about samples leaking out of Prometheus. I don't understand where the worry about samples leaking out would be here.
To take an example, @meskio has a set of metrics that are exported and publicly available for anyone to scrape into their own Prometheus already. We scrape those and then in Grafana, @meskio makes a graph from that data, a graph that represents that data in an interesting way, that isn't doing some kind of insane query, and then makes that graph public. That public grafana graph is now available for the public to look at. They cannot modify the queries involved in that graph, they can only see the results of the query/queries that @meskio determined to be safe and useful to make public.
Relay metrics scrapped from the MetricsPort are NOT meant to be made public. This would allow insight into public relays in the network and thus an attack vector to de-anonymization attacks for instance by being able to correlate load or spikes or traffic pattern.
Got it. Sorry for the many questions, but what's your requirements for
those metrics? Do you need Grafana? How long a retention period? How
frequent the scrape interval?
This would help me plan the architecture: we could, for example, have a
short-term scraper that would have higher scrap interval, shorter
retention time, but be private, and a public server that would keep a
longer retention, longer scrape interval, and redact your metrics.
Is there a namespace for your metrics, e.g. a common prefix to all
sensitive metrics? (Just curious: we could just slap a label on the job
and exclude metrics matching that label, i think.)
Retention, should bring in @gk about this but I would say at least 90 days, even 180 days if possible. Since these are network wide stats, the longer the better essentially to be able to compare point in time in a relatively short term.
Scrape interval can be 60 seconds imo, no problem there. Nothing we need to be urgent there.
Retention, should bring in @gk about this but I would say at least 90 days, even 180 days if possible. Since these are network wide stats, the longer the better essentially to be able to compare point in time in a relatively short term.
This issue has been waiting for information two
weeks or more. It needs attention. Please take care of
this before the end of
2023-10-26. ~"Needs
Information" tickets will be moved to the Icebox after
that point.
(Any ticket left in Needs Review, Needs Information, Next, or Doing
without activity for 14 days gets such
notifications. Make a comment describing the current state
of this ticket and remove the Stale label to fix this.)
To make the bot ignore this ticket, add the bot-ignore label.