I would like to use StatsD/Graphite to collect metrics for bridgestrap and rdsys. @anarcat asked on IRC if Prometheus would be a suitable alternative. I'm not sure, but I'll start with what I like about StatsD/Graphite:
I like StatsD because it accepts metrics over UDP and is therefore decoupled from the application that generates metrics. Adding a new metric to the code is as simple as adding:
metrics.Inc("rdsys.foo.bar",1)
I like Graphite because it enables powerful comparison and composition of time series. For example, I can take the time series "# of functional bridges" and "# of dysfunctional" bridges, plot them against each other, etc.
There is a Docker image that contains StatsD/Graphite and allows for easy setup.
I still have a poor understanding of what Prometheus can do. Is the above something that Prometheus/Grafana can do? If so, how would we send metrics to Prometheus? Note that I prefer a push-based model because it reduces the complexity of our code.
What are your thoughts on the above?
Edited
Designs
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
I like StatsD because it accepts metrics over UDP and is therefore decoupled from the application that generates metrics. Adding a new metric to the code is as simple as adding:
metrics.Inc("rdsys.foo.bar",1)
I'm not familiar with the code you're talking about, but it kind of looks like Prometheus. :) Prometheus is "pull" (ie. the Prometheus server scrapes metrics from endpoints, which are called exporters) instead of "push" (if i understand correctly, endpoints push their data to statsd over UDP?), so that's a significant design difference. But I don't think it's a big problem: even in the above code, the "exporter" still needs to keep track of the metric...
Prometheus is written in golang so it's a natural fit as well, by the way, in case golang is the language for your project here. It should be fairly easy to add an exporter to your project.
Prometheus talks over TCP/HTTP instead of UDP, also.
I like Graphite because it enables powerful comparison and composition of time series. For example, I can take the time series "# of functional bridges" and "# of dysfunctional" bridges, plot them against each other, etc.
This is definitely possible with Grafana. You can plot arbitrary metrics from the backend, regardless of provenance. e.g. you could plot # of function bridges vs network load, even if the former comes from the tor daemon (say) and the latter comes from Linux traffic stats.
There is a Docker image that contains StatsD/Graphite and allows for easy setup.
We already have Prometheus setup, and it's well integrated in our infrastructure. That said, there are packages for statsd/graphite in Debian, which we would probably use if we really want to use this.
All that being said, I don't think we should diverge from the existing monitoring infrastructure. We have put a lot of effort in deploying what is slowly becoming the industry standard for monitoring (prometheus) and it works, and it works well enough for your needs, I believe. :)
Assuming we're moving forward with Prometheus: How can we test this? I would like to add a bunch of metrics, have them scraped by Prometheus, and then play with its UI.
I filed an issue to create a new username on polyanthum (see #40081 (closed)) and once that's done, I'll set up our tool with its new exporter. It will be available at bridges.torproject.org:6000/metrics
Note that some of the metrics are sensitive, so bridges.tp.o:6000/metrics should not be publicly accessible. Do you have thoughts on how to best accomplish this?
Also, how do I get a Prometheus/Grafana user account? Should I file a ticket for that?
Note that some of the metrics are sensitive, so bridges.tp.o:6000/metrics should not be publicly accessible. Do you have thoughts on how to best accomplish this?
Normally, servers are firewalled until those ports are opened. This should include that port. As part of the prometheus configuration, we open the firewall only to the Prometheus server. So it's IP-based blocking, which can be fooled of course, but it's good enough for our other metrics. Would it be for yours?
Keep in mind that the actual metrics end up being public on the primary, public prometheus server: https://prometheus.torproject.org/ is protected by what we call "a trivial password", as measure to protect from bots more than prying eyes. We do have a secondary prometheus server, however, which is designed to hold more private metrics, however, so we could use that.
Also, how do I get a Prometheus/Grafana user account? Should I file a ticket for that?
I guess we can do this here or in another ticket, as you wish. I don't think we'd grant you access directly to the Prometheus interface as it's not very useful; Grafana does everything the Prometheus web interface does and more.
Normally, servers are firewalled until those ports are opened. This should include that port. As part of the prometheus configuration, we open the firewall only to the Prometheus server. So it's IP-based blocking, which can be fooled of course, but it's good enough for our other metrics. Would it be for yours?
Yes, that sounds good to me.
I guess we can do this here or in another ticket, as you wish.
Ok, let's do it here. Can you please create a Grafana account for me?
In the past a point of friction between our prometheus server and other people using it was the fact that I (or someone with puppet access) need to add the endpoint that prometheus needs to scrape on puppet.
I think we should invest some time planning for a system where people can send a PR request instead.
I have created issue #40089 (closed) to track this in parallel to this ticket.
Brief update to keep you in the loop: my service now supports Prometheus metrics via a /metrics page. I am now considering having my service spin up two Web servers; one implements the service API and it's bound to 127.0.0.1; the other implements metrics and is bound to 0.0.0.0. I'll let you know once I'm done with this and the changes are live.
Things turned out more complicated than I thought because bridgestrap's /metrics handler (which your Prometheus will scrape) is semi-public while its API (accessible via /bridge-state) is not. After thinking about this more, I see two sensible options:
Our service binds its Web server (which serves both /metrics and /bridge-state) to 127.0.0.1. We add a new ProxyPass directive to the Apache which forwards, say, /bridgestrap-metrics to 127.0.0.1/metrics. Bridgestrap's metrics aren't sensitive, so having its metrics page be public is fine.
Our service binds its Web server to 0.0.0.0 and adds an authentication mechanism to /bridge-state but not its /metrics page. This will require work on your side because we'll have to obtain TLS certificates for the service. We could do that with Let's Encrypt but that requires that the bridgestrap executable can bind to either port 80 or 443, so I'll need extra permissions for that.
Does the above make sense? Either option would be fine with me. What are your thoughts @anarcat?
i think the first option makes sense. we already have special configuration for your service in apache, i don't see why we couldn't add that as well.
we could even make a special vhost that listens on a special port for that /metrics endpoint and use the firewall to block it from everything but prometheus.
i like to think of apache as a gatekeeper/router anyways, it makes sense. and i definitely don't like the idea of you starting to mess around with let's encrypt, that's TPA stuff for now that you shouldn't have to bother with.
A heads-up: At some point in the future, we will expose the Prometheus metrics of our second service, rdsys. These are sensitive, so we won't be able to expose them the way we do with bridgestrap.
A heads-up: At some point in the future, we will expose the Prometheus metrics of our second service, rdsys. These are sensitive, so we won't be able to expose them the way we do with bridgestrap.
As I said, this could also be done with Apache: we'd just setup a listener on a separate port. Or the second service could just listen on a public port, which are firewalled by default anyways... Maybe it would be simpler that way anyways? :)
(In fact you could have setup that /metrics endpoint yourself, without apache, if you wanted - then I would have had to add a firewall rule instead of an apache one, but it's all the same to me, really. :)
btw, i haven't forgotten about creating you a new user, but it's still more complicated than i thought, so follow #40102 (closed) for progress on that.
It would be great if you could start scraping the metrics – assuming that it's fine that 1) the endpoint will occasionally be offline and 2) we're still experimenting with our metrics, so they are likely to change over time?
And yes, the URL is correct and is already serving metrics.
I would also argue that some of those metrics could be simplified as
labels. For example, you could have a bridgestrap_bridges_total
metric, with a status label, for example:
same with the API requests: you could even shove the HTTP status code
in there and then you get much more granularity in your
monitoring. You can tell what "invalid" actually means: is it a 503? a
404? etc...
... and so on. You could synthesize a status code if you never get anything from the other end.
Then you can add percentile-like metrics for all of those much more easily as well, because you hide the cardinal explosion inside the labels instead of in the metrics names...
Anyways: take a look at those docs! They can make your life much
easier in the long term. :)
i believe this is complete now, right? you have access to grafana (since #40102 (closed) is basically done) and metrics are pulled into prometheus.
if there's anything else, i think we could have a separate ticket now. :) please note i'll be away for about two weeks so someone else (@hiro probably) will have to deal with anything urgent.