basic prometheus information

22aa3092 · anarcat · a21c7c8d · 22aa3092
Verified Commit 22aa3092 authored 6 years ago by anarcat
--- a/tsa/howto/prometheus.mdwn
+++ b/tsa/howto/prometheus.mdwn
+Prometheus
+==========
+
+Prometheus is a monitoring system that is designed to process a large
+number of metrics, centralize them on one (or multiple) servers and
+serve them with a well-defined API. That API is queried through a
+domain-specific language (DSL) called "PromQL" or "Prometheus Query
+Language". Prometheus also supports basic graphing capabilities
+although those are limited enough that we use a separate graphing
+layer on top (Grafana).
+
+The Prometheus web interface is available at:
+
+<https://prometheus.torproject.org>
+
+A simple query you can try is to pick any metric in the list and click
+`Execute`. For example, [this link](https://prometheus1.torproject.org/graph?g0.range_input=2w&g0.expr=node_load5&g0.tab=0) will show the 5-minute load
+over the last two weeks for the known servers.
+
+All machines configured through Puppet are scraped by the central
+server every 15 seconds.
+
+Munin expatriates
+-----------------
+
+Here's a quick cheat sheet from people used to Munin and switching to
+Prometheus:
+
+| What              | Munin           | Prometheus                          |
+| ---               | -----           | ----------                          |
+| Scraper           | munin-update    | prometheus                          |
+| Agent             | munin-node      | prometheus node-exporter and others |
+| Graphing          | munin-graph     | prometheus or grafana               |
+| Alerting          | munin-limits    | prometheus alertmanager             |
+| Network port      | 4949            | 9100 and others                     |
+| Protocol          | TCP, text-based | HTTP, [text-based][]                |
+| Storage format    | RRD             | custom TSDB                         |
+| Downsampling      | yes             | no                                  |
+| Default interval  | 5 minutes       | 15 seconds                          |
+| Authentication    | no              | no                                  |
+| Federation        | no              | yes (can fetch from other servers)  |
+| High availability | no              | yes (alert-manager gossip protocol) |
+
+[text-based]: https://prometheus.io/docs/instrumenting/exposition_formats/
+
+Basically, Prometheus is similar to Munin in many ways:
+
+ * it "pulls" metrics from the nodes, although it does it over HTTP
+   (to <http://host:9100/metrics>) instead of a custom TCP protocol
+   like Munin
+
+ * the agent running on the nodes is called `prometheus-node-exporter`
+   instead of `munin-node`. it scrapes only a set of built-in
+   parameters like CPU, disk space and so on, different exporters are
+   necessary for different applications (like
+   `prometheus-apache-exporter`) and any application can easily
+   implement an exporter by exposing a Prometheus-compatible
+   `/metrics` endpoint
+
+ * like Munin, the node exporter doesn't have any form of
+   authentication built-in. we rely on IP-level firewalls to avoid
+   leakage
+
+ * the central server is simply called `prometheus` and runs as a
+   daemon that wakes up on its own, instead of `munin-update` which is
+   called from `munin-cron` and before that `cron`
+
+ * graphics are generated on the fly through the crude Prometheus web
+   interface or by frontends like Grafana, instead of being constantly
+   regenerated by `munin-graph`
+
+ * samples are stored in a custom "time series database" (TSDB) in
+   Prometheus instead of the (ad-hoc) RRD standard
+   
+ * Prometheus performs *no* downsampling like RRD and Prom relies on
+   smart compression to spare disk space, but it uses more than Munin
+
+ * Prometheus scrapes samples much more aggressively than Munin by
+   default, but that interval is configurable
+
+ * Prometheus can scale horizontally (by sharding different services
+   to different servers) and vertically (by aggregating different
+   servers to a central one with a different sampling frequency)
+   natively - `munin-update` and `munin-graph` can only run on a
+   single (and same) server
+
+ * Prometheus can act as an high availability alerting system thanks
+   to its `alertmanager` that can run multiple copies in parallel
+   without sending duplicate alerts - `munin-limits` can only run on a
+   single server
+
+Puppet implementation
+---------------------
+
+Every node is configured as a `node-exporter` through the
+`roles::monitored` that is included everywhere. The role might
+eventually be expanded to cover alerting and other monitoring
+resources as well. This role, in turn, includes the
+`profile::prometheus::client` which configures each client correctly
+with the right firewall rules.
+
+The firewall rules are exported from the server, defined in
+`profile::prometheus::server`. We hacked around limitations of the
+upstream Puppet module to install Prometheus using backported Debian
+packages. The monitoring server itself is defined in
+`roles::monitoring`.