- Tutorial
- Looking at pretty graphs
- How-to
- Pager playbook
- Disaster recovery
- Migrating from Munin
- Reference
- Installation
- Puppet implementation
- Manual node configuration
- SLA
- Design
- Issues
- Monitoring and testing
- Discussion
- Overview
- Goals
- Must have
- Nice to have
- Non-Goals
- Approvals required
- Proposed Solution
- Cost
- Alternatives considered
Prometheus is a monitoring system that is designed to process a large number of metrics, centralize them on one (or multiple) servers and serve them with a well-defined API. That API is queried through a domain-specific language (DSL) called "PromQL" or "Prometheus Query Language". Prometheus also supports basic graphing capabilities although those are limited enough that we use a separate graphing layer on top (see howto/Grafana).
Tutorial
Looking at pretty graphs
The Prometheus web interface is available at:
https://prometheus.torproject.org
A simple query you can try is to pick any metric in the list and click
Execute
. For example, this link will show the 5-minute load
over the last two weeks for the known servers.
The Prometheus web interface is crude: it's better to use howto/grafana dashboards for most purposes other than debugging.
How-to
Pager playbook
TBD.
Disaster recovery
If a Prometheus/Grafana is destroyed, it should be compltely
rebuildable from Puppet. Non-configuration data should be restored
from backup, with /var/lib/prometheus/
being sufficient to
reconstruct history. If even backups are destroyed, history will be
lost, but the server should still recover and start tracking new
metrics.
Migrating from Munin
Here's a quick cheat sheet from people used to Munin and switching to Prometheus:
What | Munin | Prometheus |
---|---|---|
Scraper | munin-update | prometheus |
Agent | munin-node | prometheus node-exporter and others |
Graphing | munin-graph | prometheus or grafana |
Alerting | munin-limits | prometheus alertmanager |
Network port | 4949 | 9100 and others |
Protocol | TCP, text-based | HTTP, text-based |
Storage format | RRD | custom TSDB |
Downsampling | yes | no |
Default interval | 5 minutes | 15 seconds |
Authentication | no | no |
Federation | no | yes (can fetch from other servers) |
High availability | no | yes (alert-manager gossip protocol) |
Basically, Prometheus is similar to Munin in many ways:
-
it "pulls" metrics from the nodes, although it does it over HTTP (to http://host:9100/metrics) instead of a custom TCP protocol like Munin
-
the agent running on the nodes is called
prometheus-node-exporter
instead ofmunin-node
. it scrapes only a set of built-in parameters like CPU, disk space and so on, different exporters are necessary for different applications (likeprometheus-apache-exporter
) and any application can easily implement an exporter by exposing a Prometheus-compatible/metrics
endpoint -
like Munin, the node exporter doesn't have any form of authentication built-in. we rely on IP-level firewalls to avoid leakage
-
the central server is simply called
prometheus
and runs as a daemon that wakes up on its own, instead ofmunin-update
which is called frommunin-cron
and before thatcron
-
graphics are generated on the fly through the crude Prometheus web interface or by frontends like Grafana, instead of being constantly regenerated by
munin-graph
-
samples are stored in a custom "time series database" (TSDB) in Prometheus instead of the (ad-hoc) RRD standard
-
Prometheus performs no downsampling like RRD and Prom relies on smart compression to spare disk space, but it uses more than Munin
-
Prometheus scrapes samples much more aggressively than Munin by default, but that interval is configurable
-
Prometheus can scale horizontally (by sharding different services to different servers) and vertically (by aggregating different servers to a central one with a different sampling frequency) natively -
munin-update
andmunin-graph
can only run on a single (and same) server -
Prometheus can act as an high availability alerting system thanks to its
alertmanager
that can run multiple copies in parallel without sending duplicate alerts -munin-limits
can only run on a single server
Reference
Installation
Puppet implementation
Every node is configured as a node-exporter
through the
roles::monitored
that is included everywhere. The role might
eventually be expanded to cover alerting and other monitoring
resources as well. This role, in turn, includes the
profile::prometheus::client
which configures each client correctly
with the right firewall rules.
The firewall rules are exported from the server, defined in
profile::prometheus::server
. We hacked around limitations of the
upstream Puppet module to install Prometheus using backported Debian
packages. The monitoring server itself is defined in
roles::monitoring
.
The Prometheus Puppet module was patched to allow scrape job collection and use of Debian packages for installation. Much of the initial Prometheus configuration was also documented in ticket 29681 and especially ticket 29388 which investigates storage requirements and possible alternatives for data retention policies.
Manual node configuration
External services can be monitored by Prometheus, as long as they comply with the OpenMetrics protocol, which is simply to expose metrics such as this over HTTP:
metric{label=label_val} value
A real-life (simplified) example:
node_filesystem_avail_bytes{alias="alberti.torproject.org",device="/dev/sda1",fstype="ext4",mountpoint="/"} 16160059392
The above says that the node alberti has the device /dev/sda
mounted
on /
, formatted as an ext4
filesystem which has 16160059392 bytes
(~16GB) free.
System-level metrics can easily be monitored by the secondary Prometheus server. This is usually done by installing the "node exporter", with the following steps:
-
On Debian Buster and later:
apt install prometheus-node-exporter
-
On Debian stretch:
apt install -t stretch-backports prometheus-node-exporter
... assuming that backports is already configured. if it isn't, such a line in
/etc/apt/sources.list.d/backports.debian.org.list
should suffice:deb https://deb.debian.org/debian/ stretch-backports main contrib non-free
... followed by an
apt update
, naturally.
The firewall on the machine needs to allow traffic on the exporter
port from the server prometheus2.torproject.org
. Then open a
ticket for TPA to configure the target. Make sure to
mention:
- the hostname for the exporter
- the port of the exporter (varies according to the exporter, 9100 for the node exporter)
- how often to scrape the target, if non-default (default: 15s)
Then TPA needs to hook those as part of a new node job
in the
scrape_configs
, in prometheus.yml
, from Puppet, in
profile::prometheus::server
.
SLA
Prometheus is currently not doing alerting so it doesn't have any sort of garanteed availability. It should, hopefully, not lose too many metrics over time so we can do proper long-term resource planning.
Design
Here is, from the Prometheus overview documentation, the basic architecture of a Prometheus site:
As you can see, Prometheus is somewhat tailored towards
Kubernetes but it can be used without it. We're deploying it with
the file_sd
discovery mechanism, where Puppet collects all exporters
into the central server, which then scrapes those exporters every
scrape_interval
(by default 15 seconds). The architecture graph also
shows the Alertmanager which could be used to (eventually) replace our
Nagios deployment.
It does not show that Prometheus can federate to multiple instances and the Alertmanager can be configured with High availability.
Issues
There is no issue tracker specifically for this project, File or search for issues in the team issue tracker component.
Monitoring and testing
Prometheus doesn't have specific tests, but there is a test suite in the upstream prometheus Puppet module.
The server is monitored for basic system-level metrics by Nagios. It also monitors itself for system-level metrics but also application-specific metrics.
Discussion
Overview
The prometheus and howto/grafana services were setup after anarcat realized that there was no "trending" service setup inside TPA after Munin had died (ticket 29681). The "node exporter" was deployed on all TPA hosts in mid-march 2019 (ticket 29683) and remaining traces of Munin were removed in early April 2019 (ticket 29682).
Resource requirements were researched in ticket 29388 and it was originally planned to retain 15 days of metrics. This was expanded to one year in November 2019 (ticket 31244) with the hope this could eventually be expanded further with a downsampling server in the future.
Eventually, a second Prometheus/Grafana server was setup to monitor external resources (ticket 31159) because there were concerns about mixing internal and external monitoring on TPA's side. There were also concerns on the metrics team about exposing those metrics publicly.
It was originally thought Prometheus could completely replace howto/nagios as well ticket 29864, but this turned out to be more difficult than planned. The main difficulty is that Nagios checks come with builtin threshold of acceptable performance. But Prometheus metrics are just that: metrics, without thresholds... This makes it more difficult to replace Nagios because a ton of alerts need to be rewritten to replace the existing ones. A lot of reports and functionality built-in to Nagios, like availability reports, acknowledgements and other reports, would need to be reimplemented as well.
Goals
This section didn't exist when the projec was launched, so this is really just second-guessing...
Must have
- Munin replacement: long-term trending metrics to predict resource allocation, with graphing
- free software, self-hosted
- Puppet automation
Nice to have
- possibility of eventual Nagios phase-out (ticket 29864)
Non-Goals
-
1 year data retention
Approvals required
Primary Prometheus server was decided in the Brussels 2019 devmeeting, before anarcat joined the team (ticket 29389). Secondary Prometheus server was approved in meeting/2019-04-08. Storage expansion was approved in meeting/2019-11-25.
Proposed Solution
Prometheus was chosen, see also howto/grafana.
Cost
N/A.
Alternatives considered
No alternatives research was performed, as far as we know.