|
|
[Prometheus][] is a monitoring system that is designed to process a
|
|
|
large number of metrics, centralize them on one (or multiple) servers
|
|
|
and serve them with a well-defined API. That API is queried through a
|
|
|
domain-specific language (DSL) called "PromQL" or "Prometheus Query
|
|
|
Language". Prometheus also supports basic graphing capabilities
|
|
|
although those are limited enough that we use a separate graphing
|
|
|
layer on top (see [howto/Grafana](howto/Grafana)).
|
|
|
|
|
|
[Prometheus]: https://prometheus.io/
|
|
|
|
|
|
[[_TOC_]]
|
|
|
|
|
|
# Tutorial
|
|
|
|
|
|
## Looking at pretty graphs
|
|
|
|
|
|
The Prometheus web interface is available at:
|
|
|
|
|
|
<https://prometheus.torproject.org>
|
|
|
|
|
|
A simple query you can try is to pick any metric in the list and click
|
|
|
`Execute`. For example, [this link][] will show the 5-minute load
|
|
|
over the last two weeks for the known servers.
|
|
|
|
|
|
[this link]: https://prometheus1.torproject.org/graph?g0.range_input=2w&g0.expr=node_load5&g0.tab=0
|
|
|
|
|
|
The Prometheus web interface is crude: it's better to use [howto/grafana](howto/grafana)
|
|
|
dashboards for most purposes other than debugging.
|
|
|
|
|
|
# How-to
|
|
|
|
|
|
## Pager playbook
|
|
|
|
|
|
TBD.
|
|
|
|
|
|
## Disaster recovery
|
|
|
|
|
|
If a Prometheus/Grafana is destroyed, it should be compltely
|
|
|
rebuildable from Puppet. Non-configuration data should be restored
|
|
|
from backup, with `/var/lib/prometheus/` being sufficient to
|
|
|
reconstruct history. If even backups are destroyed, history will be
|
|
|
lost, but the server should still recover and start tracking new
|
|
|
metrics.
|
|
|
|
|
|
## Migrating from Munin
|
|
|
|
|
|
Here's a quick cheat sheet from people used to Munin and switching to
|
|
|
Prometheus:
|
|
|
|
|
|
| What | Munin | Prometheus |
|
|
|
| --- | ----- | ---------- |
|
|
|
| Scraper | munin-update | prometheus |
|
|
|
| Agent | munin-node | prometheus node-exporter and others |
|
|
|
| Graphing | munin-graph | prometheus or grafana |
|
|
|
| Alerting | munin-limits | prometheus alertmanager |
|
|
|
| Network port | 4949 | 9100 and others |
|
|
|
| Protocol | TCP, text-based | HTTP, [text-based][] |
|
|
|
| Storage format | RRD | custom TSDB |
|
|
|
| Downsampling | yes | no |
|
|
|
| Default interval | 5 minutes | 15 seconds |
|
|
|
| Authentication | no | no |
|
|
|
| Federation | no | yes (can fetch from other servers) |
|
|
|
| High availability | no | yes (alert-manager gossip protocol) |
|
|
|
|
|
|
[text-based]: https://prometheus.io/docs/instrumenting/exposition_formats/
|
|
|
|
|
|
Basically, Prometheus is similar to Munin in many ways:
|
|
|
|
|
|
* it "pulls" metrics from the nodes, although it does it over HTTP
|
|
|
(to <http://host:9100/metrics>) instead of a custom TCP protocol
|
|
|
like Munin
|
|
|
|
|
|
* the agent running on the nodes is called `prometheus-node-exporter`
|
|
|
instead of `munin-node`. it scrapes only a set of built-in
|
|
|
parameters like CPU, disk space and so on, different exporters are
|
|
|
necessary for different applications (like
|
|
|
`prometheus-apache-exporter`) and any application can easily
|
|
|
implement an exporter by exposing a Prometheus-compatible
|
|
|
`/metrics` endpoint
|
|
|
|
|
|
* like Munin, the node exporter doesn't have any form of
|
|
|
authentication built-in. we rely on IP-level firewalls to avoid
|
|
|
leakage
|
|
|
|
|
|
* the central server is simply called `prometheus` and runs as a
|
|
|
daemon that wakes up on its own, instead of `munin-update` which is
|
|
|
called from `munin-cron` and before that `cron`
|
|
|
|
|
|
* graphics are generated on the fly through the crude Prometheus web
|
|
|
interface or by frontends like Grafana, instead of being constantly
|
|
|
regenerated by `munin-graph`
|
|
|
|
|
|
* samples are stored in a custom "time series database" (TSDB) in
|
|
|
Prometheus instead of the (ad-hoc) RRD standard
|
|
|
|
|
|
* Prometheus performs *no* downsampling like RRD and Prom relies on
|
|
|
smart compression to spare disk space, but it uses more than Munin
|
|
|
|
|
|
* Prometheus scrapes samples much more aggressively than Munin by
|
|
|
default, but that interval is configurable
|
|
|
|
|
|
* Prometheus can scale horizontally (by sharding different services
|
|
|
to different servers) and vertically (by aggregating different
|
|
|
servers to a central one with a different sampling frequency)
|
|
|
natively - `munin-update` and `munin-graph` can only run on a
|
|
|
single (and same) server
|
|
|
|
|
|
* Prometheus can act as an high availability alerting system thanks
|
|
|
to its `alertmanager` that can run multiple copies in parallel
|
|
|
without sending duplicate alerts - `munin-limits` can only run on a
|
|
|
single server
|
|
|
|
|
|
Reference
|
|
|
=========
|
|
|
|
|
|
## Installation
|
|
|
|
|
|
### Puppet implementation
|
|
|
|
|
|
Every node is configured as a `node-exporter` through the
|
|
|
`roles::monitored` that is included everywhere. The role might
|
|
|
eventually be expanded to cover alerting and other monitoring
|
|
|
resources as well. This role, in turn, includes the
|
|
|
`profile::prometheus::client` which configures each client correctly
|
|
|
with the right firewall rules.
|
|
|
|
|
|
The firewall rules are exported from the server, defined in
|
|
|
`profile::prometheus::server`. We hacked around limitations of the
|
|
|
upstream Puppet module to install Prometheus using backported Debian
|
|
|
packages. The monitoring server itself is defined in
|
|
|
`roles::monitoring`.
|
|
|
|
|
|
The [Prometheus Puppet module][] was patched to [allow scrape job
|
|
|
collection][] and [use of Debian packages for installation][]. Much of
|
|
|
the initial Prometheus configuration was also documented in
|
|
|
[ticket 29681][] and especially [ticket 29388][] which investigates
|
|
|
storage requirements and possible alternatives for data retention
|
|
|
policies.
|
|
|
|
|
|
[ticket 29388]: https://bugs.torproject.org/29388
|
|
|
[ticket 29681]: https://bugs.torproject.org/29681
|
|
|
[use of Debian packages for installation]: https://github.com/voxpupuli/puppet-prometheus/pull/303
|
|
|
[allow scrape job collection]: https://github.com/voxpupuli/puppet-prometheus/pull/304
|
|
|
[Prometheus Puppet module]: https://github.com/voxpupuli/puppet-prometheus/
|
|
|
|
|
|
### Manual node configuration
|
|
|
|
|
|
External services can be monitored by Prometheus, as long as they
|
|
|
comply with the [OpenMetrics][] protocol, which is simply to expose
|
|
|
metrics such as this over HTTP:
|
|
|
|
|
|
metric{label=label_val} value
|
|
|
|
|
|
A real-life (simplified) example:
|
|
|
|
|
|
node_filesystem_avail_bytes{alias="alberti.torproject.org",device="/dev/sda1",fstype="ext4",mountpoint="/"} 16160059392
|
|
|
|
|
|
The above says that the node alberti has the device `/dev/sda` mounted
|
|
|
on `/`, formatted as an `ext4` filesystem which has 16160059392 bytes
|
|
|
(~16GB) free.
|
|
|
|
|
|
[OpenMetrics]: https://openmetrics.io/
|
|
|
|
|
|
System-level metrics can easily be monitored by the secondary
|
|
|
Prometheus server. This is usually done by installing the "node
|
|
|
exporter", with the following steps:
|
|
|
|
|
|
* On Debian Buster and later:
|
|
|
|
|
|
apt install prometheus-node-exporter
|
|
|
|
|
|
* On Debian stretch:
|
|
|
|
|
|
apt install -t stretch-backports prometheus-node-exporter
|
|
|
|
|
|
... assuming that backports is already configured. if it isn't, such a line in `/etc/apt/sources.list.d/backports.debian.org.list` should suffice:
|
|
|
|
|
|
deb https://deb.debian.org/debian/ stretch-backports main contrib non-free
|
|
|
|
|
|
... followed by an `apt update`, naturally.
|
|
|
|
|
|
The firewall on the machine needs to allow traffic on the exporter
|
|
|
port from the server `prometheus2.torproject.org`. Then [open a
|
|
|
ticket][new-ticket] for TPA to configure the target. Make sure to
|
|
|
mention:
|
|
|
|
|
|
* the hostname for the exporter
|
|
|
* the port of the exporter (varies according to the exporter, 9100
|
|
|
for the node exporter)
|
|
|
* how often to scrape the target, if non-default (default: 15s)
|
|
|
|
|
|
Then TPA needs to hook those as part of a new node `job` in the
|
|
|
`scrape_configs`, in `prometheus.yml`, from Puppet, in
|
|
|
`profile::prometheus::server`.
|
|
|
|
|
|
## SLA
|
|
|
|
|
|
Prometheus is currently not doing alerting so it doesn't have any sort
|
|
|
of garanteed availability. It should, hopefully, not lose too many
|
|
|
metrics over time so we can do proper long-term resource planning.
|
|
|
|
|
|
## Design
|
|
|
|
|
|
Here is, from the [Prometheus overview documentation][], the
|
|
|
basic architecture of a Prometheus site:
|
|
|
|
|
|
[Prometheus overview documentation]: https://prometheus.io/docs/introduction/overview/
|
|
|
|
|
|
<img src="https://prometheus.io/assets/architecture.png" alt="A
|
|
|
drawing of Prometheus' architecture, showing the push gateway and
|
|
|
exporters adding metrics, service discovery through file_sd and
|
|
|
Kubernetes, alerts pushed to the Alertmanager and the various UIs
|
|
|
pulling from Prometheus" />
|
|
|
|
|
|
As you can see, Prometheus is somewhat tailored towards
|
|
|
[Kubernetes][] but it can be used without it. We're deploying it with
|
|
|
the `file_sd` discovery mechanism, where Puppet collects all exporters
|
|
|
into the central server, which then scrapes those exporters every
|
|
|
`scrape_interval` (by default 15 seconds). The architecture graph also
|
|
|
shows the Alertmanager which could be used to (eventually) replace our
|
|
|
Nagios deployment.
|
|
|
|
|
|
[Kubernetes]: https://kubernetes.io/
|
|
|
|
|
|
It does not show that Prometheus can federate to multiple instances
|
|
|
and the Alertmanager can be configured with High availability.
|
|
|
|
|
|
## Issues
|
|
|
|
|
|
There is no issue tracker specifically for this project, [File][new-ticket] or
|
|
|
[search][] for issues in the [generic internal services][search] component.
|
|
|
|
|
|
[new-ticket]: https://trac.torproject.org/projects/tor/newticket?component=Internal+Services%2FTor+Sysadmin+Team
|
|
|
[search]: https://trac.torproject.org/projects/tor/query?status=!closed&component=Internal+Services%2FTor+Sysadmin+Team
|
|
|
|
|
|
## Monitoring and testing
|
|
|
|
|
|
Prometheus doesn't have specific tests, but there *is* a test suite in
|
|
|
the upstream prometheus Puppet module.
|
|
|
|
|
|
The server is monitored for basic system-level metrics by Nagios. It
|
|
|
also monitors itself for system-level metrics but also
|
|
|
application-specific metrics.
|
|
|
|
|
|
# Discussion
|
|
|
|
|
|
## Overview
|
|
|
|
|
|
The prometheus and [howto/grafana](howto/grafana) services were setup after anarcat
|
|
|
realized that there was no "trending" service setup inside TPA after
|
|
|
Munin had died ([ticket 29681][]). The "node exporter" was deployed on
|
|
|
all TPA hosts in mid-march 2019 ([ticket 29683][]) and remaining
|
|
|
traces of Munin were removed in early April 2019 ([ticket 29682][]).
|
|
|
|
|
|
[ticket 29683]: https://bugs.torproject.org/29683
|
|
|
[ticket 29682]: https://bugs.torproject.org/29682
|
|
|
|
|
|
|
|
|
Resource requirements were researched in [ticket 29388][] and it was
|
|
|
originally planned to retain 15 days of metrics. This was expanded to
|
|
|
one year in November 2019 ([ticket 31244][]) with the hope this could
|
|
|
eventually be expanded further with a downsampling server in the
|
|
|
future.
|
|
|
|
|
|
[ticket 31244]: https://bugs.torproject.org/31244
|
|
|
[ticket 29388]: https://bugs.torproject.org/29388
|
|
|
|
|
|
Eventually, a second Prometheus/Grafana server was setup to monitor
|
|
|
external resources ([ticket 31159][]) because there were concerns
|
|
|
about mixing internal and external monitoring on TPA's side. There
|
|
|
were also concerns on the metrics team about exposing those metrics
|
|
|
publicly.
|
|
|
|
|
|
[ticket 31159]: https://bugs.torproject.org/31159
|
|
|
|
|
|
It was originally thought Prometheus could completely replace
|
|
|
[howto/nagios](howto/nagios) as well [ticket 29864][], but this turned out to be more
|
|
|
difficult than planned. The main difficulty is that Nagios checks come
|
|
|
with builtin threshold of acceptable performance. But Prometheus
|
|
|
metrics are just that: metrics, without thresholds... This makes it
|
|
|
more difficult to replace Nagios because a ton of alerts need to be
|
|
|
rewritten to replace the existing ones. A lot of reports and
|
|
|
functionality built-in to Nagios, like availability reports,
|
|
|
acknowledgements and other reports, would need to be reimplemented as
|
|
|
well.
|
|
|
|
|
|
## Goals
|
|
|
|
|
|
This section didn't exist when the projec was launched, so this is
|
|
|
really just second-guessing...
|
|
|
|
|
|
### Must have
|
|
|
|
|
|
* Munin replacement: long-term trending metrics to predict resource
|
|
|
allocation, with graphing
|
|
|
* free software, self-hosted
|
|
|
* Puppet automation
|
|
|
|
|
|
### Nice to have
|
|
|
|
|
|
* possibility of eventual Nagios phase-out ([ticket 29864][])
|
|
|
|
|
|
[ticket 29864]: https://bugs.torproject.org/29864
|
|
|
|
|
|
### Non-Goals
|
|
|
|
|
|
* > 1 year data retention
|
|
|
|
|
|
## Approvals required
|
|
|
|
|
|
Primary Prometheus server was decided [in the Brussels 2019
|
|
|
devmeeting][], before anarcat joined the team ([ticket
|
|
|
29389][]). Secondary Prometheus server was approved in
|
|
|
[meeting/2019-04-08](meeting/2019-04-08). Storage expansion was approved in
|
|
|
[meeting/2019-11-25](meeting/2019-11-25).
|
|
|
|
|
|
[in the Brussels 2019 devmeeting]: https://gitlab.torproject.org/legacy/trac/-/wikis/org/meetings/2019BrusselsAdminTeamMinutes#Trendingmonitoring
|
|
|
[ticket 29389]: https://bugs.torproject.org/29389
|
|
|
|
|
|
## Proposed Solution
|
|
|
|
|
|
Prometheus was chosen, see also [howto/grafana](howto/grafana).
|
|
|
|
|
|
## Cost
|
|
|
|
|
|
N/A.
|
|
|
|
|
|
## Alternatives considered
|
|
|
|
|
|
No alternatives research was performed, as far as we know. |