- Tutorial
- Looking at pretty graphs
- How-to
- Adding metrics for users
- Adding targets on the external server
- Adding targets on the internal server
- Automatic targets on the internal server
- Web dashboard access
- Alerting
- Adding alerts in Puppet
- Adding alerts through Git, on the external server
- Adding alert recipients
- Testing alerts
- Managing alerts with amtool
- Migrating from Munin
- Push metrics to the Pushgateway
- Pager playbook
- Troubleshooting missing metrics
- Pushgateway errors
- Disaster recovery
- Reference
- Installation
- Puppet implementation
- Pushgateway
- Alertmanager
- Manual node configuration
- Monitored services
- Internal server (prometheus1)
- External server (prometheus2)
- Other possible services to monitor
- SLA
- Design
- Pushgateway
- Alertmanager
- Issues
- Maintainer, users, and upstream
- Monitoring and testing
- Logs and metrics
- Long term metrics storage
- Backups
- Other documentation
- Discussion
- Overview
- Goals
- Must have
- Nice to have
- Non-Goals
- Approvals required
- Proposed Solution
- Cost
- Alternatives considered
Prometheus is a monitoring system that is designed to process a large number of metrics, centralize them on one (or multiple) servers and serve them with a well-defined API. That API is queried through a domain-specific language (DSL) called "PromQL" or "Prometheus Query Language". Prometheus also supports basic graphing capabilities although those are limited enough that we use a separate graphing layer on top (see howto/Grafana).
- Tutorial
- How-to
- Reference
- Discussion
Tutorial
Looking at pretty graphs
The Prometheus web interface is available at:
https://prometheus.torproject.org
A simple query you can try is to pick any metric in the list and click
Execute
. For example, this link will show the 5-minute load
over the last two weeks for the known servers.
The Prometheus web interface is crude: it's better to use howto/grafana dashboards for most purposes other than debugging.
How-to
Adding metrics for users
If you want your service to be monitored by Prometheus, you need to write or reuse an existing exporter. Writing an exporter is more involved, but still fairly easy and might be necessary if you are the maintainer of an application not already instrumented for Prometheus.
The actual documentation is fairly good, but basically: a
Prometheus exporter is a simple HTTP server which responds to a
specific URL (/metrics
, by convention, but it can be anything) and
responds with a key/value list of entries, one on each line. Each
"key" is a simple string with an arbitrary list of "labels" enclosed
in curly braces. For example, here's how the "node exporter" exports
CPU usage:
# HELP node_cpu_seconds_total Seconds the cpus spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 948736.11
node_cpu_seconds_total{cpu="0",mode="iowait"} 1659.94
node_cpu_seconds_total{cpu="0",mode="irq"} 0
node_cpu_seconds_total{cpu="0",mode="nice"} 516.23
node_cpu_seconds_total{cpu="0",mode="softirq"} 16491.47
node_cpu_seconds_total{cpu="0",mode="steal"} 0
node_cpu_seconds_total{cpu="0",mode="system"} 35893.84
node_cpu_seconds_total{cpu="0",mode="user"} 67711.74
Obviously, you don't necessarily have to write all that logic yourself, however: there are client libraries (see the Golang guide, Python demo or C documentation for examples) that do most of the job for you.
In any case, you should be careful about the names and labels of the metrics. See the metric and label naming best practices.
Once you have an exporter endpoint (say at
http://example.com:9090/metrics
), make sure it works:
curl http://example.com:9090/metrics
This should return a number of metrics that change (or not) at each call.
From there on, provide that endpoint to the sysadmins (or someone with access to the external monitoring server), which will follow the procedure below to add the metric to Prometheus.
Once the exporter is hooked into Prometheus, you can browse the metrics directly at: https://prometheus.torproject.org. Graphs should be available at https://grafana.torproject.org, although those need to be created and committed into git by sysadmins to persist, see the anarcat dashboard directory for more information.
Adding targets on the external server
Alerts and scrape targets on the external server are managed through a Git repository called prometheus-alerts. To add a scrape target:
-
clone the repository
git clone https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/ cd prometheus-alerts
-
assuming you're adding a node exporter, to add the target:
cat > targets.d/node_myproject.yaml <<EOF # scrape the external node exporters for project Foo --- - targets: - targetone.example.com - targettwo.example.com
-
add, commit, and push:
git checkout -b myproject git add targets.d git commit -m"add node exporter targets for my project" git push origin -u myproject
The last push command should show you the URL where you can submit your merge request.
After being merged, the changes should propagate within 4 to 6 hours.
See also the targets.d documentation in the git repository.
Adding targets on the internal server
Normally, services configured in Puppet SHOULD automatically be
scraped by Prometheus (see below). If, however, you need to manually
configure a service, you may define extra jobs in the
$scrape_configs
array, in the
profile::prometheus::server::internal
Puppet class.
For example, because the GitLab Prometheus setup is not managed by Puppet (tpo/tpa/gitlab#20), we cannot use this automatic setup, so manual scrape targets are defined like this:
$scrape_configs =
[
{
'job_name' => 'gitaly',
'static_configs' => [
{
'targets' => [
'gitlab-02.torproject.org:9236',
],
'labels' => {
'alias' => 'Gitaly-Exporter',
},
},
],
},
[...]
]
But ideally those would be configured with automatic targets, below.
Automatic targets on the internal server
Metrics for the internal server are scraped automatically if the exporter is configured by the puppet-prometheus module. This is done almost automatically, apart from the need to open a firewall port in our configuration.
To take the apache_exporter
, as an example, in
profile::prometheus::apache_exporter
, include the
prometheus::apache_exporter
class from the upstream Puppet module,
then we open the port to the Prometheus server on the exporter, with:
Ferm::Rule <<| tag == 'profile::prometheus::server-apache-exporter' |>>
Those rules are declared on the server, in prometheus::prometheus::server::internal
.
Web dashboard access
The main web dashboard for the internal Prometheus server should be accessible at https://prometheus.torproject.org using the well-known, public username.
The dashboard for the external Prometheus server, however, is not publicly available. To bypass it, use the following commandline to forward ports over SSH:
ssh -L 9090:localhost:9090 -L 9091:localhost:9091 -L 9093:localhost:9093 prometheus2.torproject.org
The above will also forward the management interfaces of the Alertmanager (port 9093) and Pushgateway (9091).
Alerting
We currently do not do alerting for TPA services with Prometheus. We
do, however, have the Alertmanager setup to do alerting for other
teams on the secondary Prometheus server (prometheus2
). This
documentation details how that works, but could also eventually cover
the main server, if it eventually replaces Nagios for
alerting (ticket 29864).
In general, the upstream documentation for alerting starts from the Alerting Overview but I have found it to be lacking at times. I have instead been following this tutorial which was quite helpful.
Adding alerts in Puppet
The Alertmanager can (but currently isn't, on the external server)
managed through Puppet, in profile::prometheus::server::external
.
An alerting rule, in Puppet, is defined like:
{
'name' => 'bridgestrap',
'rules' => [
'alert' => 'Bridges down',
'expr' => 'bridgestrap_fraction_functional < 0.50',
'for' => '5m',
'labels' =>
{
'severity' => 'critical',
'team' => 'anti-censorship',
},
'annotations' =>
{
'title' => 'Bridges down',
'description' => 'Too many bridges down',
# use humanizePercentage when upgrading to prom > 2.11
'summary' => 'Number of functional bridges is `{{$value}}%`',
'host' => '{{$labels.instance}}',
},
],
},
Note that we might want to move those to Hiera so that we could use YAML code directly, which would better match the syntax of the actual alerting rules.
Adding alerts through Git, on the external server
The external server pulls pulls a git repository for alerting and
targets regularly. Alerts can be added through that repository by
adding a file in the rules.d
directory, see rules.d directory
for more documentation on that.
Note that alerts (probably?) do not take effect until a sysadmin reloads Prometheus.
TODO: confirm how rules are deployed.
Adding alert recipients
To add a new recipient for alerts, look for the receivers
setting
and add something like this:
receivers => [
{
'name' => 'anti-censorship team',
'email_configs' => [
'to' => 'anti-censorship-alerts@lists.torproject.org',
# see above
'require_tls' => false,
],
},
# [...]
Then alerts can be routed to that receiver by adding a "route" in the
routes
setting. For example, this will route alerts with the team: anti-censorship
label:
routes => [
{
'receiver' => 'anti-censorship team',
'match' => {
'team' => 'anti-censorship',
},
},
],
Testing alerts
Normally, alerts should fire on the Prometheus server and be sent out
to the Alertmanager server, if the latter is correctly configured
(ie. if it's configured in prometheus.yml
, the alerting
section,
see Installation below).
If you're not sure alerts are working, head to the web dashboard (see
the access instructions) and look at the
/alerts
, and /rules
pages. For example, if you're
using port forwarding:
- http://localhost:9090/alerts - should show the configure alerts, and if they are firing
- http://localhost:9090/rules - should show the configured rules, and whether they match
Typically, the http://localhost:9093 URL should also be useful to
manage the Alertmanager, but in practice the Debian package does not
ship the web interface, so its interest is limited in that regard. See
the amtool
section below for more information.
Note that the /targets
URL is also useful to diagnose problems
with exporters, in general, see also the troubleshooting section
below.
If you can't access the dashboard at all or if the above seems too complicated, Grafana can be jury-rigged as a debugging tool for metrics as well. In the "Explore" panels, you can input Prometheus metrics, with auto-completion, and inspect the output directly.
Managing alerts with amtool
Since the Alertmanager web UI is not available in Debian, you need to use the amtool command. A few useful commands:
-
amtool alert
: show firing alerts -
amtool silence add --duration=1h --author=anarcat --comment="working on it" ALERTNAME
: silence alert ALERTNAME for an hour, with some comments
Migrating from Munin
Here's a quick cheat sheet from people used to Munin and switching to Prometheus:
What | Munin | Prometheus |
---|---|---|
Scraper | munin-update | prometheus |
Agent | munin-node | prometheus node-exporter and others |
Graphing | munin-graph | prometheus or grafana |
Alerting | munin-limits | prometheus alertmanager |
Network port | 4949 | 9100 and others |
Protocol | TCP, text-based | HTTP, text-based |
Storage format | RRD | custom TSDB |
Downsampling | yes | no |
Default interval | 5 minutes | 15 seconds |
Authentication | no | no |
Federation | no | yes (can fetch from other servers) |
High availability | no | yes (alert-manager gossip protocol) |
Basically, Prometheus is similar to Munin in many ways:
-
it "pulls" metrics from the nodes, although it does it over HTTP (to http://host:9100/metrics) instead of a custom TCP protocol like Munin
-
the agent running on the nodes is called
prometheus-node-exporter
instead ofmunin-node
. it scrapes only a set of built-in parameters like CPU, disk space and so on, different exporters are necessary for different applications (likeprometheus-apache-exporter
) and any application can easily implement an exporter by exposing a Prometheus-compatible/metrics
endpoint -
like Munin, the node exporter doesn't have any form of authentication built-in. we rely on IP-level firewalls to avoid leakage
-
the central server is simply called
prometheus
and runs as a daemon that wakes up on its own, instead ofmunin-update
which is called frommunin-cron
and before thatcron
-
graphics are generated on the fly through the crude Prometheus web interface or by frontends like Grafana, instead of being constantly regenerated by
munin-graph
-
samples are stored in a custom "time series database" (TSDB) in Prometheus instead of the (ad-hoc) RRD standard
-
Prometheus performs no downsampling like RRD and Prom relies on smart compression to spare disk space, but it uses more than Munin
-
Prometheus scrapes samples much more aggressively than Munin by default, but that interval is configurable
-
Prometheus can scale horizontally (by sharding different services to different servers) and vertically (by aggregating different servers to a central one with a different sampling frequency) natively -
munin-update
andmunin-graph
can only run on a single (and same) server -
Prometheus can act as an high availability alerting system thanks to its
alertmanager
that can run multiple copies in parallel without sending duplicate alerts -munin-limits
can only run on a single server
Push metrics to the Pushgateway
The Pushgateway is setup on the secondary Prometheus server
(prometheus2
). Note that you might not need to use the Pushgateway,
see the article about pushing metrics before going down this route.
The Pushgateway is fairly particular: it listens on port 9091 and gets data through a fairly simple curl-friendly commandline API. We have found that, once installed, this command just "does the right thing", more or less:
echo 'some_metrics{foo="bar"} 3.14 | curl --data-binary @- http://localhost:9091/metrics/job/jobtest/instance/instancetest
To confirm the data was injected by the Push gateway, this can be done:
curl localhost:9091/metrics | head
The Pushgateway is scraped, like other Prometheus jobs, every minute,
with metrics kept for a year, at the time of writing. This is
configured, inside Puppet, in profile::prometheus::server::external
.
Note that it's not possible to push timestamps into the Pushgateway, so it's not useful to ingest past historical data.
Pager playbook
TBD.
Troubleshooting missing metrics
If metrics do not correctly show up in Grafana, it might be worth checking in the Prometheus dashboard itself for the same metrics. Typically, if they do not show up in Grafana, they won't show up in Prometheus either, but it's worth a try, even if only to see the raw data.
Then, if data truly isn't present in Prometheus, you can track down
the "target" (the exporter) responsible for it in the /targets
listing. If the target is "unhealthy", it will be marked in red and an
error message will show up.
If the target is marked healthy, the next step is to scrape the
metrics manually. This, for example, will scrape the Apache exporter
from the host gayi
:
curl -s http://gayi.torproject.org:9117/metrics | grep apache
In the case of this bug, the metrics were not showing up at all:
root@hetzner-nbg1-01:~# curl -s http://gayi.torproject.org:9117/metrics | grep apache
# HELP apache_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, and goversion from which apache_exporter was built.
# TYPE apache_exporter_build_info gauge
apache_exporter_build_info{branch="",goversion="go1.7.4",revision="",version=""} 1
# HELP apache_exporter_scrape_failures_total Number of errors while scraping apache.
# TYPE apache_exporter_scrape_failures_total counter
apache_exporter_scrape_failures_total 18371
# HELP apache_up Could the apache server be reached
# TYPE apache_up gauge
apache_up 0
Notice, however, the apache_exporter_scrape_failures_total
, which
was incrementing. From there, we reproduced the work the exporter was
doing manually and fixed the issue, which involved passing the correct
argument to the exporter.
Pushgateway errors
The Pushgateway web interface provides some basic information about the metrics it collects, and allow you to view the pending metrics before they get scraped by Prometheus, which may be useful to troubleshoot issues with the gateway.
To pull metrics by hand, you can pull directly from the pushgateway:
curl localhost:9091/metrics
If you get this error while pulling metrics from the exporter:
An error has occurred while serving metrics:
collected metric "some_metric" { label:<name:"instance" value:"" > label:<name:"job" value:"some_job" > label:<name:"tag" value:"val1" > counter:<value:1 > } was collected before with the same name and label values
It's because similar metrics were sent twice into the gateway, which
corrupts the state of the pushgateway, a known problems in
earlier versions and fixed in 0.10 (Debian bullseye and later). A
workaround is simply to restart the Pushgateway (and clear the
storage, if persistence is enabled, see the --persistence.file
flag).
Disaster recovery
If a Prometheus/Grafana is destroyed, it should be compltely
rebuildable from Puppet. Non-configuration data should be restored
from backup, with /var/lib/prometheus/
being sufficient to
reconstruct history. If even backups are destroyed, history will be
lost, but the server should still recover and start tracking new
metrics.
Reference
Installation
Puppet implementation
Every TPA server is configured as a node-exporter
through the
roles::monitored
that is included everywhere. The role might
eventually be expanded to cover alerting and other monitoring
resources as well. This role, in turn, includes the
profile::prometheus::client
which configures each client correctly
with the right firewall rules.
The firewall rules are exported from the server, defined in
profile::prometheus::server
. We hacked around limitations of the
upstream Puppet module to install Prometheus using backported Debian
packages. The monitoring server itself is defined in
roles::monitoring
.
The Prometheus Puppet module was heavily patched to allow scrape job collection and use of Debian packages for installation, among many other patches sent by anarcat.
Much of the initial Prometheus configuration was also documented in ticket 29681 and especially ticket 29388 which investigates storage requirements and possible alternatives for data retention policies.
Pushgateway
The Pushgateway was configured on the external Prometheus server to allow for the metrics people to push their data inside Prometheus without having to write a Prometheus exporter inside Collector.
This was done directly inside the
profile::prometheus::server::external
class, but could be moved to a
separate profile if it needs to be deployed internally. It is assumed
that the gateway script will run directly on prometheus2
to avoid
setting up authentication and/or firewall rules, but this could be
changed.
Alertmanager
The Alertmanager is configured on the external Prometheus server for the metrics and anti-censorship teams to monitor the health of the network. It may eventually also be used to replace or enhance Nagios (ticket 29864).
It is installed through Puppet, in
profile::prometheus::server::external
, but could be moved to its own
profile if it is deployed on more than one server.
Note that Alertmanager only dispatches alerts, which are actually
generated on the Prometheus server side of things. Make sure the
following block exists in the prometheus.yml
file:
alerting:
alert_relabel_configs: []
alertmanagers:
- static_configs:
- targets:
- localhost:9093
Manual node configuration
External services can be monitored by Prometheus, as long as they comply with the OpenMetrics protocol, which is simply to expose metrics such as this over HTTP:
metric{label=label_val} value
A real-life (simplified) example:
node_filesystem_avail_bytes{alias="alberti.torproject.org",device="/dev/sda1",fstype="ext4",mountpoint="/"} 16160059392
The above says that the node alberti has the device /dev/sda
mounted
on /
, formatted as an ext4
filesystem which has 16160059392 bytes
(~16GB) free.
System-level metrics can easily be monitored by the secondary Prometheus server. This is usually done by installing the "node exporter", with the following steps:
-
On Debian Buster and later:
apt install prometheus-node-exporter
-
On Debian stretch:
apt install -t stretch-backports prometheus-node-exporter
... assuming that backports is already configured. if it isn't, such a line in
/etc/apt/sources.list.d/backports.debian.org.list
should suffice:deb https://deb.debian.org/debian/ stretch-backports main contrib non-free
... followed by an
apt update
, naturally.
The firewall on the machine needs to allow traffic on the exporter
port from the server prometheus2.torproject.org
. Then open a
ticket for TPA to configure the target. Make sure to
mention:
- the hostname for the exporter
- the port of the exporter (varies according to the exporter, 9100 for the node exporter)
- how often to scrape the target, if non-default (default: 15s)
Then TPA needs to hook those as part of a new node job
in the
scrape_configs
, in prometheus.yml
, from Puppet, in
profile::prometheus::server
.
See also Adding metrics for users, above.
Monitored services
Those are the actual services monitored by Prometheus.
Internal server (prometheus1)
The "internal" server scrapes all hosts managed by Puppet for
TPA. Puppet installs a node_exporter
on all servers, which
takes care of metrics like CPU, memory, disk usage, time accuracy, and
so on. Then other exporters might be enabled on specific services,
like email or web servers.
Access to the internal server is fairly public: the metrics there are not considered to be security sensitive and protected by authentication only to keep bots away.
External server (prometheus2)
The "external" server, on the other hand, is more restrictive and does not allow public access. This is out of concern that specific metrics might lead to timing attacks against the network and/or leak sensitive information. The external server also explicitly does not scrape TPA servers automatically: it only scrapes certain services that are manually configured by TPA.
Those are the services currently monitored by the external server:
- bridgestrap
- rdsys
- OnionPerf external nodes'
node_exporter
s - connectivity test on (some?) bridges (using the
blackbox_exporter
)
Note that this list might become out of sync with the actual
implementation, look into Puppet in
profile::prometheus::server::external
for the actual deployment.
Other possible services to monitor
Many more exporters could be configured. A non-exaustive list was built in ticket tpo/tpa/team#30028 around launch time. Here we can document more such exporters we find along the way:
- Prometheus Onion Service Exporter - "Export the status and latency of an onion service"
- hsprober - similar, but also with histogram buckets, multiple attempts, warm-up and error counts
- haproxy_exporter
There's also a list of third-party exporters in the Prometheus documentation.
SLA
Prometheus is currently not doing alerting so it doesn't have any sort of garanteed availability. It should, hopefully, not lose too many metrics over time so we can do proper long-term resource planning.
Design
Here is, from the Prometheus overview documentation, the basic architecture of a Prometheus site:
As you can see, Prometheus is somewhat tailored towards
Kubernetes but it can be used without it. We're deploying it with
the file_sd
discovery mechanism, where Puppet collects all exporters
into the central server, which then scrapes those exporters every
scrape_interval
(by default 15 seconds). The architecture graph also
shows the Alertmanager which could be used to (eventually) replace our
Nagios deployment.
It does not show that Prometheus can federate to multiple instances and the Alertmanager can be configured with High availability.
Pushgateway
The Pushgateway is a separate server from the main Prometheus server that is designed to "hold" onto metrics for ephemeral jobs that would otherwise be around long enough for Prometheus to scrape their metrics. We use it as a workaround to bridge Metrics data with Prometheus/Grafana.
Alertmanager
The Alertmanager is a separate program that receives notifications generated by Prometheus servers through an API, groups, and deduplicates notifications before sending them by email or other mechanisms.
Here's how the internal design of the Alertmanager looks like:
The first deployments of the Alertmanager at TPO do not feature a "cluster", or high availability (HA) setup.
Alerts are typically sent over email, but Alertmanager also has builtin support for:
There's also a generic webhook receiver which is typically used to send notifications. Many other endpoints are implemented through that webhook, for example:
- Cachet
- Dingtalk
- Discord
- Google Chat
- IRC
- Matrix (JS, or this one in Python, or this one)
- Mattermost
- Microsoft teams
- Phabricator
- Sachet supports many messaging systems (Twilio, Pushbullet, Telegram, Sipgate, etc)
- Sentry
- Signal (or Signald)
- Splunk
- SNMP
- Telegram (or this one)
- Twilio
- Zabbix (or this one)
And that is only what was available at the time of writing, the alertmanager-webhook and alertmanager tags on GitHub might have more.
The Alertmanager has its own web interface to see and silence alerts,
but there are also alternatives like Karma (previously
Cloudflare's unsee). The web interface is
not shipped with the Debian package, because it depends on the Elm
compiler which is not in Debian. It can be built by hand
using the debian/generate-ui.sh
script, but only in newer, post
buster versions. Another alternative to consider is Crochet.
In general, when working on alerting, keeping the "My Philosophy on Alerting" paper from a Google engineer (now the Monitoring distributed systems chapter of the Site Reliability Engineering O'Reilly book.
Another issue with alerting in Prometheus is that you can only silence warnings for a certain amount of time, then you get a notification again. The kthxbye bot works around that issue.
Issues
There is no issue tracker specifically for this project, File or search for issues in the team issue tracker component.
Maintainer, users, and upstream
The Prometheus services have been setup and are managed by anarcat inside TPA. The internal Prometheus server is mostly used by TPA staff to diagnose issues. The external Prometheus server is used by various TPO teams for their own monitoring needs.
The upstream Prometheus projects are diverse and generally active as of early 2021. Since Prometheus is used as an ad-hoc standard in the new "cloud native" communities like Kubernetes, it has seen an upsurge of development and interest from various developers, and companies. The future of Prometheus should therefore be fairly bright.
The individual exporters, however, can be hit and miss. Some exporters are "code dumps" from companies and not very well maintained. For example, Digital Ocean dumped the bind_exporter on GitHub, but it was salvaged by the Prometheus community.
Another important layer is the large amount of Puppet code that is used to deploy Prometheus and its components. This is all part of a big Puppet module, puppet-prometheus, managed by the voxpupuli collective. Our integration with the module is not yet complete: we have a lot of glue code on top of it to correctly make it work with Debian packages. A lot of work has been done to complete that work by anarcat, but work still remains, see upstream issue 32 for details.
Monitoring and testing
Prometheus doesn't have specific tests, but there is a test suite in the upstream prometheus Puppet module.
The server is monitored for basic system-level metrics by Nagios. It also monitors itself for system-level metrics but also application-specific metrics.
Logs and metrics
Prometheus servers typically do not generate many logs, except when errors and warnings occur. They should hold very little PII. The web frontends collect logs in accordance with our regular policy.
Actual metrics may contain PII, although it's quite unlikely: typically, data is anonymized and aggregated at collection time. It would still be able to deduce some activity patterns from the metrics generated by Prometheus, and use it to leverage side-channel attacks, which is why the external Prometheus server access is restricted.
Long term metrics storage
Metrics are held for about a year or less, depending on the server, see ticket 29388 for storage requirements and possible alternatives for data retention policies.
Note that extra long-term data retention might be possible using the remote read functionality, which enables the primary server to read metrics from a secondary, longer-term server transparently, keeping graphs working without having to change data source, for example.
That way you could have a short-term server which keeps lots of metrics and polls every minute or even 15 seconds, but keeps (say) only 30 days of data and a long-term server which would poll the short-term server every (say) 5 minutes) but keep (say) 5 years of metrics. But how much data would that be?
The last time we made an estimate, in May 2020, we had the following calculation for 1 minute polling interval over a year:
> 365d×1.3byte/(1min)×2000×78 to Gibyte
99,271238 gibibytes
At the time of writing (August 2021), that is still the configured interval, and the disk usage roughly matches that (98GB used). This implies that we could store about 5 years of metrics with a 5 minute polling interval, using the same disk usage, obviously:
> 5*365d×1.3byte/(5min)×2000×78 to Gibyte
99,271238 gibibytes
... or 15 years with 15 minutes, etc... As a rule of thumb, as long as we multiple the scrape interval, we can multiply the retention period as well.
On the other side, we might be able to increase granularity quite a bit by lowering the retention to (say) 30 days and 5 seconds polling interval, which would give us:
> 30d*1.3byte/(5 second)*2000*78 to Gibyte
97,911358 gibibytes
That might be a bit aggressive though: the default Prometheus
scrape_interval
is 15 seconds, not 5 seconds... With the defaults
(15 seconds scrape interval, 30 days retention), we'd be at about
30GiB disk usage, which makes for a quite reasonable and easy to
replicate primary server.
Backups
Prometheus servers should be fully configured through Puppet and
require little backups. The metrics themselves are kept in
/var/lib/prometheus2
and should be backed up along with our regular
backup procedures.
Other documentation
Discussion
Overview
The prometheus and howto/grafana services were setup after anarcat realized that there was no "trending" service setup inside TPA after Munin had died (ticket 29681). The "node exporter" was deployed on all TPA hosts in mid-march 2019 (ticket 29683) and remaining traces of Munin were removed in early April 2019 (ticket 29682).
Resource requirements were researched in ticket 29388 and it was originally planned to retain 15 days of metrics. This was expanded to one year in November 2019 (ticket 31244) with the hope this could eventually be expanded further with a downsampling server in the future.
Eventually, a second Prometheus/Grafana server was setup to monitor external resources (ticket 31159) because there were concerns about mixing internal and external monitoring on TPA's side. There were also concerns on the metrics team about exposing those metrics publicly.
It was originally thought Prometheus could completely replace howto/nagios as well ticket 29864, but this turned out to be more difficult than planned. The main difficulty is that Nagios checks come with builtin threshold of acceptable performance. But Prometheus metrics are just that: metrics, without thresholds... This makes it more difficult to replace Nagios because a ton of alerts need to be rewritten to replace the existing ones. A lot of reports and functionality built-in to Nagios, like availability reports, acknowledgements and other reports, would need to be reimplemented as well.
Goals
This section didn't exist when the projec was launched, so this is really just second-guessing...
Must have
- Munin replacement: long-term trending metrics to predict resource allocation, with graphing
- free software, self-hosted
- Puppet automation
Nice to have
- possibility of eventual Nagios phase-out (ticket 29864)
Non-Goals
- data retention beyond one year
Approvals required
Primary Prometheus server was decided in the Brussels 2019 devmeeting, before anarcat joined the team (ticket 29389). Secondary Prometheus server was approved in meeting/2019-04-08. Storage expansion was approved in meeting/2019-11-25.
Proposed Solution
Prometheus was chosen, see also howto/grafana.
Cost
N/A.
Alternatives considered
No alternatives research was performed, as far as we know.