Skip to content
Snippets Groups Projects

Prometheus is a monitoring system that is designed to process a large number of metrics, centralize them on one (or multiple) servers and serve them with a well-defined API. That API is queried through a domain-specific language (DSL) called "PromQL" or "Prometheus Query Language". Prometheus also supports basic graphing capabilities although those are limited enough that we use a separate graphing layer on top (see howto/Grafana).

Tutorial

Looking at pretty graphs

The Prometheus web interface is available at:

https://prometheus.torproject.org

A simple query you can try is to pick any metric in the list and click Execute. For example, this link will show the 5-minute load over the last two weeks for the known servers.

The Prometheus web interface is crude: it's better to use howto/grafana dashboards for most purposes other than debugging.

How-to

Adding metrics for users

If you want your service to be monitored by Prometheus, you need to write or reuse an existing exporter. Writing an exporter is more involved, but still fairly easy and might be necessary if you are the maintainer of an application not already instrumented for Prometheus.

The actual documentation is fairly good, but basically: a Prometheus exporter is a simple HTTP server which responds to a specific URL (/metrics, by convention, but it can be anything) and responds with a key/value list of entries, one on each line. Each "key" is a simple string with an arbitrary list of "labels" enclosed in curly braces. For example, here's how the "node exporter" exports CPU usage:

# HELP node_cpu_seconds_total Seconds the cpus spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 948736.11
node_cpu_seconds_total{cpu="0",mode="iowait"} 1659.94
node_cpu_seconds_total{cpu="0",mode="irq"} 0
node_cpu_seconds_total{cpu="0",mode="nice"} 516.23
node_cpu_seconds_total{cpu="0",mode="softirq"} 16491.47
node_cpu_seconds_total{cpu="0",mode="steal"} 0
node_cpu_seconds_total{cpu="0",mode="system"} 35893.84
node_cpu_seconds_total{cpu="0",mode="user"} 67711.74

Obviously, you don't necessarily have to write all that logic yourself, however: there are client libraries (see the Golang guide, Python demo or C documentation for examples) that do most of the job for you.

In any case, you should be careful about the names and labels of the metrics. See the metric and label naming best practices.

Once you have an exporter endpoint (say at http://example.com:9090/metrics), make sure it works:

curl http://example.com:9090/metrics

This should return a number of metrics that change (or not) at each call.

From there on, provide that endpoint to the sysadmins (or someone with access to the external monitoring server), which will follow the procedure below to add the metric to Prometheus.

Once the exporter is hooked into Prometheus, you can browse the metrics directly at: https://prometheus.torproject.org. Graphs should be available at https://grafana.torproject.org, although those need to be created and committed into git by sysadmins to persist, see the anarcat dashboard directory for more information.

Adding targets on the external server

Alerts and scrape targets on the external server are managed through a Git repository called prometheus-alerts. To add a scrape target:

  1. clone the repository

    git clone https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/
    cd prometheus-alerts
  2. assuming you're adding a node exporter, to add the target:

    cat > targets.d/node_myproject.yaml <<EOF
    # scrape the external node exporters for project Foo
    ---
    - targets:
      - targetone.example.com
      - targettwo.example.com
  3. add, commit, and push:

    git checkout -b myproject
    git add targets.d
    git commit -m"add node exporter targets for my project"
    git push origin -u myproject

The last push command should show you the URL where you can submit your merge request.

After being merged, the changes should propagate within 4 to 6 hours.

See also the targets.d documentation in the git repository.

Adding targets on the internal server

Normally, services configured in Puppet SHOULD automatically be scraped by Prometheus (see below). If, however, you need to manually configure a service, you may define extra jobs in the $scrape_configs array, in the profile::prometheus::server::internal Puppet class.

For example, because the GitLab Prometheus setup is not managed by Puppet (tpo/tpa/gitlab#20), we cannot use this automatic setup, so manual scrape targets are defined like this:

  $scrape_configs =
  [
    {
      'job_name'       => 'gitaly',
      'static_configs' => [
        {
          'targets' => [
            'gitlab-02.torproject.org:9236',
          ],
          'labels'  => {
            'alias' => 'Gitaly-Exporter',
          },
        },
      ],
    },
    [...]
  ]

But ideally those would be configured with automatic targets, below.

Automatic targets on the internal server

Metrics for the internal server are scraped automatically if the exporter is configured by the puppet-prometheus module. This is done almost automatically, apart from the need to open a firewall port in our configuration.

To take the apache_exporter, as an example, in profile::prometheus::apache_exporter, include the prometheus::apache_exporter class from the upstream Puppet module, then we open the port to the Prometheus server on the exporter, with:

Ferm::Rule <<| tag == 'profile::prometheus::server-apache-exporter' |>>

Those rules are declared on the server, in prometheus::prometheus::server::internal.

Web dashboard access

The main web dashboard for the internal Prometheus server should be accessible at https://prometheus.torproject.org using the well-known, public username.

The dashboard for the external Prometheus server, however, is not publicly available. To bypass it, use the following commandline to forward ports over SSH:

ssh -L 9090:localhost:9090 -L 9091:localhost:9091 -L 9093:localhost:9093 prometheus2.torproject.org

The above will also forward the management interfaces of the Alertmanager (port 9093) and Pushgateway (9091).

Alerting

We currently do not do alerting for TPA services with Prometheus. We do, however, have the Alertmanager setup to do alerting for other teams on the secondary Prometheus server (prometheus2). This documentation details how that works, but could also eventually cover the main server, if it eventually replaces Nagios for alerting (ticket 29864).

In general, the upstream documentation for alerting starts from the Alerting Overview but I have found it to be lacking at times. I have instead been following this tutorial which was quite helpful.

Adding alerts in Puppet

The Alertmanager can (but currently isn't, on the external server) managed through Puppet, in profile::prometheus::server::external.

An alerting rule, in Puppet, is defined like:

    {
      'name' => 'bridgestrap',
      'rules' => [
        'alert' => 'Bridges down',
        'expr'  => 'bridgestrap_fraction_functional < 0.50',
        'for'   => '5m',
        'labels'       =>
        {
          'severity' => 'critical',
          'team'     => 'anti-censorship',
        },
        'annotations'  =>
        {
          'title' => 'Bridges down',
          'description' => 'Too many bridges down',
          # use humanizePercentage when upgrading to prom > 2.11
          'summary' => 'Number of functional bridges is `{{$value}}%`',
          'host' => '{{$labels.instance}}',
        },
      ],
    },

Note that we might want to move those to Hiera so that we could use YAML code directly, which would better match the syntax of the actual alerting rules.

Adding alerts through Git, on the external server

The external server pulls pulls a git repository for alerting and targets regularly. Alerts can be added through that repository by adding a file in the rules.d directory, see rules.d directory for more documentation on that.

Note that alerts (probably?) do not take effect until a sysadmin reloads Prometheus.

TODO: confirm how rules are deployed.

Adding alert recipients

To add a new recipient for alerts, look for the receivers setting and add something like this:

receivers      => [
  {
    'name'          => 'anti-censorship team',
    'email_configs' => [
      'to'          => 'anti-censorship-alerts@lists.torproject.org',
      # see above
      'require_tls' => false,
    ],
  },
  # [...]

Then alerts can be routed to that receiver by adding a "route" in the routes setting. For example, this will route alerts with the team: anti-censorship label:

  routes            => [
    {
      'receiver' => 'anti-censorship team',
      'match'    => {
        'team' => 'anti-censorship',
      },
    },
  ],

Testing alerts

Normally, alerts should fire on the Prometheus server and be sent out to the Alertmanager server, if the latter is correctly configured (ie. if it's configured in prometheus.yml, the alerting section, see Installation below).

If you're not sure alerts are working, head to the web dashboard (see the access instructions) and look at the /alerts, and /rules pages. For example, if you're using port forwarding:

Typically, the http://localhost:9093 URL should also be useful to manage the Alertmanager, but in practice the Debian package does not ship the web interface, so its interest is limited in that regard. See the amtool section below for more information.

Note that the /targets URL is also useful to diagnose problems with exporters, in general, see also the troubleshooting section below.

If you can't access the dashboard at all or if the above seems too complicated, Grafana can be jury-rigged as a debugging tool for metrics as well. In the "Explore" panels, you can input Prometheus metrics, with auto-completion, and inspect the output directly.

Managing alerts with amtool

Since the Alertmanager web UI is not available in Debian, you need to use the amtool command. A few useful commands:

  • amtool alert: show firing alerts
  • amtool silence add --duration=1h --author=anarcat --comment="working on it" ALERTNAME: silence alert ALERTNAME for an hour, with some comments

Migrating from Munin

Here's a quick cheat sheet from people used to Munin and switching to Prometheus:

What Munin Prometheus
Scraper munin-update prometheus
Agent munin-node prometheus node-exporter and others
Graphing munin-graph prometheus or grafana
Alerting munin-limits prometheus alertmanager
Network port 4949 9100 and others
Protocol TCP, text-based HTTP, text-based
Storage format RRD custom TSDB
Downsampling yes no
Default interval 5 minutes 15 seconds
Authentication no no
Federation no yes (can fetch from other servers)
High availability no yes (alert-manager gossip protocol)

Basically, Prometheus is similar to Munin in many ways:

  • it "pulls" metrics from the nodes, although it does it over HTTP (to http://host:9100/metrics) instead of a custom TCP protocol like Munin

  • the agent running on the nodes is called prometheus-node-exporter instead of munin-node. it scrapes only a set of built-in parameters like CPU, disk space and so on, different exporters are necessary for different applications (like prometheus-apache-exporter) and any application can easily implement an exporter by exposing a Prometheus-compatible /metrics endpoint

  • like Munin, the node exporter doesn't have any form of authentication built-in. we rely on IP-level firewalls to avoid leakage

  • the central server is simply called prometheus and runs as a daemon that wakes up on its own, instead of munin-update which is called from munin-cron and before that cron

  • graphics are generated on the fly through the crude Prometheus web interface or by frontends like Grafana, instead of being constantly regenerated by munin-graph

  • samples are stored in a custom "time series database" (TSDB) in Prometheus instead of the (ad-hoc) RRD standard

  • Prometheus performs no downsampling like RRD and Prom relies on smart compression to spare disk space, but it uses more than Munin

  • Prometheus scrapes samples much more aggressively than Munin by default, but that interval is configurable

  • Prometheus can scale horizontally (by sharding different services to different servers) and vertically (by aggregating different servers to a central one with a different sampling frequency) natively - munin-update and munin-graph can only run on a single (and same) server

  • Prometheus can act as an high availability alerting system thanks to its alertmanager that can run multiple copies in parallel without sending duplicate alerts - munin-limits can only run on a single server

Push metrics to the Pushgateway

The Pushgateway is setup on the secondary Prometheus server (prometheus2). Note that you might not need to use the Pushgateway, see the article about pushing metrics before going down this route.

The Pushgateway is fairly particular: it listens on port 9091 and gets data through a fairly simple curl-friendly commandline API. We have found that, once installed, this command just "does the right thing", more or less:

echo 'some_metrics{foo="bar"} 3.14 | curl --data-binary @- http://localhost:9091/metrics/job/jobtest/instance/instancetest

To confirm the data was injected by the Push gateway, this can be done:

curl localhost:9091/metrics | head

The Pushgateway is scraped, like other Prometheus jobs, every minute, with metrics kept for a year, at the time of writing. This is configured, inside Puppet, in profile::prometheus::server::external.

Note that it's not possible to push timestamps into the Pushgateway, so it's not useful to ingest past historical data.

Pager playbook

TBD.

Troubleshooting missing metrics

If metrics do not correctly show up in Grafana, it might be worth checking in the Prometheus dashboard itself for the same metrics. Typically, if they do not show up in Grafana, they won't show up in Prometheus either, but it's worth a try, even if only to see the raw data.

Then, if data truly isn't present in Prometheus, you can track down the "target" (the exporter) responsible for it in the /targets listing. If the target is "unhealthy", it will be marked in red and an error message will show up.

If the target is marked healthy, the next step is to scrape the metrics manually. This, for example, will scrape the Apache exporter from the host gayi:

curl -s http://gayi.torproject.org:9117/metrics | grep apache

In the case of this bug, the metrics were not showing up at all:

root@hetzner-nbg1-01:~# curl -s http://gayi.torproject.org:9117/metrics | grep apache
# HELP apache_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, and goversion from which apache_exporter was built.
# TYPE apache_exporter_build_info gauge
apache_exporter_build_info{branch="",goversion="go1.7.4",revision="",version=""} 1
# HELP apache_exporter_scrape_failures_total Number of errors while scraping apache.
# TYPE apache_exporter_scrape_failures_total counter
apache_exporter_scrape_failures_total 18371
# HELP apache_up Could the apache server be reached
# TYPE apache_up gauge
apache_up 0

Notice, however, the apache_exporter_scrape_failures_total, which was incrementing. From there, we reproduced the work the exporter was doing manually and fixed the issue, which involved passing the correct argument to the exporter.

Pushgateway errors

The Pushgateway web interface provides some basic information about the metrics it collects, and allow you to view the pending metrics before they get scraped by Prometheus, which may be useful to troubleshoot issues with the gateway.

To pull metrics by hand, you can pull directly from the pushgateway:

curl localhost:9091/metrics

If you get this error while pulling metrics from the exporter:

An error has occurred while serving metrics:

collected metric "some_metric" { label:<name:"instance" value:"" > label:<name:"job" value:"some_job" > label:<name:"tag" value:"val1" > counter:<value:1 > } was collected before with the same name and label values

It's because similar metrics were sent twice into the gateway, which corrupts the state of the pushgateway, a known problems in earlier versions and fixed in 0.10 (Debian bullseye and later). A workaround is simply to restart the Pushgateway (and clear the storage, if persistence is enabled, see the --persistence.file flag).

Disaster recovery

If a Prometheus/Grafana is destroyed, it should be compltely rebuildable from Puppet. Non-configuration data should be restored from backup, with /var/lib/prometheus/ being sufficient to reconstruct history. If even backups are destroyed, history will be lost, but the server should still recover and start tracking new metrics.

Reference

Installation

Puppet implementation

Every TPA server is configured as a node-exporter through the roles::monitored that is included everywhere. The role might eventually be expanded to cover alerting and other monitoring resources as well. This role, in turn, includes the profile::prometheus::client which configures each client correctly with the right firewall rules.

The firewall rules are exported from the server, defined in profile::prometheus::server. We hacked around limitations of the upstream Puppet module to install Prometheus using backported Debian packages. The monitoring server itself is defined in roles::monitoring.

The Prometheus Puppet module was heavily patched to allow scrape job collection and use of Debian packages for installation, among many other patches sent by anarcat.

Much of the initial Prometheus configuration was also documented in ticket 29681 and especially ticket 29388 which investigates storage requirements and possible alternatives for data retention policies.

Pushgateway

The Pushgateway was configured on the external Prometheus server to allow for the metrics people to push their data inside Prometheus without having to write a Prometheus exporter inside Collector.

This was done directly inside the profile::prometheus::server::external class, but could be moved to a separate profile if it needs to be deployed internally. It is assumed that the gateway script will run directly on prometheus2 to avoid setting up authentication and/or firewall rules, but this could be changed.

Alertmanager

The Alertmanager is configured on the external Prometheus server for the metrics and anti-censorship teams to monitor the health of the network. It may eventually also be used to replace or enhance Nagios (ticket 29864).

It is installed through Puppet, in profile::prometheus::server::external, but could be moved to its own profile if it is deployed on more than one server.

Note that Alertmanager only dispatches alerts, which are actually generated on the Prometheus server side of things. Make sure the following block exists in the prometheus.yml file:

alerting:
  alert_relabel_configs: []
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093

Manual node configuration

External services can be monitored by Prometheus, as long as they comply with the OpenMetrics protocol, which is simply to expose metrics such as this over HTTP:

metric{label=label_val}  value

A real-life (simplified) example:

node_filesystem_avail_bytes{alias="alberti.torproject.org",device="/dev/sda1",fstype="ext4",mountpoint="/"} 16160059392

The above says that the node alberti has the device /dev/sda mounted on /, formatted as an ext4 filesystem which has 16160059392 bytes (~16GB) free.

System-level metrics can easily be monitored by the secondary Prometheus server. This is usually done by installing the "node exporter", with the following steps:

  • On Debian Buster and later:

     apt install prometheus-node-exporter
  • On Debian stretch:

     apt install -t stretch-backports prometheus-node-exporter

    ... assuming that backports is already configured. if it isn't, such a line in /etc/apt/sources.list.d/backports.debian.org.list should suffice:

     deb	https://deb.debian.org/debian/	stretch-backports	main contrib non-free

    ... followed by an apt update, naturally.

The firewall on the machine needs to allow traffic on the exporter port from the server prometheus2.torproject.org. Then open a ticket for TPA to configure the target. Make sure to mention:

  • the hostname for the exporter
  • the port of the exporter (varies according to the exporter, 9100 for the node exporter)
  • how often to scrape the target, if non-default (default: 15s)

Then TPA needs to hook those as part of a new node job in the scrape_configs, in prometheus.yml, from Puppet, in profile::prometheus::server.

See also Adding metrics for users, above.

Monitored services

Those are the actual services monitored by Prometheus.

Internal server (prometheus1)

The "internal" server scrapes all hosts managed by Puppet for TPA. Puppet installs a node_exporter on all servers, which takes care of metrics like CPU, memory, disk usage, time accuracy, and so on. Then other exporters might be enabled on specific services, like email or web servers.

Access to the internal server is fairly public: the metrics there are not considered to be security sensitive and protected by authentication only to keep bots away.

External server (prometheus2)

The "external" server, on the other hand, is more restrictive and does not allow public access. This is out of concern that specific metrics might lead to timing attacks against the network and/or leak sensitive information. The external server also explicitly does not scrape TPA servers automatically: it only scrapes certain services that are manually configured by TPA.

Those are the services currently monitored by the external server:

Note that this list might become out of sync with the actual implementation, look into Puppet in profile::prometheus::server::external for the actual deployment.

Other possible services to monitor

Many more exporters could be configured. A non-exaustive list was built in ticket tpo/tpa/team#30028 around launch time. Here we can document more such exporters we find along the way:

There's also a list of third-party exporters in the Prometheus documentation.

SLA

Prometheus is currently not doing alerting so it doesn't have any sort of garanteed availability. It should, hopefully, not lose too many metrics over time so we can do proper long-term resource planning.

Design

Here is, from the Prometheus overview documentation, the basic architecture of a Prometheus site:

A
drawing of Prometheus' architecture, showing the push gateway and
exporters adding metrics, service discovery through file_sd and
Kubernetes, alerts pushed to the Alertmanager and the various UIs
pulling from Prometheus

As you can see, Prometheus is somewhat tailored towards Kubernetes but it can be used without it. We're deploying it with the file_sd discovery mechanism, where Puppet collects all exporters into the central server, which then scrapes those exporters every scrape_interval (by default 15 seconds). The architecture graph also shows the Alertmanager which could be used to (eventually) replace our Nagios deployment.

It does not show that Prometheus can federate to multiple instances and the Alertmanager can be configured with High availability.

Pushgateway

The Pushgateway is a separate server from the main Prometheus server that is designed to "hold" onto metrics for ephemeral jobs that would otherwise be around long enough for Prometheus to scrape their metrics. We use it as a workaround to bridge Metrics data with Prometheus/Grafana.

Alertmanager

The Alertmanager is a separate program that receives notifications generated by Prometheus servers through an API, groups, and deduplicates notifications before sending them by email or other mechanisms.

Here's how the internal design of the Alertmanager looks like:

Internal architecture of the Alert manager, showing how they get the alerts from Prometheus through an API and internally pushes this through various storage queues and deduplicating notification pipelines, along with a clustered gossip protocol

The first deployments of the Alertmanager at TPO do not feature a "cluster", or high availability (HA) setup.

Alerts are typically sent over email, but Alertmanager also has builtin support for:

There's also a generic webhook receiver which is typically used to send notifications. Many other endpoints are implemented through that webhook, for example:

And that is only what was available at the time of writing, the alertmanager-webhook and alertmanager tags on GitHub might have more.

The Alertmanager has its own web interface to see and silence alerts, but there are also alternatives like Karma (previously Cloudflare's unsee). The web interface is not shipped with the Debian package, because it depends on the Elm compiler which is not in Debian. It can be built by hand using the debian/generate-ui.sh script, but only in newer, post buster versions. Another alternative to consider is Crochet.

In general, when working on alerting, keeping the "My Philosophy on Alerting" paper from a Google engineer (now the Monitoring distributed systems chapter of the Site Reliability Engineering O'Reilly book.

Another issue with alerting in Prometheus is that you can only silence warnings for a certain amount of time, then you get a notification again. The kthxbye bot works around that issue.

Issues

There is no issue tracker specifically for this project, File or search for issues in the team issue tracker component.

Maintainer, users, and upstream

The Prometheus services have been setup and are managed by anarcat inside TPA. The internal Prometheus server is mostly used by TPA staff to diagnose issues. The external Prometheus server is used by various TPO teams for their own monitoring needs.

The upstream Prometheus projects are diverse and generally active as of early 2021. Since Prometheus is used as an ad-hoc standard in the new "cloud native" communities like Kubernetes, it has seen an upsurge of development and interest from various developers, and companies. The future of Prometheus should therefore be fairly bright.

The individual exporters, however, can be hit and miss. Some exporters are "code dumps" from companies and not very well maintained. For example, Digital Ocean dumped the bind_exporter on GitHub, but it was salvaged by the Prometheus community.

Another important layer is the large amount of Puppet code that is used to deploy Prometheus and its components. This is all part of a big Puppet module, puppet-prometheus, managed by the voxpupuli collective. Our integration with the module is not yet complete: we have a lot of glue code on top of it to correctly make it work with Debian packages. A lot of work has been done to complete that work by anarcat, but work still remains, see upstream issue 32 for details.

Monitoring and testing

Prometheus doesn't have specific tests, but there is a test suite in the upstream prometheus Puppet module.

The server is monitored for basic system-level metrics by Nagios. It also monitors itself for system-level metrics but also application-specific metrics.

Logs and metrics

Prometheus servers typically do not generate many logs, except when errors and warnings occur. They should hold very little PII. The web frontends collect logs in accordance with our regular policy.

Actual metrics may contain PII, although it's quite unlikely: typically, data is anonymized and aggregated at collection time. It would still be able to deduce some activity patterns from the metrics generated by Prometheus, and use it to leverage side-channel attacks, which is why the external Prometheus server access is restricted.

Long term metrics storage

Metrics are held for about a year or less, depending on the server, see ticket 29388 for storage requirements and possible alternatives for data retention policies.

Note that extra long-term data retention might be possible using the remote read functionality, which enables the primary server to read metrics from a secondary, longer-term server transparently, keeping graphs working without having to change data source, for example.

That way you could have a short-term server which keeps lots of metrics and polls every minute or even 15 seconds, but keeps (say) only 30 days of data and a long-term server which would poll the short-term server every (say) 5 minutes) but keep (say) 5 years of metrics. But how much data would that be?

The last time we made an estimate, in May 2020, we had the following calculation for 1 minute polling interval over a year:

> 365d×1.3byte/(1min)×2000×78 to Gibyte
99,271238 gibibytes

At the time of writing (August 2021), that is still the configured interval, and the disk usage roughly matches that (98GB used). This implies that we could store about 5 years of metrics with a 5 minute polling interval, using the same disk usage, obviously:

> 5*365d×1.3byte/(5min)×2000×78 to Gibyte
99,271238 gibibytes

... or 15 years with 15 minutes, etc... As a rule of thumb, as long as we multiple the scrape interval, we can multiply the retention period as well.

On the other side, we might be able to increase granularity quite a bit by lowering the retention to (say) 30 days and 5 seconds polling interval, which would give us:

> 30d*1.3byte/(5 second)*2000*78 to Gibyte
97,911358 gibibytes

That might be a bit aggressive though: the default Prometheus scrape_interval is 15 seconds, not 5 seconds... With the defaults (15 seconds scrape interval, 30 days retention), we'd be at about 30GiB disk usage, which makes for a quite reasonable and easy to replicate primary server.

Backups

Prometheus servers should be fully configured through Puppet and require little backups. The metrics themselves are kept in /var/lib/prometheus2 and should be backed up along with our regular backup procedures.

Other documentation

Discussion

Overview

The prometheus and howto/grafana services were setup after anarcat realized that there was no "trending" service setup inside TPA after Munin had died (ticket 29681). The "node exporter" was deployed on all TPA hosts in mid-march 2019 (ticket 29683) and remaining traces of Munin were removed in early April 2019 (ticket 29682).

Resource requirements were researched in ticket 29388 and it was originally planned to retain 15 days of metrics. This was expanded to one year in November 2019 (ticket 31244) with the hope this could eventually be expanded further with a downsampling server in the future.

Eventually, a second Prometheus/Grafana server was setup to monitor external resources (ticket 31159) because there were concerns about mixing internal and external monitoring on TPA's side. There were also concerns on the metrics team about exposing those metrics publicly.

It was originally thought Prometheus could completely replace howto/nagios as well ticket 29864, but this turned out to be more difficult than planned. The main difficulty is that Nagios checks come with builtin threshold of acceptable performance. But Prometheus metrics are just that: metrics, without thresholds... This makes it more difficult to replace Nagios because a ton of alerts need to be rewritten to replace the existing ones. A lot of reports and functionality built-in to Nagios, like availability reports, acknowledgements and other reports, would need to be reimplemented as well.

Goals

This section didn't exist when the projec was launched, so this is really just second-guessing...

Must have

  • Munin replacement: long-term trending metrics to predict resource allocation, with graphing
  • free software, self-hosted
  • Puppet automation

Nice to have

Non-Goals

  • data retention beyond one year

Approvals required

Primary Prometheus server was decided in the Brussels 2019 devmeeting, before anarcat joined the team (ticket 29389). Secondary Prometheus server was approved in meeting/2019-04-08. Storage expansion was approved in meeting/2019-11-25.

Proposed Solution

Prometheus was chosen, see also howto/grafana.

Cost

N/A.

Alternatives considered

No alternatives research was performed, as far as we know.