anarcat · f6c6b238
--- a/howto/prometheus.md
+++ b/howto/prometheus.md
+[Prometheus][] is a monitoring system that is designed to process a
+large number of metrics, centralize them on one (or multiple) servers
+and serve them with a well-defined API. That API is queried through a
+domain-specific language (DSL) called "PromQL" or "Prometheus Query
+Language". Prometheus also supports basic graphing capabilities
+although those are limited enough that we use a separate graphing
+layer on top (see [howto/Grafana](howto/Grafana)).
+
+[Prometheus]: https://prometheus.io/
+
+[[_TOC_]]
+
+# Tutorial
+
+## Looking at pretty graphs
+
+The Prometheus web interface is available at:
+
+<https://prometheus.torproject.org>
+
+A simple query you can try is to pick any metric in the list and click
+`Execute`. For example, [this link][] will show the 5-minute load
+over the last two weeks for the known servers.
+
+[this link]: https://prometheus1.torproject.org/graph?g0.range_input=2w&g0.expr=node_load5&g0.tab=0
+
+The Prometheus web interface is crude: it's better to use [howto/grafana](howto/grafana)
+dashboards for most purposes other than debugging.
+
+# How-to
+
+## Pager playbook
+
+TBD.
+
+## Disaster recovery
+
+If a Prometheus/Grafana is destroyed, it should be compltely
+rebuildable from Puppet. Non-configuration data should be restored
+from backup, with `/var/lib/prometheus/` being sufficient to
+reconstruct history. If even backups are destroyed, history will be
+lost, but the server should still recover and start tracking new
+metrics.
+
+## Migrating from Munin
+
+Here's a quick cheat sheet from people used to Munin and switching to
+Prometheus:
+
+| What              | Munin           | Prometheus                          |
+| ---               | -----           | ----------                          |
+| Scraper           | munin-update    | prometheus                          |
+| Agent             | munin-node      | prometheus node-exporter and others |
+| Graphing          | munin-graph     | prometheus or grafana               |
+| Alerting          | munin-limits    | prometheus alertmanager             |
+| Network port      | 4949            | 9100 and others                     |
+| Protocol          | TCP, text-based | HTTP, [text-based][]                |
+| Storage format    | RRD             | custom TSDB                         |
+| Downsampling      | yes             | no                                  |
+| Default interval  | 5 minutes       | 15 seconds                          |
+| Authentication    | no              | no                                  |
+| Federation        | no              | yes (can fetch from other servers)  |
+| High availability | no              | yes (alert-manager gossip protocol) |
+
+[text-based]: https://prometheus.io/docs/instrumenting/exposition_formats/
+
+Basically, Prometheus is similar to Munin in many ways:
+
+ * it "pulls" metrics from the nodes, although it does it over HTTP
+   (to <http://host:9100/metrics>) instead of a custom TCP protocol
+   like Munin
+
+ * the agent running on the nodes is called `prometheus-node-exporter`
+   instead of `munin-node`. it scrapes only a set of built-in
+   parameters like CPU, disk space and so on, different exporters are
+   necessary for different applications (like
+   `prometheus-apache-exporter`) and any application can easily
+   implement an exporter by exposing a Prometheus-compatible
+   `/metrics` endpoint
+
+ * like Munin, the node exporter doesn't have any form of
+   authentication built-in. we rely on IP-level firewalls to avoid
+   leakage
+
+ * the central server is simply called `prometheus` and runs as a
+   daemon that wakes up on its own, instead of `munin-update` which is
+   called from `munin-cron` and before that `cron`
+
+ * graphics are generated on the fly through the crude Prometheus web
+   interface or by frontends like Grafana, instead of being constantly
+   regenerated by `munin-graph`
+
+ * samples are stored in a custom "time series database" (TSDB) in
+   Prometheus instead of the (ad-hoc) RRD standard
+   
+ * Prometheus performs *no* downsampling like RRD and Prom relies on
+   smart compression to spare disk space, but it uses more than Munin
+
+ * Prometheus scrapes samples much more aggressively than Munin by
+   default, but that interval is configurable
+
+ * Prometheus can scale horizontally (by sharding different services
+   to different servers) and vertically (by aggregating different
+   servers to a central one with a different sampling frequency)
+   natively - `munin-update` and `munin-graph` can only run on a
+   single (and same) server
+
+ * Prometheus can act as an high availability alerting system thanks
+   to its `alertmanager` that can run multiple copies in parallel
+   without sending duplicate alerts - `munin-limits` can only run on a
+   single server
+
+Reference
+=========
+
+## Installation
+
+### Puppet implementation
+
+Every node is configured as a `node-exporter` through the
+`roles::monitored` that is included everywhere. The role might
+eventually be expanded to cover alerting and other monitoring
+resources as well. This role, in turn, includes the
+`profile::prometheus::client` which configures each client correctly
+with the right firewall rules.
+
+The firewall rules are exported from the server, defined in
+`profile::prometheus::server`. We hacked around limitations of the
+upstream Puppet module to install Prometheus using backported Debian
+packages. The monitoring server itself is defined in
+`roles::monitoring`.
+
+The [Prometheus Puppet module][] was patched to [allow scrape job
+collection][] and [use of Debian packages for installation][]. Much of
+the initial Prometheus configuration was also documented in
+[ticket 29681][] and especially [ticket 29388][] which investigates
+storage requirements and possible alternatives for data retention
+policies.
+
+[ticket 29388]: https://bugs.torproject.org/29388
+[ticket 29681]: https://bugs.torproject.org/29681
+[use of Debian packages for installation]: https://github.com/voxpupuli/puppet-prometheus/pull/303
+[allow scrape job collection]: https://github.com/voxpupuli/puppet-prometheus/pull/304
+[Prometheus Puppet module]: https://github.com/voxpupuli/puppet-prometheus/
+
+### Manual node configuration
+
+External services can be monitored by Prometheus, as long as they
+comply with the [OpenMetrics][] protocol, which is simply to expose
+metrics such as this over HTTP:
+
+    metric{label=label_val}  value
+
+A real-life (simplified) example:
+
+    node_filesystem_avail_bytes{alias="alberti.torproject.org",device="/dev/sda1",fstype="ext4",mountpoint="/"} 16160059392
+
+The above says that the node alberti has the device `/dev/sda` mounted
+on `/`, formatted as an `ext4` filesystem which has 16160059392 bytes
+(~16GB) free.
+
+ [OpenMetrics]: https://openmetrics.io/
+
+System-level metrics can easily be monitored by the secondary
+Prometheus server. This is usually done by installing the "node
+exporter", with the following steps:
+
+ * On Debian Buster and later:
+
+        apt install prometheus-node-exporter
+
+ * On Debian stretch:
+
+        apt install -t stretch-backports prometheus-node-exporter
+
+   ... assuming that backports is already configured. if it isn't, such a line in `/etc/apt/sources.list.d/backports.debian.org.list` should suffice:
+
+        deb	https://deb.debian.org/debian/	stretch-backports	main contrib non-free
+
+   ... followed by an `apt update`, naturally.
+
+The firewall on the machine needs to allow traffic on the exporter
+port from the server `prometheus2.torproject.org`. Then [open a
+ticket][new-ticket] for TPA to configure the target. Make sure to
+mention:
+
+ * the hostname for the exporter
+ * the port of the exporter (varies according to the exporter, 9100
+   for the node exporter)
+ * how often to scrape the target, if non-default (default: 15s)
+
+Then TPA needs to hook those as part of a new node `job` in the
+`scrape_configs`, in `prometheus.yml`, from Puppet, in
+`profile::prometheus::server`.
+
+## SLA
+
+Prometheus is currently not doing alerting so it doesn't have any sort
+of garanteed availability. It should, hopefully, not lose too many
+metrics over time so we can do proper long-term resource planning.
+
+## Design
+
+Here is, from the [Prometheus overview documentation][], the
+basic architecture of a Prometheus site:
+
+[Prometheus overview documentation]: https://prometheus.io/docs/introduction/overview/
+
+<img src="https://prometheus.io/assets/architecture.png" alt="A
+drawing of Prometheus' architecture, showing the push gateway and
+exporters adding metrics, service discovery through file_sd and
+Kubernetes, alerts pushed to the Alertmanager and the various UIs
+pulling from Prometheus" />
+
+As you can see, Prometheus is somewhat tailored towards
+[Kubernetes][] but it can be used without it. We're deploying it with
+the `file_sd` discovery mechanism, where Puppet collects all exporters
+into the central server, which then scrapes those exporters every
+`scrape_interval` (by default 15 seconds). The architecture graph also
+shows the Alertmanager which could be used to (eventually) replace our
+Nagios deployment.
+
+[Kubernetes]: https://kubernetes.io/
+
+It does not show that Prometheus can federate to multiple instances
+and the Alertmanager can be configured with High availability.
+
+## Issues
+
+There is no issue tracker specifically for this project, [File][new-ticket] or
+[search][] for issues in the [generic internal services][search] component.
+
+ [new-ticket]: https://trac.torproject.org/projects/tor/newticket?component=Internal+Services%2FTor+Sysadmin+Team
+ [search]: https://trac.torproject.org/projects/tor/query?status=!closed&component=Internal+Services%2FTor+Sysadmin+Team
+
+## Monitoring and testing
+
+Prometheus doesn't have specific tests, but there *is* a test suite in
+the upstream prometheus Puppet module.
+
+The server is monitored for basic system-level metrics by Nagios. It
+also monitors itself for system-level metrics but also
+application-specific metrics.
+
+# Discussion
+
+## Overview
+
+The prometheus and [howto/grafana](howto/grafana) services were setup after anarcat
+realized that there was no "trending" service setup inside TPA after
+Munin had died ([ticket 29681][]). The "node exporter" was deployed on
+all TPA hosts in mid-march 2019 ([ticket 29683][]) and remaining
+traces of Munin were removed in early April 2019 ([ticket 29682][]).
+
+ [ticket 29683]: https://bugs.torproject.org/29683
+ [ticket 29682]: https://bugs.torproject.org/29682
+
+
+Resource requirements were researched in [ticket 29388][] and it was
+originally planned to retain 15 days of metrics. This was expanded to
+one year in November 2019 ([ticket 31244][]) with the hope this could
+eventually be expanded further with a downsampling server in the
+future.
+
+ [ticket 31244]: https://bugs.torproject.org/31244
+ [ticket 29388]: https://bugs.torproject.org/29388
+
+Eventually, a second Prometheus/Grafana server was setup to monitor
+external resources ([ticket 31159][]) because there were concerns
+about mixing internal and external monitoring on TPA's side. There
+were also concerns on the metrics team about exposing those metrics
+publicly.
+
+ [ticket 31159]: https://bugs.torproject.org/31159
+
+It was originally thought Prometheus could completely replace
+[howto/nagios](howto/nagios) as well [ticket 29864][], but this turned out to be more
+difficult than planned. The main difficulty is that Nagios checks come
+with builtin threshold of acceptable performance. But Prometheus
+metrics are just that: metrics, without thresholds... This makes it
+more difficult to replace Nagios because a ton of alerts need to be
+rewritten to replace the existing ones. A lot of reports and
+functionality built-in to Nagios, like availability reports,
+acknowledgements and other reports, would need to be reimplemented as
+well.
+
+## Goals
+
+This section didn't exist when the projec was launched, so this is
+really just second-guessing...
+
+### Must have
+
+ * Munin replacement: long-term trending metrics to predict resource
+   allocation, with graphing
+ * free software, self-hosted
+ * Puppet automation
+
+### Nice to have
+
+ * possibility of eventual Nagios phase-out ([ticket 29864][])
+
+ [ticket 29864]: https://bugs.torproject.org/29864
+
+### Non-Goals
+
+ * > 1 year data retention
+
+## Approvals required
+
+Primary Prometheus server was decided [in the Brussels 2019
+devmeeting][], before anarcat joined the team ([ticket
+29389][]). Secondary Prometheus server was approved in
+[meeting/2019-04-08](meeting/2019-04-08). Storage expansion was approved in
+[meeting/2019-11-25](meeting/2019-11-25).
+
+ [in the Brussels 2019 devmeeting]: https://gitlab.torproject.org/legacy/trac/-/wikis/org/meetings/2019BrusselsAdminTeamMinutes#Trendingmonitoring
+ [ticket 29389]: https://bugs.torproject.org/29389
+
+## Proposed Solution
+
+Prometheus was chosen, see also [howto/grafana](howto/grafana).
+
+## Cost
+
+N/A.
+
+## Alternatives considered
+
+No alternatives research was performed, as far as we know.