finish expanding prometheus template

dc97df00 · anarcat · f4a188d5 · dc97df00
Verified Commit dc97df00 authored 4 years ago by anarcat
--- a/tsa/howto/prometheus.mdwn
+++ b/tsa/howto/prometheus.mdwn
-Prometheus
-==========
-
 [Prometheus][] is a monitoring system that is designed to process a
 large number of metrics, centralize them on one (or multiple) servers
 and serve them with a well-defined API. That API is queried through a
@@ -13,8 +10,7 @@ layer on top (see [[Grafana]]).

 [[!toc levels=3]]

-Tutorial
-========
+# Tutorial

 The Prometheus web interface is available at:

@@ -29,8 +25,18 @@ over the last two weeks for the known servers.
 # How-to

 ## Pager playbook
+
+TBD.
+
 ## Disaster recovery

+If a Prometheus/Grafana is destroyed, it should be compltely
+rebuildable from Puppet. Non-configuration data should be restored
+from backup, with `/var/lib/prometheus/` being sufficient to
+reconstruct history. If even backups are destroyed, history will be
+lost, but the server should still recover and start tracking new
+metrics.
+
 ## Migrating from Munin

 Here's a quick cheat sheet from people used to Munin and switching to
@@ -134,6 +140,10 @@ policies.

 ## SLA

+Prometheus is currently not doing alerting so it doesn't have any sort
+of garanteed availability. It should, hopefully, not lose too many
+metrics over time so we can do proper long-term resource planning.
+
 ## Design

 Here is, from the [Prometheus overview documentation][], the
@@ -170,30 +180,66 @@ There is no issue tracker specifically for this project, [File][] or

 ## Monitoring and testing

+Prometheus doesn't have specific tests, but there *is* a test suite in
+the upstream prometheus Puppet module.
+
+The server is monitored for basic system-level metrics by Nagios. It
+also monitors itself for system-level metrics but also
+application-specific metrics.

 # Discussion

 ## Overview

-<!-- describe the overall project. should include a link to a ticket -->
-<!-- that has a launch checklist -->
+The prometheus and [[grafana]] services were setup after anarcat
+realized that there was no "trending" service setup inside TPA after
+Munin had died ([ticket 29681][]).
+
+ [ticket 29681]: https://trac.torproject.org/projects/tor/ticket/29681
+
+Eventually, a second Prometheus/Grafana server was setup to monitor
+external resources ([ticket 31159][]) because there were concerns
+about mixing internal and external monitoring on TPA's side. There
+were also concerns on the metrics team about exposing those metrics
+publicly.
+
+ [ticket 31159]: https://trac.torproject.org/projects/tor/ticket/31159

 ## Goals
-<!-- include bugs to be fixed -->
+
+This section didn't exist when the projec was launched, so this is
+really just second-guessing...

 ### Must have

+ * Munin replacement: long-term trending metrics to predict resource
+   allocation, with graphing
+ * free software, self-hosted
+ * Puppet automation
+
 ### Nice to have

+ * possibility of eventual Nagios phase-out
+
 ### Non-Goals

+ * > 1 year data retention
+
 ## Approvals required
-<!-- for example, legal, "vegas", accounting, current maintainer -->
+
+Primary Prometheus server was decided some time before anarcat joined
+the team ([ticket 29389][]). Secondary Prometheus server was approved in [[meeting/2019-04-08]].
+
+ [ticket 29389]: https://trac.torproject.org/projects/tor/ticket/29389

 ## Proposed Solution

+Prometheus was chosen, see also [[grafana]].
+
 ## Cost

+N/A.
+
 ## Alternatives considered

-<!-- include benchmarks and procedure if relevant -->
+No alternatives research was performed, as far as we know.