Skip to content
Snippets Groups Projects
Verified Commit dc97df00 authored by anarcat's avatar anarcat
Browse files

finish expanding prometheus template

parent f4a188d5
No related branches found
No related tags found
No related merge requests found
Prometheus
==========
[Prometheus][] is a monitoring system that is designed to process a
large number of metrics, centralize them on one (or multiple) servers
and serve them with a well-defined API. That API is queried through a
......@@ -13,8 +10,7 @@ layer on top (see [[Grafana]]).
[[!toc levels=3]]
Tutorial
========
# Tutorial
The Prometheus web interface is available at:
......@@ -29,8 +25,18 @@ over the last two weeks for the known servers.
# How-to
## Pager playbook
TBD.
## Disaster recovery
If a Prometheus/Grafana is destroyed, it should be compltely
rebuildable from Puppet. Non-configuration data should be restored
from backup, with `/var/lib/prometheus/` being sufficient to
reconstruct history. If even backups are destroyed, history will be
lost, but the server should still recover and start tracking new
metrics.
## Migrating from Munin
Here's a quick cheat sheet from people used to Munin and switching to
......@@ -134,6 +140,10 @@ policies.
## SLA
Prometheus is currently not doing alerting so it doesn't have any sort
of garanteed availability. It should, hopefully, not lose too many
metrics over time so we can do proper long-term resource planning.
## Design
Here is, from the [Prometheus overview documentation][], the
......@@ -170,30 +180,66 @@ There is no issue tracker specifically for this project, [File][] or
## Monitoring and testing
Prometheus doesn't have specific tests, but there *is* a test suite in
the upstream prometheus Puppet module.
The server is monitored for basic system-level metrics by Nagios. It
also monitors itself for system-level metrics but also
application-specific metrics.
# Discussion
## Overview
<!-- describe the overall project. should include a link to a ticket -->
<!-- that has a launch checklist -->
The prometheus and [[grafana]] services were setup after anarcat
realized that there was no "trending" service setup inside TPA after
Munin had died ([ticket 29681][]).
[ticket 29681]: https://trac.torproject.org/projects/tor/ticket/29681
Eventually, a second Prometheus/Grafana server was setup to monitor
external resources ([ticket 31159][]) because there were concerns
about mixing internal and external monitoring on TPA's side. There
were also concerns on the metrics team about exposing those metrics
publicly.
[ticket 31159]: https://trac.torproject.org/projects/tor/ticket/31159
## Goals
<!-- include bugs to be fixed -->
This section didn't exist when the projec was launched, so this is
really just second-guessing...
### Must have
* Munin replacement: long-term trending metrics to predict resource
allocation, with graphing
* free software, self-hosted
* Puppet automation
### Nice to have
* possibility of eventual Nagios phase-out
### Non-Goals
* > 1 year data retention
## Approvals required
<!-- for example, legal, "vegas", accounting, current maintainer -->
Primary Prometheus server was decided some time before anarcat joined
the team ([ticket 29389][]). Secondary Prometheus server was approved in [[meeting/2019-04-08]].
[ticket 29389]: https://trac.torproject.org/projects/tor/ticket/29389
## Proposed Solution
Prometheus was chosen, see also [[grafana]].
## Cost
N/A.
## Alternatives considered
<!-- include benchmarks and procedure if relevant -->
No alternatives research was performed, as far as we know.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment