Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
Wiki Replica
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Container Registry
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
The Tor Project
TPA
Wiki Replica
Commits
dc97df00
Verified
Commit
dc97df00
authored
4 years ago
by
anarcat
Browse files
Options
Downloads
Patches
Plain Diff
finish expanding prometheus template
parent
f4a188d5
No related branches found
Branches containing commit
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
tsa/howto/prometheus.mdwn
+56
-10
56 additions, 10 deletions
tsa/howto/prometheus.mdwn
with
56 additions
and
10 deletions
tsa/howto/prometheus.mdwn
+
56
−
10
View file @
dc97df00
Prometheus
==========
[Prometheus][] is a monitoring system that is designed to process a
large number of metrics, centralize them on one (or multiple) servers
and serve them with a well-defined API. That API is queried through a
...
...
@@ -13,8 +10,7 @@ layer on top (see [[Grafana]]).
[[!toc levels=3]]
Tutorial
========
# Tutorial
The Prometheus web interface is available at:
...
...
@@ -29,8 +25,18 @@ over the last two weeks for the known servers.
# How-to
## Pager playbook
TBD.
## Disaster recovery
If a Prometheus/Grafana is destroyed, it should be compltely
rebuildable from Puppet. Non-configuration data should be restored
from backup, with `/var/lib/prometheus/` being sufficient to
reconstruct history. If even backups are destroyed, history will be
lost, but the server should still recover and start tracking new
metrics.
## Migrating from Munin
Here's a quick cheat sheet from people used to Munin and switching to
...
...
@@ -134,6 +140,10 @@ policies.
## SLA
Prometheus is currently not doing alerting so it doesn't have any sort
of garanteed availability. It should, hopefully, not lose too many
metrics over time so we can do proper long-term resource planning.
## Design
Here is, from the [Prometheus overview documentation][], the
...
...
@@ -170,30 +180,66 @@ There is no issue tracker specifically for this project, [File][] or
## Monitoring and testing
Prometheus doesn't have specific tests, but there *is* a test suite in
the upstream prometheus Puppet module.
The server is monitored for basic system-level metrics by Nagios. It
also monitors itself for system-level metrics but also
application-specific metrics.
# Discussion
## Overview
<!-- describe the overall project. should include a link to a ticket -->
<!-- that has a launch checklist -->
The prometheus and [[grafana]] services were setup after anarcat
realized that there was no "trending" service setup inside TPA after
Munin had died ([ticket 29681][]).
[ticket 29681]: https://trac.torproject.org/projects/tor/ticket/29681
Eventually, a second Prometheus/Grafana server was setup to monitor
external resources ([ticket 31159][]) because there were concerns
about mixing internal and external monitoring on TPA's side. There
were also concerns on the metrics team about exposing those metrics
publicly.
[ticket 31159]: https://trac.torproject.org/projects/tor/ticket/31159
## Goals
<!-- include bugs to be fixed -->
This section didn't exist when the projec was launched, so this is
really just second-guessing...
### Must have
* Munin replacement: long-term trending metrics to predict resource
allocation, with graphing
* free software, self-hosted
* Puppet automation
### Nice to have
* possibility of eventual Nagios phase-out
### Non-Goals
* > 1 year data retention
## Approvals required
<!-- for example, legal, "vegas", accounting, current maintainer -->
Primary Prometheus server was decided some time before anarcat joined
the team ([ticket 29389][]). Secondary Prometheus server was approved in [[meeting/2019-04-08]].
[ticket 29389]: https://trac.torproject.org/projects/tor/ticket/29389
## Proposed Solution
Prometheus was chosen, see also [[grafana]].
## Cost
N/A.
## Alternatives considered
<!-- include benchmarks and procedure if relevant -->
No alternatives research was performed, as far as we know.
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment