Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
Wiki Replica
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Container Registry
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
The Tor Project
TPA
Wiki Replica
Commits
daf03f50
Verified
Commit
daf03f50
authored
4 years ago
by
anarcat
Browse files
Options
Downloads
Patches
Plain Diff
more historical details
parent
641cba41
No related branches found
Branches containing commit
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
tsa/howto/prometheus.mdwn
+33
-9
33 additions, 9 deletions
tsa/howto/prometheus.mdwn
with
33 additions
and
9 deletions
tsa/howto/prometheus.mdwn
+
33
−
9
View file @
daf03f50
...
...
@@ -248,11 +248,19 @@ application-specific metrics.
The prometheus and [[grafana]] services were setup after anarcat
realized that there was no "trending" service setup inside TPA after
Munin had died ([ticket 29681][]). In particular, resource
requirements were researched in [ticket 29388][] and it was originally
planned to retain 15 days of metrics. This was expanded to one year in
November 2019 ([ticket 31244][]) with the hope this could eventually
be expanded further with a downsampling server in the future.
Munin had died ([ticket 29681][]). The "node exporter" was deployed on
all TPA hosts in mid-march 2019 ([ticket 29683][]) and remaining
traces of Munin were removed in early April 2019 ([ticket 29682][]).
[ticket 29683]: https://trac.torproject.org/projects/tor/ticket/29683
[ticket 29682]: https://trac.torproject.org/projects/tor/ticket/29682
Resource requirements were researched in [ticket 29388][] and it was
originally planned to retain 15 days of metrics. This was expanded to
one year in November 2019 ([ticket 31244][]) with the hope this could
eventually be expanded further with a downsampling server in the
future.
[ticket 31244]: https://trac.torproject.org/projects/tor/ticket/31244
[ticket 29388]: https://trac.torproject.org/projects/tor/ticket/29388
...
...
@@ -265,6 +273,17 @@ publicly.
[ticket 31159]: https://trac.torproject.org/projects/tor/ticket/31159
It was originally thought Prometheus could completely replace
[[nagios]] as well [ticket 29864][], but this turned out to be more
difficult than planned. The main difficulty is that Nagios checks come
with builtin threshold of acceptable performance. But Prometheus
metrics are just that: metrics, without thresholds... This makes it
more difficult to replace Nagios because a ton of alerts need to be
rewritten to replace the existing ones. A lot of reports and
functionality built-in to Nagios, like availability reports,
acknowledgements and other reports, would need to be reimplemented as
well.
## Goals
This section didn't exist when the projec was launched, so this is
...
...
@@ -279,7 +298,9 @@ really just second-guessing...
### Nice to have
* possibility of eventual Nagios phase-out
* possibility of eventual Nagios phase-out ([ticket 29864][])
[ticket 29864]: https://trac.torproject.org/projects/tor/ticket/29864
### Non-Goals
...
...
@@ -287,10 +308,13 @@ really just second-guessing...
## Approvals required
Primary Prometheus server was decided some time before anarcat joined
the team ([ticket 29389][]). Secondary Prometheus server was approved
in [[meeting/2019-04-08]]. Storage expansion was approved in [[meeting/2019-11-25]].
Primary Prometheus server was decided [in the Brussels 2019
devmeeting][], before anarcat joined the team ([ticket
29389][]). Secondary Prometheus server was approved in
[[meeting/2019-04-08]]. Storage expansion was approved in
[[meeting/2019-11-25]].
[in the Brussels 2019 devmeeting]: https://trac.torproject.org/projects/tor/wiki/org/meetings/2019BrusselsAdminTeamMinutes#Trendingmonitoring
[ticket 29389]: https://trac.torproject.org/projects/tor/ticket/29389
## Proposed Solution
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment