Skip to content
Snippets Groups Projects
Verified Commit daf03f50 authored by anarcat's avatar anarcat
Browse files

more historical details

parent 641cba41
No related branches found
No related tags found
No related merge requests found
......@@ -248,11 +248,19 @@ application-specific metrics.
The prometheus and [[grafana]] services were setup after anarcat
realized that there was no "trending" service setup inside TPA after
Munin had died ([ticket 29681][]). In particular, resource
requirements were researched in [ticket 29388][] and it was originally
planned to retain 15 days of metrics. This was expanded to one year in
November 2019 ([ticket 31244][]) with the hope this could eventually
be expanded further with a downsampling server in the future.
Munin had died ([ticket 29681][]). The "node exporter" was deployed on
all TPA hosts in mid-march 2019 ([ticket 29683][]) and remaining
traces of Munin were removed in early April 2019 ([ticket 29682][]).
[ticket 29683]: https://trac.torproject.org/projects/tor/ticket/29683
[ticket 29682]: https://trac.torproject.org/projects/tor/ticket/29682
Resource requirements were researched in [ticket 29388][] and it was
originally planned to retain 15 days of metrics. This was expanded to
one year in November 2019 ([ticket 31244][]) with the hope this could
eventually be expanded further with a downsampling server in the
future.
[ticket 31244]: https://trac.torproject.org/projects/tor/ticket/31244
[ticket 29388]: https://trac.torproject.org/projects/tor/ticket/29388
......@@ -265,6 +273,17 @@ publicly.
[ticket 31159]: https://trac.torproject.org/projects/tor/ticket/31159
It was originally thought Prometheus could completely replace
[[nagios]] as well [ticket 29864][], but this turned out to be more
difficult than planned. The main difficulty is that Nagios checks come
with builtin threshold of acceptable performance. But Prometheus
metrics are just that: metrics, without thresholds... This makes it
more difficult to replace Nagios because a ton of alerts need to be
rewritten to replace the existing ones. A lot of reports and
functionality built-in to Nagios, like availability reports,
acknowledgements and other reports, would need to be reimplemented as
well.
## Goals
This section didn't exist when the projec was launched, so this is
......@@ -279,7 +298,9 @@ really just second-guessing...
### Nice to have
* possibility of eventual Nagios phase-out
* possibility of eventual Nagios phase-out ([ticket 29864][])
[ticket 29864]: https://trac.torproject.org/projects/tor/ticket/29864
### Non-Goals
......@@ -287,10 +308,13 @@ really just second-guessing...
## Approvals required
Primary Prometheus server was decided some time before anarcat joined
the team ([ticket 29389][]). Secondary Prometheus server was approved
in [[meeting/2019-04-08]]. Storage expansion was approved in [[meeting/2019-11-25]].
Primary Prometheus server was decided [in the Brussels 2019
devmeeting][], before anarcat joined the team ([ticket
29389][]). Secondary Prometheus server was approved in
[[meeting/2019-04-08]]. Storage expansion was approved in
[[meeting/2019-11-25]].
[in the Brussels 2019 devmeeting]: https://trac.torproject.org/projects/tor/wiki/org/meetings/2019BrusselsAdminTeamMinutes#Trendingmonitoring
[ticket 29389]: https://trac.torproject.org/projects/tor/ticket/29389
## Proposed Solution
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment