Verified Commit daf03f50 authored by anarcat's avatar anarcat
Browse files

more historical details

parent 641cba41
Loading
Loading
Loading
Loading
+33 −9
Original line number Original line Diff line number Diff line
@@ -248,11 +248,19 @@ application-specific metrics.


The prometheus and [[grafana]] services were setup after anarcat
The prometheus and [[grafana]] services were setup after anarcat
realized that there was no "trending" service setup inside TPA after
realized that there was no "trending" service setup inside TPA after
Munin had died ([ticket 29681][]). In particular, resource
Munin had died ([ticket 29681][]). The "node exporter" was deployed on
requirements were researched in [ticket 29388][] and it was originally
all TPA hosts in mid-march 2019 ([ticket 29683][]) and remaining
planned to retain 15 days of metrics. This was expanded to one year in
traces of Munin were removed in early April 2019 ([ticket 29682][]).
November 2019 ([ticket 31244][]) with the hope this could eventually

be expanded further with a downsampling server in the future.
 [ticket 29683]: https://trac.torproject.org/projects/tor/ticket/29683
 [ticket 29682]: https://trac.torproject.org/projects/tor/ticket/29682


Resource requirements were researched in [ticket 29388][] and it was
originally planned to retain 15 days of metrics. This was expanded to
one year in November 2019 ([ticket 31244][]) with the hope this could
eventually be expanded further with a downsampling server in the
future.


 [ticket 31244]: https://trac.torproject.org/projects/tor/ticket/31244
 [ticket 31244]: https://trac.torproject.org/projects/tor/ticket/31244
 [ticket 29388]: https://trac.torproject.org/projects/tor/ticket/29388
 [ticket 29388]: https://trac.torproject.org/projects/tor/ticket/29388
@@ -265,6 +273,17 @@ publicly.


 [ticket 31159]: https://trac.torproject.org/projects/tor/ticket/31159
 [ticket 31159]: https://trac.torproject.org/projects/tor/ticket/31159


It was originally thought Prometheus could completely replace
[[nagios]] as well [ticket 29864][], but this turned out to be more
difficult than planned. The main difficulty is that Nagios checks come
with builtin threshold of acceptable performance. But Prometheus
metrics are just that: metrics, without thresholds... This makes it
more difficult to replace Nagios because a ton of alerts need to be
rewritten to replace the existing ones. A lot of reports and
functionality built-in to Nagios, like availability reports,
acknowledgements and other reports, would need to be reimplemented as
well.

## Goals
## Goals


This section didn't exist when the projec was launched, so this is
This section didn't exist when the projec was launched, so this is
@@ -279,7 +298,9 @@ really just second-guessing...


### Nice to have
### Nice to have


 * possibility of eventual Nagios phase-out
 * possibility of eventual Nagios phase-out ([ticket 29864][])

 [ticket 29864]: https://trac.torproject.org/projects/tor/ticket/29864


### Non-Goals
### Non-Goals


@@ -287,10 +308,13 @@ really just second-guessing...


## Approvals required
## Approvals required


Primary Prometheus server was decided some time before anarcat joined
Primary Prometheus server was decided [in the Brussels 2019
the team ([ticket 29389][]). Secondary Prometheus server was approved
devmeeting][], before anarcat joined the team ([ticket
in [[meeting/2019-04-08]]. Storage expansion was approved in [[meeting/2019-11-25]].
29389][]). Secondary Prometheus server was approved in
[[meeting/2019-04-08]]. Storage expansion was approved in
[[meeting/2019-11-25]].


 [in the Brussels 2019 devmeeting]: https://trac.torproject.org/projects/tor/wiki/org/meetings/2019BrusselsAdminTeamMinutes#Trendingmonitoring
 [ticket 29389]: https://trac.torproject.org/projects/tor/ticket/29389
 [ticket 29389]: https://trac.torproject.org/projects/tor/ticket/29389


## Proposed Solution
## Proposed Solution