more historical details

daf03f50 · anarcat · 641cba41 · daf03f50
Verified Commit daf03f50 authored 4 years ago by anarcat
--- a/tsa/howto/prometheus.mdwn
+++ b/tsa/howto/prometheus.mdwn
@@ -248,11 +248,19 @@ application-specific metrics.

 The prometheus and [[grafana]] services were setup after anarcat
 realized that there was no "trending" service setup inside TPA after
-Munin had died ([ticket 29681][]). In particular, resource
-requirements were researched in [ticket 29388][] and it was originally
-planned to retain 15 days of metrics. This was expanded to one year in
-November 2019 ([ticket 31244][]) with the hope this could eventually
-be expanded further with a downsampling server in the future.
+Munin had died ([ticket 29681][]). The "node exporter" was deployed on
+all TPA hosts in mid-march 2019 ([ticket 29683][]) and remaining
+traces of Munin were removed in early April 2019 ([ticket 29682][]).
+
+ [ticket 29683]: https://trac.torproject.org/projects/tor/ticket/29683
+ [ticket 29682]: https://trac.torproject.org/projects/tor/ticket/29682
+
+
+Resource requirements were researched in [ticket 29388][] and it was
+originally planned to retain 15 days of metrics. This was expanded to
+one year in November 2019 ([ticket 31244][]) with the hope this could
+eventually be expanded further with a downsampling server in the
+future.

 [ticket 31244]: https://trac.torproject.org/projects/tor/ticket/31244
 [ticket 29388]: https://trac.torproject.org/projects/tor/ticket/29388
@@ -265,6 +273,17 @@ publicly.

 [ticket 31159]: https://trac.torproject.org/projects/tor/ticket/31159

+It was originally thought Prometheus could completely replace
+[[nagios]] as well [ticket 29864][], but this turned out to be more
+difficult than planned. The main difficulty is that Nagios checks come
+with builtin threshold of acceptable performance. But Prometheus
+metrics are just that: metrics, without thresholds... This makes it
+more difficult to replace Nagios because a ton of alerts need to be
+rewritten to replace the existing ones. A lot of reports and
+functionality built-in to Nagios, like availability reports,
+acknowledgements and other reports, would need to be reimplemented as
+well.
+
 ## Goals

 This section didn't exist when the projec was launched, so this is
@@ -279,7 +298,9 @@ really just second-guessing...

 ### Nice to have

- * possibility of eventual Nagios phase-out
+ * possibility of eventual Nagios phase-out ([ticket 29864][])
+
+ [ticket 29864]: https://trac.torproject.org/projects/tor/ticket/29864

 ### Non-Goals

@@ -287,10 +308,13 @@ really just second-guessing...

 ## Approvals required

-Primary Prometheus server was decided some time before anarcat joined
-the team ([ticket 29389][]). Secondary Prometheus server was approved
-in [[meeting/2019-04-08]]. Storage expansion was approved in [[meeting/2019-11-25]].
+Primary Prometheus server was decided [in the Brussels 2019
+devmeeting][], before anarcat joined the team ([ticket
+29389][]). Secondary Prometheus server was approved in
+[[meeting/2019-04-08]]. Storage expansion was approved in
+[[meeting/2019-11-25]].

+ [in the Brussels 2019 devmeeting]: https://trac.torproject.org/projects/tor/wiki/org/meetings/2019BrusselsAdminTeamMinutes#Trendingmonitoring
 [ticket 29389]: https://trac.torproject.org/projects/tor/ticket/29389

 ## Proposed Solution