diff --git a/tsa/howto/prometheus.mdwn b/tsa/howto/prometheus.mdwn index 587f7ff2e47445db7031ba86654c2b9c97a3159d..902d3785ee3b94eaaaf3bc884cad7255f1e81bec 100644 --- a/tsa/howto/prometheus.mdwn +++ b/tsa/howto/prometheus.mdwn @@ -248,11 +248,19 @@ application-specific metrics. The prometheus and [[grafana]] services were setup after anarcat realized that there was no "trending" service setup inside TPA after -Munin had died ([ticket 29681][]). In particular, resource -requirements were researched in [ticket 29388][] and it was originally -planned to retain 15 days of metrics. This was expanded to one year in -November 2019 ([ticket 31244][]) with the hope this could eventually -be expanded further with a downsampling server in the future. +Munin had died ([ticket 29681][]). The "node exporter" was deployed on +all TPA hosts in mid-march 2019 ([ticket 29683][]) and remaining +traces of Munin were removed in early April 2019 ([ticket 29682][]). + + [ticket 29683]: https://trac.torproject.org/projects/tor/ticket/29683 + [ticket 29682]: https://trac.torproject.org/projects/tor/ticket/29682 + + +Resource requirements were researched in [ticket 29388][] and it was +originally planned to retain 15 days of metrics. This was expanded to +one year in November 2019 ([ticket 31244][]) with the hope this could +eventually be expanded further with a downsampling server in the +future. [ticket 31244]: https://trac.torproject.org/projects/tor/ticket/31244 [ticket 29388]: https://trac.torproject.org/projects/tor/ticket/29388 @@ -265,6 +273,17 @@ publicly. [ticket 31159]: https://trac.torproject.org/projects/tor/ticket/31159 +It was originally thought Prometheus could completely replace +[[nagios]] as well [ticket 29864][], but this turned out to be more +difficult than planned. The main difficulty is that Nagios checks come +with builtin threshold of acceptable performance. But Prometheus +metrics are just that: metrics, without thresholds... This makes it +more difficult to replace Nagios because a ton of alerts need to be +rewritten to replace the existing ones. A lot of reports and +functionality built-in to Nagios, like availability reports, +acknowledgements and other reports, would need to be reimplemented as +well. + ## Goals This section didn't exist when the projec was launched, so this is @@ -279,7 +298,9 @@ really just second-guessing... ### Nice to have - * possibility of eventual Nagios phase-out + * possibility of eventual Nagios phase-out ([ticket 29864][]) + + [ticket 29864]: https://trac.torproject.org/projects/tor/ticket/29864 ### Non-Goals @@ -287,10 +308,13 @@ really just second-guessing... ## Approvals required -Primary Prometheus server was decided some time before anarcat joined -the team ([ticket 29389][]). Secondary Prometheus server was approved -in [[meeting/2019-04-08]]. Storage expansion was approved in [[meeting/2019-11-25]]. +Primary Prometheus server was decided [in the Brussels 2019 +devmeeting][], before anarcat joined the team ([ticket +29389][]). Secondary Prometheus server was approved in +[[meeting/2019-04-08]]. Storage expansion was approved in +[[meeting/2019-11-25]]. + [in the Brussels 2019 devmeeting]: https://trac.torproject.org/projects/tor/wiki/org/meetings/2019BrusselsAdminTeamMinutes#Trendingmonitoring [ticket 29389]: https://trac.torproject.org/projects/tor/ticket/29389 ## Proposed Solution