Loading tsa/howto/prometheus.mdwn +33 −9 Original line number Original line Diff line number Diff line Loading @@ -248,11 +248,19 @@ application-specific metrics. The prometheus and [[grafana]] services were setup after anarcat The prometheus and [[grafana]] services were setup after anarcat realized that there was no "trending" service setup inside TPA after realized that there was no "trending" service setup inside TPA after Munin had died ([ticket 29681][]). In particular, resource Munin had died ([ticket 29681][]). The "node exporter" was deployed on requirements were researched in [ticket 29388][] and it was originally all TPA hosts in mid-march 2019 ([ticket 29683][]) and remaining planned to retain 15 days of metrics. This was expanded to one year in traces of Munin were removed in early April 2019 ([ticket 29682][]). November 2019 ([ticket 31244][]) with the hope this could eventually be expanded further with a downsampling server in the future. [ticket 29683]: https://trac.torproject.org/projects/tor/ticket/29683 [ticket 29682]: https://trac.torproject.org/projects/tor/ticket/29682 Resource requirements were researched in [ticket 29388][] and it was originally planned to retain 15 days of metrics. This was expanded to one year in November 2019 ([ticket 31244][]) with the hope this could eventually be expanded further with a downsampling server in the future. [ticket 31244]: https://trac.torproject.org/projects/tor/ticket/31244 [ticket 31244]: https://trac.torproject.org/projects/tor/ticket/31244 [ticket 29388]: https://trac.torproject.org/projects/tor/ticket/29388 [ticket 29388]: https://trac.torproject.org/projects/tor/ticket/29388 Loading @@ -265,6 +273,17 @@ publicly. [ticket 31159]: https://trac.torproject.org/projects/tor/ticket/31159 [ticket 31159]: https://trac.torproject.org/projects/tor/ticket/31159 It was originally thought Prometheus could completely replace [[nagios]] as well [ticket 29864][], but this turned out to be more difficult than planned. The main difficulty is that Nagios checks come with builtin threshold of acceptable performance. But Prometheus metrics are just that: metrics, without thresholds... This makes it more difficult to replace Nagios because a ton of alerts need to be rewritten to replace the existing ones. A lot of reports and functionality built-in to Nagios, like availability reports, acknowledgements and other reports, would need to be reimplemented as well. ## Goals ## Goals This section didn't exist when the projec was launched, so this is This section didn't exist when the projec was launched, so this is Loading @@ -279,7 +298,9 @@ really just second-guessing... ### Nice to have ### Nice to have * possibility of eventual Nagios phase-out * possibility of eventual Nagios phase-out ([ticket 29864][]) [ticket 29864]: https://trac.torproject.org/projects/tor/ticket/29864 ### Non-Goals ### Non-Goals Loading @@ -287,10 +308,13 @@ really just second-guessing... ## Approvals required ## Approvals required Primary Prometheus server was decided some time before anarcat joined Primary Prometheus server was decided [in the Brussels 2019 the team ([ticket 29389][]). Secondary Prometheus server was approved devmeeting][], before anarcat joined the team ([ticket in [[meeting/2019-04-08]]. Storage expansion was approved in [[meeting/2019-11-25]]. 29389][]). Secondary Prometheus server was approved in [[meeting/2019-04-08]]. Storage expansion was approved in [[meeting/2019-11-25]]. [in the Brussels 2019 devmeeting]: https://trac.torproject.org/projects/tor/wiki/org/meetings/2019BrusselsAdminTeamMinutes#Trendingmonitoring [ticket 29389]: https://trac.torproject.org/projects/tor/ticket/29389 [ticket 29389]: https://trac.torproject.org/projects/tor/ticket/29389 ## Proposed Solution ## Proposed Solution Loading Loading
tsa/howto/prometheus.mdwn +33 −9 Original line number Original line Diff line number Diff line Loading @@ -248,11 +248,19 @@ application-specific metrics. The prometheus and [[grafana]] services were setup after anarcat The prometheus and [[grafana]] services were setup after anarcat realized that there was no "trending" service setup inside TPA after realized that there was no "trending" service setup inside TPA after Munin had died ([ticket 29681][]). In particular, resource Munin had died ([ticket 29681][]). The "node exporter" was deployed on requirements were researched in [ticket 29388][] and it was originally all TPA hosts in mid-march 2019 ([ticket 29683][]) and remaining planned to retain 15 days of metrics. This was expanded to one year in traces of Munin were removed in early April 2019 ([ticket 29682][]). November 2019 ([ticket 31244][]) with the hope this could eventually be expanded further with a downsampling server in the future. [ticket 29683]: https://trac.torproject.org/projects/tor/ticket/29683 [ticket 29682]: https://trac.torproject.org/projects/tor/ticket/29682 Resource requirements were researched in [ticket 29388][] and it was originally planned to retain 15 days of metrics. This was expanded to one year in November 2019 ([ticket 31244][]) with the hope this could eventually be expanded further with a downsampling server in the future. [ticket 31244]: https://trac.torproject.org/projects/tor/ticket/31244 [ticket 31244]: https://trac.torproject.org/projects/tor/ticket/31244 [ticket 29388]: https://trac.torproject.org/projects/tor/ticket/29388 [ticket 29388]: https://trac.torproject.org/projects/tor/ticket/29388 Loading @@ -265,6 +273,17 @@ publicly. [ticket 31159]: https://trac.torproject.org/projects/tor/ticket/31159 [ticket 31159]: https://trac.torproject.org/projects/tor/ticket/31159 It was originally thought Prometheus could completely replace [[nagios]] as well [ticket 29864][], but this turned out to be more difficult than planned. The main difficulty is that Nagios checks come with builtin threshold of acceptable performance. But Prometheus metrics are just that: metrics, without thresholds... This makes it more difficult to replace Nagios because a ton of alerts need to be rewritten to replace the existing ones. A lot of reports and functionality built-in to Nagios, like availability reports, acknowledgements and other reports, would need to be reimplemented as well. ## Goals ## Goals This section didn't exist when the projec was launched, so this is This section didn't exist when the projec was launched, so this is Loading @@ -279,7 +298,9 @@ really just second-guessing... ### Nice to have ### Nice to have * possibility of eventual Nagios phase-out * possibility of eventual Nagios phase-out ([ticket 29864][]) [ticket 29864]: https://trac.torproject.org/projects/tor/ticket/29864 ### Non-Goals ### Non-Goals Loading @@ -287,10 +308,13 @@ really just second-guessing... ## Approvals required ## Approvals required Primary Prometheus server was decided some time before anarcat joined Primary Prometheus server was decided [in the Brussels 2019 the team ([ticket 29389][]). Secondary Prometheus server was approved devmeeting][], before anarcat joined the team ([ticket in [[meeting/2019-04-08]]. Storage expansion was approved in [[meeting/2019-11-25]]. 29389][]). Secondary Prometheus server was approved in [[meeting/2019-04-08]]. Storage expansion was approved in [[meeting/2019-11-25]]. [in the Brussels 2019 devmeeting]: https://trac.torproject.org/projects/tor/wiki/org/meetings/2019BrusselsAdminTeamMinutes#Trendingmonitoring [ticket 29389]: https://trac.torproject.org/projects/tor/ticket/29389 [ticket 29389]: https://trac.torproject.org/projects/tor/ticket/29389 ## Proposed Solution ## Proposed Solution Loading