Loading service/prometheus.md +13 −10 Original line number Diff line number Diff line Loading @@ -2526,15 +2526,13 @@ No major issue resolved so far is worth mentioning here. ## Maintainers The Prometheus services have been setup and are managed by anarcat inside TPA. The internal Prometheus server is mostly used by TPA staff to diagnose issues. The external Prometheus server is used by various TPO teams for their own monitoring needs. inside TPA. ## Users <!-- TODO who the main users are, how they use the service. possibly reuse --> <!-- the Personas section in the RFC, if available. --> <!-- see overlap with above --> The internal Prometheus server is mostly used by TPA staff to diagnose issues. The external Prometheus server is used by various TPO teams for their own monitoring needs. ## Upstream Loading Loading @@ -2589,10 +2587,14 @@ policies. ## Tests Prometheus doesn't have specific tests, but there *is* a test suite in the upstream Prometheus Puppet module. The `prometheus-alerts.git` repository has tests that run in GitLab CI, see the [Testing alerts section](#testing-alerts) on how to write those. When doing major upgrades, the [Karma dashboard][] should be visited to make sure it works correctly. TODO: merge with alertmanager test stuff There is a test suite in the upstream Prometheus Puppet module as well, but it's not part of our CI. ## Logs Loading Loading @@ -2810,7 +2812,8 @@ Near the end of 2024, Icinga was replaced by Prometheus and Alertmanager, as part of [TPA-RFC-33][]. TODO: document a little bit how the actual migration went, along with the three stages and milestones the three stages and milestones. see overlap with Proposed solutions above. Before Icinga was retired, we performed an audit of the notifications sent from Icinga about our services ([#41791][]) to see if we're Loading Loading
service/prometheus.md +13 −10 Original line number Diff line number Diff line Loading @@ -2526,15 +2526,13 @@ No major issue resolved so far is worth mentioning here. ## Maintainers The Prometheus services have been setup and are managed by anarcat inside TPA. The internal Prometheus server is mostly used by TPA staff to diagnose issues. The external Prometheus server is used by various TPO teams for their own monitoring needs. inside TPA. ## Users <!-- TODO who the main users are, how they use the service. possibly reuse --> <!-- the Personas section in the RFC, if available. --> <!-- see overlap with above --> The internal Prometheus server is mostly used by TPA staff to diagnose issues. The external Prometheus server is used by various TPO teams for their own monitoring needs. ## Upstream Loading Loading @@ -2589,10 +2587,14 @@ policies. ## Tests Prometheus doesn't have specific tests, but there *is* a test suite in the upstream Prometheus Puppet module. The `prometheus-alerts.git` repository has tests that run in GitLab CI, see the [Testing alerts section](#testing-alerts) on how to write those. When doing major upgrades, the [Karma dashboard][] should be visited to make sure it works correctly. TODO: merge with alertmanager test stuff There is a test suite in the upstream Prometheus Puppet module as well, but it's not part of our CI. ## Logs Loading Loading @@ -2810,7 +2812,8 @@ Near the end of 2024, Icinga was replaced by Prometheus and Alertmanager, as part of [TPA-RFC-33][]. TODO: document a little bit how the actual migration went, along with the three stages and milestones the three stages and milestones. see overlap with Proposed solutions above. Before Icinga was retired, we performed an audit of the notifications sent from Icinga about our services ([#41791][]) to see if we're Loading