Changes

We stop short of rewriting all playbooks for Prometheus, and instead add references to the task of adding playbooks for everything (prometheus-alerts#16) where we found references to nagios.
anarcat · dfb33cb5
--- a/service/prometheus.md
+++ b/service/prometheus.md
@@ -1230,10 +1230,6 @@ made public.

 This section details how the alerting setup mentioned above works.

-Note that the [Icinga][] service is still in service, but it
-is planned to eventually be shut down and replaced by the Prometheus +
-Alertmanager setup ([issue 29864][]).
-
 In general, the upstream documentation for alerting starts from [the
 Alerting Overview][] but it can be lacking at times. [This tutorial][]
 can be quite helpful in better understanding how things are working.
@@ -2201,10 +2197,8 @@ changed.

 ### Alertmanager

-The [Alertmanager][] is configured on the external Prometheus server
-for the metrics and anti-censorship teams to monitor the health of the
-network. It may eventually also be used to replace or enhance
-[Nagios][] ([issue 29864][]).
+The [Alertmanager][] is configured on the Prometheus servers and is
+used to send alerts over IRC and email.

 It is installed through Puppet, in
 `profile::prometheus::server::external`, but could be moved to its own
@@ -2306,9 +2300,7 @@ As you can see, Prometheus is somewhat tailored towards
 [Kubernetes][] but it can be used without it. We're deploying it with
 the `file_sd` discovery mechanism, where Puppet collects all exporters
 into the central server, which then scrapes those exporters every
-`scrape_interval` (by default 15 seconds). The architecture graph also
-shows the Alertmanager which could be used to (eventually) replace our
-Nagios deployment.
+`scrape_interval` (by default 15 seconds).

 [Kubernetes]: https://kubernetes.io/

@@ -2990,14 +2982,15 @@ publicly.

 It was originally thought Prometheus could completely replace
 [Nagios][] as well [issue 29864][], but this turned out to be more
-difficult than planned. The main difficulty is that Nagios checks come
-with builtin threshold of acceptable performance. But Prometheus
-metrics are just that: metrics, without thresholds... This makes it
-more difficult to replace Nagios because a ton of alerts need to be
-rewritten to replace the existing ones. A lot of reports and
-functionality built-in to Nagios, like availability reports,
-acknowledgments and other reports, would need to be re-implemented as
-well.
+difficult than planned. 
+
+The main difficulty is that Nagios checks come with builtin threshold
+of acceptable performance. But Prometheus metrics are just that:
+metrics, without thresholds... This made it more difficult to replace
+Nagios because a ton of alerts had to be rewritten to replace the
+existing ones.
+
+This was performed in [TPA-RFC-33][], over the course of 2024 and 2025.

 ## Security and risk assessment