Changes

We stop short of rewriting all playbooks for Prometheus, and instead add references to the task of adding playbooks for everything (prometheus-alerts#16) where we found references to nagios.
anarcat · dfb33cb5
--- a/service/prometheus.md
+++ b/service/prometheus.md
@@ -1230,10 +1230,6 @@ made public.
 This section details how the alerting setup mentioned above works.
-Note that the [Icinga][] service is still in service, but it
-is planned to eventually be shut down and replaced by the Prometheus +
-Alertmanager setup ([issue 29864][]).
 In general, the upstream documentation for alerting starts from [the
 Alerting Overview][] but it can be lacking at times. [This tutorial][]
 can be quite helpful in better understanding how things are working.
@@ -2201,10 +2197,8 @@ changed.
 ### Alertmanager
-The [Alertmanager][] is configured on the external Prometheus server
+The [Alertmanager][] is configured on the Prometheus servers and is
-for the metrics and anti-censorship teams to monitor the health of the
+used to send alerts over IRC and email.
-network. It may eventually also be used to replace or enhance
-[Nagios][] ([issue 29864][]).
 It is installed through Puppet, in
 `profile::prometheus::server::external`, but could be moved to its own
@@ -2306,9 +2300,7 @@ As you can see, Prometheus is somewhat tailored towards
 [Kubernetes][] but it can be used without it. We're deploying it with
 the `file_sd` discovery mechanism, where Puppet collects all exporters
 into the central server, which then scrapes those exporters every
-`scrape_interval` (by default 15 seconds). The architecture graph also
+`scrape_interval` (by default 15 seconds).
-shows the Alertmanager which could be used to (eventually) replace our
-Nagios deployment.
 [Kubernetes]: https://kubernetes.io/
@@ -2990,14 +2982,15 @@ publicly.
 It was originally thought Prometheus could completely replace
 [Nagios][] as well [issue 29864][], but this turned out to be more
-difficult than planned. The main difficulty is that Nagios checks come
+difficult than planned. 
-with builtin threshold of acceptable performance. But Prometheus
-metrics are just that: metrics, without thresholds... This makes it
+The main difficulty is that Nagios checks come with builtin threshold
-more difficult to replace Nagios because a ton of alerts need to be
+of acceptable performance. But Prometheus metrics are just that:
-rewritten to replace the existing ones. A lot of reports and
+metrics, without thresholds... This made it more difficult to replace
-functionality built-in to Nagios, like availability reports,
+Nagios because a ton of alerts had to be rewritten to replace the
-acknowledgments and other reports, would need to be re-implemented as
+existing ones.
-well.
+This was performed in [TPA-RFC-33][], over the course of 2024 and 2025.
 ## Security and risk assessment