remove references to nagios in our docs (#41816) authored by anarcat's avatar anarcat
We stop short of rewriting all playbooks for Prometheus, and instead
add references to the task of adding playbooks for
everything (prometheus-alerts#16) where we found references to
nagios.
......@@ -1230,10 +1230,6 @@ made public.
This section details how the alerting setup mentioned above works.
Note that the [Icinga][] service is still in service, but it
is planned to eventually be shut down and replaced by the Prometheus +
Alertmanager setup ([issue 29864][]).
In general, the upstream documentation for alerting starts from [the
Alerting Overview][] but it can be lacking at times. [This tutorial][]
can be quite helpful in better understanding how things are working.
......@@ -2201,10 +2197,8 @@ changed.
### Alertmanager
The [Alertmanager][] is configured on the external Prometheus server
for the metrics and anti-censorship teams to monitor the health of the
network. It may eventually also be used to replace or enhance
[Nagios][] ([issue 29864][]).
The [Alertmanager][] is configured on the Prometheus servers and is
used to send alerts over IRC and email.
It is installed through Puppet, in
`profile::prometheus::server::external`, but could be moved to its own
......@@ -2306,9 +2300,7 @@ As you can see, Prometheus is somewhat tailored towards
[Kubernetes][] but it can be used without it. We're deploying it with
the `file_sd` discovery mechanism, where Puppet collects all exporters
into the central server, which then scrapes those exporters every
`scrape_interval` (by default 15 seconds). The architecture graph also
shows the Alertmanager which could be used to (eventually) replace our
Nagios deployment.
`scrape_interval` (by default 15 seconds).
[Kubernetes]: https://kubernetes.io/
......@@ -2990,14 +2982,15 @@ publicly.
It was originally thought Prometheus could completely replace
[Nagios][] as well [issue 29864][], but this turned out to be more
difficult than planned. The main difficulty is that Nagios checks come
with builtin threshold of acceptable performance. But Prometheus
metrics are just that: metrics, without thresholds... This makes it
more difficult to replace Nagios because a ton of alerts need to be
rewritten to replace the existing ones. A lot of reports and
functionality built-in to Nagios, like availability reports,
acknowledgments and other reports, would need to be re-implemented as
well.
difficult than planned.
The main difficulty is that Nagios checks come with builtin threshold
of acceptable performance. But Prometheus metrics are just that:
metrics, without thresholds... This made it more difficult to replace
Nagios because a ton of alerts had to be rewritten to replace the
existing ones.
This was performed in [TPA-RFC-33][], over the course of 2024 and 2025.
## Security and risk assessment
......
......