TPA-RFC-33: consider replacing nagios with prometheus

As a followup to the Prometheus/Grafana setup started in #29681 (closed), I am wondering if we should also consider replacing the Nagios/Icinga server with Prometheus. I have done a little research on the subject and figured it might be good to at least document the current state of affairs.

This would remove a complex piece of architecture we have at TPO that was designed before Puppet was properly deployed. Prometheus has an interesting federated design that allows it to scale to multiple machines easily, along with a high availability component for the alertmanager that allows it to be more reliable than a traditionnal Nagios configuration. It would also simplify our architecture as the Nagios server automation is a complex mix of Debian packages and git hooks that is serving us well, but hard to comprehend and debug for new administrators. (I managed to wipe the entire Nagios config myself on my first week on the job by messing up a configuration file.) Having the monitoring server fully deployed by Puppet would be a huge improvement, even if it would be done with Nagios instead of Prometheus, of course.

Right now the Nagios server is actually running Icinga 1.13, a Nagios fork, on a heztner machine (hetzner-hel1-01). It's doing its job generally well although it feels a little noisy, but that's to be expected form Nagios servers. Reducing the number of alerts seems to be an objective, explicitely documented in #29410 (closed), for example.

Both Grafana and Prometheus can do alerting, with various mechanisms and plugins. I haven't investigated those deeply, but in general that's not a problem in alerting: you fire some script or API and the rest happens. I suspect we could port the current Nagios alerting scripts to Prometheus fairly easily, although I haven't investigated our scripts in details.

The problem is reproducing the check scripts and their associated alert threshold. In the Nagios world, when a check is installed, it comes with its own health ("OK", "WARNING", "CRITICAL") threshold and TPO has developed a wide variety of such checks. According to the current Nagios dashboard, it monitors 4612 services on 88 hosts (which is interesting considering LDAP thinks there are 78). That looks terrifying, but it's actually a set of 9 commands running on the Nagios server, including the complex check_nrpe system, which is basically a client-side nagios that has its own set of checks. And that's where the "cardinal explosion" happens: on a typical host, there are 315 such checks implemented.

That's the hard part: convert those 324 checks into Prometheus alerts, one at a time. Unfortunately, there are no "built-in" or even "third-party" "prometheus alert sets" that I could find in my original research, although that might have changed in the last year.

Each check in Prometheus is basically a YAML file describing a Prometheus query that, when it evaluates to "true" (e.g. disk_space > 90%), sends an alert. It's not impossible to do that conversion, it's just a lot of work.

To do this progressively while allowing us to make new alerts on Prometheus instead of Nagios, I suggest to proceed the same way Cloudflare did, which is to establish a "Nagios to Prometheus" bridge, by which Nagios doesn't send the alerts on its own and instead forwards them to the Prometheus server, a plugin they called Promsaint.

With the bridge in place, Nagios checks can be migrated into Prometheus alerts progressively without disruption. Note that Cloudflare documented their experience with Prometheus in this 2017 promcon talk. Cloudflare also made an alert dashboard called unsee (see also the fork called karma) and elasticsearch integration which might be good to investigate further.

Another useful piece is this NRPE to Prometheus exporter, which allows Prometheus to directly scrape NRPE targets. It doesn't include Prometheus alerts and instead relies on a Grafana dashboard to show possible problems so, as such, I don't think it's that useful an alternative. There's a similar approach using check_mk instead.

Another possible approach is to send alerts from Nagios based on Prometheus checks, using the Prometheus nagios plugins. This might allow us to get rid of NRPE everywhere but it would probably be useful only if we do want to keep Nagios in the long term and remove NRPE in favor of the existing Prometheus exporters.

So, battle plan is basically this:

apt install prometheus-alertmanager
reimplement the Nagios alerting commands
send Nagios alerts through the alertmanager
rewrite (non-NRPE) commands (9) as Prometheus alerts
optionnally, scrape the NRPE metrics from Prometheus
optionnally, create a dashboard and/or alerts for the NRPE metrics
rewrite NRPE commands (300+) as Prometheus alerts
turn off the Nagios server
remove all traces of NRPE on all nodes

Update: this, obviously, will require more discussion than just implementing the above battle plan, as there isn't a consensus in the team towards Prometheus as a replacement for Icinga. I have assigned TPA-RFC-33 to this and started drafting requirements and personas in #40755 (closed)

Edited Jul 25, 2022 by anarcat

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information