TPA-RFC-33: monitoring system upgrade or replacement
in #29864 (closed), we've gone pretty deep in comparisons between prometheus and icinga and how the first could replace the latter.
but now we're stuck at "i like this one better than the other" because we don't have a clear set of requirements.
the task here is to write a set of requirements for the new alerting system and, ultimately, make a proposal for the replacement of the deprecated Icinga 1 deployment we have now.
-
establish requirements -
approve requirements - if replacing icinga:
-
review #29864 (closed) for ideas and tasks -
decide whether we keep the prometheus1/2 distinction -
deploy alert manager on prometheus1 -
reimplement the Nagios alerting commands (optional?) -
send Nagios alerts through the alertmanager (optional?) -
rewrite (non-NRPE) commands (9) as Prometheus alerts -
scrape the NRPE metrics from Prometheus (optional) -
create a dashboard and/or alerts for the NRPE metrics (optional) -
review the NRPE commands (300+) to see which one to rewrite as Prometheus alerts -
turn off the Icinga server -
remove all traces of NRPE on all nodes
-
- if keeping icinga
-
review work from @weasel done on DSA's Puppet/Icinga integration -
deploy that module or another inciga module inside puppet -
rewrite all the checks from the nagios-master.cfg
file into puppet (300+) -
rebuild a new Icinga 2 server -
retire the old Icinga 1 server
-
Edited by anarcat