TPA-RFC-33: monitoring system upgrade or replacement

in #29864 (closed), we've gone pretty deep in comparisons between prometheus and icinga and how the first could replace the latter.

but now we're stuck at "i like this one better than the other" because we don't have a clear set of requirements.

the task here is to write a set of requirements for the new alerting system and, ultimately, make a proposal for the replacement of the deprecated Icinga 1 deployment we have now.

establish requirements
approve requirements
if replacing icinga:
- review #29864 (closed) for ideas and tasks
- decide whether we keep the prometheus1/2 distinction
- draft specification of all components, personas, etc, see https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-33-monitoring
if keeping icinga
- ~~review work from @weasel done on DSA's Puppet/Icinga integration~~
- ~~deploy that module or another inciga module inside puppet~~
- ~~rewrite all the checks from the nagios-master.cfg file into puppet (300+)~~
- ~~rebuild a new Icinga 2 server~~
- ~~retire the old Icinga 1 server~~

current status: awaiting adoption on June 12th.

update: tracked in %TPA-RFC-33-A: emergency Icinga retirement and next.

Edited Sep 19, 2024 by anarcat

Assignee Loading

Time tracking Loading