Closed
Milestone
Jun 12, 2024–Sep 30, 2024
TPA-RFC-33-A: emergency Icinga retirement
Quote from TPA-RFC-33:
In this phase we prioritize emergency work to replace core components of the Icinga server, so the machine can be retired.
Those are the tasks required here:
- deploy Alertmanager and email notifications on
prometheus1
(team#41630 (closed))- deploy alertmanager-irc-relay on
prometheus1
(team#41631 (closed))- deploy blackbox exporter on
prometheus1
(team#41632 (closed))- priority A metrics and alerts deployment (team#41633 (closed))
- Icinga server retirement (team#41634 (closed))
- deploy Karma on
prometheus1
(team#41640 (closed))We're hoping to start this work in June and finish by August or September 2024.
Followed by %TPA-RFC-33-B: Prometheus server merge, more exporters.
Unstarted Issues (open and unassigned)
0
Ongoing Issues (open and assigned)
0
Completed Issues (closed)
23
- TPA team · needrestart prometheus check missed bacula-director-01 server
- TPA team · audit last year's nagios notifications for proper coverage in Prometheus
- TPA team · prometheus query link broken in karma
- TPA team · hide duplicate alert groups in Karma
- TPA team · monitor legacy postgresql backup system in prometheus
- TPA team · replace dsa-update-apt-status with another cron job
- TPA team · train TPA team on new monitoring system
- TPA team · apache exporter failing on donate-01, relay-01, and weather-01
- TPA team · TPA-RFC-67: retire mininag
- TPA team · Prometheus alert JobDown gets routed through the fallback -- this needs to be fixed
- TPA team · re-audit and verify prometheus configuration and roadmap matches icinga
- TPA team · mtail job on rdsys-test-01 falling through default route
- TPA team · deploy karma monitoring dashboard
- TPA team · deploy alertmanager-irc-relay on `prometheus1`
- TPA team · expose alertmanager
- TPA team · priority A metrics and alerts deployment
- TPA team · deploy Alertmanager and email notifications on prometheus1
- TPA team · prometheus node exporter conflicts with dsa-update-apt-status
- TPA team · retire hetzner-hel1-01 (nagios/icinga)
- TPA team · add #tor-alerts to the Matrix bridge and space
- TPA team · Icinga server retirement
- TPA team · deploy blackbox exporter on `prometheus1`
- prometheus-alerts · DjangoExceptions alerts mysteriously failed to send notifications
Loading
Loading
Loading