TPA-RFC-33-B: Prometheus server merge, more exporters
Quote from TPA-RFC-33:
In this phase, we integrate more exporters and services in the infrastructure, which includes merging the second Prometheus server for the service admins.
We may retire the existing servers and build two new servers instead, but the more likely outcome is to progressively integrate the targets and alerting rules from
prometheus2
intoprometheus1
and then eventually retireprometheus2
, rebuilding a copy ofprometheus1
in its place.Here are the tasks required here:
- LDAP web password addition (userdir-ldap-cgi#1)
- new authentication deployment on
prometheus1
(team#41636)- cleanup
prometheus-alerts
: add CI check for team label and regroup alerts/targets by team (prometheus-alerts#15)prometheus2
merged intoprometheus1
(team#41637)- priority B metrics and alerts deployment (team#41639 (closed))
- self-monitoring: Prometheus scraping Alertmanager, dead man's switch in Karma (team#41641)
- inhibitions (team#41642 (closed))
- once
prometheus1
has all the data fromprometheus2
, retire the latter (team#41638)- autonomous delivery (team#41644)
We hope to continue with this work promptly following phase A, in October 2024.
Follows %TPA-RFC-33-A: emergency Icinga retirement and followed by %TPA-RFC-33-C: Prometheus high availability, long term metrics, other exporters.
See also the kanban board.
Status as of January 2025 is we're 50% through the milestone. Lots of the remaining Icinga legacy has been replaced, with a few key exceptions (DNS, mainly, #41794). Remaining work is the merge and self-monitoring. We're holding off on this work until we have more stability and a plan for moving forward on the authentication merge (team#41839). Moved due date from 2024-Q1 to 2025-Q3.