grafana1, prometheus1 and karma1 are down

First diagnostic

Grafana, prometheus and karam are currently unresponsive on prometheus1, giving a 500 error.

Monitoring didn't pick it up

Logs and dumps, click to expand

make a volume to restore prom1 (@anarcat)
make a prom3 in some ganeti cluster
do a basic bootstrap without any config in puppet
rsync prometheus and grafana data from prom1 to prom3
configure it in puppet correctly so it starts scraping
1. add monitoring role to prom3
2. ~~also add prom3's IPs to firewall permissions for monitoring (This seems to be automatically managed)~~
make sure it works alright
stop prometheus on prom3
rsync again
start prometheus on prom3
make a full backup for prometheus1.tpo and then shutdown the host
~~provided that everything is still working properly, after at least two weeks retire prom1, see #42413 (closed)~~
~~delete the volume, see #42413 (closed)~~
analyze metrics to see if we can reduce the churn rate
move ahead with the prometheus server merge and HA scenario (AKA refactor tpa-rfc-33 phase B and C)

grafana dashboards are of course limited in this incident, since the out of disk error started over the weekend and wasn't noticed for a while.

Detailed post-mortem to fill in later, click to expand

The disk on prometheus1 filled up, causing trouble for all running services and preventing the storage of any new metrics.

Edited Nov 27, 2025 by anarcat

Assignee Loading

Time tracking Loading