grafana1, prometheus1 and karma1 are down
First diagnostic
Grafana, prometheus and karam are currently unresponsive on prometheus1, giving a 500 error.
Monitoring didn't pick it up
Logs and dumps, click to expand
Current status
Roles
Next steps
-
make a volume to restore prom1 (@anarcat) -
make a prom3 in some ganeti cluster -
do a basic bootstrap without any config in puppet -
rsync prometheus and grafana data from prom1 to prom3 -
configure it in puppet correctly so it starts scraping -
add monitoring role to prom3 -
also add prom3's IPs to firewall permissions for monitoring (This seems to be automatically managed)
-
-
make sure it works alright -
stop prometheus on prom3 -
rsync again -
start prometheus on prom3 -
make a full backup for prometheus1.tpo and then shutdown the host -
provided that everything is still working properly, after at least two weeks retire prom1, see #42413 (closed) -
delete the volume, see #42413 (closed) -
analyze metrics to see if we can reduce the churn rate -
move ahead with the prometheus server merge and HA scenario (AKA refactor tpa-rfc-33 phase B and C)
Dashboards
grafana dashboards are of course limited in this incident, since the out of disk error started over the weekend and wasn't noticed for a while.
but see the last 30 days disk usage of monitoring servers
and the last 90 days:
Post-mortem
Detailed post-mortem to fill in later, click to expand
- Affected users:
- Duration:
- Status page link:
- Report Status: not started
Timeline
Root cause analysis
The disk on prometheus1 filled up, causing trouble for all running services and preventing the storage of any new metrics.
What went well?
What could have gone better?
Recommendations and related issues
Edited by anarcat
