grafana1, prometheus1 and karma1 are down

First diagnostic

Grafana, prometheus and karam are currently unresponsive on prometheus1, giving a 500 error.

Monitoring didn't pick it up

Logs and dumps, click to expand

Current status

Roles

  • Command: @anarcat
  • Operations: @lelutin
  • Communications:
  • Planning:

Next steps

  1. make a volume to restore prom1 (@anarcat)
  2. make a prom3 in some ganeti cluster
  3. do a basic bootstrap without any config in puppet
  4. rsync prometheus and grafana data from prom1 to prom3
  5. configure it in puppet correctly so it starts scraping
    1. add monitoring role to prom3
    2. also add prom3's IPs to firewall permissions for monitoring (This seems to be automatically managed)
  6. make sure it works alright
  7. stop prometheus on prom3
  8. rsync again
  9. start prometheus on prom3
  10. make a full backup for prometheus1.tpo and then shutdown the host
  11. provided that everything is still working properly, after at least two weeks retire prom1, see #42413 (closed)
  12. delete the volume, see #42413 (closed)
  13. analyze metrics to see if we can reduce the churn rate
  14. move ahead with the prometheus server merge and HA scenario (AKA refactor tpa-rfc-33 phase B and C)

Dashboards

grafana dashboards are of course limited in this incident, since the out of disk error started over the weekend and wasn't noticed for a while.

but see the last 30 days disk usage of monitoring servers

and the last 90 days:

image

Post-mortem

Detailed post-mortem to fill in later, click to expand
  • Affected users:
  • Duration:
  • Status page link:
  • Report Status: not started

Timeline

Root cause analysis

The disk on prometheus1 filled up, causing trouble for all running services and preventing the storage of any new metrics.

What went well?

What could have gone better?

Recommendations and related issues

Edited Nov 27, 2025 by anarcat
Assignee Loading
Time tracking Loading