we're migrating everything into ganeti, but maybe there's some extra monitoring we could think about, as ganeti is way more knowledgeable about its own internals than libvirt was. or at least that's the feeling I get.
- we could have a nagios plugin that checks for N+1. riseup has something like this
- we could have a grafana dashboard that shows us the state of the cluster. we already have the main dashboard which we can set to show only the ganeti cluster
The current memory view goes about like this:
I'm not sure how we could improve this, but it seems to me having global (and/or per node?) memory, CPU, network and disk usage would be a great improvement as well.