review efficiency of the Ganeti cluster, particularly gnt-fsn
Let's review the provisioning ration on the ganti cluster: how much do we over or under provision and how much does this thing cost per VM anyway.
The idea is that we started down the project of hosting everything with Ganeti mostly as an experiement, with the belief it would give us better reliability. That probably is the case: we can reboot nodes without causing outages on instances (when only the needs need an upgrade, which is often not the case, that said). We could probably survive a total server loss as well, for example. But we haven't thought of how much resources go into making sure that availability is around.
So here's the task list:
-
evaluate the cost of hosting a single VM in the gnt-fsn cluster (or maybe per disk/memory/cpu unit? not sure how to evaluate this?) -
evaluate how much waste we have, for example -
how many CPUs are actually fully in use (say over a given 24h period or week?) -
how much memory is fully in use (same) -
how much disk is in use (probably just current snapshot)
I wonder if we should also evaluate performance overhead:
-
how much is Qemu/KVM costing us in terms of raw processing power? maybe run a CPU-intensive benchmark in and out of a VM on an otherwise idle node -
same with disk: do we pay a big price for virtualized I/O? -
DRBD overhead: benchmark plain disk vs DRBD -
network overhead: benchmark local disk vs network disk (might be tough without also testing DRBD? we're mostly concerned about vswitch vs local switch performance, could we compare performance with gnt-chi here for example?)
The point here is that we're paying Hetzner a lot of money for a lot of rented metal (8 machines), instead of hosting everything in their cloud. That imposes a significant management cost, while at the same time giving us certain garantees in terms of privacy in control. We need to seriously consider whether it's worth hosting our own metal, still, and efficiency is certainly a big part of this.
Also related to #40163 (evaluate power usage).