onionoo-backend is killing the ganeti cluster
today i noticed that, since last friday (UTC) morning, there has been pretty big spikes on the internal network between the ganeti nodes, every hour. it looks like this, in grafana:
We can clearly see a correlation between the two node's traffic, in reverse. This was confirmed using
tcpdump on the nodes during a surge.
It seems this is due to onionoo-backend-01 blasting the disk and CPU for some reason. This is the disk I/O graphs for that host, which correlate pretty cleanly with the above graphs:
This was confirmed by an inspection of
drbd, the mechanisms that synchronizes the disks across the network. It seems there's a huge surge of "writes" on the network every hour which lasts anywhere between 20 and 30 minutes. This was (somewhat) confirmed by running:
watch -n 0.1 -d cat /proc/drbd
on the nodes. The device IDs 4, 13 and 17 trigger a lot of changes in DRBD. 13 and 17 are the web nodes, so that's expected - probably log writes? But device ID 4 is onionoo-backend, which is what led me to the big traffic graph.
could someone from metrics investigate?
can i just turn off this machine altogether, considering it's basically trying to murder the cluster every hour? :)