load warnings on gnt-fsn: migrate some VMs to gnt-chi?
in the last week, we've had a few warnings from nagios about load being two high in the gnt-fsn cluster, particularly on fsn-node-0[12]:
2020-11-12 13:57:05 <nsa> tor-nagios: [fsn-node-01] load is WARNING: WARNING - load average: 27.93, 28.07, 22.95
2020-11-12 14:57:03 <nsa> tor-nagios: [fsn-node-01] load is OK: OK - load average: 23.70, 24.29, 25.25
2020-11-12 16:22:08 <nsa> tor-nagios: [fsn-node-01] load is WARNING: WARNING - load average: 42.81, 38.54, 35.18
2020-11-17 12:31:05 <nsa> tor-nagios: [fsn-node-01] load is WARNING: WARNING - load average: 23.08, 38.64, 37.58
2020-11-17 13:46:05 <nsa> tor-nagios: [fsn-node-01] load is OK: OK - load average: 26.70, 27.10, 25.82
2020-11-17 14:11:05 <nsa> tor-nagios: [fsn-node-01] load is WARNING: WARNING - load average: 25.37, 25.99, 27.78
2020-11-18 03:50:04 <nsa> tor-nagios: [fsn-node-01] load is WARNING: WARNING - load average: 30.22, 34.05, 30.19
2020-11-18 04:49:59 <nsa> tor-nagios: [fsn-node-01] load is OK: OK - load average: 26.40, 22.63, 23.75
2020-11-18 05:15:04 <nsa> tor-nagios: [fsn-node-01] load is WARNING: WARNING - load average: 23.39, 28.03, 28.51
2020-11-18 08:00:09 <nsa> tor-nagios: [fsn-node-01] load is OK: OK - load average: 2.99, 9.16, 18.38
2020-11-19 04:06:12 <nsa> tor-nagios: [fsn-node-02] load is WARNING: WARNING - load average: 38.44, 35.68, 30.18
2020-11-19 04:21:12 <nsa> tor-nagios: [fsn-node-02] load is OK: OK - load average: 11.93, 15.21, 20.84
It might be worth trying to figure out what, exactly, in there is causing those load spikes (see grafana or related nagios warnings) and move some of that stuff to the other ganeti cluster.
machines to move:
-
onionoo-backend-02.torproject.org (maybe get the new metrics service admins to rebuild one of those from scratch?) -
onionoo-frontend-02.torproject.org (rebuild from scratch?) -
build-x86-12.torproject.org (we already have build-x86-11.torproject.org - maybe rebuild from scratch too?) - moved to #40135 (closed)
those instances will require extra storage, so blocked on #40131 (closed) (update: iSCSI cluster working well enough for those to start):
-
tb-build-02 - redundant with tb-build-01 (rebuild from scratch?) #40198 (closed) -
web-fsn-02 - same with web-fsn-01 (although maybe just retire and rebuild as web-chi-03?) moved to #40193 (closed)
tb-build-02
would be particularly nice to migrate, as i suspect it's causing load warnings on fsn-node-03
right now.
Edited by anarcat