live migrations fail in gnt-chi cluster
i'm still having difficulty doing live migration on the ganeti cluster. I just did this:
root@chi-node-01:~# gnt-instance migrate -f onionbalance-02
Tue Nov 30 15:18:38 2021 Migrating instance onionbalance-02.torproject.org
Tue Nov 30 15:18:38 2021 * checking disk consistency between source and target
Tue Nov 30 15:18:40 2021 * closing instance disks on node chi-node-09.torproject.org
Tue Nov 30 15:18:40 2021 * changing into standalone mode
Tue Nov 30 15:18:41 2021 * changing disks into dual-master mode
Tue Nov 30 15:18:43 2021 * wait until resync is done
Tue Nov 30 15:18:43 2021 * opening instance disks on node chi-node-08.torproject.org in shared mode
Tue Nov 30 15:18:44 2021 * opening instance disks on node chi-node-09.torproject.org in shared mode
Tue Nov 30 15:18:44 2021 * preparing chi-node-09.torproject.org to accept the instance
Tue Nov 30 15:18:44 2021 * migrating instance to chi-node-09.torproject.org
Tue Nov 30 15:18:44 2021 * starting memory transfer
Tue Nov 30 15:18:55 2021 * memory transfer progress: 58.84 %
Tue Nov 30 15:18:58 2021 Migration failed, aborting
Tue Nov 30 15:18:58 2021 * closing instance disks on node chi-node-09.torproject.org
Tue Nov 30 15:18:58 2021 * changing into standalone mode
Tue Nov 30 15:18:59 2021 * changing disks into single-master mode
Tue Nov 30 15:19:00 2021 * wait until resync is done
Failure: command execution error:
Could not migrate instance onionbalance-02.torproject.org: Failed to get migration status: Can't connect to qmp socket
rerunning the migrated did work:
root@chi-node-01:~# gnt-instance migrate -f onionbalance-02
Tue Nov 30 15:22:55 2021 Migrating instance onionbalance-02.torproject.org
Tue Nov 30 15:22:55 2021 * checking disk consistency between source and target
Tue Nov 30 15:22:56 2021 * closing instance disks on node chi-node-09.torproject.org
Tue Nov 30 15:22:57 2021 * changing into standalone mode
Tue Nov 30 15:22:57 2021 * changing disks into dual-master mode
Tue Nov 30 15:22:59 2021 * wait until resync is done
Tue Nov 30 15:23:00 2021 * opening instance disks on node chi-node-08.torproject.org in shared mode
Tue Nov 30 15:23:00 2021 * opening instance disks on node chi-node-09.torproject.org in shared mode
Tue Nov 30 15:23:01 2021 * preparing chi-node-09.torproject.org to accept the instance
Tue Nov 30 15:23:01 2021 * migrating instance to chi-node-09.torproject.org
Tue Nov 30 15:23:01 2021 * starting memory transfer
Tue Nov 30 15:23:08 2021 * memory transfer complete
Tue Nov 30 15:23:08 2021 * closing instance disks on node chi-node-08.torproject.org
Tue Nov 30 15:23:09 2021 * wait until resync is done
Tue Nov 30 15:23:09 2021 * changing into standalone mode
Tue Nov 30 15:23:10 2021 * changing disks into single-master mode
Tue Nov 30 15:23:11 2021 * wait until resync is done
Tue Nov 30 15:23:12 2021 * done
... but then the instance doesn't respond to ping, even on the node. trying to connect to the console yields this mysterious error:
root@chi-node-01:~# gnt-instance console onionbalance-02
Instance onionbalance-02.torproject.org is paused, unpausing
... but no output. and bizarrely, disconnecting from the console and rerunning this still says "unpausing", so the "unpausing" doesn't actually seem to work. connecting directly over the socket on the node does nothing either:
root@chi-node-09:~# nc -U /var/run/ganeti/kvm-hypervisor/ctrl/onionbalance-02.torproject.org.serial
(no output.)
i first thought this was because we were running different versions of QEMU on the different nodes, but now the cluster has been upgraded and all nodes run the same version, yet I still have problems.
this is also not the first VM to have this problem: i had migrated web-chi-03 to that node (chi-node-09) and it had the same problem, so it could be limited to that node.