it looks like the only affected instance is gitlab-dev-01 so that's not very critical, unless @micah was doing work on there...
i can take a look tomorrow. i think we need OOB access, and that requires the ipsec + java applete setup from hell, which i don't believe we ever got working on your things...
can we bring up gitlab-dev-01 on its secondary node, or will there be DRBD sync issues?
i can take a look tomorrow. i think we need OOB access, and that requires the ipsec + java applete setup from hell, which i don't believe we ever got working on your things...
yeah, i never got strongswan working on this machine, and i'm working from a pretty unstable connection right now
ssh: connect to host 172.30.140.11 port 22: Connection refused
port 80 and 443 are open, but the former redirects to the latter (which is good) and the latter never manages to negotiate a TLS connexion (which is not).
so i'll declare this box dead. there's only one instance running there, so i'll just failover and mark it as offline.
because @kez said the box was stuck at the LUKS password prompt and failed to boot, i consider it safe to failover, as it won't be running VMs in the background. this is not a networking issue.
root@chi-node-01:~# gnt-node modify --offline=yes chi-node-11.torproject.orgFri Feb 10 02:31:52 2023 - WARNING: Communication failure to node 9d2280f4-fd97-4c9e-8ab3-6d721a5bc155: Error 7: Failed to connect to 38.229.82.114 port 1811: No route to hostFri Feb 10 02:31:55 2023 Failed to stop KVM daemon on node 'chi-node-11.torproject.org': Node is marked offlineModified node chi-node-11.torproject.org - master_candidate -> False - offline -> Trueroot@chi-node-01:~#root@chi-node-01:~# gnt-node failover chi-node-11.torproject.orgFail over instance(s) 'gitlab-dev-01.torproject.org'?y/[n]/?: ySubmitted jobs 483062Waiting for job 483062 for gitlab-dev-01.torproject.org ...Fri Feb 10 02:34:02 2023 - INFO: Selected nodes for instance gitlab-dev-01.torproject.org via iallocator hail: chi-node-10.torproject.orgFri Feb 10 02:34:02 2023 - INFO: Not checking memory on the secondary node as instance will not be startedFri Feb 10 02:34:02 2023 Failover instance gitlab-dev-01.torproject.orgFri Feb 10 02:34:02 2023 * not checking disk consistency as instance is not runningFri Feb 10 02:34:02 2023 * shutting down instance on source nodeFri Feb 10 02:34:02 2023 - WARNING: Could not shutdown instance gitlab-dev-01.torproject.org on node chi-node-11.torproject.org, proceeding anyway; please make sure node chi-node-11.torproject.org is down; error details: Node is marked offlineFri Feb 10 02:34:02 2023 * closing instance disks on node chi-node-11.torproject.orgFri Feb 10 02:34:02 2023 - WARNING: Could not close instance disks on node chi-node-11.torproject.org, proceeding anywayFri Feb 10 02:34:02 2023 * deactivating the instance's disks on source nodeFri Feb 10 02:34:02 2023 - WARNING: Could not shutdown block device disk/0 on node chi-node-11.torproject.org: Node is marked offlineAll 1 instance(s) failed over successfully.root@chi-node-01:~#
we're going to retire this cluster soon, hopefully, so this ticket can probably just be closed. in any case, there's plenty of capacity left in the other nodes to pickup the slack here.
I had a look this morning and actually chi-node-11 is also used as a secondary for survey-01 and metrics-psqlts-01, so it's not sufficient to just failover the primary instance and call it a day...
At this point we should either decide to have Cymru attempt to reset the machine, or go ahead with retirement. It seems like all signs are pointing to the latter, but I'd like @anarcat to sign off on that if possible.