chi-node-11 unresponsive

changed issue type to incident

assigned to @anarcat

changed the severity to Low - S4

it looks like the only affected instance is gitlab-dev-01 so that's not very critical, unless @micah was doing work on there...

i can take a look tomorrow. i think we need OOB access, and that requires the ipsec + java applete setup from hell, which i don't believe we ever got working on your things...

can we bring up gitlab-dev-01 on its secondary node, or will there be DRBD sync issues?

i can take a look tomorrow. i think we need OOB access, and that requires the ipsec + java applete setup from hell, which i don't believe we ever got working on your things...

yeah, i never got strongswan working on this machine, and i'm working from a pretty unstable connection right now

sure, that should be doable. it's a gnt instance failover $instance i think.

it's safe because chi-node-11 is down. when it boots up, it asks the master which instances to start, and the master will know.

...

On 2023-02-10 02:14:40, kezzle wrote:

can we bring up gitlab-dev-01 on its primary, or will there be DRBD sync issues?

Anything I did on gitlab-dev is reproducible, so not critical

i tried to ssh into the RACDM from chi-node-01:

ssh: connect to host 172.30.140.11 port 22: Connection refused

port 80 and 443 are open, but the former redirects to the latter (which is good) and the latter never manages to negotiate a TLS connexion (which is not).

so i'll declare this box dead. there's only one instance running there, so i'll just failover and mark it as offline.

because @kez said the box was stuck at the LUKS password prompt and failed to boot, i consider it safe to failover, as it won't be running VMs in the background. this is not a networking issue.

following Node failure in the pager playbook.

root@chi-node-01:~# gnt-node modify --offline=yes chi-node-11.torproject.org
Fri Feb 10 02:31:52 2023  - WARNING: Communication failure to node 9d2280f4-fd97-4c9e-8ab3-6d721a5bc155: Error 7: Failed to connect to 38.229.82.114 port 1811: No route to host
Fri Feb 10 02:31:55 2023 Failed to stop KVM daemon on node 'chi-node-11.torproject.org': Node is marked offline
Modified node chi-node-11.torproject.org
 - master_candidate -> False
 - offline -> True
root@chi-node-01:~#
root@chi-node-01:~# gnt-node failover  chi-node-11.torproject.org
Fail over instance(s) 'gitlab-dev-01.torproject.org'?
y/[n]/?: y
Submitted jobs 483062
Waiting for job 483062 for gitlab-dev-01.torproject.org ...
Fri Feb 10 02:34:02 2023  - INFO: Selected nodes for instance gitlab-dev-01.torproject.org via iallocator hail: chi-node-10.torproject.org
Fri Feb 10 02:34:02 2023  - INFO: Not checking memory on the secondary node as instance will not be started
Fri Feb 10 02:34:02 2023 Failover instance gitlab-dev-01.torproject.org
Fri Feb 10 02:34:02 2023 * not checking disk consistency as instance is not running
Fri Feb 10 02:34:02 2023 * shutting down instance on source node
Fri Feb 10 02:34:02 2023  - WARNING: Could not shutdown instance gitlab-dev-01.torproject.org on node chi-node-11.torproject.org, proceeding anyway; please make sure node chi-node-11.torproject.org is down; error details: Node is marked offline
Fri Feb 10 02:34:02 2023 * closing instance disks on node chi-node-11.torproject.org
Fri Feb 10 02:34:02 2023  - WARNING: Could not close instance disks on node chi-node-11.torproject.org, proceeding anyway
Fri Feb 10 02:34:02 2023 * deactivating the instance's disks on source node
Fri Feb 10 02:34:02 2023  - WARNING: Could not shutdown block device disk/0 on node chi-node-11.torproject.org: Node is marked offline
All 1 instance(s) failed over successfully.
root@chi-node-01:~#

we're going to retire this cluster soon, hopefully, so this ticket can probably just be closed. in any case, there's plenty of capacity left in the other nodes to pickup the slack here.

closed

changed the incident status to Resolved by closing the incident

mentioned in commit wiki-replica@babadee8

it seems i forgot to actually start gitlab-dev-01.torproject.org and only did the failover. this is now fixed and the server should be back online.

reopened

I had a look this morning and actually chi-node-11 is also used as a secondary for survey-01 and metrics-psqlts-01, so it's not sufficient to just failover the primary instance and call it a day...

I'll look into migrating the secondaries today.

The secondaries have been replaced.

At this point we should either decide to have Cymru attempt to reset the machine, or go ahead with retirement. It seems like all signs are pointing to the latter, but I'd like @anarcat to sign off on that if possible.

please retire chi-node-11.

marked this issue as related to #41071 (closed)

closed

changed the incident status to Resolved by closing the incident

chi-node-11 unresponsive

Child items ...

Activity