major outage: kvm4 down, affected: eugeni (mail, lists), alberti (ldap), pauli (puppet), rouyi (jenkins), etc

During a security reboot today, kvm4.torproject.org did not return. All virtual machines on this host are down and unavailable.

According to the Nextcloud spreadsheet (since LDAP is down), that includes:

host service impact mitigation
alberti LDAP, db.tpo critical, no passwd change read-only copies everywhere
build-x86-09 buildbox redundant N/A
eugeni incoming mail, lists critical, total outage peek at tor-puppet/modules/postfix/files/virtual and email people directly
meronense metrics.tpo critical, total outage ?
neriniflorum DNS redundant, higher TTFB? possible to remove from rotation
oo-hetzner-03 onionoo redundant ?
pauli puppet major, no config management use cumin, local git copies
rouyi jenkins critical, total outage ?
web-hetzner-01 web mirror redundant, no effect? removed from rotation automatically
weissi build box no windows builds N/A
woronowii build box no windows builds N/A

I'll note that it seems both windows build boxes are on the same machine so even if jenkins would be able to dispatch builds, we wouldn't be able to do those...

A ticket was filed with Hetzner to try and rescue the server.

Our disaster recover plan so far is to wait for that rescue to succeed, which might take up to 24h but hopefully less.

If that fails, I would suggest the following plan:

  1. recover eugeni, pauli, alberti from backups on gnt-fsn or elsewhere (we need those three to build new machines)
  2. build a new ganeti cluster (because we can't recover all of this on gnt-fsn)
  3. restore remaining machines on the new cluster
  4. decommission kvm4 officially

This could take a few days of work. :(