kvm4 is getting fairly old. it's been setup in 2015 and is showing sign of old age. for example today it freaked us all out by not returning after a reboot right before the holidays (#32801 (moved)). considering how critical that server is (email, puppet, ldap, jenkins, dns, web mirror, all the windows buildboxes!) we should start considering a decomissionning process.
at the very least, we need to get eugeni the heck out of there.
we have budget to provision another ganeti cluster, so let's use it to replace this, and hopefully more. the existing cluster has already taken more than its share by taking machines from both kvm1/textile and moly, so it's time we provision more hardware for this.
this requires a new ganeti node (fsn-node-06, #33907 (moved)).
here's the disaster recovery plan i made up on the fly in #32801 (moved), which is relevant to the discussion here:
According to the Nextcloud spreadsheet (since LDAP is down), [machines running on kvm4] includes:
|| host || service || impact || mitigation ||
|| alberti || LDAP, db.tpo || critical, no passwd change || read-only copies everywhere ||
|| build-x86-09 || buildbox || redundant || N/A ||
|| eugeni || incoming mail, lists || critical, total outage || peek at tor-puppet/modules/postfix/files/virtual and email people directly ||
|| meronense || metrics.tpo || critical, total outage || ? ||
|| neriniflorum || DNS || redundant, higher TTFB? || possible to remove from rotation ||
|| oo-hetzner-03 || onionoo || redundant || ? ||
|| pauli || puppet || major, no config management || use cumin, local git copies ||
|| rouyi || jenkins || critical, total outage || ? ||
|| web-hetzner-01 || web mirror || redundant, no effect? || removed from rotation automatically ||
|| weissi || build box || no windows builds || N/A ||
|| woronowii || build box || no windows builds || N/A ||
I'll note that it seems both windows build boxes are on the same machine so even if jenkins would be able to dispatch builds, we wouldn't be able to do those...
Our disaster recover plan so far is to wait for that rescue to succeed, which might take up to 24h but hopefully less.
If that fails, I would suggest the following plan:
recover eugeni, pauli, alberti from backups on gnt-fsn or elsewhere (we need those three to build new machines)
build a new ganeti cluster (because we can't recover all of this on gnt-fsn)
restore remaining machines on the new cluster
decommission kvm4 officially
This could take a few days of work. :(
Out of that, I would outline the following plan:
in the short term: migrate eugeni, pauli and alberti to a HA cluster, probably gnt-fsn (yes, that means it will be over-allocated even more)
in parallel or after (january): add a node or two to the ganeti cluster
migrate meronense, neriniflorum, oo-hetzner-03, and rouyi to the new cluster
This would leave the following boxes on kvm4, with the following rationale:
build-x86-09 - highly redundant, not urgent
web-hetzner-01 - one web node already present in the gnt-fsn cluster, moving this will not bring us more redundancy
weissi - hard to migrate
woronowii - hard to migrate
At that point we'd have the choice to migrate the two windows VM (ugh) and the build box to the ganeti cluster, and we'd probably decom web-hetzner-01 or move it to kvm5 or some other host, then decom kvm4.
How does that sound for a plan?
Tickets would need to be created for each one of those tasks.
i will also note that meronense has been seeing disk errors for a while now, in #32692 (moved). might be another good indication something is wrong with this box (although mdadm thinks everything is fine).
we don't have docs on how to move instances just yet, but i added a section in our ganeti manual that should be filled in when we do. for now it has references to external manuals that could be used:
add details of the machines to migrate and link to new gnt-fsn node ticket
Trac: Owner: tpa to anarcat Summary: decomission kvm4 to retire kvm4, 12 VMs to migrate Description: kvm4 is getting fairly old. it's been setup in 2015 and is showing sign of old age. for example today it freaked us all out by not returning after a reboot right before the holidays (#32801 (moved)). considering how critical that server is (email, puppet, ldap, jenkins, dns, web mirror, all the windows buildboxes!) we should start considering a decomissionning process.
at the very least, we need to get eugeni the heck out of there.
we have budget to provision another ganeti cluster, so let's use it to replace this, and hopefully more. the existing cluster has already taken more than its share by taking machines from both kvm1/textile and moly, so it's time we provision more hardware for this.
to
kvm4 is getting fairly old. it's been setup in 2015 and is showing sign of old age. for example today it freaked us all out by not returning after a reboot right before the holidays (#32801 (moved)). considering how critical that server is (email, puppet, ldap, jenkins, dns, web mirror, all the windows buildboxes!) we should start considering a decomissionning process.
at the very least, we need to get eugeni the heck out of there.
we have budget to provision another ganeti cluster, so let's use it to replace this, and hopefully more. the existing cluster has already taken more than its share by taking machines from both kvm1/textile and moly, so it's time we provision more hardware for this.
this requires a new ganeti node (fsn-node-06, #33907 (moved)).
created a ticket for every VM i think should be migrated, which means we would retire 4 VMs here:
two windows build boxes
a static mirror
a build box
does this make sense?
Trac: Description: kvm4 is getting fairly old. it's been setup in 2015 and is showing sign of old age. for example today it freaked us all out by not returning after a reboot right before the holidays (#32801 (moved)). considering how critical that server is (email, puppet, ldap, jenkins, dns, web mirror, all the windows buildboxes!) we should start considering a decomissionning process.
at the very least, we need to get eugeni the heck out of there.
we have budget to provision another ganeti cluster, so let's use it to replace this, and hopefully more. the existing cluster has already taken more than its share by taking machines from both kvm1/textile and moly, so it's time we provision more hardware for this.
this requires a new ganeti node (fsn-node-06, #33907 (moved)).
kvm4 is getting fairly old. it's been setup in 2015 and is showing sign of old age. for example today it freaked us all out by not returning after a reboot right before the holidays (#32801 (moved)). considering how critical that server is (email, puppet, ldap, jenkins, dns, web mirror, all the windows buildboxes!) we should start considering a decomissionning process.
at the very least, we need to get eugeni the heck out of there.
we have budget to provision another ganeti cluster, so let's use it to replace this, and hopefully more. the existing cluster has already taken more than its share by taking machines from both kvm1/textile and moly, so it's time we provision more hardware for this.
this requires a new ganeti node (fsn-node-06, #33907 (moved)).
build-x86-09.torproject.org (build server) - RETIRE? there's a build box on kvm4, kvm5 and two on moly, all of which are scheduled for retirement, so we have to keep some of those resources
re the build box, weasel says we can retire it, but we will eventually need to create build boxes in the gnt-fsn cluster at some point.
web-hetzner-01 can be retired.
Trac: Description: kvm4 is getting fairly old. it's been setup in 2015 and is showing sign of old age. for example today it freaked us all out by not returning after a reboot right before the holidays (#32801 (moved)). considering how critical that server is (email, puppet, ldap, jenkins, dns, web mirror, all the windows buildboxes!) we should start considering a decomissionning process.
at the very least, we need to get eugeni the heck out of there.
we have budget to provision another ganeti cluster, so let's use it to replace this, and hopefully more. the existing cluster has already taken more than its share by taking machines from both kvm1/textile and moly, so it's time we provision more hardware for this.
this requires a new ganeti node (fsn-node-06, #33907 (moved)).
build-x86-09.torproject.org (build server) - RETIRE? there's a build box on kvm4, kvm5 and two on moly, all of which are scheduled for retirement, so we have to keep some of those resources
kvm4 is getting fairly old. it's been setup in 2015 and is showing sign of old age. for example today it freaked us all out by not returning after a reboot right before the holidays (#32801 (moved)). considering how critical that server is (email, puppet, ldap, jenkins, dns, web mirror, all the windows buildboxes!) we should start considering a decomissionning process.
at the very least, we need to get eugeni the heck out of there.
we have budget to provision another ganeti cluster, so let's use it to replace this, and hopefully more. the existing cluster has already taken more than its share by taking machines from both kvm1/textile and moly, so it's time we provision more hardware for this.
this requires a new ganeti node (fsn-node-06, #33907 (moved)).
build-x86-09.torproject.org (build server) - RETIRE there's a build box on kvm4, kvm5 and two on moly, all of which are scheduled for retirement, so we have to keep some of those resources
Trac: Description: kvm4 is getting fairly old. it's been setup in 2015 and is showing sign of old age. for example today it freaked us all out by not returning after a reboot right before the holidays (#32801 (moved)). considering how critical that server is (email, puppet, ldap, jenkins, dns, web mirror, all the windows buildboxes!) we should start considering a decomissionning process.
at the very least, we need to get eugeni the heck out of there.
we have budget to provision another ganeti cluster, so let's use it to replace this, and hopefully more. the existing cluster has already taken more than its share by taking machines from both kvm1/textile and moly, so it's time we provision more hardware for this.
this requires a new ganeti node (fsn-node-06, #33907 (moved)).
build-x86-09.torproject.org (build server) - RETIRE there's a build box on kvm4, kvm5 and two on moly, all of which are scheduled for retirement, so we have to keep some of those resources
kvm4 is getting fairly old. it's been setup in 2015 and is showing sign of old age. for example today it freaked us all out by not returning after a reboot right before the holidays (#32801 (moved)). considering how critical that server is (email, puppet, ldap, jenkins, dns, web mirror, all the windows buildboxes!) we should start considering a decomissionning process.
at the very least, we need to get eugeni the heck out of there.
we have budget to provision another ganeti cluster, so let's use it to replace this, and hopefully more. the existing cluster has already taken more than its share by taking machines from both kvm1/textile and moly, so it's time we provision more hardware for this.
this requires a new ganeti node (fsn-node-06, #33907 (moved)).
build-x86-09.torproject.org (build server) - RETIRE there's a build box on kvm4, kvm5 and two on moly, all of which are scheduled for retirement, so we have to keep some of those resources
re the build box, weasel says we can retire it, but we will eventually need to create build boxes in the gnt-fsn cluster at some point.
We have at leaste one build box on gnt-fsn. We can just shut down build-x86-09 when we retire kvm4. I'm leaving it running for now, because it doesn't hurt and it helps a little bit at times.
Trac: Description: kvm4 is getting fairly old. it's been setup in 2015 and is showing sign of old age. for example today it freaked us all out by not returning after a reboot right before the holidays (#32801 (moved)). considering how critical that server is (email, puppet, ldap, jenkins, dns, web mirror, all the windows buildboxes!) we should start considering a decomissionning process.
at the very least, we need to get eugeni the heck out of there.
we have budget to provision another ganeti cluster, so let's use it to replace this, and hopefully more. the existing cluster has already taken more than its share by taking machines from both kvm1/textile and moly, so it's time we provision more hardware for this.
this requires a new ganeti node (fsn-node-06, #33907 (moved)).
build-x86-09.torproject.org (build server) - RETIRE there's a build box on kvm4, kvm5 and two on moly, all of which are scheduled for retirement, so we have to keep some of those resources
kvm4 is getting fairly old. it's been setup in 2015 and is showing sign of old age. for example today it freaked us all out by not returning after a reboot right before the holidays (#32801 (moved)). considering how critical that server is (email, puppet, ldap, jenkins, dns, web mirror, all the windows buildboxes!) we should start considering a decomissionning process.
at the very least, we need to get eugeni the heck out of there.
we have budget to provision another ganeti cluster, so let's use it to replace this, and hopefully more. the existing cluster has already taken more than its share by taking machines from both kvm1/textile and moly, so it's time we provision more hardware for this.
this requires a new ganeti node (fsn-node-06, #33907 (moved)).
Trac: Description: kvm4 is getting fairly old. it's been setup in 2015 and is showing sign of old age. for example today it freaked us all out by not returning after a reboot right before the holidays (#32801 (moved)). considering how critical that server is (email, puppet, ldap, jenkins, dns, web mirror, all the windows buildboxes!) we should start considering a decomissionning process.
at the very least, we need to get eugeni the heck out of there.
we have budget to provision another ganeti cluster, so let's use it to replace this, and hopefully more. the existing cluster has already taken more than its share by taking machines from both kvm1/textile and moly, so it's time we provision more hardware for this.
this requires a new ganeti node (fsn-node-06, #33907 (moved)).
kvm4 is getting fairly old. it's been setup in 2015 and is showing sign of old age. for example today it freaked us all out by not returning after a reboot right before the holidays (#32801 (moved)). considering how critical that server is (email, puppet, ldap, jenkins, dns, web mirror, all the windows buildboxes!) we should start considering a decomissionning process.
at the very least, we need to get eugeni the heck out of there.
we have budget to provision another ganeti cluster, so let's use it to replace this, and hopefully more. the existing cluster has already taken more than its share by taking machines from both kvm1/textile and moly, so it's time we provision more hardware for this.
this requires a new ganeti node (fsn-node-06, #33907 (moved)).
Trac: Description: kvm4 is getting fairly old. it's been setup in 2015 and is showing sign of old age. for example today it freaked us all out by not returning after a reboot right before the holidays (#32801 (moved)). considering how critical that server is (email, puppet, ldap, jenkins, dns, web mirror, all the windows buildboxes!) we should start considering a decomissionning process.
at the very least, we need to get eugeni the heck out of there.
we have budget to provision another ganeti cluster, so let's use it to replace this, and hopefully more. the existing cluster has already taken more than its share by taking machines from both kvm1/textile and moly, so it's time we provision more hardware for this.
this requires a new ganeti node (fsn-node-06, #33907 (moved)).
kvm4 is getting fairly old. it's been setup in 2015 and is showing sign of old age. for example today it freaked us all out by not returning after a reboot right before the holidays (#32801 (moved)). considering how critical that server is (email, puppet, ldap, jenkins, dns, web mirror, all the windows buildboxes!) we should start considering a decomissionning process.
at the very least, we need to get eugeni the heck out of there.
we have budget to provision another ganeti cluster, so let's use it to replace this, and hopefully more. the existing cluster has already taken more than its share by taking machines from both kvm1/textile and moly, so it's time we provision more hardware for this.
this requires a new ganeti node (fsn-node-06, #33907 (moved)).
i just retired oo-hetzner-03, the next step here is to retire kvm4 itself, it seems.
Trac: Description: kvm4 is getting fairly old. it's been setup in 2015 and is showing sign of old age. for example today it freaked us all out by not returning after a reboot right before the holidays (#32801 (moved)). considering how critical that server is (email, puppet, ldap, jenkins, dns, web mirror, all the windows buildboxes!) we should start considering a decomissionning process.
at the very least, we need to get eugeni the heck out of there.
we have budget to provision another ganeti cluster, so let's use it to replace this, and hopefully more. the existing cluster has already taken more than its share by taking machines from both kvm1/textile and moly, so it's time we provision more hardware for this.
this requires a new ganeti node (fsn-node-06, #33907 (moved)).
kvm4 is getting fairly old. it's been setup in 2015 and is showing sign of old age. for example today it freaked us all out by not returning after a reboot right before the holidays (#32801 (moved)). considering how critical that server is (email, puppet, ldap, jenkins, dns, web mirror, all the windows buildboxes!) we should start considering a decomissionning process.
at the very least, we need to get eugeni the heck out of there.
we have budget to provision another ganeti cluster, so let's use it to replace this, and hopefully more. the existing cluster has already taken more than its share by taking machines from both kvm1/textile and moly, so it's time we provision more hardware for this.
this requires a new ganeti node (fsn-node-06, #33907 (moved)).
6. removed from source code (puppet, auto-dns, domains, wiki) 7. removed from tor-passwords 8. N/A (dnswl) 9. removed from spreadsheetlast steps: disk wipes and cancelation with hetzner.
since the machine has been removed from puppet/ldap, its public key is not available from the servers anymore. if you need to connect, you can use the following known_hosts:
I scheduled deletion with hetzner for now + 2days:
Please note that this server will be cancelled on 28/05/2020 and all data will be deleted.
Confirmation* I have read and understood the above message. I confirm that I want to cancel my server. The cancellation will take effect on 28/05/2020.
i'll keep this opened until the server is canceled, but this is all but done.
hum. the first wipe didn't automatically exit, so it probably got hung there for a few hours. i hit "enter" and it started the second round. eta 45 minutes to apocalypse.
hum, okay... it failed with an error, and when i tried to open a new script to test, everything collapse in a flaming heap. now the server doesn't respond to pings, so hopefully it's really dead now.
tomorrow hetzner should retire it completely and this will be done.