TPA-RFC-72: move donate-01 VM to gnt-dal cluster
in tpo/web/donate-neo#134 (closed), we've identified severe latency issues with the donation site. @lavamind suspected the tunnel crossing the atlantic might be causing those issues, and analysis (tpo/web/donate-neo#134 (comment 3083019)) shows there's indeed a 200-400ms latency over that link, which is causing severe disruptions.
at first, @lavamind thought we should move the crm-int-01 machine next to the donate-01 machine in the gnt-dal cluster, but it's actually the other way around. the crm-* machines were moved to gnt-dal over a year ago, in #41109 (closed).
so what we need to do is to move the donate-01 VM instead.
next steps:
-
make a migration plan (for now, below is an inter-cluster migration plan inspired by #41109 (comment 2900087), but should we just rebuild a donate-02?) -
review the migration plan (@lavamind)
inter-cluster migration plan
Before:
-
schedule an outage with stakeholders -
Look for any hard-coded IPs in donate and puppet code
-
Review cross-cluster transfer procedure, see https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/ganeti/#cross-cluster-migrations -
Announce outage on status.tpo (status-site!66 (merged) waiting for merge)
During:
-
Toggle maintenance mode on frontendwe don't have one? tpo/web/donate-neo#107 -
Suspend Puppet on origin and destination clusters -
Deploy required firewall rules on origin and destination nodes -
Transfer donate-01 (see procedure in https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/ganeti/#actual-vm-migration) -
Renumber IP addresses -
Fix backend IPSec tunnel IPs
After:
-
Clear temporary firewall rules -
Reenable Puppet -
Disable frontend maintenance mode? -
Validate donate site works, see https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/donate#testing-the-donation-site -
Mark status.tpo entry as resolved
Edited by Jérôme Charaoui