TPA-RFC-72: move donate-01 VM to gnt-dal cluster

in tpo/web/donate-neo#134 (closed), we've identified severe latency issues with the donation site. @lavamind suspected the tunnel crossing the atlantic might be causing those issues, and analysis (tpo/web/donate-neo#134 (comment 3083019)) shows there's indeed a 200-400ms latency over that link, which is causing severe disruptions.

at first, @lavamind thought we should move the crm-int-01 machine next to the donate-01 machine in the gnt-dal cluster, but it's actually the other way around. the crm-* machines were moved to gnt-dal over a year ago, in #41109 (closed).

so what we need to do is to move the donate-01 VM instead.

next steps:

  • make a migration plan (for now, below is an inter-cluster migration plan inspired by #41109 (comment 2900087), but should we just rebuild a donate-02?)
  • review the migration plan (@lavamind)

inter-cluster migration plan

Before:

  • schedule an outage with stakeholders
  • Look for any hard-coded IPs in donate and puppet code
  • Review cross-cluster transfer procedure, see https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/ganeti/#cross-cluster-migrations
  • Announce outage on status.tpo (status-site!66 (merged) waiting for merge)

During:

  • Toggle maintenance mode on frontend we don't have one? tpo/web/donate-neo#107
  • Suspend Puppet on origin and destination clusters
  • Deploy required firewall rules on origin and destination nodes
  • Transfer donate-01 (see procedure in https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/ganeti/#actual-vm-migration)
  • Renumber IP addresses
  • Fix backend IPSec tunnel IPs

After:

  • Clear temporary firewall rules
  • Reenable Puppet
  • Disable frontend maintenance mode?
  • Validate donate site works, see https://gitlab.torproject.org/tpo/tpa/team/-/wikis/service/donate#testing-the-donation-site
  • Mark status.tpo entry as resolved
Edited Oct 02, 2024 by Jérôme Charaoui
Assignee Loading
Time tracking Loading