migrate gitlab-02 to new gnt-dal cluster

we're going to host more and more gitlab stuff in object storage (e.g. #41425 (closed)) and already have runners there. it makes sense to move gitlab-02 to the new gnt-dal cluster, which has faster disks and more powerful CPUs.

this should help us deal with the current overload in the gnt-fsn cluster as well (incident #41429 (closed)).

plan:

communicate date of outage to all of tor
add planned maintenance item on status.tpo
lower TTL for subdomains that need it to 5mins -- gitlab-02 (aliases which don't need to change: gitlab, containers, gitaly), *pages
prepare dal cluster for accepting instances from fsn, if needed (e.g. RAPI cert, RAPI passwords, firewall)
zerofree on the backup partition (@anarcat)
~~zerofree on the other partitions~~ (not done because it requires downtime, as it needs readonly partition)
wipe free space on volume group (create lv that covers all the remaining free space and wipe it out with dd) (@lelutin)
on the day of the maintenance window
- run steps 5 and 6 of prep
- stop puppet on ganeti nodes on the dal cluster, finalize RAPI prep
- enable maintenance mode on gitlab https://docs.gitlab.com/ee/administration/maintenance_mode/ and set message banner https://gitlab.torproject.org/admin/broadcast_messages if necessary to warn people of the read-only state
- stop the instance and start the transfer
- change IP in instance after the move
- test that gitlab and all other websites are replying properly, reconfigure grub-pc, test reboot of instance
- switch DNS entries to point to new IP
- disable maintenance mode
- schedule destruction of old instance
after the move is finished
- communicate to everyone that the move is over and that things are back to normal operation
- remove password files and cert files created by the prep before migration
- bring the TTL back to the default value of 1h for gitlab-02 and *pages
- verify that we still have data for all elements in the gitlab omnibus dashboard on grafana
- update disaster recover section of gitlab service docs with our findings

Edited Dec 02, 2024 by lelutin

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Assignee Loading

Time tracking Loading