we're going to host more and more gitlab stuff in object storage (e.g. #41425 (closed)) and already have runners there. it makes sense to move gitlab-02 to the new gnt-dal cluster, which has faster disks and more powerful CPUs.
this should help us deal with the current overload in the gnt-fsn cluster as well (incident #41429 (closed)).
plan:
communicate date of outage to all of tor
add planned maintenance item on status.tpo
lower TTL for subdomains that need it to 5mins -- gitlab-02 (aliases which don't need to change: gitlab, containers, gitaly), *pages
prepare dal cluster for accepting instances from fsn, if needed (e.g. RAPI cert, RAPI passwords, firewall)
that would mean 18 hours transfer, during which time we'd probably need gitlab to be completely offline unless we start doing fancy things with snapshots and so on.
Kezchanged title from migrate gitlab-02 to new gnt-fsn cluster to migrate gitlab-02 to new gnt-dal cluster
changed title from migrate gitlab-02 to new gnt-fsn cluster to migrate gitlab-02 to new gnt-dal cluster
I'm happy either way. @lelutin if you want to take the lead on this, I'll be happy to help out flashing out the plan and be there for assistance on the day-of.
Just making a note however that this will need coordination across teams considering the downtime this will require, which I estimate to at least several hours.
It might even be a case for a special Friday workday...
hell, we could make that a work party... i am not sure i'd do it on a
friday though, because if it breaks then we need to work weekends... if
anything, i'd rather set that on a sunday if we really want to avoid
the downtime. or a bank holiday or something.
I'm up for taking the steering wheel. yeah I'm inclined to think that this transfer should be started on thursday end of day and finish it up on a friday.
@lavamind I'd be happy to hop on a video call soon (maybe today?) to start carving up a plan for the migration.
@anarcat do we have a script on hand that can do the inter-cluster migration for us? I seem to remember seeing that, maybe in fabric scripts?
@anarcat do we have a script on hand that can do the inter-cluster migration for us? I seem to remember seeing that, maybe in fabric scripts?
it's basically built-in to ganeti, but there's some glue work that needs
to happen around it. for now it's a series of copy-paste from the ganeti
wiki page, see if you can find it on your own! (and if not, tell me
where you look so we can hotlink it)
heh yeah I was actually thinking of maybe staging a practice run since ganeti "works but has quirks".. but idle-fsn-01 being in the dal cluster would be confusing :P
I've added a quick summary of the documented move procedure to the description so that we have a flight-checklist for this particular operation. I've added communication to the rest of tor and moved reconfigure of grub-pc above the DNS change and added a reboot test after that. I'll update our documentation now to reflect that. better to consider the grub-pc detail as part of our tests and since we're messing with the boot procedure we're better off testing a reboot after that.
one concern that @lavamind raised on IRC is the sheer size of this VM and the time it would take to transfer between the two clusters.
i've ran some tests to see what we're dealing with. a simple bandwidth tests between fsn-node-01 and dal-node-01 seems to say we have about 160mbps between the two clusters, which matches my experience with past test:
in #40917 (comment 2840145), i did some tests to use "zerofree" on the disks to improve compression. one trick is that the filesystem layout on gitlab-02 is a tad more complicated: we have LVM in there, so zerofree won't cut it for that, we need... "something else" to zero out the unallocated LVM space. it looks like this could be as simple as create a new LV taking up all empty space and zeroing it, though:
i'm testing zerofree on the 300G gitlab-backup volume now. it's a simple, low hanging fruit because it can just be unmounted without stopping anything and should give us a good idea of how long zerofree will take.
update: it did 2% in about 2 minutes, so it's going to take 100 minutes for 300GB, or what, 400 minutes for 1.25TB? that's about 7 hours.
transfering 1.25TB at 164Mbit/s, on the other hand, will take about 17 hours (1.25terabyte/(164Mbit/s) in qalculate)... which is more, but not that much more... would we gain that much from the zerofree? not sure it's worth it.
zeroing out the unallocated volumes, however, seems like a smart move.
to benefit from this, however, we'll need to use the --compress flag to move-instance, which we haven't tested before. it seems the valid values for that are IEC_ALL = ["gzip","gzip-fast","gzip-slow","lzop","none"] with none being the likely default.
also note that zerofree can operate on readonly filesystems, which could be used to zero out the backup volume more intelligently than what i'm doing now... i ended up interrupting the process because it would have taken too long.
started this to actually, correctly zero out the backup volume:
date ; mount -o remount,ro /srv/gitlab-backup && time zerofree -v /dev/mapper/vg_gitlab--02_hdd-gitlab--backup ; mount -o remount,rw /srv/gitlab-backup; date
started at Thu Nov 14 21:11:59 UTC 2024, currently running in a screen(1).
update: that completed:
root@gitlab-02:~# date ; mount -o remount,ro /srv/gitlab-backup && time zerofree -v /dev/mapper/vg_gitlab--02_hdd-gitlab--backup ; mount -o remount,rw /srv/gitlab-backup; dateThu Nov 14 21:11:59 UTC 202461968855/66249791/78642176real 94m2.569suser 1m35.577ssys 5m18.201sThu Nov 14 22:46:01 UTC 2024
that's ... only 94 minutes! maybe it's worth it after all... i wonder how well gitlab behaves with a readonly filesystem?
I looked into lowering down the TTL. all of the hostnames other than gitlab-02 and pages are CNAMEs to gitlab-02, so they don't need to move.
I've changed the TTL on *pages.tpo to 10 minutes
For gitlab-02 however the IP comes from LDAP. So I've added a dnsTTL field to the gitlab-02 host entry in ldap and after a little while the change is visible on external dns servers.
lelutinmarked the checklist item lower TTL for subdomains that need it to 5mins -- gitlab-02, gitlab, containers, gitaly, pages as completed
marked the checklist item lower TTL for subdomains that need it to 5mins -- gitlab-02, gitlab, containers, gitaly, pages as completed
If this corresponds to what I've described above, I'll change the documentation because the documentation's formulation seems to imply that we should have the same password on both clusters.
the RAPI cert for the fsn cluster was alredy present on dal-node-01 (under /root)
the rapi passwords for both clusters were already present in files on dal-node-01
skipped over points 5 and 6. we'll have to do those just before we start the transfer -- adding to list in description
lelutinmarked the checklist item prepare dal cluster for accepting instances from fsn, if needed (e.g. RAPI cert, RAPI passwords, firewall) as completed
marked the checklist item prepare dal cluster for accepting instances from fsn, if needed (e.g. RAPI cert, RAPI passwords, firewall) as completed
i added items to the checklist that can be done before the maintenance window, i've zerofree'd the backup partion, but i don't think we can zerofree the other parts without downtime.
lelutinmarked the checklist item wipe free space on volume group (create lv that covers all the remaining free space and wipe it out with dd) (@lelutin) as completed
marked the checklist item wipe free space on volume group (create lv that covers all the remaining free space and wipe it out with dd) (@lelutin) as completed
we ran two transfer tests in preparation, to make sure that we have all the bits and pieces ready for the final transfer.
the first transfer was right after the creation of the new dummy VM. it was transferring at around 20MiB/s relatively throughout all the procedure and took about 16 minutes to transfer the 20Gb disk.
for the second attempt, we first ran activate disks and did a zerofree on it before launching the transfer. that second attempt took only 6 minutes.
edit: in both cases we were running the move-instance script with the option --compress=lzop set
blah there's no "maintenance mode" available to use... it's a premium feature.
So we'll just proceed with the plan I laid out at first and rely on delaying the DNS change so that nothing will get written to the new location until we decide that the migration is over.
HTTPSUnreachable: for gitlab.tpo, pages.tpo, containers.tpo and gitaly-02.tpo
PlaintextHTTPUnreachable: for the same four subdomains
jobdown for gitlab-02 with variants: job=gitlab (three alerts on three different ports), job=gitlab-workhorse, job=postgres, job=mtail, job=node, job=gitaly, job=nginx
SMTPUnreachable for gitlab-02.tpo
SSHUnreachable for gitlab-02.tpo
DRBDDegraded for dal-node-02 and dal-node-03 : they're the two destination nodes
systemd failed units on all fsn-node-* and dal-node-* ganeti nodes: we did disable puppet on all of them to avoid surprises in terms of firewall, in case puppet wanted to apply a change and reloaded the firewall. that would possibly have blocked the transfer
systemd failed unit, puppet-run.service where puppet tries to pull from a repository on gitlab.tpo: onionoo-backend-03, hetzner-nbg1-01, hetzner-nbg1-02.torproject.org, check-01.torproject.org, tb-build-03.torproject.org, collector-02.torproject.org, survey-01.torproject.org, colchicifolium.torproject.org, meronense.torproject.org, metricsdb-01.torproject.org, tb-build-02.torproject.org
I set silences for:
everything that has alias=gitlab-02.torproject.org or alias=gitaly-02.torproject.org
alias of dal-node-02.tpo or -03 andalertname=DRBDDegraded
alertname=SystedFailedUnitsand all of the aliases identified above
edit: additionally to the above-mentioned alerts, there were also those that fired during the nigh:
PuppetAgentErrors and PuppetCatalogStale for all hosts mentioned in the systemd unit failed alert. we should setup an inhibition for PuppetAgentErrors when PuppetCatalogStale also fires.
systemd failed units on all fsn-node-* and dal-node-* ganeti nodes: we did disable puppet on all of them to avoid surprises in terms of firewall, in case puppet wanted to apply a change and reloaded the firewall. that would possibly have blocked the transfer
systemd failed unit, puppet-run.service where puppet tries to pull from a repository on gitlab.tpo: onionoo-backend-03, hetzner-nbg1-01, hetzner-nbg1-02.torproject.org, check-01.torproject.org, tb-build-03.torproject.org, collector-02.torproject.org, survey-01.torproject.org, colchicifolium.torproject.org, meronense.torproject.org, metricsdb-01.torproject.org, tb-build-02.torproject.org
those were particularly noisy, and kept repeating, rendering the
notification channel pretty unusable. i think it's worth investigating
this one separately, do you want me to file an issue, or how do you plan
on following up on all of those? :)
I was thinking of adding a list of what needs to be silenced in our documentation. I don't know if that's enough to make this more calm for our future selves during other major maintenance, but it's a start.
This time around, I did not try to anticipate what alerts would be triggering and I let everything come out to IRC so that I could note everything down.
lelutinmarked the checklist item stop the instance and start the transfer as completed
marked the checklist item stop the instance and start the transfer as completed
lelutinmarked the checklist item change IP in instance after the move as completed
marked the checklist item change IP in instance after the move as completed
lelutinmarked the checklist item test that gitlab and all other websites are replying properly, reconfigure grub-pc, test reboot of instance as completed
marked the checklist item test that gitlab and all other websites are replying properly, reconfigure grub-pc, test reboot of instance as completed
lelutinmarked the checklist item switch DNS entries to point to new IP as completed
marked the checklist item switch DNS entries to point to new IP as completed
lelutinmarked the checklist item disable maintenance mode as completed
marked the checklist item disable maintenance mode as completed
Unrelated to the migration I think but we might want to take a quick look at this. in the output of journalctl I saw this go by:
Nov 29 15:03:38 gitlab-02 prometheus-node-exporter[707]: ts=2024-11-29T15:03:38.131Z caller=textfile.go:227 level=error collector=textfile msg="failed to collect textfile data" file=tpa-gitlab-ci-wait.prom err="failed to parse textfile data from \"/var/lib/prometheus/node-exporter/tpa-gitlab-ci-wait.prom\": text format parsing error in line 3: expected float as value, got \"\""
@lavamind the password and certificate files were already present from previous instance migrations, so I'm wondering if we want to keep those around or if we should rather clean things up and re-create them for the next migrations.. do you have any thoughts on this?
The TTL is back to the normal value on gitlab-02 (now visible on our ns servers). I've also cleared out the subdomain dip.torproject.org which was a CNAME to gitlab-02 since both anarcat and lavamind confirmed that it was not used anymore.
lelutinmarked the checklist item bring the TTL back to the default value of 1h for gitlab-02 and *pages as completed
marked the checklist item bring the TTL back to the default value of 1h for gitlab-02 and *pages as completed
I just saw an issue with graphs: we don't have any data on blackbox https probes since the migration, so we also won't have alerts if the sites go down.
I've tried restarting the prometheus-blackbox-exporter service but the metrics did not start coming in. restarting prometheus also did not bring the metrics back, so something else is at play. also the metrics for backbox probes stopped on the 27th
I just pushed a couple changes to the wiki section about cross-cluster migrations to make our documentation more up to date. I'll take a look at what needs to be updated in the disaster recovery section