trusted high performance cluster (gnt-dal migration)
As part of the OKR 2022 Q1/Q2 plan, we have set the following key results in this objective:
- establish a new PoP on the US west coast with trusted partners and hardware ($$)
- retire moly and move the DNS server to the new cluster (team#29974 (closed))
- reduce VM deployment time to one hour or less (currently 2 hours) (team#31239)
This was originally due, like the other OKRs, for July 2022, but was postponed to February 2023.
The actual budget for this migration has been established in TPA-RFC-40 (tpo/tpa/team#40897), and migration plan in TPA-RFC-43 (tpo/tpa/team#40929).
Migration plan
here's a working copy of the migration plan so we can attach issues and modify without having to redo TPA-RFC-43:
New colocation facility access
In this step, we pick the colocation provider and establish contact.
- get credentials for OOB management
- get address to ship servers
- get emergency/support contact information
This step needs to happen before the following steps are completed (at least the "servers shipping" step).
Followup in team#40967 (closed).
chi-node-14 transfer
This is essentially the work to transfer chi-node-14 to the new colocation facility.
- maintenance window announced to shadow people
- server shutdown in preparation for shipping
- server is shipped
- server is racked and connected
- server is renumbered and brought back online
- end of the maintenance window
This can happen in parallel with the following tasks, but after the colo setup above.
Followup in team#40968 (closed).
new hardware deployment
-
budget approval (TPA-RFC-40 is standard) -
server selection is confirmed -
servers are ordered (team#40966 (closed)) -
servers are shipped (team#40966 (closed)) -
servers are racked and connected (team#40969 (closed)) -
burn-in (team#40969 (closed))
At the end of this step, the three servers are build, shipped, connected, and remotely available for install, but not installed just yet.
This step can happen in parallel with the chi-node-14 transfer and the software migration preparation.
Followup in team#40969 (closed).
Software migration preparation
This can happen in parallel with the previous tasks.
This step was rewritten in team#40970 (closed).
Cluster configuration
This needs all the previous steps (but chi-node-14) to be done before it can go ahead.
This is split between the Ganeti base install (team#40971 (closed)) and mass migration (team#40972 (closed)).
The third node can be installed in parallel with step 4 (team#40970 (closed)) and later.
Single VM migration example
A single VM migration may look something like this:
- instance stopped on source node
- instance exported on source node
- instance imported on target node
- instance started
- instance renumbered
- instance rebooted
- old instance destroyed after 7 days
If the mass-migration process works, steps 1-4 possibly happen in parallel and operators basically only have to renumber the instances and test.
Planned Timeline
- November 2022
- W48: adopt TPA-RFC-43 proposal (tpo/tpa/team#40929)
- W48: order servers (team#40966 (closed))
- W48: confirm colo contract and access (team#40967 (closed))
- December 2022
- waiting for servers to ship
- W52: end of hardware support from Cymru
- W52: holidays
- January 2023
- W1: holidays
- W1: ideal: servers shipped (5 weeks, team#40966 (closed))
- W2: gnt-dal cluster physical setup and burn-in (team#40969 (closed))
- W3: gnt-dal cluster software setup (team#40970 (closed))
- W3-W4: gnt-dal cluster ganeti setup (team#40971 (closed)) and mass VM migration from gnt-chi to gnt-dal (team#40972 (closed))
- W4: chi-node-14 transfer (team#40968 (closed))
- February 2023:
- W5: gnt-chi cluster retirement, ideal date (team#40973 (closed))
- W7: worst case: servers shipped (10 weeks, second week of February)
- March 2023:
- W12: worst case: full build
- W13: worst case: gnt-chi cluster retirement (end of March)
This timeline is similar to the TPA-RFC-43 timeline but was bumped forward by a week.
There was also a calculation error in the original timeline that estimated the 5 week lead time as giving us servers in the second week of January (W2) when shipped in the second week of November (W47). The correct calculation, of course, is that servers would have shipped in W52, or the last week of December.
Actual timeline
The above timeline was the plan, kept for posterity. What follows is the churned plan that will update as delays possibly accumulate.
- November 2022
- W48: adopted TPA-RFC-43 proposal (tpo/tpa/team#40929)
- W48: ordered servers (team#40966 (closed))
- December 2022
- W52: end of hardware support from Cymru
- W52: holidays
- W52: servers shipped (1 week in advance, team#40966 (closed))
- W48: confirm colo contract and access (team#40967 (closed))
- January 2023
- W1: holidays
- W4: confirmed colo contract and access (team#40967 (closed))
- February 2023:
- W7: gnt-dal cluster physical setup and burn-in (team#40969 (closed))
- March 2023
- W9: gnt-dal cluster software setup (team#40970 (closed))
- W10-W11: gnt-dal cluster ganeti setup (team#40971 (closed)) and mass VM migration from gnt-chi to gnt-dal (team#40972 (closed))
- W11: chi-node-14 transfer (team#40968 (closed))
- W12: gnt-chi cluster retirement, revised (worst case) date (team#40973 (closed))
- W12: worst case: full build
- W13: worst case: gnt-chi cluster retirement (end of March)
As of 2023-01-30, we are still inside the worst-case scenario build, but have passed the ideal retirement date already (W5, now planned 4 weeks late, mostly due to delays in the datacenter setup).
Update: as of 2023-03-02 we're still inside the worse case scenario, but barely. We hope to complete the work by the end of march.
Update: the cluster migration was completed with chi-node-14
being migrated on April 5th 2023. gnt-chi cluster retirement was technically complete by the end of march (the last node was retired on March 29th) but technically dragged on for a while later as cymru decomissioned the actual hardware. We can still consider we're within the worst case scenario, as we managed to bring the new cluster back online in a reasonable timeframe.
Results
The new cluster (gnt-dal
) is now online and the old cluster (gnt-chi
) has been retired.
We have seen a slight decrease in global memory capacity:
The peak was at 5.15TB and settled at 4.90TB. The big dip in the middle was the chi-node-14 move.
The new cluster brought in an extra 1.6TB of memory but we removed more than we added, so we now have 250GB less memory than before. It might be possible to upgrade the memory in the servers, that said: each server has 512GB of RAM and could be pumped up to 1TB of RAM each. We could also add a new node.
Because most of the storage (72TiB after RAID-1) was hidden behind the SAN, we do not see that impact in the graphs, but that is also an impact of the migration that is not negligible, as we "just" 21TiB of storage in the new cluster. That said, all that storage is "fast" (SSD or NVMe) and it's still possible to add "slow" storage (HDDs) in the servers, which have empty trays. In theory, we could add another ~60TiB of storage with 18TiB HDD pairs in each server, returning to the original capacity.