trusted high performance cluster (gnt-dal migration)

As part of the OKR 2022 Q1/Q2 plan, we have set the following key results in this objective:

establish a new PoP on the US west coast with trusted partners and hardware ($$)
retire moly and move the DNS server to the new cluster (team#29974 (closed))
reduce VM deployment time to one hour or less (currently 2 hours) (team#31239)

This was originally due, like the other OKRs, for July 2022, but was postponed to February 2023.

The actual budget for this migration has been established in TPA-RFC-40 (tpo/tpa/team#40897), and migration plan in TPA-RFC-43 (tpo/tpa/team#40929).

Migration plan

here's a working copy of the migration plan so we can attach issues and modify without having to redo TPA-RFC-43:

New colocation facility access

In this step, we pick the colocation provider and establish contact.

get credentials for OOB management
get address to ship servers
get emergency/support contact information

This step needs to happen before the following steps are completed (at least the "servers shipping" step).

Followup in team#40967 (closed).

chi-node-14 transfer

This is essentially the work to transfer chi-node-14 to the new colocation facility.

maintenance window announced to shadow people
server shutdown in preparation for shipping
server is shipped
server is racked and connected
server is renumbered and brought back online
end of the maintenance window

This can happen in parallel with the following tasks, but after the colo setup above.

Followup in team#40968 (closed).

new hardware deployment

budget approval (TPA-RFC-40 is standard)
server selection is confirmed
servers are ordered (team#40966 (closed))
servers are shipped (team#40966 (closed))
servers are racked and connected (team#40969 (closed))
burn-in (team#40969 (closed))

At the end of this step, the three servers are build, shipped, connected, and remotely available for install, but not installed just yet.

This step can happen in parallel with the chi-node-14 transfer and the software migration preparation.

Followup in team#40969 (closed).

Software migration preparation

This can happen in parallel with the previous tasks.

This step was rewritten in team#40970 (closed).

Cluster configuration

This needs all the previous steps (but chi-node-14) to be done before it can go ahead.

This is split between the Ganeti base install (team#40971 (closed)) and mass migration (team#40972 (closed)).

The third node can be installed in parallel with step 4 (team#40970 (closed)) and later.

Single VM migration example

A single VM migration may look something like this:

instance stopped on source node
instance exported on source node
instance imported on target node
instance started
instance renumbered
instance rebooted
old instance destroyed after 7 days

If the mass-migration process works, steps 1-4 possibly happen in parallel and operators basically only have to renumber the instances and test.

Planned Timeline

November 2022
- W48: adopt TPA-RFC-43 proposal (tpo/tpa/team#40929)
- W48: order servers (team#40966 (closed))
- W48: confirm colo contract and access (team#40967 (closed))
December 2022
- waiting for servers to ship
- W52: end of hardware support from Cymru
- W52: holidays
January 2023
- W1: holidays
- W1: ideal: servers shipped (5 weeks, team#40966 (closed))
- W2: gnt-dal cluster physical setup and burn-in (team#40969 (closed))
- W3: gnt-dal cluster software setup (team#40970 (closed))
- W3-W4: gnt-dal cluster ganeti setup (team#40971 (closed)) and mass VM migration from gnt-chi to gnt-dal (team#40972 (closed))
- W4: chi-node-14 transfer (team#40968 (closed))
February 2023:
- W5: gnt-chi cluster retirement, ideal date (team#40973 (closed))
- W7: worst case: servers shipped (10 weeks, second week of February)
March 2023:
- W12: worst case: full build
- W13: worst case: gnt-chi cluster retirement (end of March)

This timeline is similar to the TPA-RFC-43 timeline but was bumped forward by a week.

There was also a calculation error in the original timeline that estimated the 5 week lead time as giving us servers in the second week of January (W2) when shipped in the second week of November (W47). The correct calculation, of course, is that servers would have shipped in W52, or the last week of December.

Actual timeline

The above timeline was the plan, kept for posterity. What follows is the churned plan that will update as delays possibly accumulate.

November 2022
- W48: adopted TPA-RFC-43 proposal (tpo/tpa/team#40929)
- W48: ordered servers (team#40966 (closed))
December 2022
- W52: end of hardware support from Cymru
- W52: holidays
- W52: servers shipped (1 week in advance, team#40966 (closed))
- W48: confirm colo contract and access (team#40967 (closed))
January 2023
- W1: holidays
- W4: confirmed colo contract and access (team#40967 (closed))
February 2023:
- W7: gnt-dal cluster physical setup and burn-in (team#40969 (closed))
March 2023
- W9: gnt-dal cluster software setup (team#40970 (closed))
- W10-W11: gnt-dal cluster ganeti setup (team#40971 (closed)) and mass VM migration from gnt-chi to gnt-dal (team#40972 (closed))
- W11: chi-node-14 transfer (team#40968 (closed))
- W12: gnt-chi cluster retirement, revised (worst case) date (team#40973 (closed))
- W12: worst case: full build
- W13: worst case: gnt-chi cluster retirement (end of March)

As of 2023-01-30, we are still inside the worst-case scenario build, but have passed the ideal retirement date already (W5, now planned 4 weeks late, mostly due to delays in the datacenter setup).

Update: as of 2023-03-02 we're still inside the worse case scenario, but barely. We hope to complete the work by the end of march.

Update: the cluster migration was completed with chi-node-14 being migrated on April 5th 2023. gnt-chi cluster retirement was technically complete by the end of march (the last node was retired on March 29th) but technically dragged on for a while later as cymru decomissioned the actual hardware. We can still consider we're within the worst case scenario, as we managed to bring the new cluster back online in a reasonable timeframe.

Results

The new cluster (gnt-dal) is now online and the old cluster (gnt-chi) has been retired.

We have seen a slight decrease in global memory capacity:

The peak was at 5.15TB and settled at 4.90TB. The big dip in the middle was the chi-node-14 move.

The new cluster brought in an extra 1.6TB of memory but we removed more than we added, so we now have 250GB less memory than before. It might be possible to upgrade the memory in the servers, that said: each server has 512GB of RAM and could be pumped up to 1TB of RAM each. We could also add a new node.

Because most of the storage (72TiB after RAID-1) was hidden behind the SAN, we do not see that impact in the graphs, but that is also an impact of the migration that is not negligible, as we "just" 21TiB of storage in the new cluster. That said, all that storage is "fast" (SSD or NVMe) and it's still possible to add "slow" storage (HDDs) in the servers, which have empty trays. In theory, we could add another ~60TiB of storage with 18TiB HDD pairs in each server, returning to the original capacity.

Unstarted Issues (open and unassigned)

Ongoing Issues (open and assigned)

Completed Issues (closed)

TPA team · retire tpa-bootstrap-01
#41174 Doing lifecycle
TPA team · dal-rescue-01 deployment
#41135 Doing lifecycle
TPA team · migrate CiviCRM machines to gnt-dal
#41109 Doing Ganeti
TPA team · rebuild fallax as ns3
#41107 DNS Doing
TPA team · make new mirrors in gnt-dal cluster (web-dal-07 and web-dal-08)
#41106 Doing static-component
TPA team · setup serial-over-lan console access for gnt-dal
#41084 Doing lifecycle
TPA team · Create tpa-bootstrap-01 machine
#41064 Doing lifecycle
TPA team · TPA-RFC-52: mass VM migration from gnt-chi to gnt-dal
#40972 Doing Ganeti RFC lifecycle
TPA team · gnt-dal cluster ganeti setup and burn-in
#40971 Doing Ganeti
TPA team · gnt-dal cluster software setup
#40970 Doing lifecycle
TPA team · gnt-dal cluster physical setup and burn in
#40969 Doing lifecycle
TPA team · ship chi-node-14 to new datacenter
#40968 Doing lifecycle
TPA team · get access to the new colocation facility
#40967 Doing lifecycle
TPA team · build a VPN / jumphost for the gnt-dal cluster
#41058 Backlog
TPA team · order and ship servers for gnt-dal cluster in new datacenter
#40966 Backlog lifecycle
TPA team · gnt-chi cluster retirement
#40973 Needs Information Stale lifecycle
TPA team · move critical services off, and then replace, moly
#29974 Icebox RFC lifecycle