Skip to content
Snippets Groups Projects
Verified Commit 573651d9 authored by anarcat's avatar anarcat
Browse files

cymru migration proposal

parent ffe2e171
No related branches found
No related tags found
No related merge requests found
......@@ -22,7 +22,7 @@ and add it to the above list.
* [TPA-RFC-37: Lektor replacement](policy/tpa-rfc-37-lektor-replacement)
* [TPA-RFC-38: Setting Up a Wiki Service](policy/tpa-rfc-38-new-wiki-service)
* [TPA-RFC-39: Nextcloud account policy](policy/tpa-rfc-39-nextcloud-account-policy)
* TPA-RFC-40: reserved, see tpo/tpa/team#40897
* [TPA-RFC-40: Cymru migration](policy/tpa-rfc-40-cymru-migration)
## Proposed
......
---
title: TPA-RFC-40: Cymru migration
---
[[_TOC_]]
Summary: TODO
# Background
We have decided to move all services away from Team Cymru
infrastructure.
This proposal discusses various alternatives which can be regrouped in
three big classes:
* self-hosting: we own hardware (buy it or donated) and have someone
set it up in a colo facility
* dedicated hosting: we rent hardware, someone else manages it to our
spec
* cloud hosting: we don't bother with hardware at all and move
everything into virtual machine hosting managed by someone else
Some services (web mirrors) were already moved (to OVH cloud) and
might require a second move (back into an eventual new
location). That's considered out of scope for now, but we do take into
account those resources in the planning.
## Inventory
In the Ganeti (`gnt-chi`) cluster, we have 12 machines hosting about
17 virtual machines, of which 14 much absolutely be migrated.
Those machines count for:
* memory: 262GB used out of 474GB allocated to VMs, including 300GB for a single runner
* CPUs: 78 vcores allocated
* Disk: 800GB disk allocated on SAS
* SAN: basically 1TB used, mostly for the two mirrors
* a /24 of IP addresses
* unlimited gigabit
* 2 private VLANs for management and data
This does not include:
* shadow simulator: 40 cores + 1.5TB RAM (`chi-node-14`)
* moly: another server considered negligible in terms of hardware (3
small VMs, one to rebuild)
## Colocation specifications
This is the specifications we are looking for in a colocation
provider:
- 4 to 6U rack space, with enough power to feed the machines above
- 1 or ideally 10gbit uplink unlimited
- IPv4: /24, or at least a /27 in the short term
- IPv6: we currently only have a /64
- out of band access (IPMI or serial)
- rescue systems (e.g. PXE booting)
- remote hands SLA ("how long to replace a broken hard drive?")
- private VLANs
# Proposal
## Goals
<!-- include bugs to be fixed -->
### Must have
### Nice to have
### Non-Goals
## Scope
## Affected users
# Personas
N/A?
# Alternatives considered
# Costs
## Self-hosting: ~12k$/year, 5-7 weeks
### Hardware: ~10k/year
* 3 big fat servers, each with
* at least two NICs (one public, one internal), 10gbit
* hyper-convergent (e.g. we keep the current DRBD setup)
* 25k$ AMD ryzen 64 cores, 512GB RAM, chassis, 20 bays 16 SATA 4 NVMe
* 2k$ 2xNVMe 1TB, 2 free slots
* 6k$ 6xSSD 2TB, 12 free slots
* total storage per node, post-RAID, 7TB 1TB NVMe, 6TB SSD
* ~33k$CAD per server or 25k$USD +- 5k$? times 3 = 75k$ +- 15k$
* total CPUs 192 cores (384 HT), 1.5TB RAM, 21TB storage, half of those for redundancy
* amortize over 7-8 years, so around 10k$/year for hardware
### Colocation: 150$/mth or free
Still to be determined. 150$/mth is from [this source](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40897#note_2839891).
See also [this comment](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40897#note_2838302) for other colo resources.
### Initial setup: one week
Ganeti cluster setup costs:
| Task | Estimate | Uncertainty | Total | Notes |
|---------------|----------|-------------|-------|---------------------|
| Node setup | 3 days | low | 3.3d | 1 d / machine |
| VLANs | 1 day | medium | 1.5d | could involve IPsec |
| Cluster setup | 0.5 day | low | 0.6d | |
| Total | 4.5 days | | 5.4d | |
### Batch migration: 1-2 weeks, worst case full rebuild (4-6w)
We assume each VM will take 30 minutes of work to migrate which, if
all goes well, means that we can basically migrate all the machines in
one day of work.
It might take more time to do the actual transfers, but the assumption
is the work can be parallelized and therefore transfer rates are
non-blocking. So that "day" of work would actually be spread over a
week of time.
| Task | Estimate | Uncertainty | Total | Notes |
|-------------------------|----------|-------------|---------|----------------------------------|
| research and testing | 1 day | extreme | 5d | half a day of this already spent |
| total VM migration time | 1 day | extreme | 5d | |
| Total | 2 day | extreme | 10 days | |
There is a lot of variability in this estimate. It's possible that we
could batch migrate everything in one fell swoop and just have to do
manual tweaks in LDAP and inside the VM to reset IP addresses.
### Worst case: full rebuild, 3.5-4.5 weeks
The worst case here is a fall back to the full rebuild case that we
computed for the cloud, below.
To this, we need to add a "VM bootstrap" cost. I'd say 1h hour per VM,
medium uncertainty in Ganeti, so 1.5h per VM or ~22h in Ganeti (~3
days).
## Dedicated hosting: 2-6k$/mth, 7+ weeks
### OVH cloud: 2.6k$/mth
https://www.ovhcloud.com/fr-ca/bare-metal/scale/scale-7/
- AMD Epyc 7763 - 64c/128t - 2.45GHz/3.5GHz
- 2x SSD SATA 480GB
- 512GB RAM
- 2× 1.92TB SSD NVMe + 2× 6TB HDD SATA Soft RAID
- 1Gbit/s unmetered and guaranteed
- 6bit/s local
- **back order in americas**
- 1 192,36$CAD/mth (871USD) with a 12mth commit
- for 3 servers: 3677CAD or 2615USD/mth
### Data packet: 6k$/mth
https://www.datapacket.com/pricing
* AMD EPYC 7702P 64 Cores, 128 Threads, 2 GHz
* 2x2TB NVME
* 512GB RAM
* 1gbps unmetered
* 2020$USD / mth
* for 3 servers: 6000USD/mth
* ashburn virginia
### Scaleway: 3k$/mth
- 2x AMD EPYC 7532 32C/64T - 2.4 GHz
- 1024 GB RAM
- 2 x 1.92 TB NVMe
- Up to 1 Gbps
- €1,039.99/month
- for 3 servers: ~3000USD/mth
- **only europe**
### Migration costs: 7+ weeks
We haven't estimated the migration costs here, but we assume those
will be similar to the self-hosting scenario.
## Cloud hosting: 3-22k$/mth, 5-11 weeks
### Hardware costs: 3k-22k$/mth
Let's assume we need at minimum 80 vcores and 300GB of memory, with
1TB of storage. This is likely an underestimation.
#### Amazon: 1-22+k$/mth
* 20x a1.xlarge (4 cores, 8GB memory) 998.78 USD/mth
* large runners are ridiculous: 1x r6g.12xlarge (48 CPUs, 384GB) 22,311.86USD (!!)
#### OVH cloud: 1.2k$/mth, small shadow
* 20x "comfort" (4 cores, 8GB, 28CAD/mth) = 80 cores, 160GB RAM, 400USD/mth
* 2x r2-240 (16 cores, 240GB, 1.1399$CAD/h) = 32 cores, 480GB RAM, 820USD/mth
* **cannot fully replace large runners, missing CPU cores**
#### Gandi VPS: 600$/mth, no shadow
* 20xV-R8 (4 cores, 8GB, 30EUR/mth) = 80 cores, 160GB RAM, ~600USD/mth
* **cannot replace large runners at all**
#### Scaleway: 3500$/mth
* 20x GP1-XS, 4 vCPUs, 16 GB, NVMe Local Storage or Block Storage on demand, 500 Mbit/s, From €0.08/hour, 1110USD/mth
* 1x ENT1-2XL: 96 cores, 384 GB RAM, Block Storage backend, Up to 20 Gbit/s BW, From €3.36/hour, 2333$USD/mth
### Base setup 1-5 weeks
15 machines to move to the cloud. how do we set them up?
Terraform? Click-click-click? Full unknown.
Let's say 2 hours per machine, 28 hours, which means is 4 days of 7
hours of work, with extreme uncertainty, so five times which is about
5 weeks.
### Base VM bootstrap cost 2-10 days
We estimate setting up a machine takes a ground time of 1 hour per VM,
extreme uncertainty, which means 1-5 hours, so 15-75 hours in the
cloud.
### Full rebuild: 3-4 weeks
This is calculated based on "rebuild the whole VM from scratch".
| machine | estimate | uncertainty | total | notes |
|--------------------|----------|-------------|-------|---------------|
| btcpayserver-02 | 1 day | low | 1.1 | |
| ci-runner-01 | 0.5 day | low | 0.55 | |
| ci-runner-x86-05 | 0.5 day | low | 0.55 | |
| dangerzone-01 | 0.5 day | low | 0.55 | |
| gitlab-dev-01 | 1 day | low | 1.1 | optional |
| metrics-psqlts-01 | 1 day | high | 2 | |
| moria-haven-01 | N/A | | | to be retired |
| onionbalance-02 | 0.5 day | low | 0.55 | |
| probetelemetry-01 | 1 day | low | 1.1 | |
| rdsys-frontend-01 | 1 day | low | 1.1 | |
| static-gitlab-shim | 0.5 day | low | 0.55 | |
| survey-01 | 0.5 day | low | 0.55 | |
| tb-pkgstage-01 | 1 day | high | 2 | (unknown) |
| tb-tester-01 | 1 day | high | 2 | (unknown) |
| telegram-bot-01 | 1 day | low | 1.1 | |
| web-chi-03 | N/A | | | to be retired |
| web-chi-04 | N/A | | | to be retired |
| fallax | 3 days | medium | 4.5 | |
| build-x86-05 | N/A | | | to be retired |
| build-x86-06 | N/A | | | to be retired |
| Total | | | 19.3 | |
That's 15 VMs to migrate, 5 to be destroyed (total 20).
This is almost four weeks of full time work, generally low
uncertainty. This could possibly be reduced to 14 days (about three
weeks) if jobs are parallelized and if uncertainty around tb* machines
is reduced.
# Approval
ED.
# Deadline
ASAP.
# Status
This proposal is currently in the `draft` state.
# References
See [tpo/tpa/team#40897](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40897).
# Appendix
## gnt-chi detailed inventory
root@chi-node-01:~# gnt-instance list --no-headers | wc -l
17
root@chi-node-01:~# gnt-instance list --no-headers -o name | sed 's/.torproject.org//'
btcpayserver-02
ci-runner-01
ci-runner-x86-05
dangerzone-01
gitlab-dev-01
metrics-psqlts-01
moria-haven-01
onionbalance-02
probetelemetry-01
rdsys-frontend-01
static-gitlab-shim
survey-01
tb-pkgstage-01
tb-tester-01
telegram-bot-01
web-chi-03
web-chi-04
root@chi-node-01:~# gnt-instance list -o name,be/vcpus,be/memory,disk_usage,disk_template
Instance ConfigVCPUs ConfigMaxMem DiskUsage Disk_template
btcpayserver-02.torproject.org 2 8.0G 82.4G drbd
ci-runner-01.torproject.org 8 64.0G 212.4G drbd
ci-runner-x86-05.torproject.org 30 300.0G 152.4G drbd
dangerzone-01.torproject.org 2 8.0G 12.2G drbd
gitlab-dev-01.torproject.org 2 8.0G 0M blockdev
metrics-psqlts-01.torproject.org 2 8.0G 32.4G drbd
moria-haven-01.torproject.org 2 8.0G 0M blockdev
onionbalance-02.torproject.org 2 2.0G 12.2G drbd
probetelemetry-01.torproject.org 8 4.0G 62.4G drbd
rdsys-frontend-01.torproject.org 2 8.0G 32.4G drbd
static-gitlab-shim.torproject.org 2 8.0G 32.4G drbd
survey-01.torproject.org 2 8.0G 32.4G drbd
tb-pkgstage-01.torproject.org 2 8.0G 112.4G drbd
tb-tester-01.torproject.org 2 8.0G 62.4G drbd
telegram-bot-01.torproject.org 2 8.0G 0M blockdev
web-chi-03.torproject.org 4 8.0G 0M blockdev
web-chi-04.torproject.org 4 8.0G 0M blockdev
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment