... | ... | @@ -2,11 +2,19 @@ |
|
|
title: TPA-RFC-40: Cymru migration
|
|
|
---
|
|
|
|
|
|
[[_TOC_]]
|
|
|
|
|
|
Summary: buy a few large servers to move the Cymru machines out in a
|
|
|
Summary: buy a few large servers to move the Cymru machines to a
|
|
|
trusted colocation facility.
|
|
|
|
|
|
Note: this is a huge document. The executive summary is above, to see
|
|
|
more details of the proposals, jump to the "Proposal" section
|
|
|
below. A copy of this document is available in the TPA wiki:
|
|
|
|
|
|
https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-40-cymru-migration
|
|
|
|
|
|
Here's a table of contents as well:
|
|
|
|
|
|
[[_TOC_]]
|
|
|
|
|
|
# Background
|
|
|
|
|
|
We have [decided][] to move all services away from Team Cymru
|
... | ... | @@ -31,6 +39,8 @@ account those resources in the planning. |
|
|
|
|
|
## Inventory
|
|
|
|
|
|
### gnt-chi
|
|
|
|
|
|
In the Ganeti (`gnt-chi`) cluster, we have 12 machines hosting about
|
|
|
17 virtual machines, of which 14 much absolutely be migrated.
|
|
|
|
... | ... | @@ -50,6 +60,25 @@ This does not include: |
|
|
* moly: another server considered negligible in terms of hardware (3
|
|
|
small VMs, one to rebuild)
|
|
|
|
|
|
### gnt-fsn
|
|
|
|
|
|
While we are not looking at replacing the existing gnt-fsn cluster,
|
|
|
it's still worthwhile to look at the capacity and usage there, in case
|
|
|
we need to replace that cluster as well, or grow the gnt-chi cluster
|
|
|
to similar usage.
|
|
|
|
|
|
* gnt-fsn has 4x10TB + 1x5TB HDD and 8x1TB NVMe (after raid),
|
|
|
according to `gnt-nodes list-storage`, for a total of 45TB HDD, 8TB
|
|
|
NVMe after RAID
|
|
|
|
|
|
* out of that, around 17TB is in use (basically: `ssh fsn-node-02
|
|
|
gnt-node list-storage --no-header | awk '{print $5}' | sed 's/T/G *
|
|
|
1000/;s/G/Gbyte/;s/$/ + /' | qalc`), 13TB of which on HDD
|
|
|
|
|
|
* memory: ~500GB (8*62GB = 496GB), out of this 224GB is allocated
|
|
|
|
|
|
* cores: 48 (8*12 = 96 threads), out of this 107 vCPUs are allocated
|
|
|
|
|
|
## Colocation specifications
|
|
|
|
|
|
This is the specifications we are looking for in a colocation
|
... | ... | @@ -81,15 +110,69 @@ as little as two weeks of full time work. Lead time for server |
|
|
delivery and data transfers will prolong this significantly, with
|
|
|
total migration times from 4 to 8 weeks.
|
|
|
|
|
|
The actual proposal here is, formally, to approve the acquisition of
|
|
|
three physical servers, and the monthly cost of hosting them at a
|
|
|
colocation facility.
|
|
|
|
|
|
The price breakdown is as follows:
|
|
|
|
|
|
* hardware: 42k$ ±5k$, 8k$/year over 5 years, 6k$/year over 7 years,
|
|
|
or about 500-700$/mth, most likely 600$/mth (about 6 years
|
|
|
amortization)
|
|
|
* colo: 450-2000$/mth, most likely 600$mth (4U at 150$/mth)
|
|
|
* total: 1000-2700$/mth, most likely 1200$/mth
|
|
|
* labor: 5-7 weeks full time
|
|
|
|
|
|
TODO: double-check that makes sense with the costs below.
|
|
|
|
|
|
## Goals
|
|
|
|
|
|
No must/nice/non-goals were actually set in this proposal, because it
|
|
|
was established in a rush.
|
|
|
|
|
|
## Risks
|
|
|
|
|
|
### Costs
|
|
|
|
|
|
This is the least expensive option, but possibly more risky in terms
|
|
|
of costs in the long term, as there are risks that a complete hardware
|
|
|
failure brings the service down and requires a costly replacement.
|
|
|
|
|
|
There's also a risk of extra labor required in migrating the services
|
|
|
around. We believe the risk of migrating to the cloud or another
|
|
|
hosted service is actually *higher*, however, because we wouldn't
|
|
|
control the mechanics of the hosting as well as with the proposed
|
|
|
colo providers.
|
|
|
|
|
|
In effect, we are betting that the cloud will not provide us with the
|
|
|
cost savings it promises, because we have massive CPU/memory (shadow),
|
|
|
and storage (GitLab, metrics, mirrors) requirements.
|
|
|
|
|
|
There is the possibility we are miscalculating because we are
|
|
|
calculating on the worst case scenario of full time shadow simulation
|
|
|
and CPU/memory usage, but on the other hand, we haven't explicitly
|
|
|
counted for storage usage in the cloud solution, so we might be
|
|
|
underestimating costs there as well.
|
|
|
|
|
|
### Censorship and surveillance
|
|
|
|
|
|
There is a risk we might get censored more easily at a specialized
|
|
|
provider than at a general hosting provider like Hetzner, Amazon, or
|
|
|
OVH.
|
|
|
|
|
|
We balance that risk with the risk of increased surveillance and lack
|
|
|
of trust in commercial providers.
|
|
|
|
|
|
If push comes to shove, we can still spin up mirrors or services in
|
|
|
the cloud. And indeed, the anti-censorship and metrics teams are
|
|
|
already doing so.
|
|
|
|
|
|
# Costs
|
|
|
|
|
|
This section evaluates the cost of the three options, in broad
|
|
|
terms. More specific estimates will be established as we go along.
|
|
|
terms. More specific estimates will be established as we go along. For
|
|
|
now, the budget in the proposal is the actual proposal, and the costs
|
|
|
below should be considered details of the above proposal.
|
|
|
|
|
|
## Self-hosting: ~12k$/year, 5-7 weeks
|
|
|
|
... | ... | @@ -124,13 +207,65 @@ We would amortize this expense over 7-8 years, so around 10k$/year for |
|
|
hardware, assuming we would buy something similar (but obviously
|
|
|
probably better by then) every 7 to 8 years.
|
|
|
|
|
|
### Colocation: 900$-2100$/mth
|
|
|
#### Updated server spec: 42k$USD, ~8k$/yr over 5 years, 6k$/yr for 7yrs
|
|
|
|
|
|
Here's a more precise quote established on 2022-10-06 by lavamind:
|
|
|
|
|
|
Based on the server builder on <http://interpromicro.com> which is a
|
|
|
supplier Riseup has used in the past. Here's what I was able to find
|
|
|
out. We're able to cram our base requirements into a SuperMicro 1U
|
|
|
package with the following specs :
|
|
|
|
|
|
* [SuperMicro 1114CS-THR 1U][]
|
|
|
* AMD Milan (EPYC) 7713P 64C/128T @ 2.00Ghz 256M cache
|
|
|
* 512G DDR4 RAM (8x64G)
|
|
|
* 6x Intel S4510 1.92T SATA3 SSD
|
|
|
* 2x Intel DC P4610 1.60T NVMe SSD
|
|
|
* AOC NIC 2x10GbE SFP+
|
|
|
* Quote: **13,645.25$USD**
|
|
|
|
|
|
For three such servers, we have:
|
|
|
|
|
|
* 192 cores, 384 threads
|
|
|
* 1536GB RAM (1.5TB)
|
|
|
* 34.56TB SSD storage (17TB after RAID-1)
|
|
|
* 9.6TB NVMe storage (4.8TB after RAID-1)
|
|
|
* Total: **40,936$USD**
|
|
|
|
|
|
At this price range we could likely afford to throw in a few extras:
|
|
|
|
|
|
* Double amount of RAM (1T total) +2,877
|
|
|
* Double SATA3 SSD capacity with 3.84T drives +2,040
|
|
|
* Double NVMe SSD capacity with 3.20T drives +814
|
|
|
* Switch to faster AMD Milan (EPYC) 75F3 32C/64T @ 2.95Ghz +186
|
|
|
|
|
|
There are also comparable 2U chassis with 3.5" drive bays, but since
|
|
|
we use only 2.5" drives it doesn't make much sense unless we really
|
|
|
want a system with 2 CPU sockets. Such a system would cost an
|
|
|
additional ~6,000$USD depending on the model of CPU we end up
|
|
|
choosing, bringing us closer to inital ballpark number, above.
|
|
|
|
|
|
Exact prices are still to be determined. 150$/U/mth (900$/mth for 6U)
|
|
|
figure is from [this source](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40897#note_2839891) (confidential). There's [another
|
|
|
quote](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40897#note_2840427) at 350$/U/mth (2100$/mth).
|
|
|
Considering that the base build would have enough capacity to host
|
|
|
*both* gnt-chi (800GB) and gnt-fsn (17TB, including 13TB on HDD and
|
|
|
4TB on NVMe), it seems like a sufficient build.
|
|
|
|
|
|
See also [this comment](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40897#note_2838302) for other colo resources.
|
|
|
TODO: check the math above.
|
|
|
|
|
|
[SuperMicro 1114CS-THR 1U]: https://www.supermicro.com/en/Aplus/system/1U/1114/AS-1114CS-TNR.cfm
|
|
|
|
|
|
### Colocation: 450$-2100$/mth
|
|
|
|
|
|
Exact prices are still to be determined. 150$/U/mth (900$/mth for 6U,
|
|
|
600$mth for 4U) figure is from [this source][]
|
|
|
(confidential). There's [another quote][] at 350$/U/mth (2100$/mth).
|
|
|
|
|
|
TODO: This needs to take into account chi-node-14, how many Us?
|
|
|
|
|
|
See also [this comment][] for other colo resources.
|
|
|
|
|
|
[this source]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40897#note_2839891
|
|
|
[another quote]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40897#note_2840427
|
|
|
[this comment]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40897#note_2838302
|
|
|
|
|
|
### Initial setup: one week
|
|
|
|
... | ... | @@ -195,7 +330,7 @@ this setup is proving costly and complex. |
|
|
|
|
|
### OVH cloud: 2.6k$/mth
|
|
|
|
|
|
The [Scale 7](https://www.ovhcloud.com/fr-ca/bare-metal/scale/scale-7/) server seem like it could fit well for both
|
|
|
The [Scale 7][] server seem like it could fit well for both
|
|
|
simulations and general-purpose hosting:
|
|
|
|
|
|
- AMD Epyc 7763 - 64c/128t - 2.45GHz/3.5GHz
|
... | ... | @@ -208,9 +343,11 @@ simulations and general-purpose hosting: |
|
|
- 1 192,36$CAD/mth (871USD) with a 12mth commit
|
|
|
- **total**, for 3 servers: 3677CAD or 2615USD/mth
|
|
|
|
|
|
[Scale 7]: https://www.ovhcloud.com/fr-ca/bare-metal/scale/scale-7/
|
|
|
|
|
|
### Data packet: 6k$/mth
|
|
|
|
|
|
Data Packet also has AMD EPYC machines, see their [pricing page](https://www.datapacket.com/pricing):
|
|
|
Data Packet also has AMD EPYC machines, see their [pricing page][]:
|
|
|
|
|
|
* AMD EPYC 7702P 64 Cores, 128 Threads, 2 GHz
|
|
|
* 2x2TB NVME
|
... | ... | @@ -220,6 +357,8 @@ Data Packet also has AMD EPYC machines, see their [pricing page](https://www.dat |
|
|
* ashburn virginia
|
|
|
* **total**, for 3 servers: 6000USD/mth
|
|
|
|
|
|
[pricing page]: https://www.datapacket.com/pricing
|
|
|
|
|
|
### Scaleway: 3k$/mth
|
|
|
|
|
|
Scaleway also has EPYC machines, but only in Europe:
|
... | ... | @@ -347,21 +486,27 @@ is reduced. |
|
|
|
|
|
# Approval
|
|
|
|
|
|
This will need to be approved by TPA and the TPI executive director.
|
|
|
This will need to be approved by those entities, in sequence:
|
|
|
|
|
|
1. [ ] TPA
|
|
|
2. [ ] accounting
|
|
|
3. [ ] executive director
|
|
|
|
|
|
# Deadline
|
|
|
|
|
|
ASAP.
|
|
|
* TPA: ASAP, end of day as of this message
|
|
|
* accounting, ED: at their leisure, but preferably by end of week or
|
|
|
month
|
|
|
|
|
|
# Status
|
|
|
|
|
|
This proposal is currently in the `draft` state.
|
|
|
This proposal is currently in the `proposed` state.
|
|
|
|
|
|
# References
|
|
|
|
|
|
See [tpo/tpa/team#40897](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40897) for the discussion.
|
|
|
See [tpo/tpa/team#40897][] for the discussion ticket.
|
|
|
|
|
|
# Appendix
|
|
|
[tpo/tpa/team#40897]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40897
|
|
|
|
|
|
## gnt-chi detailed inventory
|
|
|
|
... | ... | @@ -405,6 +550,21 @@ See [tpo/tpa/team#40897](https://gitlab.torproject.org/tpo/tpa/team/-/issues/408 |
|
|
web-chi-03.torproject.org 4 8.0G 0M blockdev
|
|
|
web-chi-04.torproject.org 4 8.0G 0M blockdev
|
|
|
|
|
|
root@chi-node-01:~# gnt-node list-storage | sort
|
|
|
Node Type Name Size Used Free Allocatable
|
|
|
chi-node-01.torproject.org lvm-vg vg_ganeti 464.7G 447.1G 17.6G Y
|
|
|
chi-node-02.torproject.org lvm-vg vg_ganeti 464.7G 387.1G 77.6G Y
|
|
|
chi-node-03.torproject.org lvm-vg vg_ganeti 464.7G 457.1G 7.6G Y
|
|
|
chi-node-04.torproject.org lvm-vg vg_ganeti 464.7G 104.6G 360.1G Y
|
|
|
chi-node-06.torproject.org lvm-vg vg_ganeti 464.7G 269.1G 195.6G Y
|
|
|
chi-node-07.torproject.org lvm-vg vg_ganeti 1.4T 239.1G 1.1T Y
|
|
|
chi-node-08.torproject.org lvm-vg vg_ganeti 464.7G 147.0G 317.7G Y
|
|
|
chi-node-09.torproject.org lvm-vg vg_ganeti 278.3G 275.8G 2.5G Y
|
|
|
chi-node-10.torproject.org lvm-vg vg_ganeti 278.3G 251.3G 27.0G Y
|
|
|
chi-node-11.torproject.org lvm-vg vg_ganeti 464.7G 283.6G 181.1G Y
|
|
|
|
|
|
TODO: lavamind: inventory of storage on the SAN here please. :)
|
|
|
|
|
|
## moly inventory
|
|
|
|
|
|
| instance | memory | vCPU | disk |
|
... | ... | @@ -412,3 +572,64 @@ See [tpo/tpa/team#40897](https://gitlab.torproject.org/tpo/tpa/team/-/issues/408 |
|
|
| fallax | 512MiB | 1 | 4GB |
|
|
|
| build-x86-05 | 14GB | 6 | 90GB |
|
|
|
| build-x86-06 | 14GB | 6 | 90GB |
|
|
|
|
|
|
## gnt-fsn inventory
|
|
|
|
|
|
root@fsn-node-02:~# gnt-instance list -o name,be/vcpus,be/memory,disk_usage,disk_template
|
|
|
Instance ConfigVCPUs ConfigMaxMem DiskUsage Disk_template
|
|
|
alberti.torproject.org 2 4.0G 22.2G drbd
|
|
|
bacula-director-01.torproject.org 2 8.0G 262.4G drbd
|
|
|
carinatum.torproject.org 2 2.0G 12.2G drbd
|
|
|
check-01.torproject.org 4 4.0G 32.4G drbd
|
|
|
chives.torproject.org 1 1.0G 12.2G drbd
|
|
|
colchicifolium.torproject.org 4 16.0G 734.5G drbd
|
|
|
crm-ext-01.torproject.org 2 2.0G 24.2G drbd
|
|
|
crm-int-01.torproject.org 4 8.0G 164.4G drbd
|
|
|
cupani.torproject.org 2 2.0G 144.4G drbd
|
|
|
eugeni.torproject.org 2 4.0G 99.4G drbd
|
|
|
gayi.torproject.org 2 2.0G 74.4G drbd
|
|
|
gettor-01.torproject.org 2 1.0G 12.2G drbd
|
|
|
gitlab-02.torproject.org 8 16.0G 1.2T drbd
|
|
|
henryi.torproject.org 2 1.0G 32.4G drbd
|
|
|
loghost01.torproject.org 2 2.0G 61.4G drbd
|
|
|
majus.torproject.org 2 1.0G 32.4G drbd
|
|
|
materculae.torproject.org 2 8.0G 174.5G drbd
|
|
|
media-01.torproject.org 2 2.0G 312.4G drbd
|
|
|
meronense.torproject.org 4 16.0G 524.4G drbd
|
|
|
metrics-store-01.torproject.org 2 2.0G 312.4G drbd
|
|
|
neriniflorum.torproject.org 2 1.0G 12.2G drbd
|
|
|
nevii.torproject.org 2 1.0G 24.2G drbd
|
|
|
onionoo-backend-01.torproject.org 2 16.0G 72.4G drbd
|
|
|
onionoo-backend-02.torproject.org 2 16.0G 72.4G drbd
|
|
|
onionoo-frontend-01.torproject.org 4 4.0G 12.2G drbd
|
|
|
onionoo-frontend-02.torproject.org 4 4.0G 12.2G drbd
|
|
|
palmeri.torproject.org 2 1.0G 34.4G drbd
|
|
|
pauli.torproject.org 2 4.0G 22.2G drbd
|
|
|
perdulce.torproject.org 2 1.0G 524.4G drbd
|
|
|
polyanthum.torproject.org 2 4.0G 84.4G drbd
|
|
|
relay-01.torproject.org 2 8.0G 12.2G drbd
|
|
|
rude.torproject.org 2 2.0G 64.4G drbd
|
|
|
static-master-fsn.torproject.org 2 16.0G 832.5G drbd
|
|
|
staticiforme.torproject.org 4 6.0G 322.5G drbd
|
|
|
submit-01.torproject.org 2 4.0G 32.4G drbd
|
|
|
tb-build-01.torproject.org 8 16.0G 612.4G drbd
|
|
|
tbb-nightlies-master.torproject.org 2 2.0G 142.4G drbd
|
|
|
vineale.torproject.org 4 8.0G 124.4G drbd
|
|
|
web-fsn-01.torproject.org 2 4.0G 522.5G drbd
|
|
|
web-fsn-02.torproject.org 2 4.0G 522.5G drbd
|
|
|
|
|
|
root@fsn-node-02:~# gnt-node list-storage | sort
|
|
|
Node Type Name Size Used Free Allocatable
|
|
|
fsn-node-01.torproject.org lvm-vg vg_ganeti 893.1G 469.6G 423.5G Y
|
|
|
fsn-node-01.torproject.org lvm-vg vg_ganeti_hdd 9.1T 1.9T 7.2T Y
|
|
|
fsn-node-02.torproject.org lvm-vg vg_ganeti 893.1G 495.2G 397.9G Y
|
|
|
fsn-node-02.torproject.org lvm-vg vg_ganeti_hdd 9.1T 4.4T 4.7T Y
|
|
|
fsn-node-03.torproject.org lvm-vg vg_ganeti 893.6G 333.8G 559.8G Y
|
|
|
fsn-node-03.torproject.org lvm-vg vg_ganeti_hdd 9.1T 2.5T 6.6T Y
|
|
|
fsn-node-04.torproject.org lvm-vg vg_ganeti 893.6G 586.3G 307.3G Y
|
|
|
fsn-node-04.torproject.org lvm-vg vg_ganeti_hdd 9.1T 3.0T 6.1T Y
|
|
|
fsn-node-05.torproject.org lvm-vg vg_ganeti 893.6G 431.5G 462.1G Y
|
|
|
fsn-node-06.torproject.org lvm-vg vg_ganeti 893.6G 446.1G 447.5G Y
|
|
|
fsn-node-07.torproject.org lvm-vg vg_ganeti 893.6G 775.7G 117.9G Y
|
|
|
fsn-node-08.torproject.org lvm-vg vg_ganeti 893.6G 432.2G 461.4G Y
|
|
|
fsn-node-08.torproject.org lvm-vg vg_ganeti_hdd 5.5T 1.3T 4.1T Y |