Skip to content
Snippets Groups Projects
Verified Commit 583426c8 authored by anarcat's avatar anarcat
Browse files

finish TPA-RFC-40 proposal, sent to TPA

parent 166659e1
No related branches found
No related tags found
No related merge requests found
......@@ -21,7 +21,6 @@ and add it to the above list.
* [TPA-RFC-36: Gitolite, GitWeb retirement](policy/tpa-rfc-36-gitolite-gitweb-retirement)
* [TPA-RFC-37: Lektor replacement](policy/tpa-rfc-37-lektor-replacement)
* [TPA-RFC-38: Setting Up a Wiki Service](policy/tpa-rfc-38-new-wiki-service)
* [TPA-RFC-40: Cymru migration](policy/tpa-rfc-40-cymru-migration)
* [TPA-RFC-41: Schleuder retirement](policy/tpa-rfc-41-schleuder-retirement)
## Proposed
......@@ -29,6 +28,7 @@ and add it to the above list.
<!-- No policy is currently `proposed`. -->
* [TPA-RFC-39: Nextcloud account policy](policy/tpa-rfc-39-nextcloud-account-policy)
* [TPA-RFC-40: Cymru migration](policy/tpa-rfc-40-cymru-migration)
## Standard
......
......@@ -2,11 +2,19 @@
title: TPA-RFC-40: Cymru migration
---
[[_TOC_]]
Summary: buy a few large servers to move the Cymru machines out in a
Summary: buy a few large servers to move the Cymru machines to a
trusted colocation facility.
Note: this is a huge document. The executive summary is above, to see
more details of the proposals, jump to the "Proposal" section
below. A copy of this document is available in the TPA wiki:
https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-40-cymru-migration
Here's a table of contents as well:
[[_TOC_]]
# Background
We have [decided][] to move all services away from Team Cymru
......@@ -31,6 +39,8 @@ account those resources in the planning.
## Inventory
### gnt-chi
In the Ganeti (`gnt-chi`) cluster, we have 12 machines hosting about
17 virtual machines, of which 14 much absolutely be migrated.
......@@ -50,6 +60,25 @@ This does not include:
* moly: another server considered negligible in terms of hardware (3
small VMs, one to rebuild)
### gnt-fsn
While we are not looking at replacing the existing gnt-fsn cluster,
it's still worthwhile to look at the capacity and usage there, in case
we need to replace that cluster as well, or grow the gnt-chi cluster
to similar usage.
* gnt-fsn has 4x10TB + 1x5TB HDD and 8x1TB NVMe (after raid),
according to `gnt-nodes list-storage`, for a total of 45TB HDD, 8TB
NVMe after RAID
* out of that, around 17TB is in use (basically: `ssh fsn-node-02
gnt-node list-storage --no-header | awk '{print $5}' | sed 's/T/G *
1000/;s/G/Gbyte/;s/$/ + /' | qalc`), 13TB of which on HDD
* memory: ~500GB (8*62GB = 496GB), out of this 224GB is allocated
* cores: 48 (8*12 = 96 threads), out of this 107 vCPUs are allocated
## Colocation specifications
This is the specifications we are looking for in a colocation
......@@ -81,15 +110,69 @@ as little as two weeks of full time work. Lead time for server
delivery and data transfers will prolong this significantly, with
total migration times from 4 to 8 weeks.
The actual proposal here is, formally, to approve the acquisition of
three physical servers, and the monthly cost of hosting them at a
colocation facility.
The price breakdown is as follows:
* hardware: 42k$ ±5k$, 8k$/year over 5 years, 6k$/year over 7 years,
or about 500-700$/mth, most likely 600$/mth (about 6 years
amortization)
* colo: 450-2000$/mth, most likely 600$mth (4U at 150$/mth)
* total: 1000-2700$/mth, most likely 1200$/mth
* labor: 5-7 weeks full time
TODO: double-check that makes sense with the costs below.
## Goals
No must/nice/non-goals were actually set in this proposal, because it
was established in a rush.
## Risks
### Costs
This is the least expensive option, but possibly more risky in terms
of costs in the long term, as there are risks that a complete hardware
failure brings the service down and requires a costly replacement.
There's also a risk of extra labor required in migrating the services
around. We believe the risk of migrating to the cloud or another
hosted service is actually *higher*, however, because we wouldn't
control the mechanics of the hosting as well as with the proposed
colo providers.
In effect, we are betting that the cloud will not provide us with the
cost savings it promises, because we have massive CPU/memory (shadow),
and storage (GitLab, metrics, mirrors) requirements.
There is the possibility we are miscalculating because we are
calculating on the worst case scenario of full time shadow simulation
and CPU/memory usage, but on the other hand, we haven't explicitly
counted for storage usage in the cloud solution, so we might be
underestimating costs there as well.
### Censorship and surveillance
There is a risk we might get censored more easily at a specialized
provider than at a general hosting provider like Hetzner, Amazon, or
OVH.
We balance that risk with the risk of increased surveillance and lack
of trust in commercial providers.
If push comes to shove, we can still spin up mirrors or services in
the cloud. And indeed, the anti-censorship and metrics teams are
already doing so.
# Costs
This section evaluates the cost of the three options, in broad
terms. More specific estimates will be established as we go along.
terms. More specific estimates will be established as we go along. For
now, the budget in the proposal is the actual proposal, and the costs
below should be considered details of the above proposal.
## Self-hosting: ~12k$/year, 5-7 weeks
......@@ -124,13 +207,65 @@ We would amortize this expense over 7-8 years, so around 10k$/year for
hardware, assuming we would buy something similar (but obviously
probably better by then) every 7 to 8 years.
### Colocation: 900$-2100$/mth
#### Updated server spec: 42k$USD, ~8k$/yr over 5 years, 6k$/yr for 7yrs
Here's a more precise quote established on 2022-10-06 by lavamind:
Based on the server builder on <http://interpromicro.com> which is a
supplier Riseup has used in the past. Here's what I was able to find
out. We're able to cram our base requirements into a SuperMicro 1U
package with the following specs :
* [SuperMicro 1114CS-THR 1U][]
* AMD Milan (EPYC) 7713P 64C/128T @ 2.00Ghz 256M cache
* 512G DDR4 RAM (8x64G)
* 6x Intel S4510 1.92T SATA3 SSD
* 2x Intel DC P4610 1.60T NVMe SSD
* AOC NIC 2x10GbE SFP+
* Quote: **13,645.25$USD**
For three such servers, we have:
* 192 cores, 384 threads
* 1536GB RAM (1.5TB)
* 34.56TB SSD storage (17TB after RAID-1)
* 9.6TB NVMe storage (4.8TB after RAID-1)
* Total: **40,936$USD**
At this price range we could likely afford to throw in a few extras:
* Double amount of RAM (1T total) +2,877
* Double SATA3 SSD capacity with 3.84T drives +2,040
* Double NVMe SSD capacity with 3.20T drives +814
* Switch to faster AMD Milan (EPYC) 75F3 32C/64T @ 2.95Ghz +186
There are also comparable 2U chassis with 3.5" drive bays, but since
we use only 2.5" drives it doesn't make much sense unless we really
want a system with 2 CPU sockets. Such a system would cost an
additional ~6,000$USD depending on the model of CPU we end up
choosing, bringing us closer to inital ballpark number, above.
Exact prices are still to be determined. 150$/U/mth (900$/mth for 6U)
figure is from [this source](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40897#note_2839891) (confidential). There's [another
quote](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40897#note_2840427) at 350$/U/mth (2100$/mth).
Considering that the base build would have enough capacity to host
*both* gnt-chi (800GB) and gnt-fsn (17TB, including 13TB on HDD and
4TB on NVMe), it seems like a sufficient build.
See also [this comment](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40897#note_2838302) for other colo resources.
TODO: check the math above.
[SuperMicro 1114CS-THR 1U]: https://www.supermicro.com/en/Aplus/system/1U/1114/AS-1114CS-TNR.cfm
### Colocation: 450$-2100$/mth
Exact prices are still to be determined. 150$/U/mth (900$/mth for 6U,
600$mth for 4U) figure is from [this source][]
(confidential). There's [another quote][] at 350$/U/mth (2100$/mth).
TODO: This needs to take into account chi-node-14, how many Us?
See also [this comment][] for other colo resources.
[this source]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40897#note_2839891
[another quote]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40897#note_2840427
[this comment]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40897#note_2838302
### Initial setup: one week
......@@ -195,7 +330,7 @@ this setup is proving costly and complex.
### OVH cloud: 2.6k$/mth
The [Scale 7](https://www.ovhcloud.com/fr-ca/bare-metal/scale/scale-7/) server seem like it could fit well for both
The [Scale 7][] server seem like it could fit well for both
simulations and general-purpose hosting:
- AMD Epyc 7763 - 64c/128t - 2.45GHz/3.5GHz
......@@ -208,9 +343,11 @@ simulations and general-purpose hosting:
- 1 192,36$CAD/mth (871USD) with a 12mth commit
- **total**, for 3 servers: 3677CAD or 2615USD/mth
[Scale 7]: https://www.ovhcloud.com/fr-ca/bare-metal/scale/scale-7/
### Data packet: 6k$/mth
Data Packet also has AMD EPYC machines, see their [pricing page](https://www.datapacket.com/pricing):
Data Packet also has AMD EPYC machines, see their [pricing page][]:
* AMD EPYC 7702P 64 Cores, 128 Threads, 2 GHz
* 2x2TB NVME
......@@ -220,6 +357,8 @@ Data Packet also has AMD EPYC machines, see their [pricing page](https://www.dat
* ashburn virginia
* **total**, for 3 servers: 6000USD/mth
[pricing page]: https://www.datapacket.com/pricing
### Scaleway: 3k$/mth
Scaleway also has EPYC machines, but only in Europe:
......@@ -347,21 +486,27 @@ is reduced.
# Approval
This will need to be approved by TPA and the TPI executive director.
This will need to be approved by those entities, in sequence:
1. [ ] TPA
2. [ ] accounting
3. [ ] executive director
# Deadline
ASAP.
* TPA: ASAP, end of day as of this message
* accounting, ED: at their leisure, but preferably by end of week or
month
# Status
This proposal is currently in the `draft` state.
This proposal is currently in the `proposed` state.
# References
See [tpo/tpa/team#40897](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40897) for the discussion.
See [tpo/tpa/team#40897][] for the discussion ticket.
# Appendix
[tpo/tpa/team#40897]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40897
## gnt-chi detailed inventory
......@@ -405,6 +550,21 @@ See [tpo/tpa/team#40897](https://gitlab.torproject.org/tpo/tpa/team/-/issues/408
web-chi-03.torproject.org 4 8.0G 0M blockdev
web-chi-04.torproject.org 4 8.0G 0M blockdev
root@chi-node-01:~# gnt-node list-storage | sort
Node Type Name Size Used Free Allocatable
chi-node-01.torproject.org lvm-vg vg_ganeti 464.7G 447.1G 17.6G Y
chi-node-02.torproject.org lvm-vg vg_ganeti 464.7G 387.1G 77.6G Y
chi-node-03.torproject.org lvm-vg vg_ganeti 464.7G 457.1G 7.6G Y
chi-node-04.torproject.org lvm-vg vg_ganeti 464.7G 104.6G 360.1G Y
chi-node-06.torproject.org lvm-vg vg_ganeti 464.7G 269.1G 195.6G Y
chi-node-07.torproject.org lvm-vg vg_ganeti 1.4T 239.1G 1.1T Y
chi-node-08.torproject.org lvm-vg vg_ganeti 464.7G 147.0G 317.7G Y
chi-node-09.torproject.org lvm-vg vg_ganeti 278.3G 275.8G 2.5G Y
chi-node-10.torproject.org lvm-vg vg_ganeti 278.3G 251.3G 27.0G Y
chi-node-11.torproject.org lvm-vg vg_ganeti 464.7G 283.6G 181.1G Y
TODO: lavamind: inventory of storage on the SAN here please. :)
## moly inventory
| instance | memory | vCPU | disk |
......@@ -412,3 +572,64 @@ See [tpo/tpa/team#40897](https://gitlab.torproject.org/tpo/tpa/team/-/issues/408
| fallax | 512MiB | 1 | 4GB |
| build-x86-05 | 14GB | 6 | 90GB |
| build-x86-06 | 14GB | 6 | 90GB |
## gnt-fsn inventory
root@fsn-node-02:~# gnt-instance list -o name,be/vcpus,be/memory,disk_usage,disk_template
Instance ConfigVCPUs ConfigMaxMem DiskUsage Disk_template
alberti.torproject.org 2 4.0G 22.2G drbd
bacula-director-01.torproject.org 2 8.0G 262.4G drbd
carinatum.torproject.org 2 2.0G 12.2G drbd
check-01.torproject.org 4 4.0G 32.4G drbd
chives.torproject.org 1 1.0G 12.2G drbd
colchicifolium.torproject.org 4 16.0G 734.5G drbd
crm-ext-01.torproject.org 2 2.0G 24.2G drbd
crm-int-01.torproject.org 4 8.0G 164.4G drbd
cupani.torproject.org 2 2.0G 144.4G drbd
eugeni.torproject.org 2 4.0G 99.4G drbd
gayi.torproject.org 2 2.0G 74.4G drbd
gettor-01.torproject.org 2 1.0G 12.2G drbd
gitlab-02.torproject.org 8 16.0G 1.2T drbd
henryi.torproject.org 2 1.0G 32.4G drbd
loghost01.torproject.org 2 2.0G 61.4G drbd
majus.torproject.org 2 1.0G 32.4G drbd
materculae.torproject.org 2 8.0G 174.5G drbd
media-01.torproject.org 2 2.0G 312.4G drbd
meronense.torproject.org 4 16.0G 524.4G drbd
metrics-store-01.torproject.org 2 2.0G 312.4G drbd
neriniflorum.torproject.org 2 1.0G 12.2G drbd
nevii.torproject.org 2 1.0G 24.2G drbd
onionoo-backend-01.torproject.org 2 16.0G 72.4G drbd
onionoo-backend-02.torproject.org 2 16.0G 72.4G drbd
onionoo-frontend-01.torproject.org 4 4.0G 12.2G drbd
onionoo-frontend-02.torproject.org 4 4.0G 12.2G drbd
palmeri.torproject.org 2 1.0G 34.4G drbd
pauli.torproject.org 2 4.0G 22.2G drbd
perdulce.torproject.org 2 1.0G 524.4G drbd
polyanthum.torproject.org 2 4.0G 84.4G drbd
relay-01.torproject.org 2 8.0G 12.2G drbd
rude.torproject.org 2 2.0G 64.4G drbd
static-master-fsn.torproject.org 2 16.0G 832.5G drbd
staticiforme.torproject.org 4 6.0G 322.5G drbd
submit-01.torproject.org 2 4.0G 32.4G drbd
tb-build-01.torproject.org 8 16.0G 612.4G drbd
tbb-nightlies-master.torproject.org 2 2.0G 142.4G drbd
vineale.torproject.org 4 8.0G 124.4G drbd
web-fsn-01.torproject.org 2 4.0G 522.5G drbd
web-fsn-02.torproject.org 2 4.0G 522.5G drbd
root@fsn-node-02:~# gnt-node list-storage | sort
Node Type Name Size Used Free Allocatable
fsn-node-01.torproject.org lvm-vg vg_ganeti 893.1G 469.6G 423.5G Y
fsn-node-01.torproject.org lvm-vg vg_ganeti_hdd 9.1T 1.9T 7.2T Y
fsn-node-02.torproject.org lvm-vg vg_ganeti 893.1G 495.2G 397.9G Y
fsn-node-02.torproject.org lvm-vg vg_ganeti_hdd 9.1T 4.4T 4.7T Y
fsn-node-03.torproject.org lvm-vg vg_ganeti 893.6G 333.8G 559.8G Y
fsn-node-03.torproject.org lvm-vg vg_ganeti_hdd 9.1T 2.5T 6.6T Y
fsn-node-04.torproject.org lvm-vg vg_ganeti 893.6G 586.3G 307.3G Y
fsn-node-04.torproject.org lvm-vg vg_ganeti_hdd 9.1T 3.0T 6.1T Y
fsn-node-05.torproject.org lvm-vg vg_ganeti 893.6G 431.5G 462.1G Y
fsn-node-06.torproject.org lvm-vg vg_ganeti 893.6G 446.1G 447.5G Y
fsn-node-07.torproject.org lvm-vg vg_ganeti 893.6G 775.7G 117.9G Y
fsn-node-08.torproject.org lvm-vg vg_ganeti 893.6G 432.2G 461.4G Y
fsn-node-08.torproject.org lvm-vg vg_ganeti_hdd 5.5T 1.3T 4.1T Y
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment