anarcat · 40e9ab1b
--- a/policy/tpa-rfc-40-cymru-migration.md
+++ b/policy/tpa-rfc-40-cymru-migration.md
@@ -4,11 +4,12 @@ title: TPA-RFC-40: Cymru migration
 [[_TOC_]]
-Summary: TODO
+Summary: buy a few large servers to move the Cymru machines out in a
+trusted colocation facility.
 # Background
-We have decided to move all services away from Team Cymru
+We have [decided][] to move all services away from Team Cymru
 infrastructure.
 This proposal discusses various alternatives which can be regrouped in
@@ -26,6 +27,8 @@ might require a second move (back into an eventual new
 location). That's considered out of scope for now, but we do take into
 account those resources in the planning.
+[decided]: https://blog.torproject.org/role-tor-project-board-conflicts-interest/
 ## Inventory
 In the Ganeti (`gnt-chi`) cluster, we have 12 machines hosting about
@@ -60,51 +63,72 @@ provider:
 - rescue systems (e.g. PXE booting)
 - remote hands SLA ("how long to replace a broken hard drive?")
 - private VLANs
+ - ideally not in Europe (where we already have lots of resources)
 # Proposal
-## Goals
+After evaluating the costs, it is the belief of TPA that
+infrastructure hosted at Cymru should be rebuilt in a new Ganeti
-<!-- include bugs to be fixed -->
+cluster hosted in a trusted colocation facility which still needs to
+be determined.
-### Must have
+This will require a significant capital expenditure (~75,000$, still
+to be clarified) that could be subsidized. Amortized over 7 to 8
+years, it is actually cheaper, per month, than moving to the cloud.
-### Nice to have
+Migration labor costs are also smaller; we could be up and running in
+as little as two weeks of full time work. Lead time for server
+delivery and data transfers will prolong this significantly, with
+total migration times from 4 to 8 weeks.
-### Non-Goals
+## Goals
-## Scope
+No must/nice/non-goals were actually set in this proposal, because it
+was established in a rush.
-## Affected users
+# Costs
-# Personas
+This section evaluates the cost of the three options, in broad
+terms. More specific estimates will be established as we go along.
-N/A?
+## Self-hosting: ~12k$/year, 5-7 weeks
-# Alternatives considered
+With this option, TPI buys hardware and has it shipped to a colocation
+facility (or has the colo buy and deploy the hardware).
-# Costs
+A new Ganeti cluster is built from those machines, and the current
+virtual machines are mass-migrated to the new cluster.
-## Self-hosting: ~12k$/year, 5-7 weeks
+The risk of this procedure is that the mass-migration fails and that
+virtual machines need to be rebuilt from scratch, in which case the
+labor costs are expanded.
 ### Hardware: ~10k/year
- * 3 big fat servers, each with
+We would buy 3 big servers, each with:
 * at least two NICs (one public, one internal), 10gbit
- * hyper-convergent (e.g. we keep the current DRBD setup)
 * 25k$ AMD ryzen 64 cores, 512GB RAM, chassis, 20 bays 16 SATA 4 NVMe
 * 2k$ 2xNVMe 1TB, 2 free slots
 * 6k$ 6xSSD 2TB, 12 free slots
+ * hyper-convergent (e.g. we keep the current DRBD setup)
 * total storage per node, post-RAID, 7TB 1TB NVMe, 6TB SSD
- * ~33k$CAD per server or 25k$USD +- 5k$? times 3 = 75k$ +- 15k$
+ * total per server: ~33k$CAD or 25k$USD +- 5k$
- * total CPUs 192 cores (384 HT), 1.5TB RAM, 21TB storage, half of those for redundancy
+ * total for 3 servers: 75k$USD +- 15k$
- * amortize over 7-8 years, so around 10k$/year for hardware
+ * total capacity:
+   * CPUs 192 cores (384 threads)
+   * 1.5TB RAM
+   * 21TB storage, half of those for redundancy
-### Colocation: 150$/mth or free
+We would amortize this expense over 7-8 years, so around 10k$/year for
+hardware, assuming we would buy something similar (but obviously
+probably better by then) every 7 to 8 years.
-Still to be determined. 150$/mth is from [this source](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40897#note_2839891).
+### Colocation: 150$/mth or free
-See also [this comment](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40897#note_2838302) for other colo resources.
+Exact prices are still to be determined. 150$/mth figure is from [this
+source](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40897#note_2839891) (confidential). See also [this comment](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40897#note_2838302) for other colo
+resources.
 ### Initial setup: one week
@@ -117,26 +141,39 @@ Ganeti cluster setup costs:
 | Cluster setup | 0.5 day  | low         | 0.6d  |                     |
 | Total         | 4.5 days |             | 5.4d  |                     |
+This gets us a basic cluster setup, into which virtual machines can be
+imported (or created).
 ### Batch migration: 1-2 weeks, worst case full rebuild (4-6w)
 We assume each VM will take 30 minutes of work to migrate which, if
 all goes well, means that we can basically migrate all the machines in
 one day of work.
-It might take more time to do the actual transfers, but the assumption
-is the work can be parallelized and therefore transfer rates are
-non-blocking. So that "day" of work would actually be spread over a
-week of time.
 | Task                    | Estimate | Uncertainty | Total   | Notes                            |
 |-------------------------|----------|-------------|---------|----------------------------------|
 | research and testing    | 1 day    | extreme     | 5d      | half a day of this already spent |
 | total VM migration time | 1 day    | extreme     | 5d      |                                  |
 | Total                   | 2 day    | extreme     | 10 days |                                  |
-There is a lot of variability in this estimate. It's possible that we
+It might take more time to do the actual transfers, but the assumption
-could batch migrate everything in one fell swoop and just have to do
+is the work can be done in parallel and therefore transfer rates are
-manual tweaks in LDAP and inside the VM to reset IP addresses.
+non-blocking. So that "day" of work would actually be spread over a
+week of time.
+There is a lot of uncertainty in this estimate. It's possible the
+migration procedure doesn't work at all, and in fact has proven to be
+[problematic][18] in our first tests. [Further testing][] showed it was
+possible to migrate a virtual machine so it is believed we will be
+able to streamline this process.
+[18]: https://github.com/ganeti/instance-debootstrap/issues/18
+It's therefore possible that we could batch migrate everything in one
+fell swoop. We would then just have to do manual changes in LDAP and
+inside the VM to reset IP addresses.
+[Further testing]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40917
 ### Worst case: full rebuild, 3.5-4.5 weeks
@@ -144,14 +181,20 @@ The worst case here is a fall back to the full rebuild case that we
 computed for the cloud, below.
 To this, we need to add a "VM bootstrap" cost. I'd say 1h hour per VM,
-medium uncertainty in Ganeti, so 1.5h per VM or ~22h in Ganeti (~3
+medium uncertainty in Ganeti, so 1.5h per VM or ~22h (~3 days).
-days).
 ## Dedicated hosting: 2-6k$/mth, 7+ weeks
+In this scenario, we rent machines from a provider (probably a
+commercial provider). It's unclear we will be able to reproduce the
+Ganeti setup the way we need to, as we do not always get the private
+VLAN we need to setup the storage backend. At Hetzner, for example,
+this setup is proving costly and complex.
 ### OVH cloud: 2.6k$/mth
-https://www.ovhcloud.com/fr-ca/bare-metal/scale/scale-7/
+The [Scale 7](https://www.ovhcloud.com/fr-ca/bare-metal/scale/scale-7/) server seem like it could fit well for both
+simulations and general-purpose hosting:
 - AMD Epyc 7763 - 64c/128t - 2.45GHz/3.5GHz
 - 2x SSD SATA 480GB
@@ -161,41 +204,57 @@ https://www.ovhcloud.com/fr-ca/bare-metal/scale/scale-7/
 - 6bit/s local
 - **back order in americas**
 - 1 192,36$CAD/mth (871USD) with a 12mth commit
- for 3 servers: 3677CAD or 2615USD/mth
+- **total**, for 3 servers: 3677CAD or 2615USD/mth
 ### Data packet: 6k$/mth
-https://www.datapacket.com/pricing
+Data Packet also has AMD EPYC machines, see their [pricing page](https://www.datapacket.com/pricing):
 * AMD EPYC 7702P 64 Cores, 128 Threads, 2 GHz
 * 2x2TB NVME
 * 512GB RAM
 * 1gbps unmetered
 * 2020$USD / mth
-* for 3 servers: 6000USD/mth
 * ashburn virginia
+* **total**, for 3 servers: 6000USD/mth
 ### Scaleway: 3k$/mth
+Scaleway also has EPYC machines, but only in Europe:
 - 2x AMD EPYC 7532 32C/64T - 2.4 GHz
 - 1024 GB RAM
 - 2 x 1.92 TB NVMe
 - Up to 1 Gbps
 - €1,039.99/month
- for 3 servers: ~3000USD/mth
 - **only europe**
+- **total**, for 3 servers: ~3000USD/mth
 ### Migration costs: 7+ weeks
-We haven't estimated the migration costs here, but we assume those
+We haven't estimated the migration costs specifically for this
-will be similar to the self-hosting scenario.
+scenario, but we assume those will be similar to the self-hosting
+scenario, but on the upper uncertainty margin.
 ## Cloud hosting: 3-22k$/mth, 5-11 weeks
+In this scenario, each virtual machine is moved to cloud. It's unclear
+how that would happen exactly, which is the main reason behind the far
+ranging time estimates.
+In general, large simulations seem costly in this environment as well,
+at least if we run them full time.
 ### Hardware costs: 3k-22k$/mth
 Let's assume we need at minimum 80 vcores and 300GB of memory, with
-1TB of storage. This is likely an underestimation.
+1TB of storage. This is likely an underestimation, as we don't have
+proper per-VM disk storage details. This would require a lot more
+effort in estimation that is not seen as necessary.
+Note that most providers do not provide virtual machines large enough
+for the Shadow simulations, or if they do, are too costly
+(e.g. Amazon), with Scaleway being an exception.
 #### Amazon: 1-22+k$/mth
@@ -220,23 +279,29 @@ Let's assume we need at minimum 80 vcores and 300GB of memory, with
 ### Base setup 1-5 weeks
-15 machines to move to the cloud. how do we set them up?
+This involves creating 15 virtual machines in the cloud, so learning a
+new platform and bootstrapping new tools. It could involve things like
-Terraform? Click-click-click? Full unknown.
+Terraform or click-click-click in a new dashboard? Full unknown.
 Let's say 2 hours per machine, 28 hours, which means is 4 days of 7
 hours of work, with extreme uncertainty, so five times which is about
 5 weeks.
+This might be an over-estimation.
 ### Base VM bootstrap cost 2-10 days
 We estimate setting up a machine takes a ground time of 1 hour per VM,
-extreme uncertainty, which means 1-5 hours, so 15-75 hours in the
+extreme uncertainty, which means 1-5 hours, so 15-75 hours, or 2 to 10
-cloud.
+days.
 ### Full rebuild: 3-4 weeks
-This is calculated based on "rebuild the whole VM from scratch".
+In this scenario, we need to reinstall the virtual machines from
+scratch, as we cannot use the export/import procedures Ganeti provides
+us. It's *possible* we could use a more standard export mechanism in
+Ganeti and have that adapted to the cloud, but this would also take
+some research and development time.
 | machine            | estimate | uncertainty | total | notes         |
 |--------------------|----------|-------------|-------|---------------|
@@ -271,7 +336,7 @@ is reduced.
 # Approval
-ED.
+This will need to be approved by TPA and the TPI executive director.
 # Deadline
@@ -283,7 +348,7 @@ This proposal is currently in the `draft` state.
 # References
-See [tpo/tpa/team#40897](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40897).
+See [tpo/tpa/team#40897](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40897) for the discussion.
 # Appendix
@@ -329,3 +394,10 @@ See [tpo/tpa/team#40897](https://gitlab.torproject.org/tpo/tpa/team/-/issues/408
    web-chi-03.torproject.org                   4         8.0G        0M blockdev
    web-chi-04.torproject.org                   4         8.0G        0M blockdev
+## moly inventory
+| instance     | memory | vCPU | disk |
+|--------------|--------|------|------|
+| fallax       | 512MiB | 1    | 4GB  |
+| build-x86-05 | 14GB   | 6    | 90GB |
+| build-x86-06 | 14GB   | 6    | 90GB |