From 51f79854c9cf0544b9393b895172a0d9806d69df Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Antoine=20Beaupr=C3=A9?= <anarcat@debian.org>
Date: Tue, 6 Jun 2023 17:25:27 -0400
Subject: [PATCH] link, formatting tweaks (tpo/tpa/team#40478)

---
 policy/tpa-rfc-56-large-file-storage.md | 137 ++++++++++++++----------
 1 file changed, 81 insertions(+), 56 deletions(-)

diff --git a/policy/tpa-rfc-56-large-file-storage.md b/policy/tpa-rfc-56-large-file-storage.md
index c7e11482..cb7f3da9 100644
--- a/policy/tpa-rfc-56-large-file-storage.md
+++ b/policy/tpa-rfc-56-large-file-storage.md
@@ -15,39 +15,35 @@ giving a proposal of a solution that should cover most of them.
 Those are the issues that were raised in the past with servers running
 out of disk space:
 
-  * [#40475 (closed)](/tpo/tpa/team/-/issues/40475), [#40615
-    (closed)](/tpo/tpa/team/-/issues/40615): "gitlab-02 running out of
-    disk space"). CI artifacts, and non-linear growth events
+  * [#40475 (closed)][], [#40615 (closed)][]: "gitlab-02 running out
+    of disk space"). CI artifacts, and non-linear growth events
 
-  * [#40431 (closed)](/tpo/tpa/team/-/issues/40431): "`ci-runner-01`
-    invalid ubuntu package signatures"); [gitlab#95
-    (closed)](/tpo/tpa/gitlab/-/issues/95): "Occasionally clean-up
+  * [#40431 (closed)][]: "`ci-runner-01` invalid ubuntu package
+    signatures"); [gitlab#95 (closed)][]: "Occasionally clean-up
     Gitlab CI storage". non-linear, possibly explosive and
     unpredictable growth. cache sharing issues between
     runners. somewhat under control now that we have more runners.
 
-  * [#40477 (closed)](/tpo/tpa/team/-/issues/40477) ("backup failure: disk full on
+  * [#40477 (closed)][] ("backup failure: disk full on
     bungei"). backups, non-linear, mostly archive-01 but also
-    gitlab. workaround [good for ~8
-    months](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40477#note_2756638
-    "backup failure: disk full on bungei") (from October 2021, so
+    gitlab. workaround [good for ~8 months][] (from October 2021, so
     until June 2022) hopefully.
 
-  * [#40442 (closed)](/tpo/tpa/team/-/issues/40442) ("meronense running out of disk
+  * [#40442 (closed)][] ("meronense running out of disk
     space"). metrics storage, linear growth. transitioning between
     storage systems (see [tpo/network-health/metrics/collector#40012
-    (closed)](https://gitlab.torproject.org/tpo/network-health/metrics/collector/-/issues/40012
-    "Come up with a plan to make past descriptors etc. easier
-    available and queryable \(giant database\)")). workaround good for
-    years.
-  * [#40535 (closed)](/tpo/tpa/team/-/issues/40535): "colchicifolium disk full". storage is
+    (closed)][]). workaround good for years.
+
+  * [#40535 (closed)][]: "colchicifolium disk full". storage is
     steadily increasing, adding about 30GB per 90 days according to
-    [@hiro](/hiro "Hiro"), with `/srv` regularly reaching 90% full and capacity
+    hiro, with `/srv` regularly reaching 90% full and capacity
     being added
 
+TODO: update numbers above
+
 TODO: to add, https://gitlab.torproject.org/tpo/tpa/team/-/issues/40478#note_2808917
 
-> archive-01 ([#40779 (closed)](/tpo/tpa/team/-/issues/40779)) and vineale ([#40778 (closed)](/tpo/tpa/team/-/issues/40778))
+> archive-01 ([#40779 (closed)][]) and vineale ([#40778 (closed)][])
 > just ran out of disk space too. the strategy for the former is to
 > just bump up disk space and eventually migrate to gitlab. for the
 > former, it's unclear. it seems like we're eating 2TB a year on that
@@ -59,7 +55,7 @@ TODO: to add, https://gitlab.torproject.org/tpo/tpa/team/-/issues/40478#note_280
 > for the discussion.
 
 Note that GitLab needs to be scaled up specifically as well, which
-primarily involves splitting it in multiple machines, see [#40479](/tpo/tpa/team/-/issues/40479 "scale out GitLab to 2k users")
+primarily involves splitting it in multiple machines, see [#40479][]
 for that discussion. It's partly in scope of this discussion in the
 sense that a solution chosen here must be somewhat useful to scale
 GitLab out.
@@ -73,26 +69,41 @@ Design and performance issues:
    change in the migration algorithm, but it could also be fixed by
    changing storage away from DRBD to something else.
 
- * [tpo/network-health/metrics/collector#40012 (closed)](/tpo/network-health/metrics/collector/-/issues/40012): "Come up
+ * [tpo/network-health/metrics/collector#40012 (closed)][]: "Come up
    with a plan to make past descriptors etc. easier available and
-   queryable \(giant database\)" (in onionoo/collector storage). lots
+   queryable (giant database)" (in onionoo/collector storage). lots
    of small files, might require FS snapshots or transition to
    database, see new design in that ticket, or object storage (see
-   also [tpo/network-health/metrics/collector#40023 (closed)](/tpo/network-health/metrics/collector/-/issues/40023),
-   "Move collector storage from file based to object storage")
+   next item)
 
- * [#40650 (closed)](/tpo/tpa/team/-/issues/40650): "colchicifolium backups are barely
+ * [#40650 (closed)][]: "colchicifolium backups are barely
    functional". backups take _days_ to complete, possible solution is
    to "Move collector storage from file based to object storage"
-   ([tpo/network-health/metrics/collector#40023 (closed)](/tpo/network-health/metrics/collector/-/issues/40023))
+   ([tpo/network-health/metrics/collector#40023 (closed)][])
 
- * [#40482 (closed)](/tpo/tpa/team/-/issues/40482): "meronense performance problems (out of
+ * [#40482 (closed)][]: "meronense performance problems (out of
    memory?)". nightly memory spikes usage every night, not directly
    TPA's responsability, but related to the above
 
 Much of the above and this RFC come from the brainstorm established in
 issue [tpo/tpa/team#40478][].
 
+[#40475 (closed)]: /tpo/tpa/team/-/issues/40475
+[#40615 (closed)]: /tpo/tpa/team/-/issues/40615
+[#40431 (closed)]: /tpo/tpa/team/-/issues/40431
+[gitlab#95 (closed)]: /tpo/tpa/gitlab/-/issues/95
+[#40477 (closed)]: /tpo/tpa/team/-/issues/40477
+[good for ~8 months]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40477#note_2756638 "backup failure: disk full on bungei"
+[#40442 (closed)]: /tpo/tpa/team/-/issues/40442
+[tpo/network-health/metrics/collector#40012 (closed)]: https://gitlab.torproject.org/tpo/network-health/metrics/collector/-/issues/40012
+[#40535 (closed)]: /tpo/tpa/team/-/issues/40535
+[#40779 (closed)]: /tpo/tpa/team/-/issues/40779
+[#40778 (closed)]: /tpo/tpa/team/-/issues/40778
+[#40479]: /tpo/tpa/team/-/issues/40479 "scale out GitLab to 2k users"
+[tpo/network-health/metrics/collector#40023 (closed)]: /tpo/network-health/metrics/collector/-/issues/40023
+[#40650 (closed)]: /tpo/tpa/team/-/issues/40650
+[#40482 (closed)]: /tpo/tpa/team/-/issues/40482
+
 ## Storage usage analysis
 
 redo the graphs in https://gitlab.torproject.org/tpo/tpa/team/-/issues/40478#note_2760208
@@ -138,10 +149,10 @@ in kubernetes, assuming we might want to go there:
 
 object storage options:
 
-  * [minio](https://min.io/): suggested/shipped by gitlab omnibus now?
+  * [minio][]: suggested/shipped by gitlab omnibus now?
   * ceph has support for s3
-  * [openio](https://www.openio.io/) mentioned in one of the GitLab threads, not evaluated
-  * [garage](https://garagehq.deuxfleurs.fr/) is another alternative
+  * [openio][] mentioned in one of the GitLab threads, not evaluated
+  * [garage][] is another alternative
 
 in general: i think Ceph is a great option that ticks a lot of the boxes here:
 
@@ -174,12 +185,16 @@ trash its own backups). we could build this with ZFS/BTRFS replication, again.
 > restoring websites in #40501 (closed) today, really positive
 > feeling.
 
+[minio]: https://min.io/
+[openio]: https://www.openio.io/
+[garage]: https://garagehq.deuxfleurs.fr/
+
 ## TODO: triage Ceph war stories from GitLab and SO
 
 more war stories, this time from gitlab:
 
   * when they were saying they would move to bare metal and ceph: <https://about.gitlab.com/blog/2016/11/10/why-choose-bare-metal/>
-  * when they subsequently tried and failed and switched back to the cloud and not ceph, see <https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/727> <https://gitlab.com/gitlab-com/operations/-/issues/1> quote from [this deployment issue](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/241#note_39509631): 
+  * when they subsequently tried and failed and switched back to the cloud and not ceph, see <https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/727> <https://gitlab.com/gitlab-com/operations/-/issues/1> quote from [this deployment issue][]: 
 
 > While it's true that we lean towards PostgreSQL, our usage of CephFS was not
 > for the database server, but for the git repositories. In the end we
@@ -208,28 +223,34 @@ for. And indeed, "NFS" (ie. real filesystem) is now (14.x?) deprecated
 in GitLab, so shared data storage is expected to go through S3-like
 "object storage" APIs from here on.
 
+[this deployment issue]: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/241#note_39509631
+
 ## TODO: triage CERN experience
 
 oh, and also i should drop this here... CERN started with a 3PB Ceph
-deployment [around 2015](https://www.openstack.org/videos/summits/vancouver-2015/ceph-at-cern-a- year-in-the-life-of-a-petabyte-scale-block-storage-service). It seems it's still in use:
+deployment [around 2015][]. It seems it's still in use:
 
-  * [2017](https://indico.cern.ch/event/649159/contributions/2761965/attachments/1544385/2423339/hroussea-storage-at-CERN.pdf), 65PB
-  * [2018](https://indico.mathrice.fr/event/143/contribution/1/material/slides/0.pdf), 300PB?
-  * [2019](https://www.hpcwire.com/2019/09/30/how-ceph-is-helping-to-unlock-the-secrets-of-the-universe/), 1PB/day, 115PB/year?
-  * [2021](https://www.concat.de/wp-content/uploads/2021/05/WP-Storage-Wars-Part-3-CEPH-for-HPC-Environments.pdf), 65PB?
+  * [2017][], 65PB
+  * [2018][], 300PB?
+  * [2019][], 1PB/day, 115PB/year?
+  * [2021][], 65PB?
 
 ... although, as you can see, it's not exactly clear to me how much data is
 managed by ceph. they seem to have a good experience with Ceph in any case,
 with three active committers, and they say it's a "great community", which is
 certainly a plus...
 
+[around 2015]: https://www.openstack.org/videos/summits/vancouver-2015/ceph-at-cern-a- year-in-the-life-of-a-petabyte-scale-block-storage-service
+[2017]: https://indico.cern.ch/event/649159/contributions/2761965/attachments/1544385/2423339/hroussea-storage-at-CERN.pdf
+[2018]: https://indico.mathrice.fr/event/143/contribution/1/material/slides/0.pdf
+[2019]: https://www.hpcwire.com/2019/09/30/how-ceph-is-helping-to-unlock-the-secrets-of-the-universe/
+[2021]: https://www.concat.de/wp-content/uploads/2021/05/WP-Storage-Wars-Part-3-CEPH-for-HPC-Environments.pdf
+
 ## TODO triage meeting brainstorm
 
 https://gitlab.torproject.org/tpo/tpa/team/-/issues/40478#note_2788264
 
-we ended up [brainstorming this in a
-meeting](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/meeting/2022-02-14#storage-
-brainstorm), where we said:
+we ended up [brainstorming this in a meeting][] , where we said:
 
 > We considered the following technologies for the broader problem:
 >
@@ -241,20 +262,21 @@ brainstorm), where we said:
 > the service with the CI runners image/cache storage backends, which can
 > easily be rebuilt/migrated if we want to drop that test.
 >
-> This would disregard the block storage problem, but we could pretend this
-> would be solved at the service level eventually (e.g. redesign the metrics
-> storage, split up the gitlab server). Anyways, migrating away from DRBD to
-> Ceph is a major undertaking that would require a lot of work. It would also
-> be part of the largest "[trusted high performance
-> cluster](https://gitlab.torproject.org/groups/tpo/tpa/-/milestones/2)" work
-> that we recently de-prioritized.
+> This would disregard the block storage problem, but we could pretend
+> this would be solved at the service level eventually (e.g. redesign
+> the metrics storage, split up the gitlab server). Anyways, migrating
+> away from DRBD to Ceph is a major undertaking that would require a
+> lot of work. It would also be part of the largest "[trusted high
+> performance cluster][]" work that we recently de-prioritized.
 
 so it looks like the next step might be to setup minio here as a prototype.
-[@hiro](/hiro "Hiro") is also considering object storage for collector
-([tpo/network-health/metrics/collector#40023 (closed)](/tpo/network-
-health/metrics/collector/-/issues/40023 "Move collector storage from file
-based to object storage")) which could solve a lot of the problems we're
-having here.
+
+hiro is also considering object storage for collector
+([tpo/network-health/metrics/collector#40023 (closed)][] which could
+solve a lot of the problems we're having here.
+
+[trusted high performance cluster]: https://gitlab.torproject.org/groups/tpo/tpa/-/milestones/2
+[brainstorming this in a meeting]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/meeting/2022-02-14#storage-brainstorm
 
 ## upstream provider
 
@@ -265,17 +287,20 @@ https://gitlab.torproject.org/tpo/tpa/team/-/issues/40478#note_2843500
 
 ## minio licensing dispute
 
-re minio, they are involved in a [licensing
-dispute](https://blocksandfiles.com/2023/03/26/we-object-minio-says-no-more-
-open-license-for-you-weka/) with commercial storage providers
-([Weka](https://www.weka.io/) and [Nutanix](https://www.nutanix.com/)) because
-the latter used Minio in their products without giving attribution. see also
-[this hacker news discussion](https://news.ycombinator.com/item?id=32148007).
+re minio, they are involved in a [licensing dispute][] with commercial
+storage providers ([Weka][] and [Nutanix][]) because the latter used
+Minio in their products without giving attribution. see also [this
+hacker news discussion][].
 
 it should also be noted that they switched to the AGPL relatively recently. i
 don't think this should keep us from using it, but just a note to say there's
 some storm brewing there.
 
+[Weka]: https://www.weka.io/
+[Nutanix]: https://www.nutanix.com/
+[this hacker news discussion]: https://news.ycombinator.com/item?id=32148007
+[licensing dispute]: https://blocksandfiles.com/2023/03/26/we-object-minio-says-no-more-open-license-for-you-weka/
+
 # Costs
 
 # Approval
-- 
GitLab