review numbers (#40478)

95d28c99 · anarcat · 51f79854 · 95d28c99
Verified Commit 95d28c99 authored 1 year ago by anarcat
--- a/policy/tpa-rfc-56-large-file-storage.md
+++ b/policy/tpa-rfc-56-large-file-storage.md
@@ -15,98 +15,107 @@ giving a proposal of a solution that should cover most of them.
 Those are the issues that were raised in the past with servers running
 out of disk space:

-  * [#40475 (closed)][], [#40615 (closed)][]: "gitlab-02 running out
-    of disk space"). CI artifacts, and non-linear growth events
-
-  * [#40431 (closed)][]: "`ci-runner-01` invalid ubuntu package
-    signatures"); [gitlab#95 (closed)][]: "Occasionally clean-up
-    Gitlab CI storage". non-linear, possibly explosive and
-    unpredictable growth. cache sharing issues between
-    runners. somewhat under control now that we have more runners.
-
-  * [#40477 (closed)][] ("backup failure: disk full on
-    bungei"). backups, non-linear, mostly archive-01 but also
-    gitlab. workaround [good for ~8 months][] (from October 2021, so
-    until June 2022) hopefully.
-
-  * [#40442 (closed)][] ("meronense running out of disk
-    space"). metrics storage, linear growth. transitioning between
-    storage systems (see [tpo/network-health/metrics/collector#40012
-    (closed)][]). workaround good for years.
-
-  * [#40535 (closed)][]: "colchicifolium disk full". storage is
-    steadily increasing, adding about 30GB per 90 days according to
-    hiro, with `/srv` regularly reaching 90% full and capacity
-    being added
-
-TODO: update numbers above
-
-TODO: to add, https://gitlab.torproject.org/tpo/tpa/team/-/issues/40478#note_2808917
-
-> archive-01 ([#40779 (closed)][]) and vineale ([#40778 (closed)][])
-> just ran out of disk space too. the strategy for the former is to
-> just bump up disk space and eventually migrate to gitlab. for the
-> former, it's unclear. it seems like we're eating 2TB a year on that
-> thing, or more...
->
-> also, we were asked where to put large VM images (3x8GB), and we
-> answered "git(lab) LFS" with the intention of moving to object
-> storage if we run out of space on the main VM, see #40767 (closed)
-> for the discussion.
-
-Note that GitLab needs to be scaled up specifically as well, which
-primarily involves splitting it in multiple machines, see [#40479][]
-for that discussion. It's partly in scope of this discussion in the
-sense that a solution chosen here must be somewhat useful to scale
-GitLab out.
-
-Design and performance issues:
-
- * Ganeti's DRBD backend - a full reboot of all nodes in the cluster
-   takes hours, because all machines need to be migrated between the
-   nodes (which is fine) and do not migrate back to their original
-   pattern (which is not). this might or might not be fixed by a
-   change in the migration algorithm, but it could also be fixed by
-   changing storage away from DRBD to something else.
-
- * [tpo/network-health/metrics/collector#40012 (closed)][]: "Come up
-   with a plan to make past descriptors etc. easier available and
-   queryable (giant database)" (in onionoo/collector storage). lots
-   of small files, might require FS snapshots or transition to
-   database, see new design in that ticket, or object storage (see
-   next item)
-
- * [#40650 (closed)][]: "colchicifolium backups are barely
-   functional". backups take _days_ to complete, possible solution is
-   to "Move collector storage from file based to object storage"
-   ([tpo/network-health/metrics/collector#40023 (closed)][])
-
- * [#40482 (closed)][]: "meronense performance problems (out of
-   memory?)". nightly memory spikes usage every night, not directly
-   TPA's responsability, but related to the above
+  * **GitLab**. [#40475 (closed)][], [#40615 (closed)][], [#41139][]:
+    "`gitlab-02` running out of disk space". CI artifacts, and
+    non-linear growth events.
+
+  * **GitLab CI**. [#40431 (closed)][]: "`ci-runner-01` invalid ubuntu
+    package signatures"; [gitlab#95 (closed)][]: "Occasionally
+    clean-up Gitlab CI storage". Non-linear, possibly explosive and
+    unpredictable growth. Cache sharing issues between
+    runners. Somewhat under control now that we have more runners, but
+    current aggressive cache purging degrades performance.
+
+  * **Backups**. [#40477 (closed)][]: "backup failure: disk full on
+    `bungei`". Was non-linear, mostly due to `archive-01` but also
+    GitLab. A workaround [good for ~8 months][] (from October 2021, so
+    until June 2022) was deployed and usage seems stable since
+    September 2022.
+
+  * **Metrics**. [#40442 (closed)][]: "`meronense` running out of disk
+    space". Linear growth. Current allocation (512GB) seem sufficient
+    for a few more years, conversion to a new storage backend planned
+    (see below).
+
+  * **Collector**. [#40535 (closed)][]: "`colchicifolium` disk
+    full". Linear growth, about 200GB used per year, 1TB allocated in
+    June 2023, therefore possibly good for 5 years.
+
+ * **Archives**: [#40779 (closed)][]: "`archive-01` running out of
+   disk space". Added 2TB in May 2022, seem to be using about 500GB
+   per year, good for 2-3 more years.
+
+ * **Legacy Git**: [#40778 (closed)][]: "`vineale` out of disk space",
+   May 2022. Negligible (64GB), scheduled for retirement (see
+   [TPA-RFC-36][]).
+
+There are also design and performance issues that are relevant in this
+discussion:
+
+ * **Ganeti virtual machines storage**. A full reboot of all nodes in
+   the cluster takes hours, because all machines need to be migrated
+   between the nodes (which is fine) and do not migrate back to their
+   original pattern (which is not). Improvements have been made to the
+   migration algorithm, but it could also be fixed by changing storage
+   away from DRBD to another storage backend like Ceph.
+
+ * **Large file storage**. We were asked where to put large VM images
+   (3x8GB), and we answered "git(lab) LFS" with the intention of
+   moving to object storage if we run out of space on the main VM, see
+   [#40767 (closed)][] for the discussion. We also were requested to
+   host a container registry in [tpo/tpa/gitlab#89][].
+
+ * **Metrics database**. [tpo/network-health/metrics/collector#40012
+   (closed)][]: "Come up with a plan to make past descriptors
+   etc. easier available and queryable (giant database)" (in
+   onionoo/collector storage). This is currently being rebuilt as a
+   [Victoria Metrics][] server ([tpo/tpa/team#41130][]).
+
+ * **Collector storage**. [#40650 (closed)][]: "colchicifolium backups
+   are barely functional". Backups take _days_ to complete, possible
+   solution is to "Move collector storage from file based to object
+   storage" ([tpo/network-health/metrics/collector#40023 (closed)][],
+   currently on hold).
+
+ * **GitLab scalability**. GitLab needs to be scaled up for
+   performance reasons as well, which primarily involves splitting it
+   in multiple machines, see [#40479][] for that discussion. It's
+   partly in scope of this discussion in the sense that a solution
+   chosen here should be compatible with GitLab's design.

 Much of the above and this RFC come from the brainstorm established in
 issue [tpo/tpa/team#40478][].

-[#40475 (closed)]: /tpo/tpa/team/-/issues/40475
-[#40615 (closed)]: /tpo/tpa/team/-/issues/40615
-[#40431 (closed)]: /tpo/tpa/team/-/issues/40431
+[#40475 (closed)]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40475
+[#40615 (closed)]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40615
+[#40431 (closed)]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40431
 [gitlab#95 (closed)]: /tpo/tpa/gitlab/-/issues/95
-[#40477 (closed)]: /tpo/tpa/team/-/issues/40477
+[#40477 (closed)]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40477
 [good for ~8 months]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40477#note_2756638 "backup failure: disk full on bungei"
-[#40442 (closed)]: /tpo/tpa/team/-/issues/40442
+[#40442 (closed)]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40442
 [tpo/network-health/metrics/collector#40012 (closed)]: https://gitlab.torproject.org/tpo/network-health/metrics/collector/-/issues/40012
-[#40535 (closed)]: /tpo/tpa/team/-/issues/40535
-[#40779 (closed)]: /tpo/tpa/team/-/issues/40779
-[#40778 (closed)]: /tpo/tpa/team/-/issues/40778
-[#40479]: /tpo/tpa/team/-/issues/40479 "scale out GitLab to 2k users"
-[tpo/network-health/metrics/collector#40023 (closed)]: /tpo/network-health/metrics/collector/-/issues/40023
-[#40650 (closed)]: /tpo/tpa/team/-/issues/40650
-[#40482 (closed)]: /tpo/tpa/team/-/issues/40482
+[#40535 (closed)]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40535
+[#40779 (closed)]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40779
+[#40778 (closed)]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40778
+[#40479]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40479 "scale out GitLab to 2k users"
+[tpo/network-health/metrics/collector#40023 (closed)]: https://gitlab.torproject.org/tpo/network-health/metrics/collector/-/issues/40023
+[#40650 (closed)]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40650
+[#40482 (closed)]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40482
+[#41139]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41139
+[#40767 (closed)]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40767
+[tpo/tpa/gitlab#89]: https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/89
+[tpo/tpa/team#41130]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41130
+[Victoria Metrics]: https://victoriametrics.github.io/
+[TPA-RFC-36]: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-36-gitolite-gitweb-retirement

 ## Storage usage analysis

-redo the graphs in https://gitlab.torproject.org/tpo/tpa/team/-/issues/40478#note_2760208
+[According to Grafana][], TPA manages around 111TB of available
+storage, with 71TB in use.
+
+TODO: redo the graphs in https://gitlab.torproject.org/tpo/tpa/team/-/issues/40478#note_2760208
+
+[According to Grafana]: https://grafana.torproject.org/d/wUmZB05Zk/tpo-overview?orgId=1&viewPanel=30&from=now-1y&to=now

 # Proposal