start summarizing team#40478

bd27ea8b · anarcat · a3b456bf · bd27ea8b · bd27ea8b
Verified Commit bd27ea8b authored 1 year ago by anarcat
--- a/policy.md
+++ b/policy.md
@@ -21,6 +21,7 @@ and add it to the above list.
 * [TPA-RFC-38: Setting Up a Wiki Service](policy/tpa-rfc-38-new-wiki-service)
 * [TPA-RFC-45: Mail architecture](policy/tpa-rfc-45-mail-architecture)
 * [TPA-RFC-47: Email account retirement](policy/tpa-rfc-47-email-account-retirement)
+ * [TPA-RFC-56: large file storage](policy/tpa-rfc-56-large-file-storage)

 ## Proposed


--- a/policy/tpa-rfc-56-large-file-storage.md
+++ b/policy/tpa-rfc-56-large-file-storage.md
+---
+title: TPA-RFC-56: large file storage
+---
+
+[[_TOC_]]
+
+Summary: TODO
+
+# Background
+
+We've had multiple incident with servers running out of disk space in
+the past. This RFC aims at collecting a summary of those issues and
+giving a proposal of a solution that should cover most of them.
+
+Those are the issues that were raised in the past with servers running
+out of disk space:
+
+  * [#40475 (closed)](/tpo/tpa/team/-/issues/40475), [#40615
+    (closed)](/tpo/tpa/team/-/issues/40615): "gitlab-02 running out of
+    disk space"). CI artifacts, and non-linear growth events
+
+  * [#40431 (closed)](/tpo/tpa/team/-/issues/40431): "`ci-runner-01`
+    invalid ubuntu package signatures"); [gitlab#95
+    (closed)](/tpo/tpa/gitlab/-/issues/95): "Occasionally clean-up
+    Gitlab CI storage". non-linear, possibly explosive and
+    unpredictable growth. cache sharing issues between
+    runners. somewhat under control now that we have more runners.
+
+  * [#40477 (closed)](/tpo/tpa/team/-/issues/40477) ("backup failure: disk full on
+    bungei"). backups, non-linear, mostly archive-01 but also
+    gitlab. workaround [good for ~8
+    months](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40477#note_2756638
+    "backup failure: disk full on bungei") (from October 2021, so
+    until June 2022) hopefully.
+
+  * [#40442 (closed)](/tpo/tpa/team/-/issues/40442) ("meronense running out of disk
+    space"). metrics storage, linear growth. transitioning between
+    storage systems (see [tpo/network-health/metrics/collector#40012
+    (closed)](https://gitlab.torproject.org/tpo/network-health/metrics/collector/-/issues/40012
+    "Come up with a plan to make past descriptors etc. easier
+    available and queryable \(giant database\)")). workaround good for
+    years.
+  * [#40535 (closed)](/tpo/tpa/team/-/issues/40535): "colchicifolium disk full". storage is
+    steadily increasing, adding about 30GB per 90 days according to
+    [@hiro](/hiro "Hiro"), with `/srv` regularly reaching 90% full and capacity
+    being added
+
+TODO: to add, https://gitlab.torproject.org/tpo/tpa/team/-/issues/40478#note_2808917
+
+> archive-01 ([#40779 (closed)](/tpo/tpa/team/-/issues/40779)) and vineale ([#40778 (closed)](/tpo/tpa/team/-/issues/40778))
+> just ran out of disk space too. the strategy for the former is to
+> just bump up disk space and eventually migrate to gitlab. for the
+> former, it's unclear. it seems like we're eating 2TB a year on that
+> thing, or more...
+>
+> also, we were asked where to put large VM images (3x8GB), and we
+> answered "git(lab) LFS" with the intention of moving to object
+> storage if we run out of space on the main VM, see #40767 (closed)
+> for the discussion.
+
+Note that GitLab needs to be scaled up specifically as well, which
+primarily involves splitting it in multiple machines, see [#40479](/tpo/tpa/team/-/issues/40479 "scale out GitLab to 2k users")
+for that discussion. It's partly in scope of this discussion in the
+sense that a solution chosen here must be somewhat useful to scale
+GitLab out.
+
+Design and performance issues:
+
+ * Ganeti's DRBD backend - a full reboot of all nodes in the cluster
+   takes hours, because all machines need to be migrated between the
+   nodes (which is fine) and do not migrate back to their original
+   pattern (which is not). this might or might not be fixed by a
+   change in the migration algorithm, but it could also be fixed by
+   changing storage away from DRBD to something else.
+
+ * [tpo/network-health/metrics/collector#40012 (closed)](/tpo/network-health/metrics/collector/-/issues/40012): "Come up
+   with a plan to make past descriptors etc. easier available and
+   queryable \(giant database\)" (in onionoo/collector storage). lots
+   of small files, might require FS snapshots or transition to
+   database, see new design in that ticket, or object storage (see
+   also [tpo/network-health/metrics/collector#40023 (closed)](/tpo/network-health/metrics/collector/-/issues/40023),
+   "Move collector storage from file based to object storage")
+
+ * [#40650 (closed)](/tpo/tpa/team/-/issues/40650): "colchicifolium backups are barely
+   functional". backups take _days_ to complete, possible solution is
+   to "Move collector storage from file based to object storage"
+   ([tpo/network-health/metrics/collector#40023 (closed)](/tpo/network-health/metrics/collector/-/issues/40023))
+
+ * [#40482 (closed)](/tpo/tpa/team/-/issues/40482): "meronense performance problems (out of
+   memory?)". nightly memory spikes usage every night, not directly
+   TPA's responsability, but related to the above
+
+Much of the above and this RFC come from the brainstorm established in
+issue [tpo/tpa/team#40478][].
+
+## Storage usage analysis
+
+redo the graphs in https://gitlab.torproject.org/tpo/tpa/team/-/issues/40478#note_2760208
+
+# Proposal
+
+## Goals
+
+<!-- include bugs to be fixed -->
+
+### Must have
+
+### Nice to have
+
+### Non-Goals
+
+## Scope
+
+## Affected users
+
+# Examples or Personas
+
+Examples:
+
+ * ...
+
+Counter examples:
+
+ * ...
+
+# Alternatives considered
+
+## Throw hardware at it
+
+## TODO: brainstorm ideas to triage
+
+just throwing ideas out there.
+
+in kubernetes, assuming we might want to go there:
+
+  * <https://longhorn.io/> \- k8s volumes, native-only, no legacy support?
+  * <https://rook.io/> \- ceph operator
+
+object storage options:
+
+  * [minio](https://min.io/): suggested/shipped by gitlab omnibus now?
+  * ceph has support for s3
+  * [openio](https://www.openio.io/) mentioned in one of the GitLab threads, not evaluated
+  * [garage](https://garagehq.deuxfleurs.fr/) is another alternative
+
+in general: i think Ceph is a great option that ticks a lot of the boxes here:
+
+  * redundancy (a la DRBD)
+  * but also load-balancing (ie. read/write to multiple servers, i think)
+  * S3 backend, which checks the gitlab box.
+  * native ganeti integration
+
+the only concern might be its performance and reliability. gitlab evaluated it
+as a NFS replacement but decided against it. other war stories:
+
+  * <https://blog.acolyer.org/2019/11/06/ceph-evolution/>
+  * <https://michael-prokop.at/blog/2021/04/09/a-ceph-war-story/>
+  * <https://www.openstack.org/summit/vancouver-2015/summit-videos/presentation/ceph-at-cern-a-year-in-the-life-of-a-petabyte-scale-block-storage-service>
+  * <https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/727> \- gitlab evaluated Ceph and moving to metal in 2016 and decided to stick with the cloud
+
+alternatively, we could go with a SAN, home-grown or commercial, but i would
+rather avoid proprietary stuff, which means we'd have to build our own, and
+i'm not sure how we would do _that_. ZFS replication maybe? and that would
+only solve the Ganeti storage problems. we'd still need an S3 storage, but we
+could use something like minio for that specifically.
+
+oh, and we could fix the backup problems by ditching bacula and switching to
+something like borg. we'd need an offsite server to "pull" the backups,
+however (because borg is push, which means a compromised backup server can
+trash its own backups). we could build this with ZFS/BTRFS replication, again.
+
+> another caveat with borg is that restores are kind of slow. bacula
+> seems to be really fast at restores, at least it's my experience
+> restoring websites in #40501 (closed) today, really positive
+> feeling.
+
+## TODO: triage Ceph war stories from GitLab and SO
+
+more war stories, this time from gitlab:
+
+  * when they were saying they would move to bare metal and ceph: <https://about.gitlab.com/blog/2016/11/10/why-choose-bare-metal/>
+  * when they subsequently tried and failed and switched back to the cloud and not ceph, see <https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/727> <https://gitlab.com/gitlab-com/operations/-/issues/1> quote from [this deployment issue](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/241#note_39509631): 
+
+> While it's true that we lean towards PostgreSQL, our usage of CephFS was not
+> for the database server, but for the git repositories. In the end we
+> abandoned our usage of CephFS for shared storage and reverted back to a
+> sharded NFS design.
+
+and StackOverflow's (presumably) Jeff Atwood:
+
+  * "We had disastrous experiences with Ceph and Gluster on bare metal. I think this says more about the immaturity (and difficulty) of distributed file systems than the cloud per se." <https://news.ycombinator.com/item?id=12940042> in response to the first article from GitLab.com above (which ended up being correct: the went back to the cloud)
+
+about this, one key thing to keep in mind is that GitLab were looking
+for an NFS replacement.
+
+we don't use NFS anywhere right now (thank god) so that is not a
+requirement.
+
+the above "horror stories" might not be the same with other storage
+mechanisms. indeed, there's a big difference between using Ceph as a
+filesystem (ie. CephFS) and an object storage (RadosGW) or block
+storage (RBD), which might be better targets for us.
+
+In particular, we're likely to use Ceph as a block device -- for
+Ganeti instance disks, which Ganeti has good support for -- or object
+storage -- for GitLab's "things", which it is now also designed
+for. And indeed, "NFS" (ie. real filesystem) is now (14.x?) deprecated
+in GitLab, so shared data storage is expected to go through S3-like
+"object storage" APIs from here on.
+
+## TODO: triage CERN experience
+
+oh, and also i should drop this here... CERN started with a 3PB Ceph
+deployment [around 2015](https://www.openstack.org/videos/summits/vancouver-2015/ceph-at-cern-a- year-in-the-life-of-a-petabyte-scale-block-storage-service). It seems it's still in use:
+
+  * [2017](https://indico.cern.ch/event/649159/contributions/2761965/attachments/1544385/2423339/hroussea-storage-at-CERN.pdf), 65PB
+  * [2018](https://indico.mathrice.fr/event/143/contribution/1/material/slides/0.pdf), 300PB?
+  * [2019](https://www.hpcwire.com/2019/09/30/how-ceph-is-helping-to-unlock-the-secrets-of-the-universe/), 1PB/day, 115PB/year?
+  * [2021](https://www.concat.de/wp-content/uploads/2021/05/WP-Storage-Wars-Part-3-CEPH-for-HPC-Environments.pdf), 65PB?
+
+... although, as you can see, it's not exactly clear to me how much data is
+managed by ceph. they seem to have a good experience with Ceph in any case,
+with three active committers, and they say it's a "great community", which is
+certainly a plus...
+
+## TODO triage meeting brainstorm
+
+https://gitlab.torproject.org/tpo/tpa/team/-/issues/40478#note_2788264
+
+we ended up [brainstorming this in a
+meeting](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/meeting/2022-02-14#storage-
+brainstorm), where we said:
+
+> We considered the following technologies for the broader problem:
+>
+>   * S3 object storage for gitlab
+>   * ceph block storage for ganeti
+>   * filesystem snapshots for gitlab / metrics servers backups
+>
+> We'll look at setting up a VM with minio for testing. We could first test
+> the service with the CI runners image/cache storage backends, which can
+> easily be rebuilt/migrated if we want to drop that test.
+>
+> This would disregard the block storage problem, but we could pretend this
+> would be solved at the service level eventually (e.g. redesign the metrics
+> storage, split up the gitlab server). Anyways, migrating away from DRBD to
+> Ceph is a major undertaking that would require a lot of work. It would also
+> be part of the largest "[trusted high performance
+> cluster](https://gitlab.torproject.org/groups/tpo/tpa/-/milestones/2)" work
+> that we recently de-prioritized.
+
+so it looks like the next step might be to setup minio here as a prototype.
+[@hiro](/hiro "Hiro") is also considering object storage for collector
+([tpo/network-health/metrics/collector#40023 (closed)](/tpo/network-
+health/metrics/collector/-/issues/40023 "Move collector storage from file
+based to object storage")) which could solve a lot of the problems we're
+having here.
+
+## upstream provider
+
+"they have terabytes of storage where we could run a VM to have a
+secondary storage server for bacula."
+
+https://gitlab.torproject.org/tpo/tpa/team/-/issues/40478#note_2843500
+
+## minio licensing dispute
+
+re minio, they are involved in a [licensing
+dispute](https://blocksandfiles.com/2023/03/26/we-object-minio-says-no-more-
+open-license-for-you-weka/) with commercial storage providers
+([Weka](https://www.weka.io/) and [Nutanix](https://www.nutanix.com/)) because
+the latter used Minio in their products without giving attribution. see also
+[this hacker news discussion](https://news.ycombinator.com/item?id=32148007).
+
+it should also be noted that they switched to the AGPL relatively recently. i
+don't think this should keep us from using it, but just a note to say there's
+some storm brewing there.
+
+# Costs
+
+# Approval
+
+# Deadline
+
+# Status
+
+This proposal is currently in the `draft` state.
+
+# References
+
+ * discussion issue: [tpo/tpa/team#40478][].
+
+[tpo/tpa/team#40478]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40478