look at ceph in depth, split out other topics (tpo/tpa/team#40478) (fef69bf1) · Commits · The Tor Project / TPA / Wiki Replica

policy/tpa-rfc-56-large-file-storage.md

+139 −90

Original line number	Diff line number	Diff line
		@@ -199,101 +199,12 @@ just throwing ideas out there.

		object storage options:

		* ceph has support for s3
		* [openio][] mentioned in one of the GitLab threads, not evaluated,
		python, main website down: https://www.openio.io/
		(`SSL_ERROR_NO_CYPHER_OVERLAP`)

		in general: i think Ceph is a great option that ticks a lot of the boxes here:

		* redundancy (a la DRBD)
		* but also load-balancing (ie. read/write to multiple servers, i think)
		* S3 backend, which checks the gitlab box.
		* native ganeti integration

		the only concern might be its performance and reliability. gitlab evaluated it
		as a NFS replacement but decided against it. other war stories:

		* <https://blog.acolyer.org/2019/11/06/ceph-evolution/>
		* <https://michael-prokop.at/blog/2021/04/09/a-ceph-war-story/>
		* <https://www.openstack.org/summit/vancouver-2015/summit-videos/presentation/ceph-at-cern-a-year-in-the-life-of-a-petabyte-scale-block-storage-service>
		* <https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/727> \- gitlab evaluated Ceph and moving to metal in 2016 and decided to stick with the cloud

		alternatively, we could go with a SAN, home-grown or commercial, but i would
		rather avoid proprietary stuff, which means we'd have to build our own, and
		i'm not sure how we would do _that_. ZFS replication maybe? and that would
		only solve the Ganeti storage problems. we'd still need an S3 storage, but we
		could use something like MinIO for that specifically.

		oh, and we could fix the backup problems by ditching bacula and switching to
		something like borg. we'd need an offsite server to "pull" the backups,
		however (because borg is push, which means a compromised backup server can
		trash its own backups). we could build this with ZFS/BTRFS replication, again.

		> another caveat with borg is that restores are kind of slow. bacula
		> seems to be really fast at restores, at least it's my experience
		> restoring websites in #40501 (closed) today, really positive
		> feeling.

		[openio]: https://www.openio.io/

		## TODO: triage Ceph war stories from GitLab and SO

		more war stories, this time from gitlab:

		* when they were saying they would move to bare metal and ceph: <https://about.gitlab.com/blog/2016/11/10/why-choose-bare-metal/>
		* when they subsequently tried and failed and switched back to the cloud and not ceph, see <https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/727> <https://gitlab.com/gitlab-com/operations/-/issues/1> quote from [this deployment issue][]:

		> While it's true that we lean towards PostgreSQL, our usage of CephFS was not
		> for the database server, but for the git repositories. In the end we
		> abandoned our usage of CephFS for shared storage and reverted back to a
		> sharded NFS design.

		and StackOverflow's (presumably) Jeff Atwood:

		* "We had disastrous experiences with Ceph and Gluster on bare metal. I think this says more about the immaturity (and difficulty) of distributed file systems than the cloud per se." <https://news.ycombinator.com/item?id=12940042> in response to the first article from GitLab.com above (which ended up being correct: the went back to the cloud)

		about this, one key thing to keep in mind is that GitLab were looking
		for an NFS replacement.

		we don't use NFS anywhere right now (thank god) so that is not a
		requirement.

		the above "horror stories" might not be the same with other storage
		mechanisms. indeed, there's a big difference between using Ceph as a
		filesystem (ie. CephFS) and an object storage (RadosGW) or block
		storage (RBD), which might be better targets for us.

		In particular, we're likely to use Ceph as a block device -- for
		Ganeti instance disks, which Ganeti has good support for -- or object
		storage -- for GitLab's "things", which it is now also designed
		for. And indeed, "NFS" (ie. real filesystem) is now (14.x?) deprecated
		in GitLab, so shared data storage is expected to go through S3-like
		"object storage" APIs from here on.

		[this deployment issue]: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/241#note_39509631

		## TODO: triage CERN experience

		oh, and also i should drop this here... CERN started with a 3PB Ceph
		deployment [around 2015][]. It seems it's still in use:

		* [2017][], 65PB
		* [2018][], 300PB?
		* [2019][], 1PB/day, 115PB/year?
		* [2021][], 65PB?

		... although, as you can see, it's not exactly clear to me how much data is
		managed by ceph. they seem to have a good experience with Ceph in any case,
		with three active committers, and they say it's a "great community", which is
		certainly a plus...

		[around 2015]: https://www.openstack.org/videos/summits/vancouver-2015/ceph-at-cern-a- year-in-the-life-of-a-petabyte-scale-block-storage-service
		[2017]: https://indico.cern.ch/event/649159/contributions/2761965/attachments/1544385/2423339/hroussea-storage-at-CERN.pdf
		[2018]: https://indico.mathrice.fr/event/143/contribution/1/material/slides/0.pdf
		[2019]: https://www.hpcwire.com/2019/09/30/how-ceph-is-helping-to-unlock-the-secrets-of-the-universe/
		[2021]: https://www.concat.de/wp-content/uploads/2021/05/WP-Storage-Wars-Part-3-CEPH-for-HPC-Environments.pdf

		## TODO triage meeting brainstorm

		https://gitlab.torproject.org/tpo/tpa/team/-/issues/40478#note_2788264
		@@ -482,7 +393,123 @@ News discussion][30256753] and [this other one][33853539].

		## SeaweedFS

		https://github.com/seaweedfs/seaweedfs
		TODO: review https://github.com/seaweedfs/seaweedfs

		## Ceph

		[Ceph](https://ceph.io/en/) is ([according to Wikipedia](https://en.wikipedia.org/wiki/Ceph_(software))) a "software-defined storage
		platform that provides object storage, block storage, and file storage
		built on a common distributed cluster foundation. Ceph provides
		completely distributed operation without a single point of failure and
		scalability to the exabyte level, and is freely available."

		[ceph-debian]: https://tracker.debian.org/pkg/ceph

		It's kind of a beast. It's written in C++ and Python and is [packaged
		in Debian][ceph-debian]. It provides a lot of features we are looking for
		here:

		* redundancy ("a la" DRBD)
		* load-balancing (read/write to multiple servers)
		* [far-ranging](https://docs.ceph.com/en/latest/radosgw/s3/) object storage compatibility
		* native Ganeti integration with an iSCSI backend
		* [Puppet module](https://github.com/openstack/puppet-ceph)
		* [Grafana](https://packages.debian.org/unstable/ceph-grafana-dashboards) and [Prometheus dashboards](https://packages.debian.org/unstable/ceph-prometheus-alerts), both packaged in Debian

		More features:

		* block device snapshots and mirroring
		* erasure coding
		* self-healing
		* used at CERN, OVH, and Digital Ocean
		* [yearly release cycle with two-year support lifetime](https://docs.ceph.com/en/latest/releases/general/)
		* cache tiering (e.g. use SSDs as caches)
		* also provides a networked filesystem (CephFS) with an optional NFS
		frontend

		Downsides:

		* complexity: at least 3-4 daemons to manager a cluster, although
		this could might be easier to live with thanks to the Debian
		packages
		* high hardware requirements (quad-core, 64-128GB RAM, 10gbps),
		although their [minimum requirements](https://docs.ceph.com/en/latest/start/hardware-recommendations/#minimum-hardware-recommendations) are actually quite
		attainable

		### Scalability promises

		CERN started with a 3PB Ceph deployment [around 2015][]. It seems it's
		still in use:

		* [2017][], 65PB
		* [2018][], 300PB?
		* [2019][], 1PB/day, 115PB/year?
		* [2021][], 65PB?

		... although, as you can see, it's not exactly clear to me how much data is
		managed by ceph. they seem to have a good experience with Ceph in any case,
		with three active committers, and they say it's a "great community", which is
		certainly a plus.

		On the other hand, managing lots of data is part of their core
		mission, in a sense, so they can probably afford putting more people
		on the problem than we can.

		[around 2015]: https://www.openstack.org/videos/summits/vancouver-2015/ceph-at-cern-a- year-in-the-life-of-a-petabyte-scale-block-storage-service
		[2017]: https://indico.cern.ch/event/649159/contributions/2761965/attachments/1544385/2423339/hroussea-storage-at-CERN.pdf
		[2018]: https://indico.mathrice.fr/event/143/contribution/1/material/slides/0.pdf
		[2019]: https://www.hpcwire.com/2019/09/30/how-ceph-is-helping-to-unlock-the-secrets-of-the-universe/
		[2021]: https://www.concat.de/wp-content/uploads/2021/05/WP-Storage-Wars-Part-3-CEPH-for-HPC-Environments.pdf

		### Complexity and other concerns concerns

		GitLab tried to [move from the cloud to bare metal](https://about.gitlab.com/blog/2016/11/10/why-choose-bare-metal/). [Issue 727](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/727)
		and [issue 1](https://gitlab.com/gitlab-com/operations/-/issues/1) track their attempt to migrate to Ceph which
		failed. They moved back to the cloud. A choice quote from [this
		deployment issue][]:

		> While it's true that we lean towards PostgreSQL, our usage of CephFS was not
		> for the database server, but for the git repositories. In the end we
		> abandoned our usage of CephFS for shared storage and reverted back to a
		> sharded NFS design.

		Jeff Atwood also described his experience, presumably from
		StackOverflow's attempts:

		> We had disastrous experiences with Ceph and Gluster on bare metal. I
		> think this says more about the immaturity (and difficulty) of
		> distributed file systems than the cloud per se.

		This was a [Hacker News comment](https://news.ycombinator.com/item?id=12940042) in response to the first article
		from GitLab.com above, which ended up being correct as GitLab went
		back to the cloud.

		One key thing to keep in mind is that GitLab were looking for an NFS
		replacement, but we don't use NFS anywhere right now (thank god) so
		that is not a requirement for us. So those issues might be less of a
		problem, as the above "horror stories" might not be the same with
		other storage mechanisms. Indeed, there's a big difference between
		using Ceph as a filesystem (ie. CephFS) and an object storage
		(RadosGW) or block storage (RBD), which might be better targets for
		us.

		In particular, we could use Ceph as a block device -- for Ganeti
		instance disks, which Ganeti has good support for -- or object storage
		-- for GitLab's "things", which it is now also designed for. And
		indeed, "NFS" (ie. real filesystem) is now (14.x?) deprecated in
		GitLab, so shared data storage is expected to go through S3-like
		"object storage" APIs from here on.

		[this deployment issue]: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/241#note_39509631

		Some more Ceph war stories:

		* [A Ceph war story](https://michael-prokop.at/blog/2021/04/09/a-ceph-war-story/) - a major outage and recovery due to XFS and
		firmware problems
		* [File systems unfit as distributed storage backends: lessons from
		ten years of Ceph evolution](https://blog.acolyer.org/2019/11/06/ceph-evolution/) - how Ceph migrated from normal
		filesystem backends to their own native block device store
		("BlueStore"), an approach also used by recent MinIO versions

		## Kubernetes

		@@ -500,6 +527,28 @@ beast and seems overkill to fix the immediate problem at hand,
		although it could be interesting to manage our growing fleet of
		containers eventually.

		## Storage Area Network (SAN)

		We could go with a SAN, home-grown or commercial, but i would rather
		avoid proprietary stuff, which means we'd have to build our own, and
		i'm not sure how we would do _that_. ZFS replication maybe? and that
		would only solve the Ganeti storage problems. we'd still need an S3
		storage, but we could use something like MinIO for that specifically.

		## Backup-specific solutions

		TODO: formatting.

		we could fix the backup problems by ditching bacula and switching to
		something like borg. we'd need an offsite server to "pull" the backups,
		however (because borg is push, which means a compromised backup server can
		trash its own backups). we could build this with ZFS/BTRFS replication, again.

		> another caveat with borg is that restores are kind of slow. bacula
		> seems to be really fast at restores, at least it's my experience
		> restoring websites in #40501 (closed) today, really positive
		> feeling.

		# Costs

		# Approval