Verified Commit fef69bf1 authored by anarcat's avatar anarcat
Browse files

look at ceph in depth, split out other topics (team#40478)

parent ce3687b5
Loading
Loading
Loading
Loading
+139 −90
Original line number Diff line number Diff line
@@ -199,101 +199,12 @@ just throwing ideas out there.

object storage options:

 * ceph has support for s3
 * [openio][] mentioned in one of the GitLab threads, not evaluated,
   python, main website down: https://www.openio.io/
   (`SSL_ERROR_NO_CYPHER_OVERLAP`)

in general: i think Ceph is a great option that ticks a lot of the boxes here:

  * redundancy (a la DRBD)
  * but also load-balancing (ie. read/write to multiple servers, i think)
  * S3 backend, which checks the gitlab box.
  * native ganeti integration

the only concern might be its performance and reliability. gitlab evaluated it
as a NFS replacement but decided against it. other war stories:

  * <https://blog.acolyer.org/2019/11/06/ceph-evolution/>
  * <https://michael-prokop.at/blog/2021/04/09/a-ceph-war-story/>
  * <https://www.openstack.org/summit/vancouver-2015/summit-videos/presentation/ceph-at-cern-a-year-in-the-life-of-a-petabyte-scale-block-storage-service>
  * <https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/727> \- gitlab evaluated Ceph and moving to metal in 2016 and decided to stick with the cloud

alternatively, we could go with a SAN, home-grown or commercial, but i would
rather avoid proprietary stuff, which means we'd have to build our own, and
i'm not sure how we would do _that_. ZFS replication maybe? and that would
only solve the Ganeti storage problems. we'd still need an S3 storage, but we
could use something like MinIO for that specifically.

oh, and we could fix the backup problems by ditching bacula and switching to
something like borg. we'd need an offsite server to "pull" the backups,
however (because borg is push, which means a compromised backup server can
trash its own backups). we could build this with ZFS/BTRFS replication, again.

> another caveat with borg is that restores are kind of slow. bacula
> seems to be really fast at restores, at least it's my experience
> restoring websites in #40501 (closed) today, really positive
> feeling.

[openio]: https://www.openio.io/

## TODO: triage Ceph war stories from GitLab and SO

more war stories, this time from gitlab:

  * when they were saying they would move to bare metal and ceph: <https://about.gitlab.com/blog/2016/11/10/why-choose-bare-metal/>
  * when they subsequently tried and failed and switched back to the cloud and not ceph, see <https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/727> <https://gitlab.com/gitlab-com/operations/-/issues/1> quote from [this deployment issue][]: 

> While it's true that we lean towards PostgreSQL, our usage of CephFS was not
> for the database server, but for the git repositories. In the end we
> abandoned our usage of CephFS for shared storage and reverted back to a
> sharded NFS design.

and StackOverflow's (presumably) Jeff Atwood:

  * "We had disastrous experiences with Ceph and Gluster on bare metal. I think this says more about the immaturity (and difficulty) of distributed file systems than the cloud per se." <https://news.ycombinator.com/item?id=12940042> in response to the first article from GitLab.com above (which ended up being correct: the went back to the cloud)

about this, one key thing to keep in mind is that GitLab were looking
for an NFS replacement.

we don't use NFS anywhere right now (thank god) so that is not a
requirement.

the above "horror stories" might not be the same with other storage
mechanisms. indeed, there's a big difference between using Ceph as a
filesystem (ie. CephFS) and an object storage (RadosGW) or block
storage (RBD), which might be better targets for us.

In particular, we're likely to use Ceph as a block device -- for
Ganeti instance disks, which Ganeti has good support for -- or object
storage -- for GitLab's "things", which it is now also designed
for. And indeed, "NFS" (ie. real filesystem) is now (14.x?) deprecated
in GitLab, so shared data storage is expected to go through S3-like
"object storage" APIs from here on.

[this deployment issue]: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/241#note_39509631

## TODO: triage CERN experience

oh, and also i should drop this here... CERN started with a 3PB Ceph
deployment [around 2015][]. It seems it's still in use:

  * [2017][], 65PB
  * [2018][], 300PB?
  * [2019][], 1PB/day, 115PB/year?
  * [2021][], 65PB?

... although, as you can see, it's not exactly clear to me how much data is
managed by ceph. they seem to have a good experience with Ceph in any case,
with three active committers, and they say it's a "great community", which is
certainly a plus...

[around 2015]: https://www.openstack.org/videos/summits/vancouver-2015/ceph-at-cern-a- year-in-the-life-of-a-petabyte-scale-block-storage-service
[2017]: https://indico.cern.ch/event/649159/contributions/2761965/attachments/1544385/2423339/hroussea-storage-at-CERN.pdf
[2018]: https://indico.mathrice.fr/event/143/contribution/1/material/slides/0.pdf
[2019]: https://www.hpcwire.com/2019/09/30/how-ceph-is-helping-to-unlock-the-secrets-of-the-universe/
[2021]: https://www.concat.de/wp-content/uploads/2021/05/WP-Storage-Wars-Part-3-CEPH-for-HPC-Environments.pdf

## TODO triage meeting brainstorm

https://gitlab.torproject.org/tpo/tpa/team/-/issues/40478#note_2788264
@@ -482,7 +393,123 @@ News discussion][30256753] and [this other one][33853539].

## SeaweedFS

https://github.com/seaweedfs/seaweedfs
TODO: review https://github.com/seaweedfs/seaweedfs

## Ceph

[Ceph](https://ceph.io/en/) is ([according to Wikipedia](https://en.wikipedia.org/wiki/Ceph_(software))) a "software-defined storage
platform that provides object storage, block storage, and file storage
built on a common distributed cluster foundation. Ceph provides
completely distributed operation without a single point of failure and
scalability to the exabyte level, and is freely available."

[ceph-debian]: https://tracker.debian.org/pkg/ceph

It's kind of a beast. It's written in C++ and Python and is [packaged
in Debian][ceph-debian]. It provides a *lot* of features we are looking for
here:

 * redundancy ("a la" DRBD)
 * load-balancing (read/write to multiple servers)
 * [far-ranging](https://docs.ceph.com/en/latest/radosgw/s3/) object storage compatibility
 * native Ganeti integration with an iSCSI backend
 * [Puppet module](https://github.com/openstack/puppet-ceph)
 * [Grafana](https://packages.debian.org/unstable/ceph-grafana-dashboards) and [Prometheus dashboards](https://packages.debian.org/unstable/ceph-prometheus-alerts), both packaged in Debian

More features:

 * block device snapshots and mirroring
 * erasure coding
 * self-healing
 * used at CERN, OVH, and Digital Ocean
 * [yearly release cycle with two-year support lifetime](https://docs.ceph.com/en/latest/releases/general/)
 * cache tiering (e.g. use SSDs as caches)
 * also provides a networked filesystem (CephFS) with an optional NFS
   frontend

Downsides:

 * complexity: at least 3-4 daemons to manager a cluster, although
   this could might be easier to live with thanks to the Debian
   packages
 * high hardware requirements (quad-core, 64-128GB RAM, 10gbps),
   although their [minimum requirements](https://docs.ceph.com/en/latest/start/hardware-recommendations/#minimum-hardware-recommendations) are actually quite
   attainable

### Scalability promises

CERN started with a 3PB Ceph deployment [around 2015][]. It seems it's
still in use:

  * [2017][], 65PB
  * [2018][], 300PB?
  * [2019][], 1PB/day, 115PB/year?
  * [2021][], 65PB?

... although, as you can see, it's not exactly clear to me how much data is
managed by ceph. they seem to have a good experience with Ceph in any case,
with three active committers, and they say it's a "great community", which is
certainly a plus.

On the other hand, managing lots of data is part of their core
mission, in a sense, so they can probably afford putting more people
on the problem than we can. 

[around 2015]: https://www.openstack.org/videos/summits/vancouver-2015/ceph-at-cern-a- year-in-the-life-of-a-petabyte-scale-block-storage-service
[2017]: https://indico.cern.ch/event/649159/contributions/2761965/attachments/1544385/2423339/hroussea-storage-at-CERN.pdf
[2018]: https://indico.mathrice.fr/event/143/contribution/1/material/slides/0.pdf
[2019]: https://www.hpcwire.com/2019/09/30/how-ceph-is-helping-to-unlock-the-secrets-of-the-universe/
[2021]: https://www.concat.de/wp-content/uploads/2021/05/WP-Storage-Wars-Part-3-CEPH-for-HPC-Environments.pdf

### Complexity and other concerns concerns

GitLab tried to [move from the cloud to bare metal](https://about.gitlab.com/blog/2016/11/10/why-choose-bare-metal/). [Issue 727](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/727)
and [issue 1](https://gitlab.com/gitlab-com/operations/-/issues/1) track their attempt to migrate to Ceph which
failed. They moved back to the cloud. A choice quote from [this
deployment issue][]:

> While it's true that we lean towards PostgreSQL, our usage of CephFS was not
> for the database server, but for the git repositories. In the end we
> abandoned our usage of CephFS for shared storage and reverted back to a
> sharded NFS design.

Jeff Atwood also described his experience, presumably from
StackOverflow's attempts:

> We had disastrous experiences with Ceph and Gluster on bare metal. I
> think this says more about the immaturity (and difficulty) of
> distributed file systems than the cloud per se.

This was a [Hacker News comment](https://news.ycombinator.com/item?id=12940042) in response to the first article
from GitLab.com above, which ended up being correct as GitLab went
back to the cloud.

One key thing to keep in mind is that GitLab were looking for an NFS
replacement, but we don't use NFS anywhere right now (thank god) so
that is not a requirement for us. So those issues might be less of a
problem, as the above "horror stories" might not be the same with
other storage mechanisms. Indeed, there's a big difference between
using Ceph as a filesystem (ie. CephFS) and an object storage
(RadosGW) or block storage (RBD), which might be better targets for
us.

In particular, we could use Ceph as a block device -- for Ganeti
instance disks, which Ganeti has good support for -- or object storage
-- for GitLab's "things", which it is now also designed for. And
indeed, "NFS" (ie. real filesystem) is now (14.x?) deprecated in
GitLab, so shared data storage is expected to go through S3-like
"object storage" APIs from here on.

[this deployment issue]: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/241#note_39509631

Some more Ceph war stories:

 * [A Ceph war story](https://michael-prokop.at/blog/2021/04/09/a-ceph-war-story/) - a major outage and recovery due to XFS and
   firmware problems
 * [File systems unfit as distributed storage backends: lessons from
   ten years of Ceph evolution](https://blog.acolyer.org/2019/11/06/ceph-evolution/) - how Ceph migrated from normal
   filesystem backends to their own native block device store
   ("BlueStore"), an approach also used by recent MinIO versions

## Kubernetes

@@ -500,6 +527,28 @@ beast and seems overkill to fix the immediate problem at hand,
although it could be interesting to manage our growing fleet of
containers eventually.

## Storage Area Network (SAN)

We could go with a SAN, home-grown or commercial, but i would rather
avoid proprietary stuff, which means we'd have to build our own, and
i'm not sure how we would do _that_. ZFS replication maybe? and that
would only solve the Ganeti storage problems. we'd still need an S3
storage, but we could use something like MinIO for that specifically.

## Backup-specific solutions

TODO: formatting.

we could fix the backup problems by ditching bacula and switching to
something like borg. we'd need an offsite server to "pull" the backups,
however (because borg is push, which means a compromised backup server can
trash its own backups). we could build this with ZFS/BTRFS replication, again.

> another caveat with borg is that restores are kind of slow. bacula
> seems to be really fast at restores, at least it's my experience
> restoring websites in #40501 (closed) today, really positive
> feeling.

# Costs

# Approval