*<https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/727>\- gitlab evaluated Ceph and moving to metal in 2016 and decided to stick with the cloud
alternatively, we could go with a SAN, home-grown or commercial, but i would
rather avoid proprietary stuff, which means we'd have to build our own, and
i'm not sure how we would do _that_. ZFS replication maybe? and that would
only solve the Ganeti storage problems. we'd still need an S3 storage, but we
could use something like minio for that specifically.
oh, and we could fix the backup problems by ditching bacula and switching to
something like borg. we'd need an offsite server to "pull" the backups,
however (because borg is push, which means a compromised backup server can
trash its own backups). we could build this with ZFS/BTRFS replication, again.
> another caveat with borg is that restores are kind of slow. bacula
> seems to be really fast at restores, at least it's my experience
> restoring websites in #40501 (closed) today, really positive
> feeling.
## TODO: triage Ceph war stories from GitLab and SO
more war stories, this time from gitlab:
* when they were saying they would move to bare metal and ceph: <https://about.gitlab.com/blog/2016/11/10/why-choose-bare-metal/>
* when they subsequently tried and failed and switched back to the cloud and not ceph, see <https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/727><https://gitlab.com/gitlab-com/operations/-/issues/1> quote from [this deployment issue](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/241#note_39509631):
> While it's true that we lean towards PostgreSQL, our usage of CephFS was not
> for the database server, but for the git repositories. In the end we
> abandoned our usage of CephFS for shared storage and reverted back to a
> sharded NFS design.
and StackOverflow's (presumably) Jeff Atwood:
* "We had disastrous experiences with Ceph and Gluster on bare metal. I think this says more about the immaturity (and difficulty) of distributed file systems than the cloud per se." <https://news.ycombinator.com/item?id=12940042> in response to the first article from GitLab.com above (which ended up being correct: the went back to the cloud)
about this, one key thing to keep in mind is that GitLab were looking
for an NFS replacement.
we don't use NFS anywhere right now (thank god) so that is not a
requirement.
the above "horror stories" might not be the same with other storage
mechanisms. indeed, there's a big difference between using Ceph as a
filesystem (ie. CephFS) and an object storage (RadosGW) or block
storage (RBD), which might be better targets for us.
In particular, we're likely to use Ceph as a block device -- for
Ganeti instance disks, which Ganeti has good support for -- or object
storage -- for GitLab's "things", which it is now also designed
for. And indeed, "NFS" (ie. real filesystem) is now (14.x?) deprecated
in GitLab, so shared data storage is expected to go through S3-like
"object storage" APIs from here on.
## TODO: triage CERN experience
oh, and also i should drop this here... CERN started with a 3PB Ceph
deployment [around 2015](https://www.openstack.org/videos/summits/vancouver-2015/ceph-at-cern-a- year-in-the-life-of-a-petabyte-scale-block-storage-service). It seems it's still in use: