title: TPA-RFC-63: buy a new backup storage server (5k$ + 100$/mth)
costs: 5000USD one time, 100$/mth, 170$/mth amortized over 6 years.
approval: TPA, accounting, ED
deadline: March 2024
status: standard
discussion: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41364
Summary: 5k budget amortized over 6 years, with 100$/mth hosting, so 170$USD/mth, for a new 80TB (4 drives, expandable to 8) backup server in the secondary location for disaster recovery and the new metrics storage service. Comparable to the current Hetzner backup storage server (190USD/mth for 100TB).
Background
Our backup system relies on a beefy storage server with a 90TB raw disk capacity (72.6TiB). That server currently costs us 175EUR (190USD) per month at Hetzner, on a leased server. That server is currently running out of disk space. We've been having issues with it as early as 2021, but have continuously been able to work around the issues.
Lately, however, this work has been getting more difficult, wasting more and more engineering time as we try to fit more things on this aging server. The last incident, in October 2023, used up all the remaining spare capacity on the server, and we're at risk of seeing new machines without backups, or breaking backups of other machines because we run out of disk space.
This is particularly a concern for new metrics services, which are pivoting towards a new storage solution. This will centralize storage on one huge database server (5TiB with 0.5TiB growth per year), which the current architecture cannot handle at all, especially at the software level.
There was also a scary incident in December 2023 where parts of the main Ganeti cluster went down, taking down the GitLab server and many other services for an hour long outage. The recovery prospects for this were dim, as an estimate for a GitLab migration says it would have taken 18 hours, just to copy data over between the two data centers.
So having a secondary storage server that would be responsible for backing up Hetzner outside of Hetzner seems like a crucial step to handle such disaster recovery scenarios.
Proposal
The proposal is to buy a new bare metal storage server from InterPRO provider, where we recently bought the Tor Browser build machines and Ganeti cluster.
We had an estimate of about 5000$USD for a 80TB server (four 20 TB drives, expandable to eight). Amortized over 6 years, this adds up to a 70$USD/mth expense.
Our colocation provider in the US has nicely offered us a 100$/mth deal for this, which adds up to 170$/mth total.
The server would be built with the same software stack as the current storage server, with the exception of the PostgreSQL database backups, for which we'd experiment with pgbarman.
Alternatives considered
Here are other options that were evaluated before proposing this solution. We have not evaluated other hardware providers as we are currently satisfied with the current provider.
Replacement from Hetzner
An alternative to the above would be to completely replace the storage server at Hetzner by the newer generation they offer, which is the SX134 (the current server being a SX132). That server offers 160TiB of disk space for 208EUR/mth or 227USD/mth.
That would solve the storage issue, but would raise monthly costs by 37USD/mth. It would also not address the vulnerability in the disaster recovery plan, where the backup server is in the same location as the main cluster.
Resizing partitions
One problem with the current server is that we have two separate partitions: one for normal backups, and another, separate partition, for database backups.
The normal backups partition is actually pretty healthy, at 63% disk usage, at the moment. But it did run out in the October 2021 incident, after which we've allocated the last available space from the disks. But for normal backups, the situation is stable.
For databases, it's a different story: the new metrics servers take up a lot of space, and we're struggling to keep up. It could be possible to resize partitions and move things around to allocate more space for the database backups, but this is a time-consuming and risky operation, as disk shrinks are more dangerous than growth operations.
Resizing disks would also not solve the disaster recover vulnerability.
Usage diet
We could also just try to tell people to use less disk space and be frugal in their use of technology. In our experience, this doesn't work so well, as it is patronizing, and, broadly, just ineffective at effecting real change.
It also doesn't solve the disaster recovery vulnerability, obviously.