Skip to content
Snippets Groups Projects
Verified Commit 99283b3a authored by anarcat's avatar anarcat
Browse files

propose budget for a new backup server (team#41364)

parent 7a93952c
No related branches found
No related tags found
No related merge requests found
......@@ -25,7 +25,9 @@ and add it to the above list.
## Proposed
No policy is currently `proposed`.
<!-- No policy is currently `proposed`. -->
* [TPA-RFC-63: Storage server budget](policy/tpa-rfc-63-storage-server-budget)
## Standard
......
---
title: TPA-RFC-63: buy a new backup storage server (5k$ + 100$/mth)
---
[[_TOC_]]
Summary: 5k budget amortized over 6 years, with 100$/mth hosting, so
170$USD/mth, for a new 80TB (4 drives, expandable to 8) backup server
in the secondary location for disaster recovery and the new metrics
storage service. Comparable to the current Hetzner backup storage
server (190USD/mth for 100TB).
# Background
Our backup system relies on a beefy storage server with a 90TB raw
disk capacity (72.6TiB). That server currently costs us 175EUR
(190USD) per month at Hetzner, on bare metal. That server is currently
running out of disk space. We've been having issues with it as [early
as 2021][], but have continuously been able to work around the issues.
[early as 2021]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40477
Lately, however, this work has been getting more difficult, wasting
more and more engineering time as we try to fit more things on this
aging server. The last incident, in [October 2023][], used up all the
remaining spare capacity on the server, and we're now blocked from
expanding other machines.
[October 2023]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41361
This is particularly a concern for new metrics services, which are
pivoting towards a new storage solution. This will centralize storage
on one huge database server (5TiB with 0.5TiB growth per year), which
the current architecture cannot handle at all, especially at the
software level.
There was also a [scary incident in December 2023][] where parts of
the main Ganeti cluster went down, taking down the GitLab server and
many other services for an [hour long outage][]. The recovery
prospects for this were dim, as an [estimate for a GitLab
migration][] says it would have taken 18 hours, just to copy data
over between the two data centers.
[scary incident in December 2023]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41429
[hour long outage]: https://status.torproject.org/issues/2023-12-06-gitlab-collector-outage/
[estimate for a GitLab migration]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41431
So having a secondary storage server that would be responsible for
backing up Hetzner outside of Hetzner seems like a crucial step to
handle such disaster recovery scenarios.
# Proposal
The proposal is to buy a new bare metal storage server from InterPRO
provider, where we recently bought the Tor Browser build machines and
Ganeti cluster.
We had an estimate of about 5000$USD for a 80TB server (four 20 TB
drives, expandable to eight). Amortized over 6 years, this adds up to
a 70$USD/mth expense.
Our colocation provider in the US has nicely offered us a 100$/mth
deal for this, which adds up to 170$/mth total.
The server would be built with the same software stack as the current
storage server, with the exception of the PostgreSQL database backups,
for which we'd experiment with [pgbarman][].
[pgbarman]: https://pgbarman.org/
# Alternatives considered
## Replacement
An alternative to the above would be to completely replace the storage
server at Hetzner by the newer generation they offer, which is the
[SX134][] (the current server being a SX132). That server offers
160TiB of disk space for 208EUR/mth or 227USD/mth.
[SX134]: https://www.hetzner.com/dedicated-rootserver/sx134/configurator/#/
That would solve the storage issue, but would raise monthly costs by
37USD/mth. It would also not address the vulnerability in the disaster
recovery plan, where the backup server is in the same location as the
main cluster.
# Costs
5000USD one time, 100$/mth, 170$/mth amortized over 6 years.
# Approval
Isabela, Sue.
# Deadline
Ideally would be approved in March.
# Status
This proposal is currently in the `proposed` state.
# References
* [quote from provider](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41536)
* [discussion issue](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41364)
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment