TPA-RFC-84: design and implement backup strategy for MinIO buckets or the entire server

We're considering using MinIO for more and more things, mainly GitLab (artifacts storage in #41403 and gitaly backups in #40518) but possibly other (e.g. metrics storage in tpo/network-health/metrics/collector#40023 (closed)).

Right now, we don't have any backups of that server, which is probably fine: we only store container images there, which can be regenerated in case of a catastrophe. But if we start storing gitaly backups and gitlab artifacts, it needs to be permanent now.

Research how backups can be performed, develop a policy and implement it.

Next steps:

  • research articles anarcat found on the topic (see wallabag)
  • discuss the idea in the network
  • decide if we want this per bucket or per site
  • write up a proposal in particular (in progress, see https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-84-minio-backups-and-scaling)
    • backup/restore recovery
    • impact on other teams
    • timeline
    • estimates
    • review this issue
  • implement proposal
    • minio-fsn-02 setup (4TiB), consider splitting in chunks? (#42136 (closed)) see https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-84-minio-backups-and-scaling#warm-hard-disk-storage
    • implement quotas (#42155 (closed)) (should resolve #42077 (closed)), see https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-84-minio-backups-and-scaling#quotas
      • add monitoring for bucket quota usage
    • sync up minio-fsn-02 and minio-01 with hot/cold storage (#42156 (closed))
      • setup tiered storage
      • join minio-fsn-02 to cluster
      • test assigning a bucket to a specific tier. tie the network-health bucket to the "warm" tier
    • add more storage capacity to the warm tier cluster (#42237 (closed))
    • implement storage backups for both clusters (minio-01 and minio-fsn-02), see https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-84-minio-backups-and-scaling#minio-native-backups-with-possible-exceptions
    • consider setting up a new minio-dal-03 server
  • document and test backup/restore procedures
Edited Jul 15, 2025 by lelutin
Assignee Loading
Time tracking Loading