TPA-RFC-84: design and implement backup strategy for MinIO buckets or the entire server
We're considering using MinIO for more and more things, mainly GitLab (artifacts storage in #41403 and gitaly backups in #40518) but possibly other (e.g. metrics storage in tpo/network-health/metrics/collector#40023). Right now, we don't have any backups of that server, which is probably fine: we only store container images there, which can be regenerated in case of a catastrophe. But if we start storing gitaly backups and gitlab artifacts, it needs to be permanent now. Research how backups can be performed, develop a policy and implement it. Next steps: * [x] research articles anarcat found on the topic (see wallabag) * [x] discuss the idea in the network * [x] decide if we want this per bucket or per site * [x] write up a proposal in particular (in progress, see https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-84-minio-backups-and-scaling) * [ ] backup/restore recovery * [x] impact on other teams * [x] timeline * [x] estimates * [x] review this issue * [ ] implement proposal * [x] minio-fsn-02 setup (4TiB), consider splitting in chunks? (tpo/tpa/team#42136) see https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-84-minio-backups-and-scaling#warm-hard-disk-storage * [x] implement quotas (tpo/tpa/team#42155) (should resolve #42077), see https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-84-minio-backups-and-scaling#quotas * [x] add monitoring for bucket quota usage * [x] sync up minio-fsn-02 and minio-01 with hot/cold storage (tpo/tpa/team#42156) * [x] setup tiered storage * [x] join minio-fsn-02 to cluster * [x] test assigning a bucket to a specific tier. tie the `network-health` bucket to the "warm" tier * [ ] add more storage capacity to the `warm` tier cluster (tpo/tpa/team#42237) * [ ] implement storage backups for both clusters (minio-01 and minio-fsn-02), see https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-84-minio-backups-and-scaling#minio-native-backups-with-possible-exceptions * [ ] consider setting up a new minio-dal-03 server * [ ] document and test backup/restore procedures
issue