TPA-RFC-56: large-scale storage problems brainstorm
This ticket is meant to address storage problems I find we frequently stumble upon in the infrastructure, in various places. A few key examples:
"Out of disk" incidents:
- #40475 (closed), #40615 (closed) - gitlab-02. CI artifacts, and non-linear growth events
- #40431 (closed) - ci-runner-01 (AKA "ci-runner-01 invalid ubuntu package signatures" and gitlab#95 (closed), "occasionnally cleanup CI storage"). non-linear, possibly explosive and unpredictable growth. cache sharing issues between runners. somewhat under control now that we have more runners.
- #40477 (closed) - bungei. backups, non-linear, mostly archive-01 but also gitlab. workaround good for ~8 months (from october 2021, so until june 2022) hopefully.
- #40442 (closed) - meronense. metrics storage, linear growth. transitioning between storage systems (see tpo/network-health/metrics/collector#40012 (closed)). workaround good for years.
-
#40535 (closed) - colchicifolium storage is steadily increasing, adding about 30GB per 90 days according to @hiro, with
/srv
regularly reaching 90% full and capacity being added
Note that GitLab needs to be scaled up specifically as well, which primarily involves splitting it in multiple machines, see #40479.
Design and performance issues:
- DRBD - a full reboot of all nodes in the cluster takes hours, because all machines need to be migrated between the nodes (which is fine) and do not migrate back to their original pattern (which is not). this might or might not be fixed by a change in the migration algorithm, but it could also be fixed by changing storage away from DRBD to something else.
- tpo/network-health/metrics/collector#40012 (closed) - onionoo/collector storage. lots of small files, might require FS snapshots or transition to database, see new design in that ticket, or object storage (tpo/network-health/metrics/collector#40023 (closed))
- #40650 (closed) - colchicifolium backups take days to complete, possible solution in object storage (tpo/network-health/metrics/collector#40023 (closed))
- #40482 (closed) - meronense spikes memory usage every night, not directly TPA's responsability, but related to the above
So those are the issues I'm aware of. This ticket is to solicit ideas on how to fix those. It could be a new storage system to fix all of those, or tweaks specific to individual issues, or just "throw hardware at the problem", anything goes.
As a reminder:
Brainstorming is a group creativity technique by which efforts are made to find a conclusion for a specific problem by gathering a list of ideas spontaneously contributed by its members.
In other words, brainstorming is a situation where a group of people meet to generate new ideas and solutions around a specific domain of interest by removing inhibitions. People are able to think more freely and they suggest as many spontaneous new ideas as possible. All the ideas are noted down without criticism and after the brainstorming session the ideas are evaluated. The term was popularized by Alex Faickney Osborn in the 1953 book Applied Imagination.
Note the emphasis. I'll give this a week or so, and then we can collect those in a wiki page and start evaluating it.
Outcome, see: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-56-large-file-storage