Skip to content

TPA-RFC-56: large-scale storage problems brainstorm

This ticket is meant to address storage problems I find we frequently stumble upon in the infrastructure, in various places. A few key examples:

"Out of disk" incidents:

  • #40475 (closed), #40615 (closed) - gitlab-02. CI artifacts, and non-linear growth events
  • #40431 (closed) - ci-runner-01 (AKA "ci-runner-01 invalid ubuntu package signatures" and gitlab#95 (closed), "occasionnally cleanup CI storage"). non-linear, possibly explosive and unpredictable growth. cache sharing issues between runners. somewhat under control now that we have more runners.
  • #40477 (closed) - bungei. backups, non-linear, mostly archive-01 but also gitlab. workaround good for ~8 months (from october 2021, so until june 2022) hopefully.
  • #40442 (closed) - meronense. metrics storage, linear growth. transitioning between storage systems (see tpo/network-health/metrics/collector#40012 (closed)). workaround good for years.
  • #40535 (closed) - colchicifolium storage is steadily increasing, adding about 30GB per 90 days according to @hiro, with /srv regularly reaching 90% full and capacity being added

Note that GitLab needs to be scaled up specifically as well, which primarily involves splitting it in multiple machines, see #40479.

Design and performance issues:

So those are the issues I'm aware of. This ticket is to solicit ideas on how to fix those. It could be a new storage system to fix all of those, or tweaks specific to individual issues, or just "throw hardware at the problem", anything goes.

As a reminder:

Brainstorming is a group creativity technique by which efforts are made to find a conclusion for a specific problem by gathering a list of ideas spontaneously contributed by its members.

In other words, brainstorming is a situation where a group of people meet to generate new ideas and solutions around a specific domain of interest by removing inhibitions. People are able to think more freely and they suggest as many spontaneous new ideas as possible. All the ideas are noted down without criticism and after the brainstorming session the ideas are evaluated. The term was popularized by Alex Faickney Osborn in the 1953 book Applied Imagination.

Note the emphasis. I'll give this a week or so, and then we can collect those in a wiki page and start evaluating it.

Outcome, see: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-56-large-file-storage

Edited by anarcat
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information