Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
Wiki Replica
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Container Registry
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
The Tor Project
TPA
Wiki Replica
Commits
95d28c99
Verified
Commit
95d28c99
authored
1 year ago
by
anarcat
Browse files
Options
Downloads
Patches
Plain Diff
review numbers (
#40478
)
parent
51f79854
No related branches found
Branches containing commit
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
policy/tpa-rfc-56-large-file-storage.md
+91
-82
91 additions, 82 deletions
policy/tpa-rfc-56-large-file-storage.md
with
91 additions
and
82 deletions
policy/tpa-rfc-56-large-file-storage.md
+
91
−
82
View file @
95d28c99
...
...
@@ -15,98 +15,107 @@ giving a proposal of a solution that should cover most of them.
Those are the issues that were raised in the past with servers running
out of disk space:
*
[
#40475 (closed)
][]
,
[
#40615 (closed)
][]
: "gitlab-02 running out
of disk space"). CI artifacts, and non-linear growth events
*
[
#40431 (closed)
][]
: "
`ci-runner-01`
invalid ubuntu package
signatures");
[
gitlab#95 (closed)
][]
: "Occasionally clean-up
Gitlab CI storage". non-linear, possibly explosive and
unpredictable growth. cache sharing issues between
runners. somewhat under control now that we have more runners.
*
[
#40477 (closed)
][]
("backup failure: disk full on
bungei"). backups, non-linear, mostly archive-01 but also
gitlab. workaround
[
good for ~8 months
][]
(from October 2021, so
until June 2022) hopefully.
*
[
#40442 (closed)
][]
("meronense running out of disk
space"). metrics storage, linear growth. transitioning between
storage systems (see
[
tpo/network-health/metrics/collector#40012
(closed)
][]
). workaround good for years.
*
[
#40535 (closed)
][]
: "colchicifolium disk full". storage is
steadily increasing, adding about 30GB per 90 days according to
hiro, with
`/srv`
regularly reaching 90% full and capacity
being added
TODO: update numbers above
TODO: to add, https://gitlab.torproject.org/tpo/tpa/team/-/issues/40478#note_2808917
> archive-01 ([#40779 (closed)][]) and vineale ([#40778 (closed)][])
> just ran out of disk space too. the strategy for the former is to
> just bump up disk space and eventually migrate to gitlab. for the
> former, it's unclear. it seems like we're eating 2TB a year on that
> thing, or more...
>
> also, we were asked where to put large VM images (3x8GB), and we
> answered "git(lab) LFS" with the intention of moving to object
> storage if we run out of space on the main VM, see #40767 (closed)
> for the discussion.
Note that GitLab needs to be scaled up specifically as well, which
primarily involves splitting it in multiple machines, see
[
#40479
][]
for that discussion. It's partly in scope of this discussion in the
sense that a solution chosen here must be somewhat useful to scale
GitLab out.
Design and performance issues:
*
Ganeti's DRBD backend - a full reboot of all nodes in the cluster
takes hours, because all machines need to be migrated between the
nodes (which is fine) and do not migrate back to their original
pattern (which is not). this might or might not be fixed by a
change in the migration algorithm, but it could also be fixed by
changing storage away from DRBD to something else.
*
[
tpo/network-health/metrics/collector#40012 (closed)
][]
: "Come up
with a plan to make past descriptors etc. easier available and
queryable (giant database)" (in onionoo/collector storage). lots
of small files, might require FS snapshots or transition to
database, see new design in that ticket, or object storage (see
next item)
*
[
#40650 (closed)
][]
: "colchicifolium backups are barely
functional". backups take _days_ to complete, possible solution is
to "Move collector storage from file based to object storage"
(
[
tpo/network-health/metrics/collector#40023 (closed)
][]
)
*
[
#40482 (closed)
][]
: "meronense performance problems (out of
memory?)". nightly memory spikes usage every night, not directly
TPA's responsability, but related to the above
*
**GitLab**
.
[
#40475 (closed)
][]
,
[
#40615 (closed)
][]
,
[
#41139
][]
:
"
`gitlab-02`
running out of disk space". CI artifacts, and
non-linear growth events.
*
**GitLab CI**
.
[
#40431 (closed)
][]
: "
`ci-runner-01`
invalid ubuntu
package signatures";
[
gitlab#95 (closed)
][]
: "Occasionally
clean-up Gitlab CI storage". Non-linear, possibly explosive and
unpredictable growth. Cache sharing issues between
runners. Somewhat under control now that we have more runners, but
current aggressive cache purging degrades performance.
*
**Backups**
.
[
#40477 (closed)
][]
: "backup failure: disk full on
`bungei`
". Was non-linear, mostly due to
`archive-01`
but also
GitLab. A workaround
[
good for ~8 months
][]
(from October 2021, so
until June 2022) was deployed and usage seems stable since
September 2022.
*
**Metrics**
.
[
#40442 (closed)
][]
: "
`meronense`
running out of disk
space". Linear growth. Current allocation (512GB) seem sufficient
for a few more years, conversion to a new storage backend planned
(see below).
*
**Collector**
.
[
#40535 (closed)
][]
: "
`colchicifolium`
disk
full". Linear growth, about 200GB used per year, 1TB allocated in
June 2023, therefore possibly good for 5 years.
*
**Archives**
:
[
#40779 (closed)
][]
: "
`archive-01`
running out of
disk space". Added 2TB in May 2022, seem to be using about 500GB
per year, good for 2-3 more years.
*
**Legacy Git**
:
[
#40778 (closed)
][]
: "
`vineale`
out of disk space",
May 2022. Negligible (64GB), scheduled for retirement (see
[
TPA-RFC-36
][]
).
There are also design and performance issues that are relevant in this
discussion:
*
**Ganeti virtual machines storage**
. A full reboot of all nodes in
the cluster takes hours, because all machines need to be migrated
between the nodes (which is fine) and do not migrate back to their
original pattern (which is not). Improvements have been made to the
migration algorithm, but it could also be fixed by changing storage
away from DRBD to another storage backend like Ceph.
*
**Large file storage**
. We were asked where to put large VM images
(3x8GB), and we answered "git(lab) LFS" with the intention of
moving to object storage if we run out of space on the main VM, see
[
#40767 (closed)
][]
for the discussion. We also were requested to
host a container registry in
[
tpo/tpa/gitlab#89
][]
.
*
**Metrics database**
.
[
tpo/network-health/metrics/collector#40012
(closed)
][]
: "Come up with a plan to make past descriptors
etc. easier available and queryable (giant database)" (in
onionoo/collector storage). This is currently being rebuilt as a
[
Victoria Metrics
][]
server (
[
tpo/tpa/team#41130
][]
).
*
**Collector storage**
.
[
#40650 (closed)
][]
: "colchicifolium backups
are barely functional". Backups take _days_ to complete, possible
solution is to "Move collector storage from file based to object
storage" (
[
tpo/network-health/metrics/collector#40023 (closed)
][]
,
currently on hold).
*
**GitLab scalability**
. GitLab needs to be scaled up for
performance reasons as well, which primarily involves splitting it
in multiple machines, see
[
#40479
][]
for that discussion. It's
partly in scope of this discussion in the sense that a solution
chosen here should be compatible with GitLab's design.
Much of the above and this RFC come from the brainstorm established in
issue
[
tpo/tpa/team#40478
][]
.
[
#40475 (closed)
]:
/tpo/tpa/team/-/issues/40475
[
#40615 (closed)
]:
/tpo/tpa/team/-/issues/40615
[
#40431 (closed)
]:
/tpo/tpa/team/-/issues/40431
[
#40475 (closed)
]:
https://gitlab.torproject.org
/tpo/tpa/team/-/issues/40475
[
#40615 (closed)
]:
https://gitlab.torproject.org
/tpo/tpa/team/-/issues/40615
[
#40431 (closed)
]:
https://gitlab.torproject.org
/tpo/tpa/team/-/issues/40431
[
gitlab#95 (closed)
]:
/tpo/tpa/gitlab/-/issues/95
[
#40477 (closed)
]:
/tpo/tpa/team/-/issues/40477
[
#40477 (closed)
]:
https://gitlab.torproject.org
/tpo/tpa/team/-/issues/40477
[
good for ~8 months
]:
https://gitlab.torproject.org/tpo/tpa/team/-/issues/40477#note_2756638
"backup failure: disk full on bungei"
[
#40442 (closed)
]:
/tpo/tpa/team/-/issues/40442
[
#40442 (closed)
]:
https://gitlab.torproject.org
/tpo/tpa/team/-/issues/40442
[
tpo/network-health/metrics/collector#40012 (closed)
]:
https://gitlab.torproject.org/tpo/network-health/metrics/collector/-/issues/40012
[
#40535 (closed)
]:
/tpo/tpa/team/-/issues/40535
[
#40779 (closed)
]:
/tpo/tpa/team/-/issues/40779
[
#40778 (closed)
]:
/tpo/tpa/team/-/issues/40778
[
#40479
]:
/tpo/tpa/team/-/issues/40479
"scale out GitLab to 2k users"
[
tpo/network-health/metrics/collector#40023 (closed)
]:
/tpo/network-health/metrics/collector/-/issues/40023
[
#40650 (closed)
]:
/tpo/tpa/team/-/issues/40650
[
#40482 (closed)
]:
/tpo/tpa/team/-/issues/40482
[
#40535 (closed)
]:
https://gitlab.torproject.org/tpo/tpa/team/-/issues/40535
[
#40779 (closed)
]:
https://gitlab.torproject.org/tpo/tpa/team/-/issues/40779
[
#40778 (closed)
]:
https://gitlab.torproject.org/tpo/tpa/team/-/issues/40778
[
#40479
]:
https://gitlab.torproject.org/tpo/tpa/team/-/issues/40479
"scale out GitLab to 2k users"
[
tpo/network-health/metrics/collector#40023 (closed)
]:
https://gitlab.torproject.org/tpo/network-health/metrics/collector/-/issues/40023
[
#40650 (closed)
]:
https://gitlab.torproject.org/tpo/tpa/team/-/issues/40650
[
#40482 (closed)
]:
https://gitlab.torproject.org/tpo/tpa/team/-/issues/40482
[
#41139
]:
https://gitlab.torproject.org/tpo/tpa/team/-/issues/41139
[
#40767 (closed)
]:
https://gitlab.torproject.org/tpo/tpa/team/-/issues/40767
[
tpo/tpa/gitlab#89
]:
https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/89
[
tpo/tpa/team#41130
]:
https://gitlab.torproject.org/tpo/tpa/team/-/issues/41130
[
Victoria Metrics
]:
https://victoriametrics.github.io/
[
TPA-RFC-36
]:
https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-36-gitolite-gitweb-retirement
## Storage usage analysis
redo the graphs in https://gitlab.torproject.org/tpo/tpa/team/-/issues/40478#note_2760208
[
According to Grafana
][]
, TPA manages around 111TB of available
storage, with 71TB in use.
TODO: redo the graphs in https://gitlab.torproject.org/tpo/tpa/team/-/issues/40478#note_2760208
[
According to Grafana
]:
https://grafana.torproject.org/d/wUmZB05Zk/tpo-overview?orgId=1&viewPanel=30&from=now-1y&to=now
# Proposal
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment