keep gitlab artifacts disk space usage under control

added Doing label

changed the description

I pushed the gitlab-pipeline-vacuum script yesterday to help deal with cases where GitLab's own artifacts cleanup mechanisms are insufficient.

I'm proposing to use this script to run several daily jobs for specific projects:

./gitlab-pipeline-vacuum \
    -p tpo/web/community \
    -p tpo/web/manual \
    -p tpo/web/support \
    -p tpo/web/tpo \
    -p tpo/web/gettor-web \
    -p tpo/web/donate-static \
    --source pipeline \
    --keep-last 1 \
    --min-age "1 hour"

The above will manage the removal of localization staging pipelines which can run at most every 30 minutes when translations are contributed on Transifex. Those pipelines have been observed to produce several gigabytes of artifacts per day and because of the "Keep artifacts from most recent successful jobs" option, they are not systematically cleaned up on an hourly basis despite what the expire_in: CI paramater suggests.

./gitlab-pipeline-vacuum \
    -p tpo/core/debian/tor \
    --keep-last 1 \
    --min-age "1 week"

The above will help ensure that CI artifacts for Debian packaging does not grow out of control. On this repository, it has been observed that job pages may suggest that the job's artifacts have been removed (eg. "The artifacts were removed 1 day ago") but attempting to download them by querying the job's artifact download URL manually demonstrates that the job's artifacts are still available. In addition, deleting pipelines regularly here will keep the indefinite job log storage (around 50MB per pipeline) doesn't become a problem.

And finally, we should consider a weekly job which would impose a hard limit on CI retention:

./gitlab-pipeline-vacuum \
    --all-projects \
    --min-age "90 days"

If we ran this today it would free up a significant amount of storage space:

Would delete a total of 25.74 GiB (artifacts 22.14 GiB / logs 3.59 GiB / others 16.85 MiB) in 790 projects and 3892 pipelines

In addition to freeing up storage for artifacts and job logs, it would also likely contribute to controlling the GitLab database growth.

Jérôme Charaoui commented:

I pushed the gitlab-pipeline-vacuum script yesterday to help deal with cases where GitLab's own artifacts cleanup mechanisms are insufficient.

I commented there that we need better documentation on that as well. Inbound links (i think in service/ci.md?) should also be corrected to mention this script.

I'm proposing to use this script to run several daily jobs for specific projects:

Could you clarify here? How would this be implemented and deployed?

Docs will need an update for this as well.

./gitlab-pipeline-vacuum \
    -p tpo/web/community \
    -p tpo/web/manual \
    -p tpo/web/support \
    -p tpo/web/tpo \
    -p tpo/web/gettor-web \
    -p tpo/web/donate-static \
    --source pipeline \
    --keep-last 1 \
    --min-age "1 hour"
The above will manage the removal of localization staging pipelines which can run at most every 30 minutes when translations are contributed on Transifex. Those pipelines have been observed to produce several gigabytes of artifacts per day and because of the "Keep artifacts from most recent successful jobs" option, they are not systematically cleaned up on an hourly basis despite what the expire_in: CI paramater suggests.

Could you clarify this? They do expire, it's just that one artifact is kept per branch, no?

./gitlab-pipeline-vacuum \
    -p tpo/core/debian/tor \
    --keep-last 1
    --min-age "1 week"
The above will help ensure that CI artifacts for Debian packaging does not grow out of control. On this repository, it has been observed that job pages may suggest that the job's artifacts have been removed (eg. "The artifacts were removed 1 day ago") but attempting to download them by querying the job's artifact download URL manually demonstrates that the job's artifacts are still available. In addition, deleting pipelines regularly here will keep the indefinite job log storage (around 50MB per pipeline) doesn't become a problem.

Ouch. It might be worth linking to the related upstream gitlab issue when we deploy this, so that we know when we can remove this if upstream does fix that issue...

And finally, we should consider a weekly job which would impose a hard limit on CI retention:
./gitlab-pipeline-vacuum \
    --all-projects \
    --min-age "90 days"

Now that I look at this that way, --min-age feels backwards. It seems to say that something should be at least that old to be kept... --max-age would make more sense to me: "don't keep artifacts older than X". Or maybe just call this --age?

If we ran this today it would free up a significant amount of storage space:
Would delete a total of 25.74 GiB (artifacts 22.14 GiB / logs 3.59 GiB / others 16.85 MiB) in 790 projects and 3892 pipelines

Wow, that's impressive!

In addition to freeing up storage for artifacts and job logs, it would also likely contribute to controlling the GitLab database growth.

I wonder how much that is a problem right now, have you checked?

Thanks for all this hard work!

...

On 2022-02-23 14:34:54, Jérôme Charaoui (@lavamind) wrote:

-- Antoine Beaupré torproject.org system administration

Could you clarify here? How would this be implemented and deployed?

To be determined, if we decide it's a good idea!

Could you clarify this? They do expire, it's just that one artifact is kept per branch, no?

Honestly I've been looking at this for days not and it's not entirely clear even to me. The main thing is that expired artifacts are not always removed from disk even if GitLab suggests that they are, while deleting via the API works. I don't know how else I can put it.

Ouch. It might be worth linking to the related upstream gitlab issue when we deploy this, so that we know when we can remove this if upstream does fix that issue...

Sure but even if the issue gets closed upstream, I'd still keep those scripts running. GitLab is such a complex beast that I wouldn't be surprised at all if this breaks again down the line. In addition, job logs being kept forever is not a bug.

Now that I look at this that way, --min-age feels backwards. It seems to say that something should be at least that old to be kept... --max-age would make more sense to me: "don't keep artifacts older than X". Or maybe just call this --age?

I don't know... It makes sense to me, but I'm kind of biased! I think the idea of "minimum age" is that we're defining criteria for which pipelines we want to remove, not the ones we want to keep. Those types of criteria have keep in the switch, like --keep-tags. Plus I didn't actually invent this, I picked it up from gitlab-artifact-cleanup so I know at least one other person agrees with me

I wonder how much that is a problem right now, have you checked?

I have no idea, we don't seem to have any metrics for Postgresql in Prometheus.

Jérôme Charaoui commented on a discussion: #119 (comment 2781229)

Could you clarify here? How would this be implemented and deployed?

To be determined, if we decide it's a good idea!

Could you clarify this? They do expire, it's just that one artifact is kept per branch, no?

Honestly I've been looking at this for days not and it's not entirely clear even to me. The main thing is that expired artifacts are not always removed from disk even if GitLab suggests that they are, while deleting via the API works. I don't know how else I can put it.

Ouch. It might be worth linking to the related upstream gitlab issue when we deploy this, so that we know when we can remove this if upstream does fix that issue...

Sure but even if the issue gets closed upstream, I'd still keep those scripts running. GitLab is such a complex beast that I wouldn't be surprised at all if this breaks again down the line. In addition, job logs being kept forever is not a bug.

Okay, I think i'd need a clearer table of the various use case and retention periods to understand this better... something like:

Scenario	Retention with "keep latest"	Retention without	Ok?
Latest artifact on main	infinite	expire_in	yes
Latest artifact on another ref	infinite	expire_in	no?
Older artifact	expire_in	expire_in	no, we want a hard 90 days limit?
Job logs	infinite	infinite	no, we want to expire logs after X days?
Pipeline entries in database	infinite	infinite	no, we want a hard limit too?

does this make sense?

Now that I look at this that way, --min-age feels backwards. It seems to say that something should be at least that old to be kept... --max-age would make more sense to me: "don't keep artifacts older than X". Or maybe just call this --age?

I don't know... It makes sense to me, but I'm kind of biased! I think the idea of "minimum age" is that we're defining criteria for which pipelines we want to remove, not the ones we want to keep. Those types of criteria have keep in the switch, like --keep-tags. Plus I didn't actually invent this, I picked it up from gitlab-artifact-cleanup so I know at least one other person agrees with me

Let's try --age then, please.

I wonder how much that is a problem right now, have you checked?

I have no idea, we don't seem to have any metrics for Postgresql in Prometheus.

i think we do! but they do not include disk space usage, unfortunately:

https://grafana.torproject.org/d/IvhES05ik/postgresql-overview

We can look at the normal disk space usage, however and that looks pretty stable over the last 3 months (ie. after we moved artifacts out of there):

https://grafana.torproject.org/d/Z7T7Cfemz/node-exporter-full?orgId=1&viewPanel=152&var-job=node&var-node=gitlab-02.torproject.org&var-port=9100&from=now-90d&to=now

a.

...

On 2022-02-23 20:59:45, Jérôme Charaoui (@lavamind) wrote:

-- Antoine Beaupré torproject.org system administration

Makign a note here of more projects that generate a large number of pipelines and/or artifacts and the command to clean them up:

./gitlab-pipeline-vacuum -p tpo/tpa/triage-ops --keep-last 1 --min-age "1 day"
./gitlab-pipeline-vacuum -p tpo/core/doc --keep-last 1 --min-age "1 day"

./gitlab-pipeline-vacuum -p tpo/tpa/triage-ops --keep-last 1 --min-age "1 day"

i don't why that would need to keep any artifacts whatsoever, except maybe logs... in fact, looking at the pipeline, it doesn't define artifacts at all... what are you refering to here?

upstream issue is now https://gitlab.com/gitlab-org/gitlab/-/issues/346261 and it turns out even gitlab.com might have similar problems: https://gitlab.com/gitlab-org/gitlab/-/issues/322817

I read some parts of https://gitlab.com/gitlab-org/gitlab/-/issues/353128 today and the TL;DR seems to be that they changed how the artifact expiry worker does its thing, but because it was kinda experimental and it could delete more stuff than intended (artifacts older than 2020 or 2021, unclear, and marked as kept), they just disabled the new worker and the entire "we delete expired artifacts" thing. So as I understand it, since GitLab 14.6, there's simply no artifact deletion at all.

The current documentation suggests:

If you don’t need any artifacts created before 2020-06-23, an Administrator can enable the worker for removing expired CI/CD artifacts:

Feature.enable(:ci_destroy_all_expired_service)

I think this describes us. I know @jnewsome is keeping some artifacts around with Keep but afaik they're all newer than this date (which would make sense since we starting doing shadow stuff on CI in 2021).

So if you agree @anarcat I think we should just enable this feature flag and hope this takes care of the problem for us. GitLab's own plan is to toggle that feature to enabled by default in the next release.

I think that should be ok. @mikeperry can you confirm?

nice find, and yeah, sounds like a good plan, since we're going to be hit by this in the future anyways...

does that mean we won't need the manual removal stuff you've been working on?

...

On 2022-03-14 20:53:33, Jérôme Charaoui (@lavamind) wrote:

Jérôme Charaoui commented:

I read some parts of https://gitlab.com/gitlab-org/gitlab/-/issues/353128#note_852169018 today and the TL;DR seems to be that they changed how the artifact expiry worker does its thing, but because it was kinda experimental and it could delete more stuff than intended (artifacts older that 2020 or 2021, unclear), they just disabled the new worker and the entire "we delete expired artifacts" thing. So as I understand it, since GitLab 14.6, there's simply no artifact deletion at all.

The current documentation suggests:

If you don’t need any artifacts created before 2020-06-23, an Administrator can enable the worker for removing expired CI/CD artifacts:

Feature.enable(:ci_destroy_all_expired_service)

I think this describes us. I know @jnewsome is keeping some artifacts around with Keep but afaik they're all newer than this date (which would make sense since we starting doing shadow stuff on CI in 2021).

So if you agree @anarcat I think we should just enable this feature flag and hope this takes care of the problem for us. GitLab's own plan is to toggle that feature to enabled by default in the next release.

-- Antoine Beaupré torproject.org system administration

does that mean we won't need the manual removal stuff you've been working on?

Not sure, I'll have a better idea once we get GitLab to delete expired artifacts again...

Yes, we did not start marking sim runs as keep until roughly November/December 2021. Also, because results are exported to sim-results repo, worst case we can rebuild any graphs we need.

So whatever is easier for you is fine with me, here.

Alright, I've toggled the GitLab feature via the Rails console. I'll monitor the free space on the filesystem to see if this helps or not.

So a few minutes ago, a whopping 97G was freed on the artifacts storage filesystem, which is now 68% free.

Unbelievable.

So looking at the graph for the last hours, it does appear like toggling the feature flag is indeed causing GitLab to clean up artifacts as expected. Closing.

wait! :) what about job logs? what about your script? do we have documentation on this in the pager playbook?

...

On 2022-03-16 12:47:31, Jérôme Charaoui (@lavamind) wrote:

So looking at the graph for the last hours, it does appear like toggling the feature flag is indeed causing GitLab to clean up artifacts as expected. Closing.

-- Antoine Beaupré torproject.org system administration

I suspect that as long as GitLab cleans up the CI artifacts itself, maybe keeping job logs forever won't be such an issue and we can just ignore it. In this case running the script in an automated fashion might not be needed, unless we still want to impose a maximum lifetime for all CI artifacts, in which case we don't even need those gitlab-tools scripts, we can just run a Rails command via cron on the GitLab server itself.

closed

reopened

mentioned in commit wiki-replica@5f95d3ca

Updated the pager playbook on howto/gitlab!

closed

keep gitlab artifacts disk space usage under control

Child items ...

Activity