in team#40615 (closed), we had problems with disk space usage on gitlab-02, which were mostly resolved (as an incident), but @lavamind has been working on a script to improve things in the long term.
the task here is to cleanup old artifacts and jobs, possibly periodically, but at least have a tool to allow us to do those cleanups by hand at first.
Edited
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related.
Learn more.
The above will manage the removal of localization staging pipelines which can run at most every 30 minutes when translations are contributed on Transifex. Those pipelines have been observed to produce several gigabytes of artifacts per day and because of the "Keep artifacts from most recent successful jobs" option, they are not systematically cleaned up on an hourly basis despite what the expire_in: CI paramater suggests.
The above will help ensure that CI artifacts for Debian packaging does not grow out of control. On this repository, it has been observed that job pages may suggest that the job's artifacts have been removed (eg. "The artifacts were removed 1 day ago") but attempting to download them by querying the job's artifact download URL manually demonstrates that the job's artifacts are still available. In addition, deleting pipelines regularly here will keep the indefinite job log storage (around 50MB per pipeline) doesn't become a problem.
And finally, we should consider a weekly job which would impose a hard limit on CI retention:
I pushed the gitlab-pipeline-vacuum script yesterday to help deal with cases where GitLab's own artifacts cleanup mechanisms are insufficient.
I commented there that we need better documentation on that as
well. Inbound links (i think in service/ci.md?) should also be corrected
to mention this script.
I'm proposing to use this script to run several daily jobs for specific projects:
Could you clarify here? How would this be implemented and deployed?
The above will manage the removal of localization staging pipelines which can run at most every 30 minutes when translations are contributed on Transifex. Those pipelines have been observed to produce several gigabytes of artifacts per day and because of the "Keep artifacts from most recent successful jobs" option, they are not systematically cleaned up on an hourly basis despite what the expire_in: CI paramater suggests.
Could you clarify this? They do expire, it's just that one artifact is
kept per branch, no?
The above will help ensure that CI artifacts for Debian packaging does not grow out of control. On this repository, it has been observed that job pages may suggest that the job's artifacts have been removed (eg. "The artifacts were removed 1 day ago") but attempting to download them by querying the job's artifact download URL manually demonstrates that the job's artifacts are still available. In addition, deleting pipelines regularly here will keep the indefinite job log storage (around 50MB per pipeline) doesn't become a problem.
Ouch. It might be worth linking to the related upstream gitlab issue
when we deploy this, so that we know when we can remove this if upstream
does fix that issue...
And finally, we should consider a weekly job which would impose a hard limit on CI retention:
Now that I look at this that way, --min-age feels backwards. It seems
to say that something should be at least that old to be
kept... --max-age would make more sense to me: "don't keep artifacts
older than X". Or maybe just call this --age?
If we ran this today it would free up a significant amount of storage space:
Would delete a total of 25.74 GiB (artifacts 22.14 GiB / logs 3.59 GiB / others 16.85 MiB) in 790 projects and 3892 pipelines
Wow, that's impressive!
In addition to freeing up storage for artifacts and job logs, it would also likely contribute to controlling the GitLab database growth.
I wonder how much that is a problem right now, have you checked?
Thanks for all this hard work!
...
On 2022-02-23 14:34:54, Jérôme Charaoui (@lavamind) wrote:
--
Antoine Beaupré
torproject.org system administration
Could you clarify here? How would this be implemented and deployed?
To be determined, if we decide it's a good idea!
Could you clarify this? They do expire, it's just that one artifact is kept per branch, no?
Honestly I've been looking at this for days not and it's not entirely clear even to me. The main thing is that expired artifacts are not always removed from disk even if GitLab suggests that they are, while deleting via the API works. I don't know how else I can put it.
Ouch. It might be worth linking to the related upstream gitlab issue when we deploy this, so that we know when we can remove this if upstream does fix that issue...
Sure but even if the issue gets closed upstream, I'd still keep those scripts running. GitLab is such a complex beast that I wouldn't be surprised at all if this breaks again down the line. In addition, job logs being kept forever is not a bug.
Now that I look at this that way, --min-age feels backwards. It seems to say that something should be at least that old to be kept... --max-age would make more sense to me: "don't keep artifacts older than X". Or maybe just call this --age?
I don't know... It makes sense to me, but I'm kind of biased! I think the idea of "minimum age" is that we're defining criteria for which pipelines we want to remove, not the ones we want to keep. Those types of criteria have keep in the switch, like --keep-tags. Plus I didn't actually invent this, I picked it up from gitlab-artifact-cleanup so I know at least one other person agrees with me
I wonder how much that is a problem right now, have you checked?
I have no idea, we don't seem to have any metrics for Postgresql in Prometheus.
Could you clarify here? How would this be implemented and deployed?
To be determined, if we decide it's a good idea!
Could you clarify this? They do expire, it's just that one artifact is kept per branch, no?
Honestly I've been looking at this for days not and it's not entirely clear even to me. The main thing is that expired artifacts are not always removed from disk even if GitLab suggests that they are, while deleting via the API works. I don't know how else I can put it.
Ouch. It might be worth linking to the related upstream gitlab issue when we deploy this, so that we know when we can remove this if upstream does fix that issue...
Sure but even if the issue gets closed upstream, I'd still keep those scripts running. GitLab is such a complex beast that I wouldn't be surprised at all if this breaks again down the line. In addition, job logs being kept forever is not a bug.
Okay, I think i'd need a clearer table of the various use case and
retention periods to understand this better... something like:
Scenario
Retention with "keep latest"
Retention without
Ok?
Latest artifact on main
infinite
expire_in
yes
Latest artifact on another ref
infinite
expire_in
no?
Older artifact
expire_in
expire_in
no, we want a hard 90 days limit?
Job logs
infinite
infinite
no, we want to expire logs after X days?
Pipeline entries in database
infinite
infinite
no, we want a hard limit too?
does this make sense?
Now that I look at this that way, --min-age feels backwards. It seems to say that something should be at least that old to be kept... --max-age would make more sense to me: "don't keep artifacts older than X". Or maybe just call this --age?
I don't know... It makes sense to me, but I'm kind of biased! I think the idea of "minimum age" is that we're defining criteria for which pipelines we want to remove, not the ones we want to keep. Those types of criteria have keep in the switch, like --keep-tags. Plus I didn't actually invent this, I picked it up from gitlab-artifact-cleanup so I know at least one other person agrees with me
Let's try --age then, please.
I wonder how much that is a problem right now, have you checked?
I have no idea, we don't seem to have any metrics for Postgresql in Prometheus.
i think we do! but they do not include disk space usage, unfortunately:
i don't why that would need to keep any artifacts whatsoever, except maybe logs... in fact, looking at the pipeline, it doesn't define artifacts at all... what are you refering to here?
I read some parts of https://gitlab.com/gitlab-org/gitlab/-/issues/353128 today and the TL;DR seems to be that they changed how the artifact expiry worker does its thing, but because it was kinda experimental and it could delete more stuff than intended (artifacts older than 2020 or 2021, unclear, and marked as kept), they just disabled the new worker and the entire "we delete expired artifacts" thing. So as I understand it, since GitLab 14.6, there's simply no artifact deletion at all.
If you don’t need any artifacts created before 2020-06-23, an Administrator can enable the worker for removing expired CI/CD artifacts:
Feature.enable(:ci_destroy_all_expired_service)
I think this describes us. I know @jnewsome is keeping some artifacts around with Keep but afaik they're all newer than this date (which would make sense since we starting doing shadow stuff on CI in 2021).
So if you agree @anarcat I think we should just enable this feature flag and hope this takes care of the problem for us. GitLab's own plan is to toggle that feature to enabled by default in the next release.
nice find, and yeah, sounds like a good plan, since we're going to be
hit by this in the future anyways...
does that mean we won't need the manual removal stuff you've been
working on?
...
On 2022-03-14 20:53:33, Jérôme Charaoui (@lavamind) wrote:
Jérôme Charaoui commented:
I read some parts of https://gitlab.com/gitlab-org/gitlab/-/issues/353128#note_852169018 today and the TL;DR seems to be that they changed how the artifact expiry worker does its thing, but because it was kinda experimental and it could delete more stuff than intended (artifacts older that 2020 or 2021, unclear), they just disabled the new worker and the entire "we delete expired artifacts" thing. So as I understand it, since GitLab 14.6, there's simply no artifact deletion at all.
If you don’t need any artifacts created before 2020-06-23, an Administrator can enable the worker for removing expired CI/CD artifacts:
Feature.enable(:ci_destroy_all_expired_service)
I think this describes us. I know @jnewsome is keeping some artifacts around with Keep but afaik they're all newer than this date (which would make sense since we starting doing shadow stuff on CI in 2021).
So if you agree @anarcat I think we should just enable this feature flag and hope this takes care of the problem for us. GitLab's own plan is to toggle that feature to enabled by default in the next release.
--
Antoine Beaupré
torproject.org system administration
Yes, we did not start marking sim runs as keep until roughly November/December 2021. Also, because results are exported to sim-results repo, worst case we can rebuild any graphs we need.
So whatever is easier for you is fine with me, here.
So looking at the graph for the last hours, it does appear like toggling the feature flag is indeed causing GitLab to clean up artifacts as expected. Closing.
wait! :) what about job logs? what about your script? do we have
documentation on this in the pager playbook?
...
On 2022-03-16 12:47:31, Jérôme Charaoui (@lavamind) wrote:
So looking at the graph for the last hours, it does appear like toggling the feature flag is indeed causing GitLab to clean up artifacts as expected. Closing.
--
Antoine Beaupré
torproject.org system administration
I suspect that as long as GitLab cleans up the CI artifacts itself, maybe keeping job logs forever won't be such an issue and we can just ignore it. In this case running the script in an automated fashion might not be needed, unless we still want to impose a maximum lifetime for all CI artifacts, in which case we don't even need those gitlab-tools scripts, we can just run a Rails command via cron on the GitLab server itself.