gitlab-runner-x86-03 ran out of disk space because of rust builds
Once again, we're in a situation where our runners are full.
Steps to reproduce
shows that the runner ate about 200GB in less than 10 hours:
What is the current bug behavior?
rust builds not only take a lot of disk space (20-30GiB), but sometimes those files stick around after build.
What is the expected correct behavior?
we should have jobs that single-handedly fill up the drive. we have a hefty 300GiB partition on this, and about 150GiB of that is arti builds.
When did this start?
today, this filled up in a couple of hours. unclear if it's a regression related to today's changes.
Relevant logs and/or screenshots
here's a partial output of the tpa-du-gl-volumes
output on ci-runner-x86-03 before cleanup:
project-869 dgoulet/arti 53.17 GiB
=> /cache 99.88 MiB
=> /builds 53.07 GiB
project-2997 opara/arti 33.57 GiB
=> /cache 10.50 MiB
=> /builds 33.56 GiB
project-681 nickm/arti 19.66 GiB
=> /cache 31.73 MiB
=> /builds 19.63 GiB
project-2112 tpo/anti-censorship/lox 18.21 GiB
=> /cache 893.79 MiB
=> <md5>3c3f060a0374fc8bc39395164f415a70-protected 5.30 GiB
=> /builds 2.76 GiB
=> <md5>c33bcaa1fd2c77edfc3893b41966cea8-protected 9.28 GiB
project-426 tpo/core/tor 11.26 GiB
=> <md5>3c3f060a0374fc8bc39395164f415a70-protected 4.49 GiB
=> <md5>c33bcaa1fd2c77edfc3893b41966cea8-protected 6.78 GiB
project-1285 Diziet/arti 9.52 GiB
=> /cache 116.12 MiB
=> <md5>3c3f060a0374fc8bc39395164f415a70-protected 7.16 MiB
=> /builds 5.31 GiB
=> <md5>c33bcaa1fd2c77edfc3893b41966cea8-protected 4.10 GiB
project-788 jnewsome/arti 9.28 GiB
=> /cache 114.68 MiB
=> /builds 9.17 GiB
project-3023 wesleyac/arti 6.83 GiB
=> /cache 213.12 MiB
=> /builds 6.62 GiB
project-1706 gabi-250/arti 6.29 GiB
=> /cache 312.74 MiB
=> /builds 5.99 GiB
project-3268 hjrgrn/arti 5.50 GiB
=> <md5>3c3f060a0374fc8bc39395164f415a70-protected 130.66 MiB
=> <md5>c33bcaa1fd2c77edfc3893b41966cea8-protected 5.37 GiB
project-647 tpo/core/arti 5.49 GiB
=> <md5>3c3f060a0374fc8bc39395164f415a70-protected 440.29 MiB
=> <md5>c33bcaa1fd2c77edfc3893b41966cea8-protected 5.06 GiB
project-1772 cve/arti 3.67 GiB
=> /cache 31.73 MiB
=> <md5>3c3f060a0374fc8bc39395164f415a70-protected 0 B
=> /builds 2.33 GiB
=> <md5>c33bcaa1fd2c77edfc3893b41966cea8-protected 1.31 GiB
project-1326 tpo/applications/vpn 2.72 GiB
=> /cache 20.95 KiB
=> <md5>3c3f060a0374fc8bc39395164f415a70-protected 10.28 KiB
=> /builds 1.41 GiB
=> <md5>c33bcaa1fd2c77edfc3893b41966cea8-protected 1.31 GiB
project-950 tpo/core/doc 2.39 GiB
=> <md5>3c3f060a0374fc8bc39395164f415a70-protected 0 B
=> <md5>c33bcaa1fd2c77edfc3893b41966cea8-protected 2.39 GiB
project-543 tpo/core/torspec 2.29 GiB
=> /cache 26.72 MiB
=> <md5>3c3f060a0374fc8bc39395164f415a70-protected 530.46 MiB
=> /builds 114.54 MiB
=> <md5>c33bcaa1fd2c77edfc3893b41966cea8-protected 1.63 GiB
project-1992 tpo/tpa/renovate-cron 1.99 GiB
=> <md5>3c3f060a0374fc8bc39395164f415a70-protected 1010.01 KiB
=> <md5>c33bcaa1fd2c77edfc3893b41966cea8-protected 1.99 GiB
project-2011 tpo/web/l10n/tpo 1.27 GiB
=> <md5>3c3f060a0374fc8bc39395164f415a70-protected 63.82 MiB
=> <md5>c33bcaa1fd2c77edfc3893b41966cea8-protected 1.21 GiB
project-1923 tpo/web/donate-neo 1.17 GiB
=> /cache 438.89 MiB
=> <md5>3c3f060a0374fc8bc39395164f415a70-protected 387.77 MiB
=> /builds 199.03 MiB
=> <md5>c33bcaa1fd2c77edfc3893b41966cea8-protected 171.37 MiB
project-1178 tpo/core/tor-ci-win32 1.09 GiB
=> <md5>3c3f060a0374fc8bc39395164f415a70-protected 0 B
=> <md5>c33bcaa1fd2c77edfc3893b41966cea8-protected 1.09 GiB
project-3393 dan/vpn 1.06 GiB
=> /cache 10.41 KiB
=> /builds 1.06 GiB
project-43 tpo/anti-censorship/pluggable-transports/snowflake 1015.66 MiB
=> /cache 14.43 MiB
=> <md5>3c3f060a0374fc8bc39395164f415a70-protected 22 B
=> /builds 937.94 MiB
Possible fixes
not sure. previously (#40931 (closed)) this was solved by making the rust folks cleanup after themselves (tpo/core/arti!786 (merged), tpo/core/arti!2159 (merged)), but it seems this doesn't work as reliably as it can. perhaps some job failures fail to cleanup those directories correctly?
in #40931 (comment 2850251), @jnewsome proposed to disable the cache on the runners which, counter-intuitively, enables the object-storage cache and stops keeping a copy of the container volumes like that.
another alternative would be to build a base arti image that builds could start from, but (a) we're not sure a 30GiB image is a good idea and (b) it would solve the issue of leftover data.