ci-runner-01 out of space

added incident label

Previous instances and (short term) solutions: gitlab#95 (closed)

Open issue for long term solutions: #40478 (closed)

thanks for the ticket, @jnewsome ... just to be sure, this is not directly related with #40476 (closed) because this is not the cache volume you intend to use for sims?

i am starting to think we just need to grow the disk on that runner...

i'll take a look at this.

added Doing label

assigned to @anarcat

current disk usage:

root@ci-runner-01:~# docker system df
TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          34        2         20.05GB   19.98GB (99%)
Containers      28        7         2.44GB    1.547kB (0%)
Local Volumes   106       14        50.65GB   48.96GB (96%)
Build Cache     0         0         0B        0B
root@ci-runner-01:~# df -h / /srv
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       9.8G  3.0G  6.4G  32% /
/dev/sdc         99G   72G   22G  77% /srv

the "local volumes" part is all made of caches, i think. the biggest offenders there are:

                                                            
    6.4 GiB [######### ] /runner-qlbl8xrr-project-647-concurrent-2-cache-
    3.8 GiB [#####     ] /runner-qlbl8xrr-project-647-concurrent-0-cache-
    3.8 GiB [#####     ] /runner-qlbl8xrr-project-973-concurrent-1-cache-
    3.8 GiB [#####     ] /runner-qlbl8xrr-project-973-concurrent-0-cache-
    3.6 GiB [#####     ] /runner-qlbl8xrr-project-982-concurrent-0-cache-
    1.3 GiB [#         ] /runner-qlbl8xrr-project-950-concurrent-0-cache-
  933.7 MiB [#         ] /runner-qlbl8xrr-project-645-concurrent-1-cache-
  888.6 MiB [#         ] /runner-qlbl8xrr-project-647-concurrent-3-cache-
  878.1 MiB [#         ] /runner-qlbl8xrr-project-973-concurrent-3-cache-

Those projects are:

arti (647): 12GB+
0x00A5/arti (973): 8GB+
tpo/core/tor-ci-coverity (982): 4GB+
tpo/core/doc (950): 1.3GB+
mikeperry/tor (945): 933MB

... and that's just the top 10. I haven't yet made a thing that digests all of those caches and spits out the per project disk usage (but it's on my roadmap :p).

The TL;DR: is a problem that @lavamind correctly identified before: the @tpo/core people are taking a lot of disk space on runners. :) we need to figure out ways to better deal with the caching here.

could one way of solving this be to enable the gitlab registry (gitlab#89 (closed)), so that instead of relying on the cache, the core tor people could rely on docker images for their base stuff?

it might seem like just shifting the problem elsewhere, but docker images are easier to reuse and cleanup than caches...

otherwise we could just throw hardware at the problem, but we actually don't have that much free disk space on that cluster, unless we start sucking things out of the SAN, but that is kind of painful to configure, so i'm procrastinating a bit on that.

would welcome feedback on the above.

The core/arti project is Rust and built in non-release mode takes roughly 4.5GB just locally here on my side of the world.

But looking at https://gitlab.torproject.org/tpo/core/arti/-/blob/main/.gitlab-ci.yml, it seems that only OSX cache is kept for some reason but nothing else should hit the caches. @nickm, @eta, does this needs to be cached?

The arti artifacts is the target/ dir which is likely huge but it expires after 1 hour so that shouldn't be the problem.

Another thing, is it legit to allow non Developer role to run their own CI? (looking at https://gitlab.torproject.org/0x00A5/arti/) (poking @ahf).

could one way of solving this be to enable the gitlab registry (gitlab#89 (closed)), so that instead of relying on the cache, the core tor people could rely on docker images for their base stuff?

I'm not familiar with how exactly these projects are using cache, but my 2c from the sponsor 67 (shadow sims) project: I have a lot of custom dependencies (shadow, tgen, a patched tor, oniontrace, ...) that need to be built before running the simulation. Initially I was baking those into a custom Docker image, pushing those to dockerhub, and then having the CI pull that image. That turned out to be a big headache though - having to locally build and push a big image any time one of the deps is changed/tweaked is annoying, and easy to get wrong or forget to do. Putting each of those as a job in the CI with a cached result is much easier to manage.

The idea behind this is not only to enable the image registry, but also allow users to build and push their own images from within CI, something which is currently not possible because it requires a privileged Docker instance.

in other words, if your deps are being built automatically as part of a scheduled CI run, would that help?

...

On 2021-11-01 16:56:24, Jim Newsome (@jnewsome) wrote:

Jim Newsome commented:

could one way of solving this be to enable the gitlab registry (gitlab#89 (closed)), so that instead of relying on the cache, the core tor people could rely on docker images for their base stuff?

I'm not familiar with how exactly these projects are using cache, but my 2c from the sponsor 67 (shadow sims) project: I have a lot of custom dependencies (shadow, tgen, a patched tor, oniontrace, ...) that need to be built before running the simulation. Initially I was baking those into the Docker image, pushing those to dockerhub, and then having the CI pull that image. That turned out to be a big headache though - having to locally build and push a big image any time one of the deps is changed/tweaked is annoying, and easy to get wrong or forget to do. Putting each of those as a job in the CI with a cached result is much easier to manage.

-- Antoine Beaupré torproject.org system administration

in other words, if your deps are being built automatically as part of a scheduled CI run, would that help?

Maybe, but it's definitely simpler to have the single CI workflow responsible for the end-to-end process.

Maybe this project is a bit unique though in that I have to pretty frequently bump versions of those deps, switch to custom branches of them, etc.

Maybe, but it's definitely simpler to have the single CI workflow responsible for the end-to-end process.

it could still be the same CI workflow, just split in multiple jobs... no?

it could still be the same CI workflow, just split in multiple jobs... no?

Ah, yeah if I guess if you support building Docker images from within the CI that's true.

Wouldn't we need to support Docker-in-Docker to be able to build Docker images from inside gitlab jobs, though? I briefly looked at that and got the impression that required the "outer" docker to run in privileged mode, but maybe there's a way to do it without it.

It's also a bit less granular than caches. e.g. right now if I change just the Shadow version or just the Tor version, I'll only have a cache miss for that build, but still have a hit for the other. Not necessarily a deal-breaker, but a downside.

I'm also unclear why it's easier to manage Docker images than caches, but you would know better than me :)

Wouldn't we need to support Docker-in-Docker to be able to build Docker images from inside gitlab jobs, though? I briefly looked at that and got the impression that required the "outer" docker to run in privileged mode, but maybe there's a way to do it without it.

we do need some sort of DIND to build images, yes. i managed to build an image from scratch inside our CI with plain docker import (which doesn't require anything magic like namespaces), but that's not the typical way you build images. details in gitlab#90 (closed).

It's also a bit less granular than caches. e.g. right now if I change just the Shadow version or just the Tor version, I'll only have a cache miss for that build, but still have a hit for the other. Not necessarily a deal-breaker, but a downside.

not sure what you mean there, but surely we could replicate that system with a multi-layered Docker image, e.g. the Shadow image would be build FROM tor?

I'm also unclear why it's easier to manage Docker images than caches, but you would know better than me :)

well at least one way images are easier is that they are centralized in the registry, as opposed to caches which are spread around the runners. so i only have to manage one beefy disk instead of multiple ones, which is one of the things that makes me hesitant in growing the runner right now (if i grow that runner, people now expect runners to be big).

the other is that, quite frankly, it's harder for you people to fill up the container registry than the caches, because it's more of a pain in the back to build container images than just write to cache. :p granted, that's kind of a BOFH move, but it's still a reality.

(i was originally worried about opening the registry exactly for the reverse reason, but since now people have bled over the cache and i grew the disk on the gitlab server, i'm worried about the opposite.) :)

not sure what you mean there, but surely we could replicate that system with a multi-layered Docker image, e.g. the Shadow image would be build FROM tor?

I was thinking everything after the first "miss"/change would need to be rebuilt, but I think that can be avoided using a multi-image build, at the cost of more docker complexity....

as opposed to caches which are spread around the runners

Distributed cache? :-D https://docs.gitlab.com/runner/configuration/autoscale.html#distributed-runners-caching

the other is that, quite frankly, it's harder for you people to fill up the container registry than the caches, because it's more of a pain in the back to build container images than just write to cache.

I mean, yeah, this doesn't seem like a very compelling reason. Assuming the caches are actually useful, this means either more engineering/maintenance effort to switch things over to Docker images and end up using roughly the same amount of storage again, or lose the benefit of caching.

Jim Newsome commented on a discussion: #40493 (comment 2758113)

as opposed to caches which are spread around the runners

Distributed cache? :-D https://docs.gitlab.com/runner/configuration/autoscale.html#distributed-runners-caching

Yeah, that requires me setting up an S3 cluster. :)

the other is that, quite frankly, it's harder for you people to fill up the container registry than the caches, because it's more of a pain in the back to build container images than just write to cache.

I mean, yeah, this doesn't seem like a very compelling reason. Assuming the caches are actually useful, this means either more engineering/maintenance effort to switch things over to Docker images and end up using roughly the same amount of storage again, or lose the benefit of caching.

The thing with docker images is that they somewhat force you in a better workflow. And the actually provide a useful feature on their own. e.g. i think we should definitely have an official, solid, and constantly update image for at least tor, but also arti, etc...

In other words, having those docker images would benefit more than just the sysadmins complaining about caches, IMHO. :)

...

On 2021-11-01 19:22:23, Jim Newsome (@jnewsome) wrote:

-- Antoine Beaupré torproject.org system administration

... and that's just the top 10. I haven't yet made a thing that digests all of those caches and spits out the per project disk usage (but it's on my roadmap :p).

and that's done:

root@ci-runner-01:~# tpa-du-gl-volumes 
INFO: calling `du -sb /var/lib/docker/volumes/*`
project-973     0x00A5/arti     14.72 GiB
project-647     tpo/core/arti   11.09 GiB
project-645     mikeperry/tor   5.72 GiB
project-982     tpo/core/tor-ci-coverity        3.50 GiB
project-1178    tpo/core/tor-ci-win32   3.17 GiB
project-1183            2.65 GiB
project-638     neel/tor        2.18 GiB
project-77      emmapeel/manual 1.98 GiB
project-21      tpo/web/community       1.79 GiB
project-950     tpo/core/doc    1.24 GiB
project-969     jnewsome/sponsor-61-sims        931.12 MiB
project-1076    HackerNCoder/community  773.74 MiB
project-24      tpo/web/support 549.83 MiB
project-466     tpo/web/blog    441.16 MiB
project-425     tpo/community/l10n      216.69 MiB
project-1177    dgoulet/tor-ci-release  96.69 MiB
project-695     tpo/tpa/wiki-replica    9.79 MiB
project-1186    tpo/core/tor-ci-release 8.28 MiB
project-811     ahf/triage-ops  489.54 KiB
WARNING: cannot find some projects, maybe private? (1183)

again, that kind of puts the network team on the spot, but particularly arti. we should definitely work on that caching.

changed the severity to Medium - S3

changed the severity to Low - S4

mentioned in commit anarcat/tor@cdb40075

mentioned in merge request tpo/core/tor!476 (closed)

project-645 mikeperry/tor 5.72 GiB

i've issued tpo/core/tor!476 (closed) for those projects. but this only scratches the surface of the problem.

mentioned in issue #40498 (closed)

i looked at just doubling the disk using the local SAS drives on that VM, and that can't be done: the node is full.

root@chi-node-01:~# gnt-instance grow-disk ci-runner-01.torproject.org 2 100G
Failure: prerequisites not met for this operation:
error type: insufficient_resources, error details:
Not enough disk space on target node chi-node-03.torproject.org vg vg_ganeti: required 102400 MiB, available 50656 MiB

now i could start juggling VMs around to free up some space, but there's a nice little nugget i could kill before that, shadow-01: #40498 (closed). this is going to take a little while because our retirement process enforces some delays, but it should give us some breathing room soon-ish.

this is the disk usage on ci-runner-01 in the last year:

we resized the disk only once, bringing it to 100GB. seems reasonable to bump that again.

shadow-01 killed, resizing the disk:

root@chi-node-01:~# gnt-instance grow-disk ci-runner-01.torproject.org 2 100G
Mon Nov  1 21:02:17 2021 Growing disk 2 of instance 'ci-runner-01.torproject.org' by 100.0G to 200.0G
Mon Nov  1 21:02:20 2021  - INFO: Waiting for instance ci-runner-01.torproject.org to sync disks
Mon Nov  1 21:02:20 2021  - INFO: - device disk/2:  0.10% done, 2h 20m 55s remaining (estimated)
Mon Nov  1 21:03:20 2021  - INFO: - device disk/2:  2.10% done, 56m 19s remaining (estimated)

i still need to resize the underlying filesystem, and i'm heading out so that might not actually be finished before tomorrow, but it should give us some breathing room.

As several of the lingering docker volumes appear to contain stuff from /builds, we should probably take note of the following recommendation from GitLab:

GitLab Runner does not stop you from storing things inside of the Builds Directory. For example, you can store tools inside of /builds/tools that can be used during CI execution. We HIGHLY discourage this, you should never store anything inside of the Builds Directory. GitLab Runner should have total control over it and does not provide stability in such cases. If you have dependencies that are required for your CI, we recommend installing them in some other place.

Seen at https://docs.gitlab.com/runner/best_practice/#build-directory

how do we act on this though? in other words, is it that jobs explicitly store stuff in there, or do we need to explicitly opt out?

...

On 2021-11-01 21:11:21, Jérôme Charaoui (@lavamind) wrote:

Jérôme Charaoui commented:

As several of the lingering docker volumes appear to contain stuff from /builds, we should probably take note of the following recommendation from GitLab:

GitLab Runner does not stop you from storing things inside of the Builds Directory. For example, you can store tools inside of /builds/tools that can be used during CI execution. We HIGHLY discourage this, you should never store anything inside of the Builds Directory. GitLab Runner should have total control over it and does not provide stability in such cases. If you have dependencies that are required for your CI, we recommend installing them in some other place.

Seen at https://docs.gitlab.com/runner/best_practice/#build-directory

-- Antoine Beaupré torproject.org system administration

ci-runner-01 out of space

Child items ...

Activity