Periodically, and intermittently, the following failure happens with CI runner jobs:
ERROR: Job failed (system failure): prepare environment: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? (docker.go:570:120s). Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information
My understanding of this failure is that the runner is attempting to use the docker socket but is not able to actually access the socket inside the executor. The solutions for this are to map the docker socket into the runner, for example, the following (limited) snippet from a gitlab-runner config.toml:
So first possibility is the volume configuration, where you specify how the docker socket is mapped to the array of volumes.
Second possibility is using DiND (docker-in-docker), with a privileged container, which does not require you to map the docker socket, but does require you to have the option privileged = true. I believe that TPA has decided not to allow DiND in the past.
A couple other possibilities that should be also on the table:
make sure that the gitlab-runner package is up to date. Older versions have issues, use the gitlab 3rd party debian packages for this
make sure the docker packages are up to date, use the ones provided by upstream in their 3rd party repo, as the ones in Debian have these issues
make sure the OS is running at least bullseye
restart the gitlab-runner and the docker-runner - this shouldn't be necessary, but I've seen this get wedged in irritating ways before (but it stopped once I started using the upstream docker versions, and had the configuration settings right
Checklist:
try podman as a gitlab runner to see if it has the same problems (it doesn't!)
run podman for a while to shake out problems (we found a few, but all should be fixed now)
Here is an example pipeline, restarting the job will sometimes "fix" it.
[...]
My understanding of this failure is that the runner is attempting to use the docker socket but is not able to actually access the socket inside the executor.
That's odd. If that's the case, why would this failure be intermittent? It seems to me that if the Docker socket needs to be accessible, and isn't, this should always fail, and retries shouldn't fix it.
Second possibility is using DiND (docker-in-docker), with a privileged container, which does not require you to map the docker socket, but does require you to have the option privileged = true. I believe that TPA has decided not to allow DiND in the past.
That is correct.
A couple other possibilities that should be also on the table:
make sure that the gitlab-runner package is up to date. Older versions have issues, use the gitlab 3rd party debian packages for this
We are indeed using the upstream packages here:
file { '/etc/apt/sources.list.d/runner_gitlab-runner.list': content => @("EOF"), # using upstream packages because Debian is lagging behind: # - bullseye won't ship it # - it has at least one security issue: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=985377 deb [signed-by=/usr/share/keyrings/gitlab-archive-keyring.gpg] https://packages.gitlab.com/runner/gitlab-runner/debian/ ${distro_codename} main | EOF notify => Class['apt::update'], require => File['/usr/share/keyrings/gitlab-archive-keyring.gpg'], }
make sure the docker packages are up to date, use the ones provided by upstream in their 3rd party repo, as the ones in Debian have these issues
hmm... that might require a second look. we have this pin right now:
So maybe a quick fix for this would be to upgrade the runners to bookowrm?
make sure the OS is running at least bullseye
We're at bullseye everywhere except a couple of exceptions, all of GitLab, including runners, are specifically upgraded to bullseye already.
restart the gitlab-runner and the docker-runner - this shouldn't be necessary, but I've seen this get wedged in irritating ways before (but it stopped once I started using the upstream docker versions, and had the configuration settings right
Interesting!
What i find the most bizarre with this bug is that it's intermittent. I should also add that we've had similar issues reported #40407 (closed) and gitlab#120 (closed), but eventually resolved. But I am also tracking this upstream issue:
It's one of those "missed deliverables" bugs that stays open forever because it only affects self-managed instances, which GitLab seems to care slightly less about than gitlab.com itself...
So maybe the next step is to kick those boxes down to bookworm?
That's odd. If that's the case, why would this failure be
intermittent? It seems to me that if the Docker socket needs to be
accessible, and isn't, this should always fail, and retries
shouldn't fix it.
I don't know. I had a similar problem in the past, and I thought the
same way as you, but I added the thing and it stopped happening. I can't
say for sure it was that, and not some bug in one of the other layers,
or another solution I also possibly implemented around the same time. :o
make sure the docker packages are up to date, use the ones provided by upstream in their 3rd party repo, as the ones in Debian have these issues
hmm... that might require a second look. we have this pin right now:
So maybe a quick fix for this would be to upgrade the runners to bookowrm?
I don't have this problem on bullseye runners... however, i don't have
docker.io installed, what I have installed from upstream docker is
docker-ce, it pulled in a few other things that may or may not be relevant:
ii docker-buildx-plugin 0.11.2-1~debian.11~bullseye amd64 Docker Buildx cli plugin.ii docker-ce 5:24.0.5-1~debian.11~bullseye amd64 Docker: the open-source application container engineii docker-ce-cli 5:24.0.5-1~debian.11~bullseye amd64 Docker CLI: the open-source application container engineii docker-ce-rootless-extras 5:24.0.5-1~debian.11~bullseye amd64 Rootless support for Docker.ii docker-compose-plugin 2.20.2-1~debian.11~bullseye amd64 Docker Compose (V2) plugin for the Docker CLI.
So maybe the next step is to kick those boxes down to bookworm?
Well because bullseye does work, I don't think that is the problem, but
it is weird you have a pin for a different package
So maybe a quick fix for this would be to upgrade the runners to bookowrm?
I don't have this problem on bullseye runners...
... maybe because you don't run the packages from debian.org?
however, i don't have docker.io installed, what I have installed from upstream docker is docker-ce, it pulled in a few other things that may or may not be relevant:
ii docker-buildx-plugin 0.11.2-1~debian.11~bullseye amd64 Docker Buildx cli plugin.ii docker-ce 5:24.0.5-1~debian.11~bullseye amd64 Docker: the open-source application container engineii docker-ce-cli 5:24.0.5-1~debian.11~bullseye amd64 Docker CLI: the open-source application container engineii docker-ce-rootless-extras 5:24.0.5-1~debian.11~bullseye amd64 Rootless support for Docker.ii docker-compose-plugin 2.20.2-1~debian.11~bullseye amd64 Docker Compose (V2) plugin for the Docker CLI.
So maybe the next step is to kick those boxes down to bookworm?
Well because bullseye does work, I don't think that is the problem, but it is weird you have a pin for a different package
The pin is for the package shipped by debian.org (docker.io), not the upstream package (docker-ce).
it doesn't look like any package shipped by debian.org matches the 24 release number... not sure what's up with that:
So maybe a quick fix for this would be to upgrade the runners to bookowrm?
I don't have this problem on bullseye runners...
... maybe because you don't run the packages from debian.org?
yes, that is right.
however, i don't have docker.io installed, what I have installed from upstream docker is docker-ce, it pulled in a few other things that may or may not be relevant:
ii docker-buildx-plugin 0.11.2-1~debian.11~bullseye amd64 Docker Buildx cli plugin.ii docker-ce 5:24.0.5-1~debian.11~bullseye amd64 Docker: the open-source application container engineii docker-ce-cli 5:24.0.5-1~debian.11~bullseye amd64 Docker CLI: the open-source application container engineii docker-ce-rootless-extras 5:24.0.5-1~debian.11~bullseye amd64 Rootless support for Docker.ii docker-compose-plugin 2.20.2-1~debian.11~bullseye amd64 Docker Compose (V2) plugin for the Docker CLI.
So maybe the next step is to kick those boxes down to bookworm?
Well because bullseye does work, I don't think that is the problem, but it is weird you have a pin for a different package
The pin is for the package shipped by debian.org (docker.io), not the upstream package (docker-ce).
That is why I said in the original issue:
make sure the docker packages are up to date, use the ones provided by upstream in their 3rd party repo, as the ones in Debian have these issues
hey folks, i've setup a podman runner. it doesn't run untagged jobs right now as a safety measure, in case it starts breaking people's CI.
i'd love if people could give it a go. see https://docs.gitlab.com/ee/ci/yaml/#tags for how to add tags to your configuration, but it basically requires a configuration change.
note that in our ci-test gitlab-ci.yaml file we added a TPA_TAG_VALUE variable to be able to pass arbitrary tags down into the jobs without having to constantly change the .yaml file...
Is this a thing we could sensibly do to some or all of our jobs in our project, generally? I think the effect would be that every run would try to use this podman runner. So it would have to have enough capacity. Since the failures are intermittent, we'd have to run with at this configuration for a while to see if it helps.
(And: is there some way to do a more SELECT-like search of jobs? To tell if things have improved we'd need to first estimate the failure probability based on historical data, which would tell us how long we'd need to run with the alternative config.)
Is this a thing we could sensibly do to some or all of our jobs in our project, generally? I think the effect would be that every run would try to use this podman runner. So it would have to have enough capacity. Since the failures are intermittent, we'd have to run with at this configuration for a while to see if it helps.
Effectively, this is more or less what will happen in two weeks, once #41296 (closed) is adopted / deployed. That is: the runner will pick up untagged job and will have as likely a chance to pick up on jobs as the other runners.
If we want only the podman to run, we do things in reverse and tell the docker runner to stop accepting untagged jobs, but I see that as a subsequent step, possibly with a complete retirement of the other runner.
(And: is there some way to do a more SELECT-like search of jobs? To tell if things have improved we'd need to first estimate the failure probability based on historical data, which would tell us how long we'd need to run with the alternative config.)
Not really. In theory, GitLab should provide us with Prometheus metrics about the number of jobs queued and their status, but in practice i've had a hard time coercing the stats out of there... Right now we have this dashboard, but it's only queued vs running:
so i'd say observability on this is rather poor right now. if you're more familiar with gitlab, i'd welcome input on how to diagnose this. i guess that, in theory, Someone could sit down and look at the PostgreSQL database to extract some numbers, but that's a bit beyond how far I'm willing to dig into GitLab right now: i'd much rather have those numbers exported in our normal observability platform (prometheus) than create some bespoke thing here...
but yes, it should be possible to look at the job status history, as a one-off thing, using PostgreSQL. but we should be careful as there are always lots of failing jobs, regardless of the runner's status, because, well, people and machines fail for other reasons. :)
It does have some info about number of concurrent jobs, durations of jobs, and limits, but its a bit unclear if gitlab_runner_errors_total would expose actual job failures, or if its just runner-specific failures.
What you can see with these metrics is:
Jobs started on runners:
View an overview of the total jobs executed on your runner fleet for a selected time interval.
View trends in usage. You should analyze this dashboard weekly at a minimum.
You can correlate this data with other metrics, like job duration, to determine if you need configuration changes or capacity upgrades to continue to service your internal SLO’s for CI/CD job performance.
Job duration:
Analyze the performance and scaling of your runner fleet.
Runner capacity:
View the number of jobs being executed divided by the value of limit or concurrent.
Determine if there is still capacity to execute additional jobs.
(from the page linked above).
There was a MR that was closed upstream that provided job success/failure metrics, but it seems like it got closed due to inactivity. There was another request for this which revealed an external exporter that provides this kind of data with a wild dashboard that is worth checking out, just because its cool looking.
I should read later emails before commenting, it looks like you figured out the monitoring runners goods and my comment is out of date now (except look at that external exporter page, because the dashboard is wild)
i am kind of terrified by the cardinal explosion of that exporter, i would advise against plugging it into our infrastructure without serious consideration. it also looks like it must monitor specific projects as opposed to the entire instance.
but yeah, the dashboard looks kind of cool.
i should also update @Diziet on the status of extracting more info from gitlab, as someone answered my comment here:
so it should be possible to make a node exporter textfile metric out of date, e.g. to show the current wait time or something. it's overdue, so i'll Just Do It now.
so i'd say observability on this is rather poor right now. if you're more familiar with gitlab, i'd welcome input on how to diagnose this
If you want to give me some kind of readonly access to the postgresql I'd be happy to guddle around. I don't think it should be too hard to come up with a suitable query for (eg) arti's main branch. There are jobs there that (almost) never fail on main other than due to docker trouble.
If you want to give me some kind of readonly access to the postgresql I'd be happy to guddle around
I'm afraid the security implications of this are too intricate for this to be possible. I meant more in the sense that if you already know of something in GitLab itself that exists and would allow us to do this, I would welcome it. Or, alternatively, if you had some patch to add that to the already existing metrics, that would also be great.
I, also, can dig inside the psql database. :) I really did mean that I don't think it's a good use of our time, as I suspect the database structure of that thing to be kind of horrible (hello rails!).
well well, would you look at that... after digging around the gitlab.com issue queues, I found this epic called Runner Fleet: Queue visibility and observability which aims to "Provide an integrated view in GitLab that provides visibility into jobs in queue for runners". So that's already a promising objective but, if it's like other GitLab issues I've found, nothing might happen for a long time there...
for now it looks a little broken because the exporters haven't been gathering numbers while large jobs were running, we might get cuter things later on.
but for now, i'll call this a success. we should be able to notice job failures. unfortunately, GitLab doesn't distinguish a between "i screwed up this patch" and "docker can't start" type of failures, so a lot of this will have to be played by ear.
still, this should allow us to get a better idea if the failure rates are higher on the podman runner.
also note that we don't have per project stats in there, only per runner, and even that is kind of minimal.