Intermittant docker failure inside of runners
Periodically, and intermittently, the following failure happens with CI runner jobs:
ERROR: Job failed (system failure): prepare environment: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? (docker.go:570:120s). Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information
Here is an example pipeline, restarting the job will sometimes "fix" it.
My understanding of this failure is that the runner is attempting to use the docker socket but is not able to actually access the socket inside the executor. The solutions for this are to map the docker socket into the runner, for example, the following (limited) snippet from a gitlab-runner config.toml
:
[runners.docker]
...
volumes = ["/certs/client", "/cache", "/run/docker.sock:/run/docker.sock"]
...
So first possibility is the volume configuration, where you specify how the docker socket is mapped to the array of volumes.
Second possibility is using DiND (docker-in-docker), with a privileged container, which does not require you to map the docker socket, but does require you to have the option privileged = true
. I believe that TPA has decided not to allow DiND in the past.
A couple other possibilities that should be also on the table:
- make sure that the gitlab-runner package is up to date. Older versions have issues, use the gitlab 3rd party debian packages for this
- make sure the docker packages are up to date, use the ones provided by upstream in their 3rd party repo, as the ones in Debian have these issues
- make sure the OS is running at least bullseye
- restart the gitlab-runner and the docker-runner - this shouldn't be necessary, but I've seen this get wedged in irritating ways before (but it stopped once I started using the upstream docker versions, and had the configuration settings right
Checklist:
-
try podman as a gitlab runner to see if it has the same problems (it doesn't!) -
run podman for a while to shake out problems (we found a few, but all should be fixed now) -
setup a large runner on ci-runner-x86-02 -
pause then retire ci-runner-x86-01 -
setup podman runners on chi-node-14 -
purge docker from chi-node-14