Intermittant docker failure inside of runners

marked this issue as related to gitlab#120 (closed)

marked this issue as related to #40407 (closed)

assigned to @anarcat

Here is an example pipeline, restarting the job will sometimes "fix" it. [...] My understanding of this failure is that the runner is attempting to use the docker socket but is not able to actually access the socket inside the executor.

That's odd. If that's the case, why would this failure be intermittent? It seems to me that if the Docker socket needs to be accessible, and isn't, this should always fail, and retries shouldn't fix it.

Second possibility is using DiND (docker-in-docker), with a privileged container, which does not require you to map the docker socket, but does require you to have the option privileged = true. I believe that TPA has decided not to allow DiND in the past.

That is correct.

A couple other possibilities that should be also on the table:

make sure that the gitlab-runner package is up to date. Older versions have issues, use the gitlab 3rd party debian packages for this

We are indeed using the upstream packages here:

  file { '/etc/apt/sources.list.d/runner_gitlab-runner.list':
    content => @("EOF"),
    # using upstream packages because Debian is lagging behind:
    # - bullseye won't ship it
    # - it has at least one security issue: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=985377
    deb [signed-by=/usr/share/keyrings/gitlab-archive-keyring.gpg] https://packages.gitlab.com/runner/gitlab-runner/debian/ ${distro_codename} main
    | EOF
    notify  => Class['apt::update'],
    require => File['/usr/share/keyrings/gitlab-archive-keyring.gpg'],
  }

make sure the docker packages are up to date, use the ones provided by upstream in their 3rd party repo, as the ones in Debian have these issues

hmm... that might require a second look. we have this pin right now:

      Package: docker.io runc containerd libseccomp2
      Pin: release n=bullseye
      Pin-Priority: 500

So maybe a quick fix for this would be to upgrade the runners to bookowrm?

make sure the OS is running at least bullseye

We're at bullseye everywhere except a couple of exceptions, all of GitLab, including runners, are specifically upgraded to bullseye already.

restart the gitlab-runner and the docker-runner - this shouldn't be necessary, but I've seen this get wedged in irritating ways before (but it stopped once I started using the upstream docker versions, and had the configuration settings right

Interesting!

What i find the most bizarre with this bug is that it's intermittent. I should also add that we've had similar issues reported #40407 (closed) and gitlab#120 (closed), but eventually resolved. But I am also tracking this upstream issue:

https://gitlab.com/gitlab-org/gitlab-runner/-/issues/2890

It's one of those "missed deliverables" bugs that stays open forever because it only affects self-managed instances, which GitLab seems to care slightly less about than gitlab.com itself...

So maybe the next step is to kick those boxes down to bookworm?

That's odd. If that's the case, why would this failure be intermittent? It seems to me that if the Docker socket needs to be accessible, and isn't, this should always fail, and retries shouldn't fix it.

I don't know. I had a similar problem in the past, and I thought the same way as you, but I added the thing and it stopped happening. I can't say for sure it was that, and not some bug in one of the other layers, or another solution I also possibly implemented around the same time. :o

make sure the docker packages are up to date, use the ones provided by upstream in their 3rd party repo, as the ones in Debian have these issues

hmm... that might require a second look. we have this pin right now:
      Package: docker.io runc containerd libseccomp2
      Pin: release n=bullseye
      Pin-Priority: 500
So maybe a quick fix for this would be to upgrade the runners to bookowrm?

I don't have this problem on bullseye runners... however, i don't have docker.io installed, what I have installed from upstream docker is docker-ce, it pulled in a few other things that may or may not be relevant:

ii  docker-buildx-plugin                 0.11.2-1~debian.11~bullseye    amd64        Docker Buildx cli plugin.
ii  docker-ce                            5:24.0.5-1~debian.11~bullseye  amd64        Docker: the open-source application container engine
ii  docker-ce-cli                        5:24.0.5-1~debian.11~bullseye  amd64        Docker CLI: the open-source application container engine
ii  docker-ce-rootless-extras            5:24.0.5-1~debian.11~bullseye  amd64        Rootless support for Docker.
ii  docker-compose-plugin                2.20.2-1~debian.11~bullseye    amd64        Docker Compose (V2) plugin for the Docker CLI.

So maybe the next step is to kick those boxes down to bookworm?

Well because bullseye does work, I don't think that is the problem, but it is weird you have a pin for a different package

...

On 2023-08-09 13:40:42, anarcat (@anarcat) wrote:

So maybe a quick fix for this would be to upgrade the runners to bookowrm?

I don't have this problem on bullseye runners...

... maybe because you don't run the packages from debian.org?

however, i don't have docker.io installed, what I have installed from upstream docker is docker-ce, it pulled in a few other things that may or may not be relevant:
ii  docker-buildx-plugin                 0.11.2-1~debian.11~bullseye    amd64        Docker Buildx cli plugin.
ii  docker-ce                            5:24.0.5-1~debian.11~bullseye  amd64        Docker: the open-source application container engine
ii  docker-ce-cli                        5:24.0.5-1~debian.11~bullseye  amd64        Docker CLI: the open-source application container engine
ii  docker-ce-rootless-extras            5:24.0.5-1~debian.11~bullseye  amd64        Rootless support for Docker.
ii  docker-compose-plugin                2.20.2-1~debian.11~bullseye    amd64        Docker Compose (V2) plugin for the Docker CLI.
So maybe the next step is to kick those boxes down to bookworm?

Well because bullseye does work, I don't think that is the problem, but it is weird you have a pin for a different package

The pin is for the package shipped by debian.org (docker.io), not the upstream package (docker-ce).

it doesn't look like any package shipped by debian.org matches the 24 release number... not sure what's up with that:

anarcat@angela:help.torproject.org$ rmadison docker.io
debian:
docker.io  | 18.09.1+dfsg1-7.1+deb10u3 | oldoldstable       | source, amd64, arm64, armel, armhf, i386, ppc64el, s390x
docker.io  | 18.09.1+dfsg1-7.1+deb10u3 | oldoldstable-debug | source
docker.io  | 20.10.5+dfsg1-1+deb11u2   | oldstable          | source, amd64, arm64, armel, armhf, i386, mips64el, mipsel, ppc64el, s390x
docker.io  | 20.10.5+dfsg1-1+deb11u2   | oldstable-debug    | source
docker.io  | 20.10.24+dfsg1-1          | stable             | source
docker.io  | 20.10.24+dfsg1-1+b3       | stable             | amd64, arm64, armel, armhf, i386, mips64el, mipsel, ppc64el, s390x
docker.io  | 20.10.25+dfsg1-1          | testing            | source
docker.io  | 20.10.25+dfsg1-1          | unstable           | source
docker.io  | 20.10.25+dfsg1-1          | unstable-debug     | source
docker.io  | 20.10.25+dfsg1-1+b1       | testing            | amd64, arm64, armel, armhf, i386, mips64el, ppc64el, s390x
docker.io  | 20.10.25+dfsg1-1+b1       | unstable           | amd64, arm64, armel, armhf, i386, mips64el, mipsel, ppc64el, s390x

there's a bug open to upgrade to 23+ without response:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1033839

at this stage, i'd be tempted to look at a podman runner, it's much better supported in debian...

anarcat commented on a discussion: #41295 (comment 2930987)

So maybe a quick fix for this would be to upgrade the runners to bookowrm?

I don't have this problem on bullseye runners...

... maybe because you don't run the packages from debian.org?

yes, that is right.

however, i don't have docker.io installed, what I have installed from upstream docker is docker-ce, it pulled in a few other things that may or may not be relevant:
ii  docker-buildx-plugin                 0.11.2-1~debian.11~bullseye    amd64        Docker Buildx cli plugin.
ii  docker-ce                            5:24.0.5-1~debian.11~bullseye  amd64        Docker: the open-source application container engine
ii  docker-ce-cli                        5:24.0.5-1~debian.11~bullseye  amd64        Docker CLI: the open-source application container engine
ii  docker-ce-rootless-extras            5:24.0.5-1~debian.11~bullseye  amd64        Rootless support for Docker.
ii  docker-compose-plugin                2.20.2-1~debian.11~bullseye    amd64        Docker Compose (V2) plugin for the Docker CLI.
So maybe the next step is to kick those boxes down to bookworm?

Well because bullseye does work, I don't think that is the problem, but it is weird you have a pin for a different package
The pin is for the package shipped by debian.org (docker.io), not the upstream package (docker-ce).

That is why I said in the original issue:

make sure the docker packages are up to date, use the ones provided by upstream in their 3rd party repo, as the ones in Debian have these issues

...

On 2023-08-10 20:27:37, anarcat (@anarcat) wrote:

make sure the docker packages are up to date, use the ones provided by upstream in their 3rd party repo, as the ones in Debian have these issues

right, of course, sorry for the overlook.

So maybe the next step is to kick those boxes down to bookworm?

Oh, and maybe we could try to build a podman-backed runner in bookworm as well, see if that has novel ways of failing us. ;)

added Doing label and removed Next label

marked this issue as related to #41296 (closed)

mentioned in issue #41296 (closed)

marked this issue as related to gitlab#90 (closed)

Oh, and maybe we could try to build a podman-backed runner in bookworm as well, see if that has novel ways of failing us. ;)

i'll see if that works #41296 (closed)

otherwise it's worth trying the latest docker for that and gitlab#90 (closed)

hey folks, i've setup a podman runner. it doesn't run untagged jobs right now as a safety measure, in case it starts breaking people's CI.

i'd love if people could give it a go. see https://docs.gitlab.com/ee/ci/yaml/#tags for how to add tags to your configuration, but it basically requires a configuration change.

note that in our ci-test gitlab-ci.yaml file we added a TPA_TAG_VALUE variable to be able to pass arbitrary tags down into the jobs without having to constantly change the .yaml file...

Is this a thing we could sensibly do to some or all of our jobs in our project, generally? I think the effect would be that every run would try to use this podman runner. So it would have to have enough capacity. Since the failures are intermittent, we'd have to run with at this configuration for a while to see if it helps.

(And: is there some way to do a more SELECT-like search of jobs? To tell if things have improved we'd need to first estimate the failure probability based on historical data, which would tell us how long we'd need to run with the alternative config.)

Is this a thing we could sensibly do to some or all of our jobs in our project, generally? I think the effect would be that every run would try to use this podman runner. So it would have to have enough capacity. Since the failures are intermittent, we'd have to run with at this configuration for a while to see if it helps.

Effectively, this is more or less what will happen in two weeks, once #41296 (closed) is adopted / deployed. That is: the runner will pick up untagged job and will have as likely a chance to pick up on jobs as the other runners.

If we want only the podman to run, we do things in reverse and tell the docker runner to stop accepting untagged jobs, but I see that as a subsequent step, possibly with a complete retirement of the other runner.

(And: is there some way to do a more SELECT-like search of jobs? To tell if things have improved we'd need to first estimate the failure probability based on historical data, which would tell us how long we'd need to run with the alternative config.)

Not really. In theory, GitLab should provide us with Prometheus metrics about the number of jobs queued and their status, but in practice i've had a hard time coercing the stats out of there... Right now we have this dashboard, but it's only queued vs running:

https://grafana.torproject.org/d/QrDJktiMz/gitlab-omnibus?orgId=1&refresh=1m&viewPanel=42

u: tor-guest, no password

so i'd say observability on this is rather poor right now. if you're more familiar with gitlab, i'd welcome input on how to diagnose this. i guess that, in theory, Someone could sit down and look at the PostgreSQL database to extract some numbers, but that's a bit beyond how far I'm willing to dig into GitLab right now: i'd much rather have those numbers exported in our normal observability platform (prometheus) than create some bespoke thing here...

but yes, it should be possible to look at the job status history, as a one-off thing, using PostgreSQL. but we should be careful as there are always lots of failing jobs, regardless of the runner's status, because, well, people and machines fail for other reasons. :)

The available prometheus metrics for runners is somewhat limited: https://docs.gitlab.com/runner/fleet_scaling/index.html#monitoring-runners

It does have some info about number of concurrent jobs, durations of jobs, and limits, but its a bit unclear if gitlab_runner_errors_total would expose actual job failures, or if its just runner-specific failures.

What you can see with these metrics is:

Jobs started on runners:

View an overview of the total jobs executed on your runner fleet for a selected time interval.
View trends in usage. You should analyze this dashboard weekly at a minimum.

You can correlate this data with other metrics, like job duration, to determine if you need configuration changes or capacity upgrades to continue to service your internal SLO’s for CI/CD job performance.

Job duration:

Analyze the performance and scaling of your runner fleet.

Runner capacity:

View the number of jobs being executed divided by the value of limit or concurrent.
Determine if there is still capacity to execute additional jobs.

(from the page linked above).

There was a MR that was closed upstream that provided job success/failure metrics, but it seems like it got closed due to inactivity. There was another request for this which revealed an external exporter that provides this kind of data with a wild dashboard that is worth checking out, just because its cool looking.

I should read later emails before commenting, it looks like you figured out the monitoring runners goods and my comment is out of date now (except look at that external exporter page, because the dashboard is wild)

i am kind of terrified by the cardinal explosion of that exporter, i would advise against plugging it into our infrastructure without serious consideration. it also looks like it must monitor specific projects as opposed to the entire instance.

but yeah, the dashboard looks kind of cool.

i should also update @Diziet on the status of extracting more info from gitlab, as someone answered my comment here:

https://gitlab.com/gitlab-org/gitlab/-/issues/387908#note_1518518908

so it should be possible to make a node exporter textfile metric out of date, e.g. to show the current wait time or something. it's overdue, so i'll Just Do It now.

added Needs Information label and removed Doing label

so i'd say observability on this is rather poor right now. if you're more familiar with gitlab, i'd welcome input on how to diagnose this

If you want to give me some kind of readonly access to the postgresql I'd be happy to guddle around. I don't think it should be too hard to come up with a suitable query for (eg) arti's main branch. There are jobs there that (almost) never fail on main other than due to docker trouble.

If you want to give me some kind of readonly access to the postgresql I'd be happy to guddle around

I'm afraid the security implications of this are too intricate for this to be possible. I meant more in the sense that if you already know of something in GitLab itself that exists and would allow us to do this, I would welcome it. Or, alternatively, if you had some patch to add that to the already existing metrics, that would also be great.

I, also, can dig inside the psql database. :) I really did mean that I don't think it's a good use of our time, as I suspect the database structure of that thing to be kind of horrible (hello rails!).

well well, would you look at that... after digging around the gitlab.com issue queues, I found this epic called Runner Fleet: Queue visibility and observability which aims to "Provide an integrated view in GitLab that provides visibility into jobs in queue for runners". So that's already a promising objective but, if it's like other GitLab issues I've found, nothing might happen for a long time there...

But somewhere in that epic, i found Automatically collect CI Runner metrics and that led me to this guide to setup monitoring on runners. I thought I had tried that before and that lead nowhere, but now it does seem like the runners themselves do provide some pretty amazing metrics.

so i guess i'll try that next!

marked this issue as related to #41042 (closed)

found the issue, #41042 (closed), reopened.

mentioned in issue #41042 (closed)

i made a broad detour by #41042 (closed) and we now have a gitlab CI dashboard in:

https://grafana.torproject.org/d/fd0b2fb2-88d0-4f85-bc86-16164c083b51/gitlab-ci-overview?orgId=1

for now it looks a little broken because the exporters haven't been gathering numbers while large jobs were running, we might get cuter things later on.

but for now, i'll call this a success. we should be able to notice job failures. unfortunately, GitLab doesn't distinguish a between "i screwed up this patch" and "docker can't start" type of failures, so a lot of this will have to be played by ear.

still, this should allow us to get a better idea if the failure rates are higher on the podman runner.

also note that we don't have per project stats in there, only per runner, and even that is kind of minimal.

Intermittant docker failure inside of runners

Designs

Child items ...

Activity