Gitlab runner fails to resolve gitlab.torproject.org

Right after GeKo told me about this on IRC, I noticed it had happened in https://gitlab.torproject.org/tpo/core/tor/-/jobs/10528 too.

last time this happened, a reboot fixed it. rebooting now,, we'll see.

rebooted, retried the job.

Mentioned on IRC, this might be the issue we see here: https://gitlab.com/gitlab-org/gitlab-runner/-/issues/6644

rebooting seems to have fixed the issue.

looking at this comment in the issue ahf found indicates this might be a bridge mapping issue. The magic command (docker inspect --format='{{.NetworkSettings.Networks}}' $CONTAINER_ID) doesn't give us anything interesting:

map[bridge:0xc0005ec000]

A fuller output looks like:

root@ci-runner-01:~# docker inspect runner-9avwsm6s-project-321-concurrent-3-5dd950797d8d7760-predefined-2
[
    {
        "Id": "e8c1d0406421c8623f71e310d30b096fdfe71f5bb09a7157891a857ef6e47ab6",
        "Created": "2021-02-04T15:17:22.659419093Z",
        "Path": "/usr/bin/dumb-init",
        "Args": [
            "/entrypoint",
            "gitlab-runner-build"
        ],
        "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 0,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2021-02-04T15:17:23.583299634Z",
            "FinishedAt": "2021-02-04T15:17:23.767182637Z"
        },
        "Image": "sha256:c398d3217fca9b237cc9946289c2790831d109248aea38723aeb5ee6da0f13a5",
        "ResolvConfPath": "/var/lib/docker/containers/e8c1d0406421c8623f71e310d30b096fdfe71f5bb09a7157891a857ef6e47ab6/resolv.conf",
        "HostnamePath": "/var/lib/docker/containers/e8c1d0406421c8623f71e310d30b096fdfe71f5bb09a7157891a857ef6e47ab6/hostname",
        "HostsPath": "/var/lib/docker/containers/e8c1d0406421c8623f71e310d30b096fdfe71f5bb09a7157891a857ef6e47ab6/hosts",
        "LogPath": "/var/lib/docker/containers/e8c1d0406421c8623f71e310d30b096fdfe71f5bb09a7157891a857ef6e47ab6/e8c1d0406421c8623f71e310d30b096fdfe71f5bb09a7157891a857ef6e47ab6-json.log",
        "Name": "/runner-9avwsm6s-project-321-concurrent-3-5dd950797d8d7760-predefined-2",
        "RestartCount": 0,
        "Driver": "overlay2",
        "Platform": "linux",
        "MountLabel": "",
        "ProcessLabel": "",
        "AppArmorProfile": "docker-default",
        "ExecIDs": null,
        "HostConfig": {
            "Binds": [
                "runner-9avwsm6s-project-321-concurrent-3-cache-3c3f060a0374fc8bc39395164f415a70:/cache",
                "runner-9avwsm6s-project-321-concurrent-3-cache-c33bcaa1fd2c77edfc3893b41966cea8:/builds"
            ],
            "ContainerIDFile": "",
            "LogConfig": {
                "Type": "json-file",
                "Config": {}
            },
            "NetworkMode": "default",
            "PortBindings": null,
            "RestartPolicy": {
                "Name": "no",
                "MaximumRetryCount": 0
            },
            "AutoRemove": false,
            "VolumeDriver": "",
            "VolumesFrom": null,
            "CapAdd": null,
            "CapDrop": null,
            "CgroupnsMode": "host",
            "Dns": null,
            "DnsOptions": null,
            "DnsSearch": null,
            "ExtraHosts": null,
            "GroupAdd": null,
            "IpcMode": "shareable",
            "Cgroup": "",
            "Links": null,
            "OomScoreAdj": 0,
            "PidMode": "",
            "Privileged": false,
            "PublishAllPorts": false,
            "ReadonlyRootfs": false,
            "SecurityOpt": null,
            "UTSMode": "",
            "UsernsMode": "",
            "ShmSize": 67108864,
            "Runtime": "runc",
            "ConsoleSize": [
                0,
                0
            ],
            "Isolation": "",
            "CpuShares": 0,
            "Memory": 0,
            "NanoCpus": 0,
            "CgroupParent": "",
            "BlkioWeight": 0,
            "BlkioWeightDevice": null,
            "BlkioDeviceReadBps": null,
            "BlkioDeviceWriteBps": null,
            "BlkioDeviceReadIOps": null,
            "BlkioDeviceWriteIOps": null,
            "CpuPeriod": 0,
            "CpuQuota": 0,
            "CpuRealtimePeriod": 0,
            "CpuRealtimeRuntime": 0,
            "CpusetCpus": "",
            "CpusetMems": "",
            "Devices": null,
            "DeviceCgroupRules": null,
            "DeviceRequests": null,
            "KernelMemory": 0,
            "KernelMemoryTCP": 0,
            "MemoryReservation": 0,
            "MemorySwap": 0,
            "MemorySwappiness": null,
            "OomKillDisable": false,
            "PidsLimit": null,
            "Ulimits": null,
            "CpuCount": 0,
            "CpuPercent": 0,
            "IOMaximumIOps": 0,
            "IOMaximumBandwidth": 0,
            "MaskedPaths": [
                "/proc/asound",
                "/proc/acpi",
                "/proc/kcore",
                "/proc/keys",
                "/proc/latency_stats",
                "/proc/timer_list",
                "/proc/timer_stats",
                "/proc/sched_debug",
                "/proc/scsi",
                "/sys/firmware"
            ],
            "ReadonlyPaths": [
                "/proc/bus",
                "/proc/fs",
                "/proc/irq",
                "/proc/sys",
                "/proc/sysrq-trigger"
            ]
        },
        "GraphDriver": {
            "Data": {
                "LowerDir": "/var/lib/docker/overlay2/93054eff44c85c770103885eed501dc04248359bbfa842dfc84eebd6f4094ce1-init/diff:/var/lib/docker/overlay2/ef1b70c57e4f97533927651274889a59f6043f1a104800bc5045118411a41db8/diff",
                "MergedDir": "/var/lib/docker/overlay2/93054eff44c85c770103885eed501dc04248359bbfa842dfc84eebd6f4094ce1/merged",
                "UpperDir": "/var/lib/docker/overlay2/93054eff44c85c770103885eed501dc04248359bbfa842dfc84eebd6f4094ce1/diff",
                "WorkDir": "/var/lib/docker/overlay2/93054eff44c85c770103885eed501dc04248359bbfa842dfc84eebd6f4094ce1/work"
            },
            "Name": "overlay2"
        },
        "Mounts": [
            {
                "Type": "volume",
                "Name": "runner-9avwsm6s-project-321-concurrent-3-cache-3c3f060a0374fc8bc39395164f415a70",
                "Source": "/var/lib/docker/volumes/runner-9avwsm6s-project-321-concurrent-3-cache-3c3f060a0374fc8bc39395164f415a70/_data",
                "Destination": "/cache",
                "Driver": "local",
                "Mode": "z",
                "RW": true,
                "Propagation": ""
            },
            {
                "Type": "volume",
                "Name": "runner-9avwsm6s-project-321-concurrent-3-cache-c33bcaa1fd2c77edfc3893b41966cea8",
                "Source": "/var/lib/docker/volumes/runner-9avwsm6s-project-321-concurrent-3-cache-c33bcaa1fd2c77edfc3893b41966cea8/_data",
                "Destination": "/builds",
                "Driver": "local",
                "Mode": "z",
                "RW": true,
                "Propagation": ""
            }
        ],
        "Config": {
            "Hostname": "runner-9avwsm6s-project-321-concurrent-3",
            "Domainname": "",
            "User": "",
            "AttachStdin": true,
            "AttachStdout": true,
            "AttachStderr": true,
            "Tty": false,
            "OpenStdin": true,
            "StdinOnce": true,
            "Env": [ ENV LIST WITH SECRETS REDACTED ],
            "Cmd": [
                "gitlab-runner-build"
            ],
            "Image": "sha256:c398d3217fca9b237cc9946289c2790831d109248aea38723aeb5ee6da0f13a5",
            "Volumes": null,
            "WorkingDir": "",
            "Entrypoint": [
                "/usr/bin/dumb-init",
                "/entrypoint"
            ],
            "OnBuild": null,
            "Labels": {
                "com.gitlab.gitlab-runner.job.before_sha": "7430b4ef9f4b0371502560126b5342dc4f117371",
                "com.gitlab.gitlab-runner.job.id": "10797",
                "com.gitlab.gitlab-runner.job.ref": "maint-1.1",
                "com.gitlab.gitlab-runner.job.sha": "beaf6de889bc75d53a6b0b90d12ab85aa0db56a0",
                "com.gitlab.gitlab-runner.pipeline.id": "2507",
                "com.gitlab.gitlab-runner.project.id": "321",
                "com.gitlab.gitlab-runner.runner.id": "9avWSM6S",
                "com.gitlab.gitlab-runner.runner.local_id": "0",
                "com.gitlab.gitlab-runner.type": "predefined"
            }
        },
        "NetworkSettings": {
            "Bridge": "",
            "SandboxID": "[LONG HEX HASH REDACTED]",
            "HairpinMode": false,
            "LinkLocalIPv6Address": "",
            "LinkLocalIPv6PrefixLen": 0,
            "Ports": {},
            "SandboxKey": "/var/run/docker/netns/[SHORT HEX HASH REDACTED]",
            "SecondaryIPAddresses": null,
            "SecondaryIPv6Addresses": null,
            "EndpointID": "",
            "Gateway": "",
            "GlobalIPv6Address": "",
            "GlobalIPv6PrefixLen": 0,
            "IPAddress": "",
            "IPPrefixLen": 0,
            "IPv6Gateway": "",
            "MacAddress": "",
            "Networks": {
                "bridge": {
                    "IPAMConfig": null,
                    "Links": null,
                    "Aliases": null,
                    "NetworkID": "[ANOTHER LONG HEX HASH REDACTED]",
                    "EndpointID": "",
                    "Gateway": "",
                    "IPAddress": "",
                    "IPPrefixLen": 0,
                    "IPv6Gateway": "",
                    "GlobalIPv6Address": "",
                    "GlobalIPv6PrefixLen": 0,
                    "MacAddress": "",
                    "DriverOpts": null
                }
            }
        }
    }
]

So I don't think that's an easy fix for us, unfortunately. but we'll see: next time this happens, maybe this output will be different enough to figure out what is wrong.

"how hard can networking be" right? :)

in the meantime, i'll close this, please do reopen this ticket (or a new one) when/if it happens again, sorry for the inconvenience, and thanks for flying TPA! :)

closed

assigned to @anarcat

this is happening again, unclear if it's because of an upgrade or what, investigating.

reopened

added Doing label

there was no upgrade or reboot recently, latest upgrade is:

root@ci-runner-01:~# grep docker /var/log/dpkg.log*
/var/log/dpkg.log:2021-02-03 21:55:24 upgrade docker.io:amd64 18.09.1+dfsg1-7.1+deb10u2 20.10.2+dfsg1-2

root@ci-runner-01:~# uptime
 15:36:47 up 5 days, 26 min,  1 user,  load average: 0.06, 0.25, 0.25

Those dates match when this problem manifested itself the last time.

I can confirm that networking is completely down inside the container. I confirmed with a Python image doing a simple socket connect:

>>> import socket
>>> sock = socket.socket()

>>> sock.connect(('206.248.172.91', 80))
[hangs]
^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyboardInterrupt

i restarted docker and it fixed the problem, but it's kind of annoying that this comes up like this.

docker had ran out of disk space today (#95 (closed)), maybe that was the cause?

closing until this happens again...

closed

mentioned in commit wiki-replica@78906ab9

this happened again. restarting docker fixed it, but i have no idea what triggered it this time. we didn't run out of disk space, and there were no upgrades since feb 15:

root@ci-runner-01:~# grep upgrade /var/log/dpkg.log | tail -10
2021-02-19 06:40:08 upgrade libdns1104:amd64 1:9.11.5.P4+dfsg-5.1+deb10u2 1:9.11.5.P4+dfsg-5.1+deb10u3
2021-02-19 06:40:08 upgrade libisc1100:amd64 1:9.11.5.P4+dfsg-5.1+deb10u2 1:9.11.5.P4+dfsg-5.1+deb10u3
2021-02-19 06:40:08 upgrade liblwres161:amd64 1:9.11.5.P4+dfsg-5.1+deb10u2 1:9.11.5.P4+dfsg-5.1+deb10u3
2021-02-19 06:40:08 upgrade libisc-export1100:amd64 1:9.11.5.P4+dfsg-5.1+deb10u2 1:9.11.5.P4+dfsg-5.1+deb10u3
2021-02-19 06:40:09 upgrade libdns-export1104:amd64 1:9.11.5.P4+dfsg-5.1+deb10u2 1:9.11.5.P4+dfsg-5.1+deb10u3
2021-02-21 06:45:04 upgrade libzstd1:amd64 1.3.8+dfsg-3+deb10u1 1.3.8+dfsg-3+deb10u2
2021-02-21 06:45:04 upgrade ldap-utils:amd64 2.4.47+dfsg-3+deb10u5 2.4.47+dfsg-3+deb10u6
2021-02-21 06:45:04 upgrade libldap-common:all 2.4.47+dfsg-3+deb10u5 2.4.47+dfsg-3+deb10u6
2021-02-21 06:45:04 upgrade libldap-2.4-2:amd64 2.4.47+dfsg-3+deb10u5 2.4.47+dfsg-3+deb10u6
2021-02-22 06:40:17 upgrade screen:amd64 4.6.2-3 4.6.2-3+deb10u1

@ahf reported this today, and i doubt he would have tolerated that problem a full week, so this is not an upgrade problem.

reopened

Happened again since the setup was poked at last night (danish time/CET).

Based on my inbox, with the following emails:

N      GitLab               Failed pipeline for main | Triage Ops | 775dc2b9              2021/03/02 01:03
N      GitLab               Fixed pipeline for main | Triage Ops | 775dc2b                2021/03/02 02:03
N      GitLab               Failed pipeline for main | Triage Ops | 775dc2b9              2021/03/02 06:03

It seems like it has happened somewhere after 02:03 (UTC) where anarcat fixed the runner and 06:03 (UTC) where the hourly triage ops project failed again.

It seems like it has happened somewhere after 02:03 (UTC) where anarcat fixed the runner and 06:03 (UTC) where the hourly triage ops project failed again.

and you were saying on IRC that you probably had a working pipeline go through at 05:03 as well, because that pipeline runs hourly, right?

so this interesting thing happened during that period: the firewall rules were reloaded by puppet...

Mar  2 05:32:38 ci-runner-01/ci-runner-01 puppet-agent[29768]: (/Stage[main]/Nagios::Client/Ferm::Rule[roles-nagiosmaster-ssh-hetzner-hel1-01.torproject.org]/File[/etc/ferm/tor.d/00_roles-nagiosmaster-ssh-hetzner-hel1-01.torproject.org]/content) content changed '{md5}2b9f880a29c99666e1c9cd9e9198b93c' to '{md5}241239967a530bc4a65e333f3bb4a78d'
Mar  2 05:32:38 ci-runner-01/ci-runner-01 puppet-agent[29768]: (/Stage[main]/Nagios::Client/Ferm::Rule[roles-nagiosmaster-nrpe-hetzner-hel1-01.torproject.org]/File[/etc/ferm/tor.d/00_roles-nagiosmaster-nrpe-hetzner-hel1-01.torproject.org]/content) content changed '{md5}754de66b83324832b8169788843adea4' to '{md5}47052b5099b122f3f9cb0dbd8bffe22d'
Mar  2 05:32:38 ci-runner-01/ci-runner-01 systemd[1]: Reloading ferm firewall configuration.
Mar  2 05:32:38 ci-runner-01/ci-runner-01 ferm[29958]: Reloading Firewall configuration....
Mar  2 05:32:38 ci-runner-01/ci-runner-01 systemd[1]: Reloaded ferm firewall configuration.
Mar  2 05:32:38 ci-runner-01/ci-runner-01 puppet-agent[29768]: (/Stage[main]/Ferm/Exec[ferm reload]) Triggered 'refresh' from 2 events

Maybe that's the cause? And indeed, if i fix the problem (by restarting docker), then reload the firewall rules, the problem comes back!

root@ci-runner-01:~# service docker restart
root@ci-runner-01:~# docker run -it --rm debian:stable ping -c 3 torproject.org
PING torproject.org (116.202.120.165) 56(84) bytes of data.
64 bytes from web-fsn-01.torproject.org (116.202.120.165): icmp_seq=1 ttl=55 time=124 ms
64 bytes from web-fsn-01.torproject.org (116.202.120.165): icmp_seq=2 ttl=55 time=124 ms
64 bytes from web-fsn-01.torproject.org (116.202.120.165): icmp_seq=3 ttl=55 time=126 ms

--- torproject.org ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 5ms
rtt min/avg/max/mdev = 123.559/124.411/126.079/1.213 ms
root@ci-runner-01:~# service ferm reload
root@ci-runner-01:~# docker run -it --rm debian:stable ping -c 3 torproject.org
ping: torproject.org: Temporary failure in name resolution
root@ci-runner-01:~#

isn't that interesting?? :) absolutely no idea wtf is going on here, but it does seem like a reproducible failure at least!

It seems this is a known problem in Docker too! I'll analyze that thread and see what i can come up with.

There are many workarounds suggested in that discussion:

restart docker after reloading the firewall rules
rewrite the firewall rules to avoid flushing the docker rules, probably not practical without severe hacking in iptables/ferm
another similar hack, called docker-fw
upstream should have a docker network reload-firewall command to avoid restarting the entire daemon (comment, not implemented of course)
run Docker inside its own namespace with systemd-named-netns (comment)

I can confirm that reloading ferm flushes critical firewall rules from Docker:

--- before	2021-03-02 14:29:54.856964851 +0000
+++ after	2021-03-02 14:30:00.737016665 +0000
@@ -101,12 +101,6 @@
 
 Chain FORWARD (policy ACCEPT)
 target     prot opt source               destination         
-DOCKER-USER  all  --  0.0.0.0/0            0.0.0.0/0           
-DOCKER-ISOLATION-STAGE-1  all  --  0.0.0.0/0            0.0.0.0/0           
-ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
-DOCKER     all  --  0.0.0.0/0            0.0.0.0/0           
-ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           
-ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           
 
 Chain OUTPUT (policy ACCEPT)
 target     prot opt source               destination         
@@ -149,20 +143,3 @@
 ACCEPT     tcp  --  193.10.5.2           0.0.0.0/0           
 ACCEPT     tcp  --  206.248.172.91       0.0.0.0/0           
 ACCEPT     tcp  --  216.137.119.51       0.0.0.0/0           
-
-Chain DOCKER (1 references)
-target     prot opt source               destination         
-
-Chain DOCKER-ISOLATION-STAGE-1 (1 references)
-target     prot opt source               destination         
-DOCKER-ISOLATION-STAGE-2  all  --  0.0.0.0/0            0.0.0.0/0           
-RETURN     all  --  0.0.0.0/0            0.0.0.0/0           
-
-Chain DOCKER-ISOLATION-STAGE-2 (1 references)
-target     prot opt source               destination         
-DROP       all  --  0.0.0.0/0            0.0.0.0/0           
-RETURN     all  --  0.0.0.0/0            0.0.0.0/0           
-
-Chain DOCKER-USER (1 references)
-target     prot opt source               destination         
-RETURN     all  --  0.0.0.0/0            0.0.0.0/0

maybe it's just a matter of adding those rules ourselves? i'm hesitating between this and adding a systemd override to make sure docker is restarted when ferm is reloaded... rules don't change that often, and the worst effect would be a failed pipeline because of the interrupt...

I'll also point out that a podman-based runner might not be affected by this bug.

i restarted docker for now, but we should have a more long term fix...

added Next label and removed Doing label

seems like the simplest workaround is to re-add this firewall rule from Docker:

iptables -t nat -A POSTROUTING -s 172.17.0.0/16 \! -o docker0 -j MASQUERADE

It's rather strange that the other rules are not used, but at least that works. We could add to our ferm configs to completely fix this issue.

Unfortunately, hooking up docker.service into ferm.service is not possible, because systemd doesn't allow service dependencies on service reloads, only restarts. we'd have to add something like service docker restart to the ExecReload command of ferm.service and that's really yucky. That's actually not accurate: reload can trigger a reload: https://www.freedesktop.org/software/systemd/man/systemd.unit.html#PropagatesReloadTo= and restart a restart: https://www.freedesktop.org/software/systemd/man/systemd.unit.html#BindsTo= it still might not work for our case, because docker reload is not sufficient to fix the bug.

Thanks for working on this.

It seems to have to start happening again: https://gitlab.torproject.org/juga/sbws/-/jobs/13965

and it's yet again ferm that was reloaded:

Mar  3 10:29:30 ci-runner-01/ci-runner-01 puppet-agent[30031]: (/Stage[main]/Ferm/Exec[ferm reload]) Triggered 'refresh' from 2 events

i restarted docker, but we really do need to deploy a fix for this.

to make sure this survives the next firewall reload, I've done this gross override:

root@ci-runner-01:~# systemctl cat ferm
# /lib/systemd/system/ferm.service
[Unit]
Description=ferm firewall configuration
RequiresMountsFor=/var/cache/
Wants=network-pre.target
Before=network-pre.target shutdown.target
Conflicts=shutdown.target
DefaultDependencies=no

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/etc/init.d/ferm start
ExecReload=/etc/init.d/ferm reload
ExecStop=/etc/init.d/ferm stop

[Install]
WantedBy=sysinit.target

# /etc/systemd/system/ferm.service.d/override.conf
[Service]
ExecReload=/etc/init.d/ferm reload
ExecReload=service docker restart
root@ci-runner-01:~# systemctl show ferm | grep ExecReload
ExecReload={ path=/etc/init.d/ferm ; argv[]=/etc/init.d/ferm reload ; ignore_errors=no ; start_time=[n/a] ; stop_time=[n/a] ; pid=0 ; code=(null) ; status=0/0 }
ExecReload={ path=/etc/init.d/ferm ; argv[]=/etc/init.d/ferm reload ; ignore_errors=no ; start_time=[n/a] ; stop_time=[n/a] ; pid=0 ; code=(null) ; status=0/0 }
ExecReload={ path=/usr/sbin/service ; argv[]=/usr/sbin/service docker restart ; ignore_errors=no ; start_time=[n/a] ; stop_time=[n/a] ; pid=0 ; code=(null) ; status=0/0 }
root@ci-runner-01:~# docker run -it --rm debian:stable ping -c 3 torproject.org
PING torproject.org (95.216.163.36) 56(84) bytes of data.
64 bytes from hetzner-hel1-03.torproject.org (95.216.163.36): icmp_seq=1 ttl=50 time=121 ms
64 bytes from hetzner-hel1-03.torproject.org (95.216.163.36): icmp_seq=2 ttl=50 time=121 ms
64 bytes from hetzner-hel1-03.torproject.org (95.216.163.36): icmp_seq=3 ttl=50 time=121 ms

--- torproject.org ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 5ms
rtt min/avg/max/mdev = 120.979/121.098/121.224/0.301 ms
root@ci-runner-01:~# systemctl reload ferm
root@ci-runner-01:~# docker run -it --rm debian:stable ping -c 3 torproject.org
PING torproject.org (116.202.120.166) 56(84) bytes of data.
64 bytes from web-fsn-02.torproject.org (116.202.120.166): icmp_seq=1 ttl=55 time=123 ms
64 bytes from web-fsn-02.torproject.org (116.202.120.166): icmp_seq=2 ttl=55 time=123 ms
^C
--- torproject.org ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 3ms
rtt min/avg/max/mdev = 123.053/123.070/123.087/0.017 ms
^[[Aroot@ci-runner-01jobs^C
root@ci-runner-01:~# systemctl status ferm
● ferm.service - ferm firewall configuration
   Loaded: loaded (/lib/systemd/system/ferm.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/ferm.service.d
           └─override.conf
   Active: active (exited) since Thu 2021-02-04 15:10:19 UTC; 1 months 2 days ago
  Process: 31253 ExecReload=/etc/init.d/ferm reload (code=exited, status=0/SUCCESS)
  Process: 31276 ExecReload=/etc/init.d/ferm reload (code=exited, status=0/SUCCESS)
  Process: 31298 ExecReload=/usr/sbin/service docker restart (code=exited, status=0/SUCCESS)
 Main PID: 238 (code=exited, status=0/SUCCESS)

Mar 02 14:52:23 ci-runner-01 systemd[1]: Reloading ferm firewall configuration.
Mar 02 14:52:23 ci-runner-01 ferm[6739]: Reloading Firewall configuration....
Mar 02 14:52:23 ci-runner-01 systemd[1]: Reloaded ferm firewall configuration.
Mar 03 10:29:29 ci-runner-01 systemd[1]: Reloading ferm firewall configuration.
Mar 03 10:29:29 ci-runner-01 ferm[30223]: Reloading Firewall configuration....
Mar 03 10:29:29 ci-runner-01 systemd[1]: Reloaded ferm firewall configuration.
Mar 09 20:52:05 ci-runner-01 systemd[1]: Reloading ferm firewall configuration.
Mar 09 20:52:05 ci-runner-01 ferm[31253]: Reloading Firewall configuration....
Mar 09 20:52:06 ci-runner-01 ferm[31276]: Reloading Firewall configuration....
Mar 09 20:52:19 ci-runner-01 systemd[1]: Reloaded ferm firewall configuration.

so yeah, it's kind of gross, because it's a (hidden) service dependency... i'd much rather have the firewall rules properly reloaded on service docker reload, so that we could have the "reloads" depend on each other, but alas, this would require a patch to docker and to the service files, so it's not going to happen anytime soon.

i'm not super comfortable with adding just the firewall rule either, because i'm not sure it's sufficient to ensure proper operation. ping might work, but maybe other firewall issues could fail in more subtle ways, so i don't really want to play around with this.

the downside with this approach is that it will probably crash any container when ferm reloads. but it beats having all containers fail from there on.

only remaining task is to add this hack to puppet now.

only remaining task is to add this hack to puppet now.

this is now done. do reopen if this happens again.

closed

apparently this is something specific to the docker image used to run the runner. there's a workaround described here:

https://gitlab.com/gitlab-org/gitlab-runner/-/issues/6644#note_593121647

... and this might even be fixed in the next upstream release, so reopening to track this.

added Backlog label and removed Next label

reopened

i have removed the workaround in:

a6749879 remove ferm + docker workaround

@ahf if CI starts breaking again because it "can't resolve" things, do let me know so i can restore this workaround, but this was apparently fixed upstream so our kludges are not required, whoohoo!

closed

removed Backlog label

added Anti-Censorship label

removed Anti-Censorship label

added Doing label

due to a bug in the puppet configuration (#109 (closed)), this was only deployed just now.

so this happened again today: https://gitlab.torproject.org/tpo/core/arti/-/jobs/34618

reopening.

reopened

docker restarted, hack restored, issue should be fixed. unfortunate that upstream still has the bug of course, but we'll have to live with our hack right now...

closed

mentioned in issue team#40368 (closed)

restarted docker on shadow in team#40368 (closed) and on the arm builder as well, just in case.

marked this issue as related to team#40541 (closed)

mentioned in issue team#40541 (closed)

for completeness's sake, there was an update to the upstream ticket recently which says it's just a matter of adding a dns= entry into the runner config. i doubt it would solve our problem, because it's not just DNS that fails but the entire network stack. it's possible there are multiple tangled up issues here as well.

Gitlab runner fails to resolve gitlab.torproject.org

Child items ...

Activity