something crashes btcpayserver networking once in a while
In #41566, @susan reported issues with the BTCpay server, typically a 504 gateway timeout when trying to browse transactions (see https://gitlab.torproject.org/tpo/tpa/team/-/issues/incident/41566#note_3014974 for a reproducer). I could confirm there was a network issue inside at least one of the containers:
root@btcpayserver-02:~# docker exec -it generated_btcpayserver_1 bash
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
root@44abcc1f87ad:/app# apt update
0% [Connecting to deb.debian.org (151.101.162.132)]^C
root@44abcc1f87ad:/app#
exit
that apt update
command never returns. but after restarting docker, everything is back to normal:
root@btcpayserver-02:~# service docker restart
root@btcpayserver-02:~# docker exec -it generated_btcpayserver_1 bash
bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
root@17c922f5b664:/app# apt update
Get:1 http://deb.debian.org/debian bookworm InRelease [151 kB]
Get:2 http://deb.debian.org/debian bookworm-updates InRelease [55.4 kB]
Get:3 http://deb.debian.org/debian-security bookworm-security InRelease [48.0 kB]
Get:4 http://deb.debian.org/debian bookworm/main amd64 Packages [8786 kB]
Get:5 http://deb.debian.org/debian bookworm-updates/main amd64 Packages [12.7 kB]
Get:6 http://deb.debian.org/debian-security bookworm-security/main amd64 Packages [150 kB]
Fetched 9203 kB in 1s (7850 kB/s)
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
This feels related to #40541 (closed): we used to have systemd hooks that would kick docker when the firewall would update, otherwise we'd lose network like this. i suspect the fix in #40541 (comment 2766535) ("using a ferm feature allowing us to persist rules across ferm reload/restarts") doesn't actually work. So I'm pondering restoring that workaround, not at the docker-wide level, but maybe disabled by default and enabled on hosts (like this one) where a restart is less of an issue.
i suspect this is the puppet run that might have broken this:
Mar 27 18:28:48 btcpayserver-02/btcpayserver-02 systemd[1]: Started puppet-run.service - Run the Puppet agent on this machine.
Mar 27 18:29:01 btcpayserver-02/btcpayserver-02 puppet-agent[519488]: (/Stage[main]/Ferm/File[/etc/ferm/tor.d/10_ssh-allow-jumphost-majus.torproject.org]/ensure) removed
Mar 27 18:29:01 btcpayserver-02/btcpayserver-02 puppet-agent[519488]: (/Stage[main]/Ferm/File[/etc/ferm/conf.d/defs.conf]/content) content changed '{sha256}76c8dda0809a4503c1cdf7e44137b51b15a1ab2770937798c5c861b6887afbc1' to '{sha256}2481da82548adbd0bb2038065c6d7c87c45635e8bde711812cd145dd204691de'
Mar 27 18:29:01 btcpayserver-02/btcpayserver-02 puppet-agent[519488]: (/Stage[main]/Nagios::Client/File[/etc/nagios/nrpe.d/nrpe_tor.cfg]/content) content changed '{sha256}c6681f6ff6b39b8e3d3c331ae04b611eac8e92a50f63d918408f13dc3f4e082a' to '{sha256}b9f9de72729eade591b975f9887aac21488ea2737e772185bed03def016b0665'
Mar 27 18:29:03 btcpayserver-02/btcpayserver-02 nrpe[624]: Caught SIGTERM - shutting down...
Mar 27 18:29:03 btcpayserver-02/btcpayserver-02 systemd[1]: Stopping nagios-nrpe-server.service - Nagios Remote Plugin Executor...
Mar 27 18:29:03 btcpayserver-02/btcpayserver-02 nrpe[624]: Daemon shutdown
Mar 27 18:29:03 btcpayserver-02/btcpayserver-02 systemd[1]: nagios-nrpe-server.service: Deactivated successfully.
Mar 27 18:29:03 btcpayserver-02/btcpayserver-02 systemd[1]: Stopped nagios-nrpe-server.service - Nagios Remote Plugin Executor.
Mar 27 18:29:03 btcpayserver-02/btcpayserver-02 systemd[1]: nagios-nrpe-server.service: Consumed 8min 7.092s CPU time.
Mar 27 18:29:03 btcpayserver-02/btcpayserver-02 systemd[1]: Started nagios-nrpe-server.service - Nagios Remote Plugin Executor.
Mar 27 18:29:03 btcpayserver-02/btcpayserver-02 puppet-agent[519488]: (/Stage[main]/Nagios::Client/Service[nagios-nrpe-server]) Triggered 'refresh' from 1 event
Mar 27 18:29:03 btcpayserver-02/btcpayserver-02 nrpe[519657]: Starting up daemon
Mar 27 18:29:03 btcpayserver-02/btcpayserver-02 nrpe[519657]: Server listening on 0.0.0.0 port 5666.
Mar 27 18:29:03 btcpayserver-02/btcpayserver-02 nrpe[519657]: Server listening on :: port 5666.
Mar 27 18:29:03 btcpayserver-02/btcpayserver-02 nrpe[519657]: Listening for connections on port 5666
Mar 27 18:29:03 btcpayserver-02/btcpayserver-02 nrpe[519657]: Allowing connections from: 95.216.141.241,2a01:4f9:c010:5f1::1,49.12.57.130,2a01:4f8:fff0:4f:266:37ff:fee9:5df8,
Mar 27 18:29:04 btcpayserver-02/btcpayserver-02 systemd[1]: Reloading ferm.service - ferm firewall configuration...
Mar 27 18:29:04 btcpayserver-02/btcpayserver-02 ferm[519688]: Reloading Firewall configuration...
Mar 27 18:29:04 btcpayserver-02/btcpayserver-02 ferm[519694]: Warning in /etc/ferm/tor.d/00_roles-nagiosmaster-nrpe-hetzner-hel1-01.torproject.org line 10: Chain 5666 already exists
Mar 27 18:29:04 btcpayserver-02/btcpayserver-02 ferm[519694]: Warning in /etc/ferm/tor.d/00_roles-nagiosmaster-nrpe-hetzner-hel1-01.torproject.org line 10: Chain 5666 already exists
Mar 27 18:29:04 btcpayserver-02/btcpayserver-02 ferm[519694]: Warning in /etc/ferm/tor.d/00_tor-ssh line 10: Chain ssh already exists
Mar 27 18:29:04 btcpayserver-02/btcpayserver-02 ferm[519694]: Warning in /etc/ferm/tor.d/00_tor-ssh line 10: Chain ssh already exists
Mar 27 18:29:04 btcpayserver-02/btcpayserver-02 ferm[519688]: .
Mar 27 18:29:04 btcpayserver-02/btcpayserver-02 systemd[1]: Reloaded ferm.service - ferm firewall configuration.
Mar 27 18:29:04 btcpayserver-02/btcpayserver-02 puppet-agent[519488]: (/Stage[main]/Ferm/Exec[ferm reload]) Triggered 'refresh' from 2 events
Mar 27 18:29:04 btcpayserver-02/btcpayserver-02 puppet-agent[519488]: Applied catalog in 3.69 seconds
Mar 27 18:29:04 btcpayserver-02/btcpayserver-02 systemd[1]: puppet-run.service: Deactivated successfully.
Mar 27 18:29:04 btcpayserver-02/btcpayserver-02 systemd[1]: puppet-run.service: Consumed 8.027s CPU time.
obviously, the Best solution would be to switch to nftables already and ditch that old ferm (#40554), but that's a serious undertaking, for which we don't have the free cycles for at this point...
@lavamind, what do you think?