AptUpdateLagging: The package list on several nodes are not being updated

added Backlog label

assigned to @anarcat

added Doing label and removed Backlog label

okay, so we do have an actual problem here: it seems the apt-daily.service job, which was supposed to run apt-get update everywhere, at least daily, is not actually doing its job. take idle-fsn-01 for example:

root@idle-fsn-01:~# /usr/share/prometheus-node-exporter-collectors/apt_info.py
# HELP apt_upgrades_pending Apt packages pending updates by origin.
# TYPE apt_upgrades_pending gauge
apt_upgrades_pending{origin="",arch=""} 0
# HELP apt_upgrades_held Apt packages pending updates but held back.
# TYPE apt_upgrades_held gauge
apt_upgrades_held{origin="",arch=""} 0
# HELP apt_autoremove_pending Apt packages pending autoremoval.
# TYPE apt_autoremove_pending gauge
apt_autoremove_pending 15
# HELP apt_package_cache_timestamp_seconds Apt update last run time.
# TYPE apt_package_cache_timestamp_seconds gauge
apt_package_cache_timestamp_seconds 1728443683.6660926
# HELP node_reboot_required Node reboot is required for software updates.
# TYPE node_reboot_required gauge

that timestampe is more than a day old:

> (1728574537-1728443683.6660926)s

  (1728574537 − 1728443683,6660926) secondes ≈
  1 d + 12 h + 20 min + 53,33 s

the timer is scheduled to run in 12h, and the service actually ran an hour ago:

root@idle-fsn-01:~# systemctl status apt-daily.timer
● apt-daily.timer - Daily apt download activities
     Loaded: loaded (/lib/systemd/system/apt-daily.timer; enabled; preset: enabled)
     Active: active (waiting) since Wed 2024-10-09 15:23:52 UTC; 24h ago
    Trigger: Fri 2024-10-11 04:36:12 UTC; 12h left
   Triggers: ● apt-daily.service

Oct 09 15:23:52 idle-fsn-01 systemd[1]: Started apt-daily.timer - Daily apt download activities.
root@idle-fsn-01:~# systemctl status apt-daily.service
○ apt-daily.service - Daily apt download activities
     Loaded: loaded (/lib/systemd/system/apt-daily.service; static)
     Active: inactive (dead) since Thu 2024-10-10 10:32:38 UTC; 5h 4min ago
TriggeredBy: ● apt-daily.timer
       Docs: man:apt(8)
   Main PID: 20791 (code=exited, status=0/SUCCESS)
        CPU: 38ms

Oct 10 10:32:38 idle-fsn-01 systemd[1]: Starting apt-daily.service - Daily apt download activities...
Oct 10 10:32:38 idle-fsn-01 systemd[1]: apt-daily.service: Deactivated successfully.
Oct 10 10:32:38 idle-fsn-01 systemd[1]: Finished apt-daily.service - Daily apt download activities.

strangely though, if i do start that service by hand, the timestamp does update:

root@idle-fsn-01:~# systemctl start apt-daily.service
root@idle-fsn-01:~# systemctl status apt-daily.service
○ apt-daily.service - Daily apt download activities
     Loaded: loaded (/lib/systemd/system/apt-daily.service; static)
     Active: inactive (dead) since Thu 2024-10-10 15:40:29 UTC; 4s ago
TriggeredBy: ● apt-daily.timer
       Docs: man:apt(8)
    Process: 26094 ExecStartPre=/usr/lib/apt/apt-helper wait-online (code=exited, status=0/SUCCESS)
    Process: 26098 ExecStart=/usr/lib/apt/apt.systemd.daily update (code=exited, status=0/SUCCESS)
   Main PID: 26098 (code=exited, status=0/SUCCESS)
        CPU: 3.052s

Oct 10 15:40:24 idle-fsn-01 systemd[1]: Starting apt-daily.service - Daily apt download activities...
Oct 10 15:40:29 idle-fsn-01 systemd[1]: apt-daily.service: Deactivated successfully.
Oct 10 15:40:29 idle-fsn-01 systemd[1]: Finished apt-daily.service - Daily apt download activities.
Oct 10 15:40:29 idle-fsn-01 systemd[1]: apt-daily.service: Consumed 3.052s CPU time.
root@idle-fsn-01:~# /usr/share/prometheus-node-exporter-collectors/apt_info.py
# HELP apt_upgrades_pending Apt packages pending updates by origin.
# TYPE apt_upgrades_pending gauge
apt_upgrades_pending{origin="",arch=""} 0
# HELP apt_upgrades_held Apt packages pending updates but held back.
# TYPE apt_upgrades_held gauge
apt_upgrades_held{origin="",arch=""} 0
# HELP apt_autoremove_pending Apt packages pending autoremoval.
# TYPE apt_autoremove_pending gauge
apt_autoremove_pending 15
# HELP apt_package_cache_timestamp_seconds Apt update last run time.
# TYPE apt_package_cache_timestamp_seconds gauge
apt_package_cache_timestamp_seconds 1728574827.4657624
# HELP node_reboot_required Node reboot is required for software updates.
# TYPE node_reboot_required gauge
node_reboot_required 0

not sure what's actually going on here.

marked this issue as related to team#41770 (closed)

mentioned in issue team#41770 (closed)

@anarcat I haven't looked at the file at all.. I just saw something recently about timer unit files needing to have OnBootSec in order for them to be activated at boot time. is this maybe missing in the unit file?

so yeah, i thought i had a fix for this in puppet by passing always to that parameter, i made a detailed explanation in the commit log:

commit a8c447cef0101444378ea787f50e4ac4d92db945
Author:     Antoine Beaupré <anarcat@debian.org>
Date: Thu Oct 10 12:00:39 2024 -0400

run apt-get update more frequently

This setting changes the APT::Periodic::Update-Package-Lists parameter
in the APT configuration, which in turns affects how the
`/usr/lib/apt/apt.systemd.daily` script works. By default, it's set to
zero ("0") but the unattended-upgrades module somewhat reasonably
turns that on by setting it to one ("1").

Except one, here, doesn't just mean "true", it means "one day", or
"daily". What that's going to do is make the script check the update
timestamp for when it ran last and make sure we wait more than a
day. But then the script only gets called once or twice a day, which
means that it can actually take more than a day to run apt-get update.

Reduce that complexity and just run apt-get update every time we call
this script.

Upgrades, however, are still configured to run "more than daily"
here. The rationale is that we might not want *those* to upgrade twice
a day. That could be fixed later, but so far are alerts for this are
much more liberaly, and I think the current settings for that are fine.

Closes: tpo/tpa/prometheus-alerts#22

1 file changed, 4 insertions(+)
modules/profile/manifests/unattended_upgrades.pp | 4 ++++

modified   modules/profile/manifests/unattended_upgrades.pp
@@ -39,5 +39,9 @@ class profile::unattended_upgrades {
       # takes effect only after unattended-upgrades 2.5, shipped in BULLSEYE
       remove_new_unused_deps => true,
       remove_unused_kernel   => true,
+      # always run apt-get update in the daily job, instead of having
+      # *two* separate timers. the systemd timer takes care of making
+      # sure this only runs daily.
+      update                 => 'always',
   }
 }

but that broke because our unattended-upgrades module is out of date. and we can't update it because it drops support for puppet 5 and would break upgrades there.

so i'm just going to bump the alert latency threshold fold now and punt this forward to until we upgrade to puppet 7 everywhere.

changed milestone to %Debian 12 bookworm upgrade

added Icebox label

removed Doing label

mentioned in commit 91bef26b

so i tweaked the alert in 91bef26b, the next step here is to upgrade all servers to puppet 7 (#41819) and then merge the u-u-upgrade branch in Puppet, and then revert 91bef26b here.

phew.

unassigned @anarcat

mentioned in commit wiki-replica@89f3cecd

okay, this is weird: on lists-01 right now, there's an AptUpdateLagging alert firing, and it seems legit:

root@lists-01:/etc/mailman3# systemctl status apt-daily | grep Active
     Active: inactive (dead) since Tue 2024-10-29 20:07:55 UTC; 17h ago
root@lists-01:/etc/mailman3# /usr/share/prometheus-node-exporter-collectors/apt_info.py | grep timestamp
# HELP apt_package_cache_timestamp_seconds Apt update last run time.
# TYPE apt_package_cache_timestamp_seconds gauge
apt_package_cache_timestamp_seconds 1730096470.0152664
root@lists-01:/etc/mailman3# date +%s
1730294846

the service ran recently enough:

root@lists-01:/etc/mailman3# systemctl status apt-daily | grep Active
     Active: inactive (dead) since Tue 2024-10-29 20:07:55 UTC; 17h ago

... buuuut it didn't work because our "update flag" is wrong:

root@lists-01:/etc/mailman3# apt-config dump | grep APT::Periodic::Update-Package-Lists
APT::Periodic::Update-Package-Lists "1";

so it looks like our thresholds are still too sensitive here, i'll bump those again.

added Doing label and removed Icebox label

assigned to @anarcat

so i tweaked the alert in 91bef26b, the next step here is to upgrade all servers to puppet 7 (#41819) and then merge the u-u-upgrade branch in Puppet, and then revert 91bef26b here.

i've done that merge, now that we've upgraded to Puppet 7 everywhere. i don't think this will have any negative impact, but i've ran puppet by hand on the idle-* node to see, seems like the catalog compiles fine at least.

this should knock out yet another false positive, whoohoo!

closed

AptUpdateLagging: The package list on several nodes are not being updated

Designs

Child items ...

Activity