Skip to content
Snippets Groups Projects

Major upgrades

Major upgrades are done by hand, with a "cheat sheet" created for each major release. Here are the currently documented ones:

Team-specific upgrade policies

Before we perform a major upgrade, it might be advisable to consult with the team working on the box to see if it will interfere for their work. Some teams might block if they believe the major upgrade will break their service. They are not allowed to indefinitely block the upgrade, however.

Team policies:

  • anti-censorship: TBD
  • metrics: one or two work-day advance notice (source)
  • funding: schedule a maintenance window
  • git: TBD
  • gitlab: TBD
  • translation: TBD

Some teams might be missing from the list.

All time version graph

graph showing the number of hosts per Debian release over time
The above graph shows the number of hosts running a particular version of Debian over time since data collection started in 2019, covering 5 different Debian releases:
  • jessie
  • stretch
  • buster (upgraded in 28 months)
  • bullseye (upgraded in 24 months and counting, running for 48 months and counting)
  • bookworm (running 1 month and counting)

Minor upgrades

Unattended upgrades

Most of the packages upgrades are handled by the unattended-upgrades package which is configured via puppet.

Unattended-upgrades writes logs to /var/log/unattended-upgrades/ but also /var/log/dpkg.log.

The default configuration file for unattended-upgrades is at /etc/apt/apt.conf.d/50unattended-upgrades.

Pending upgrades are still noticed by Nagios which warns loudly about them in its usual channels.

Note that unattended-upgrades is configured to upgrade packages regardless of their origin (Unattended-Upgrade::Origins-Pattern { "origin=*" }). If a new sources.list entry is added, it will be picked up and applied by unattended-upgrades unless it has a special policy (like Debian's backports). It is strongly recommended that new sources.list entries be paired with a "pin" (see apt_preferences(5)). See also tpo/tpa/team#40771 for a discussion and rationale of that change.

Manual upgrades with Cumin

It's also possible to do a manual mass-upgrade run with Cumin:

cumin -b 10  '*' 'apt update ; unattended-upgrade ; TERM=doit dsa-update-apt-status'

The TERM override is to skip the jitter introduced by the script when running automated.

The above will respect the unattended-upgrade policy, which may block certain upgrades. If you want to bypass that, use regular apt:

cumin -b 10  '*' 'apt update ; apt upgrade -yy ; TERM=doit dsa-update-apt-status'

Special cases and manual restarts

The above covers all upgrades that are automatically applied, but some are blocked from automation and require manual intervention.

Others do upgrade automatically, but require a manual restart. Normally, needrestart runs after upgrades and takes care of restarting services, but it can't actually deal with everything.

There is a Nagios check that might trigger and tell you that some services are running with outdated libraries. You may see a warning like:

[web-chi-03] needrestart is WARNING: WARN - Kernel: 5.10.0-15-amd64, Services: 1 (!), Containers: none, Sessions: none

The detailed status information will show you which service it fails to restart:

WARN - Kernel: 5.10.0-15-amd64, Services: 1 (!), Containers: none, Sessions: none
Services:
- cron.service

If you cannot figure out why the warning happens, you might want to run the check by hand:

needrestart -v

Packages are blocked from upgrades when they cause significant breakage during an upgrade run, enough to cause an outage and/or require significant recovery work. This is done through Puppet, in the profile::unattended_upgrades class, in the blacklist setting.

Packages can be unblocked if and only if:

  • the bug is confirmed as fixed in Debian
  • the fix is deployed on all servers and confirmed as working
  • we have good confidence that future upgrades will not break the system again

This section documents how to do some of those upgrades and restarts by hand.

GitLab runner upgrades

Every month or so GitLab publishes a update to the gitlab-runner apt package. The package is excluded from unattended-upgrades to avoid any risk of interrupting long-running CI jobs (eg. large shadow sims).

The recommended procedure is to go through each CI machine one at a time, pause all the runners on that single machine, ensure no long-running shadow sims are being executed, and launch apt upgrade. If any regular CI jobs are running, systemd will wait up to one hour for them to end, then proceed with the package upgrade.

cron.service

This is typically services that should be ran under systemd --user but instead are started with a @reboot cron job.

For this kind of service, reboot the server or ask the service admin to restart their services themselves. Ideally, this service should be converted to a systemd unit, see this documentation.

ud-replicate special case

Sometimes, userdir-ldap's ud-replicate leaves a multiplexing SSH process lying around and those show up as part of cron.service.

Logging into the LDAP server (currently alberti) and killing all the sshdist process will clear those:

pkill -u sshdist ssh

systemd user manager services

The needrestart tool lacks the ability to restart user-based systemd daemons and services. Example below, when running needrestart -rl:

User sessions running outdated binaries:
 onionoo @ user manager service: systemd[853]
 onionoo-unpriv @ user manager service: systemd[854]

To restart these services, this command may be executed:

systemctl restart user@$(id -u onionoo) user@$(id -u onionoo-unpriv)

Sometimes an error message similar to this is shown:

Job for user@1547.service failed because the control process exited with error code.

The solution here is to run the systemctl restart command again, and the error should no longer appear.

Ganeti

The ganeti.service warning is typically an OpenSSL upgrade that affects qemu, and restarting ganeti (thankfully) doesn't restart VMs. to Fix this, migrate all VMs to their secondaries and back, see Ganeti reboot procedures, possibly the instance-only restart procedure.

Open vSwitch

This is generally the openvswitch-switch and openvswitch-common services, which are blocked from upgrades because of bug 34185

To upgrade manually, empty the server, restart, upgrade OVS, then migrate the machines back. It's actually easier to just treat this as a "reboot the nodes only" procedure, see the Ganeti reboot procedures instead.

Note that this might be fixed in Debian bullseye, bug 961746 in Debian is marked as fixed, but will still need to be tested on our side first. Update: it hasn't been fixed.

Grub

grub-pc (bug 40042) has been known to have issues as well, so it is blocked. to upgrade, make sure the install device is defined, by running dpkg-reconfigure grub-pc. this issue might actually have been fixed in the package, see issue 40185.

Update: this issue has been resolved and grub upgrades are now automated. This section is kept for historical reference, or in case the upgrade path is broken again.

user@ services

Services setup with the new systemd-based startup system documented in doc/services may not automatically restart. They may be (manually) restarted with:

systemctl restart user@1504.service

There's a feature request (bug #843778) to implement support for those services directly in needrestart.

Reboots

Sometimes it is necessary to perform a reboot on the hosts, when the kernel is updated. Nagios will warn about this, with something like this:

WARNING: Kernel needs upgrade [linux-image-4.9.0-9-amd64 != linux-image-4.9.0-8-amd64]

TODO: the above is the old way, the needrestart check has a different output. document it above.

Rebooting a single host

If this is only a virtual machine, and the only one affected, it can be rebooted directly. This can be done with the tsa-misc script called reboot:

./reboot -H test-01.torproject.org,test-02.torproject.org

By default, the script will wait 2 minutes before hosts: that should be changed to 30 minutes if the hosts are part of a mirror network to give the monitoring systems (mini-nag) time to rotate the hosts in and out of DNS:

./reboot -H mirror-01.torproject.org,mirror-02.torproject.org --delay-nodes 1800

If the host has an encrypted filesystem and is hooked up with Mandos, it will return automatically. Otherwise it might need a password to be entered at boot time, either through the initramfs (if it has the profile::fde class in Puppet) or manually, after the boot. That is the case for the mandos-01 server itself, for example, as it currently can't unlock itself, naturally.

Batch rebooting multiple hosts

IMPORTANT: before following this procedure, make sure that only a subset of the hosts need a restart. If all hosts need a reboot, it's likely going to be faster and easier to reboot the entire clusters at once, see the Ganeti reboot procedures instead.

LDAP hosts have information about how they can be rebooted, in the rebootPolicy field. Here are what the various fields mean:

  • justdoit - can be rebooted any time, with a 10 minute delay, possibly in parallel
  • rotation - part of a cluster where each machine needs to be rebooted one at a time, with a 30 minute delay for DNS to update
  • manual - needs to be done by hand or with a special tool (fabric in case of ganeti, reboot-host in the case of KVM, nothing for windows boxes)

Therefore, it's possible to selectively reboot some of those hosts in batches. Again, this is pretty rare: typically, you would either reboot only a single host or all hosts, in which case a cluster-wide reboot (with Ganeti, below) would be more appropriate.

This routine should be able to reboot all hosts with a rebootPolicy defined to justdoit or rotation:

echo "rebooting 'justdoit' hosts with a 10-minute delay, every 2 minutes...."
./reboot -H $(ssh db.torproject.org 'ldapsearch -h db.torproject.org -x -ZZ -b ou=hosts,dc=torproject,dc=org -LLL "(rebootPolicy=justdoit)" hostname | awk "\$1 == \"hostname:\" {print \$2}" | sort -R') --delay-shutdown=10 --delay-hosts=120

echo "rebooting 'rotation' hosts with a 10-minute delay, every 30 minutes...."
./reboot -H $(ssh db.torproject.org 'ldapsearch -h db.torproject.org -x -ZZ -b ou=hosts,dc=torproject,dc=org -LLL "(rebootPolicy=rotation)" hostname | awk "\$1 == \"hostname:\" {print \$2}" | sort -R') --delay-shutdown=10 --delay-hosts=1800

Rebooting KVM hosts

The remaining is the "manual" procedure, which includes one KVM last:

./reboot-host moly.torproject.org

... and Ganeti nodes, below.

Rebooting Ganeti nodes

See the Ganeti reboot procedures for this procedure.

Remaining nodes

The Nagios unhandled problems will show remaining hosts that might have been missed by the above procedure..