- Major upgrades
- Team-specific upgrade policies
- All time version graph
- Minor upgrades
- Unattended upgrades
- Manual upgrades with Cumin
- Special cases and manual restarts
- GitLab runner upgrades
- cron.service
- ud-replicate special case
- systemd user manager services
- Ganeti
- Open vSwitch
- Grub
- user@ services
- Reboots
- Rebooting a single host
- Batch rebooting multiple hosts
- Rebooting KVM hosts
- Rebooting Ganeti nodes
- Remaining nodes
Major upgrades
Major upgrades are done by hand, with a "cheat sheet" created for each major release. Here are the currently documented ones:
Team-specific upgrade policies
Before we perform a major upgrade, it might be advisable to consult with the team working on the box to see if it will interfere for their work. Some teams might block if they believe the major upgrade will break their service. They are not allowed to indefinitely block the upgrade, however.
Team policies:
- anti-censorship: TBD
- metrics: one or two work-day advance notice (source)
- funding: schedule a maintenance window
- git: TBD
- gitlab: TBD
- translation: TBD
Some teams might be missing from the list.
All time version graph

- jessie
- stretch
- buster (upgraded in 28 months)
- bullseye (upgraded in 24 months and counting, running for 48 months and counting)
- bookworm (running 1 month and counting)
Minor upgrades
Unattended upgrades
Most of the packages upgrades are handled by the unattended-upgrades package which is configured via puppet.
Unattended-upgrades writes logs to /var/log/unattended-upgrades/
but
also /var/log/dpkg.log
.
The default configuration file for unattended-upgrades is at
/etc/apt/apt.conf.d/50unattended-upgrades
.
Pending upgrades are still noticed by Nagios which warns loudly about them in its usual channels.
Note that unattended-upgrades is configured to upgrade packages
regardless of their origin (Unattended-Upgrade::Origins-Pattern { "origin=*" }
). If a new sources.list
entry is added, it
will be picked up and applied by unattended-upgrades unless it has a
special policy (like Debian's backports). It is strongly recommended
that new sources.list
entries be paired with a "pin" (see
apt_preferences(5)). See also tpo/tpa/team#40771 for a
discussion and rationale of that change.
Manual upgrades with Cumin
It's also possible to do a manual mass-upgrade run with Cumin:
cumin -b 10 '*' 'apt update ; unattended-upgrade ; TERM=doit dsa-update-apt-status'
The TERM
override is to skip the jitter introduced by the script
when running automated.
The above will respect the unattended-upgrade
policy, which may
block certain upgrades. If you want to bypass that, use regular apt
:
cumin -b 10 '*' 'apt update ; apt upgrade -yy ; TERM=doit dsa-update-apt-status'
Special cases and manual restarts
The above covers all upgrades that are automatically applied, but some are blocked from automation and require manual intervention.
Others do upgrade automatically, but require a manual restart. Normally, needrestart runs after upgrades and takes care of restarting services, but it can't actually deal with everything.
There is a Nagios check that might trigger and tell you that some services are running with outdated libraries. You may see a warning like:
[web-chi-03] needrestart is WARNING: WARN - Kernel: 5.10.0-15-amd64, Services: 1 (!), Containers: none, Sessions: none
The detailed status information will show you which service it fails to restart:
WARN - Kernel: 5.10.0-15-amd64, Services: 1 (!), Containers: none, Sessions: none
Services:
- cron.service
If you cannot figure out why the warning happens, you might want to run the check by hand:
needrestart -v
Packages are blocked from upgrades when they cause significant
breakage during an upgrade run, enough to cause an outage and/or
require significant recovery work. This is done through Puppet, in the
profile::unattended_upgrades
class, in the blacklist
setting.
Packages can be unblocked if and only if:
- the bug is confirmed as fixed in Debian
- the fix is deployed on all servers and confirmed as working
- we have good confidence that future upgrades will not break the system again
This section documents how to do some of those upgrades and restarts by hand.
GitLab runner upgrades
Every month or so GitLab publishes a update to the gitlab-runner
apt
package. The package is excluded from unattended-upgrades
to avoid any
risk of interrupting long-running CI jobs (eg. large shadow sims).
The recommended procedure is to go through each CI machine one at a time,
pause all the runners on that single machine, ensure no long-running
shadow sims are being executed, and launch apt upgrade
. If any regular
CI jobs are running, systemd will wait up to one hour for them to end,
then proceed with the package upgrade.
cron.service
This is typically services that should be ran under systemd --user
but instead are started with a @reboot
cron job.
For this kind of service, reboot the server or ask the service admin to restart their services themselves. Ideally, this service should be converted to a systemd unit, see this documentation.
ud-replicate special case
Sometimes, userdir-ldap's ud-replicate
leaves a multiplexing SSH
process lying around and those show up as part of
cron.service
.
Logging into the LDAP server (currently alberti
) and killing all the
sshdist
process will clear those:
pkill -u sshdist ssh
systemd user manager services
The needrestart
tool lacks
the ability to restart user-based systemd daemons and services. Example
below, when running needrestart -rl
:
User sessions running outdated binaries:
onionoo @ user manager service: systemd[853]
onionoo-unpriv @ user manager service: systemd[854]
To restart these services, this command may be executed:
systemctl restart user@$(id -u onionoo) user@$(id -u onionoo-unpriv)
Sometimes an error message similar to this is shown:
Job for user@1547.service failed because the control process exited with error code.
The solution here is to run the systemctl restart
command again, and
the error should no longer appear.
Ganeti
The ganeti.service
warning is typically an OpenSSL upgrade that
affects qemu, and restarting ganeti (thankfully) doesn't restart
VMs. to Fix this, migrate all VMs to their secondaries and back, see
Ganeti reboot procedures, possibly the instance-only restart
procedure.
Open vSwitch
This is generally the openvswitch-switch
and openvswitch-common
services, which are blocked from upgrades because of bug 34185
To upgrade manually, empty the server, restart, upgrade OVS, then migrate the machines back. It's actually easier to just treat this as a "reboot the nodes only" procedure, see the Ganeti reboot procedures instead.
Note that this might be fixed in Debian bullseye, bug 961746 in Debian is marked as fixed, but will still need to be tested on our side first. Update: it hasn't been fixed.
Grub
grub-pc
(bug 40042) has been known to have issues as well, so
it is blocked. to upgrade, make sure the install device is defined, by
running dpkg-reconfigure grub-pc
. this issue might actually have
been fixed in the package, see issue 40185.
Update: this issue has been resolved and grub upgrades are now automated. This section is kept for historical reference, or in case the upgrade path is broken again.
user@ services
Services setup with the new systemd-based startup system documented in doc/services may not automatically restart. They may be (manually) restarted with:
systemctl restart user@1504.service
There's a feature request (bug #843778) to implement support for those services directly in needrestart.
Reboots
Sometimes it is necessary to perform a reboot on the hosts, when the kernel is updated. Nagios will warn about this, with something like this:
WARNING: Kernel needs upgrade [linux-image-4.9.0-9-amd64 != linux-image-4.9.0-8-amd64]
TODO: the above is the old way, the needrestart check has a different output. document it above.
Rebooting a single host
If this is only a virtual machine, and the only one affected, it can
be rebooted directly. This can be done with the tsa-misc
script
called reboot
:
./reboot -H test-01.torproject.org,test-02.torproject.org
By default, the script will wait 2 minutes before hosts: that should
be changed to 30 minutes if the hosts are part of a mirror network
to give the monitoring systems (mini-nag
) time to rotate the hosts
in and out of DNS:
./reboot -H mirror-01.torproject.org,mirror-02.torproject.org --delay-nodes 1800
If the host has an encrypted filesystem and is hooked up with Mandos, it
will return automatically. Otherwise it might need a password to be
entered at boot time, either through the initramfs (if it has the
profile::fde
class in Puppet) or manually, after the boot. That is
the case for the mandos-01
server itself, for example, as it
currently can't unlock itself, naturally.
Batch rebooting multiple hosts
IMPORTANT: before following this procedure, make sure that only a subset of the hosts need a restart. If all hosts need a reboot, it's likely going to be faster and easier to reboot the entire clusters at once, see the Ganeti reboot procedures instead.
LDAP hosts have information about how they can be rebooted, in the
rebootPolicy
field. Here are what the various fields mean:
-
justdoit
- can be rebooted any time, with a 10 minute delay, possibly in parallel -
rotation
- part of a cluster where each machine needs to be rebooted one at a time, with a 30 minute delay for DNS to update -
manual
- needs to be done by hand or with a special tool (fabric in case of ganeti, reboot-host in the case of KVM, nothing for windows boxes)
Therefore, it's possible to selectively reboot some of those hosts in batches. Again, this is pretty rare: typically, you would either reboot only a single host or all hosts, in which case a cluster-wide reboot (with Ganeti, below) would be more appropriate.
This routine should be able to reboot all hosts with a rebootPolicy
defined to justdoit
or rotation
:
echo "rebooting 'justdoit' hosts with a 10-minute delay, every 2 minutes...."
./reboot -H $(ssh db.torproject.org 'ldapsearch -h db.torproject.org -x -ZZ -b ou=hosts,dc=torproject,dc=org -LLL "(rebootPolicy=justdoit)" hostname | awk "\$1 == \"hostname:\" {print \$2}" | sort -R') --delay-shutdown=10 --delay-hosts=120
echo "rebooting 'rotation' hosts with a 10-minute delay, every 30 minutes...."
./reboot -H $(ssh db.torproject.org 'ldapsearch -h db.torproject.org -x -ZZ -b ou=hosts,dc=torproject,dc=org -LLL "(rebootPolicy=rotation)" hostname | awk "\$1 == \"hostname:\" {print \$2}" | sort -R') --delay-shutdown=10 --delay-hosts=1800
Rebooting KVM hosts
The remaining is the "manual" procedure, which includes one KVM last:
./reboot-host moly.torproject.org
... and Ganeti nodes, below.
Rebooting Ganeti nodes
See the Ganeti reboot procedures for this procedure.
Remaining nodes
The Nagios unhandled problems will show remaining hosts that might have been missed by the above procedure..