Changes

anarcat · 1665fe74
--- a/howto/upgrades.md
+++ b/howto/upgrades.md
+This page documents how upgrades are performed across the fleet in the
+Tor project. Typically, we're talking about Debian package upgrades,
+both routine and major upgrades. Service-specific upgrades notes are
+in their own service, in the "Upgrades" section.
+Note that reboot procedures have been moved to a separate page, in the
+[reboot documentation](howto/reboots).
 [[_TOC_]]
 # Major upgrades
@@ -399,196 +407,4 @@ those services directly in needrestart.
 # Reboots
-Sometimes it is necessary to perform a reboot on the hosts, when the
+This section was moved to the [reboot documentation](howto/reboots).
-kernel is updated. Prometheus will warn about this with the
-`NeedsReboot` alert, which looks like:
-    Servers running bookworm needs to reboot
-You can see the list of pending reboots with this Fabric task:
-    fab fleet.pending-reboots
-See below for how to handle specific situations.
-## Full fleet reboot
-This is the most likely scenario, especially when we were able to upgrade all of
-the servers to the same, stable, release of debian.
-In this case, the faster way to run reboots is to reboot ganeti nodes with all
-of their contained instances in order to clear out reboots for many servers at
-once, then reboot the hosts that are not in ganeti.
-Note that to make the reboots run more smoothly, you can temporarily modify your
-[yubikey touch policy](howto/yubikey#touch-policy) to remove the need to always
-confirm by touching the key.
-### Rebooting Ganeti nodes
-See the [Ganeti reboot procedures](howto/ganeti#rebooting) for this procedure.
-### Remaining nodes
-The [Karma alert
-dashboard](https://karma.torproject.org/?q=%40state%3Dactive&q=alertname%3DNeedsReboot)
-will show remaining hosts that might have been missed by the above procedure.
-But if you want to run more upgrades in parallel and are doing a
-fleet-wide reboot, while running the Ganeti reboots (above), you can
-perform reboots on the hosts _not_ on Ganeti cluster by pulling the
-list of hosts from LDAP:
-    fab -H $(ssh db.torproject.org 'ldapsearch -H ldap://db.torproject.org -x -ZZ -b "ou=hosts,dc=torproject,dc=org" "(!(physicalHost=gnt-*))" hostname' | sed -n '/hostname/{s/hostname: //;p}' | grep -v ".*-node-[0-9]\+\|^#" | paste -sd ',') fleet.reboot-host
-## Rebooting a single host
-If this is only a virtual machine, and the only one affected, it can
-be rebooted directly. This can be done with the `fabric-tasks` task
-`fleet.reboot-host`:
-    fab -H test-01.torproject.org,test-02.torproject.org fleet.reboot-host
-By default, the script will wait 2 minutes before hosts: that should
-be changed to *30 minutes* if the hosts are part of a mirror network
-to give the monitoring systems (`mini-nag`) time to rotate the hosts
-in and out of DNS:
-    fab -H mirror-01.torproject.org,mirror-02.torproject.org fleet.reboot-host --delay-hosts 1800
-If the host has an encrypted filesystem and is hooked up with Mandos, it
-will return automatically. Otherwise it might need a password to be
-entered at boot time, either through the initramfs (if it has the
-`profile::fde` class in Puppet) or manually, after the boot. That is
-the case for the `mandos-01` server itself, for example, as it
-currently can't unlock itself, naturally.
-Note that you can cancel a reboot with `--kind=cancel`. This also
-cascades down Ganeti nodes.
-## Batch rebooting multiple hosts
-IMPORTANT: before following this procedure, make sure that only a
-subset of the hosts need a restart. If *all* hosts need a reboot, it's
-likely going to be faster and easier to reboot the entire clusters at
-once, see the [Ganeti reboot procedures](howto/ganeti#rebooting) instead.
-NOTE: Reboots will tend to stop for user confirmation whenever packages get
-upgraded just before the reboot. To prevent the process from waiting for your
-manual input, it is suggested that upgrades are run first, using cumin. See
-[how to run upgrades in the section above](#manual-upgrades-with-cumin).
-LDAP hosts have information about how they can be rebooted, in the
-`rebootPolicy` field. Here are what the various fields mean:
- * `justdoit` - can be rebooted any time, with a 10 minute delay,
-   possibly in parallel
- * `rotation` - part of a cluster where each machine needs to be
-   rebooted one at a time, with a 30 minute delay for DNS to update
- * `manual` - needs to be done by hand or with a special tool (fabric
-   in case of ganeti, reboot-host in the case of KVM, nothing for
-   windows boxes)
-Therefore, it's possible to selectively reboot some of those hosts in
-batches. Again, this is pretty rare: typically, you would either
-reboot only a single host or *all* hosts, in which case a cluster-wide
-reboot (with Ganeti, below) would be more appropriate.
-This routine should be able to reboot all hosts with a `rebootPolicy`
-defined to `justdoit` or `rotation`:
-    echo "rebooting 'justdoit' hosts with a 10-minute delay, every 2 minutes...."
-    fab -H $(ssh db.torproject.org 'ldapsearch -H ldap://db.torproject.org -x -ZZ -b ou=hosts,dc=torproject,dc=org -LLL "(rebootPolicy=justdoit)" hostname | awk "\$1 == \"hostname:\" {print \$2}" | sort -R' | paste -sd ',') fleet.reboot-host --delay-shutdown-minutes=10 --delay-hosts-seconds=120
-    echo "rebooting 'rotation' hosts with a 10-minute delay, every 30 minutes...."
-    fab -H $(ssh db.torproject.org 'ldapsearch -H ldap://db.torproject.org -x -ZZ -b ou=hosts,dc=torproject,dc=org -LLL "(rebootPolicy=rotation)" hostname | awk "\$1 == \"hostname:\" {print \$2}" | sort -R' | paste -sd ',') fleet.reboot-host --delay-shutdown-minutes=10 --delay-hosts-seconds=1800
-Another example, this will reboot all hosts running Debian `bookworm`,
-in random order:
-    fab -H $(ssh puppetdb-01.torproject.org "curl -s -G http://localhost:8080/pdb/query/v4 --data-urlencode 'query=inventory[certname] { facts.os.distro.codename = \"bookworm\" }'" | jq -r '.[].certname' | sort -R | paste -sd ',')
-And this will reboot all hosts with a pending kernel upgrade (updates
-only when puppet agent runs), again in random order:
-    fab -H $(ssh puppetdb-01.torproject.org "curl -s -G http://localhost:8080/pdb/query/v4 --data-urlencode 'query=inventory[certname] { facts.apt_reboot_required = true }'" | jq -r '.[].certname' | sort -R | paste -sd ',')
-And this is the list of all *physical* hosts with a pending upgrade, alphabetically:
-    fab -H $(ssh puppetdb-01.torproject.org "curl -s -G http://localhost:8080/pdb/query/v4 --data-urlencode 'query=inventory[certname] { facts.apt_reboot_required = true and facts.virtual = \"physical\" }'" | jq -r '.[].certname'  | sort | paste -sd ',')
-## Userland reboots
-systemd 254 (Debian 13 trixie and above) has a special command:
-    systemctl soft-reboot
-That will "shut down and reboot userspace". As the [manual page
-explains](https://manpages.debian.org/testing/systemd/systemd-soft-reboot.service.8.en.html):
-> systemd-soft-reboot.service is a system service that is pulled in by
-> soft-reboot.target and is responsible for performing a
-> userspace-only reboot operation. When invoked, it will send the
-> SIGTERM signal to any processes left running (but does not follow up
-> with SIGKILL, and does not wait for the processes to exit). If the
-> /run/nextroot/ directory exists (which may be a regular directory, a
-> directory mount point or a symlink to either) then it will switch
-> the file system root to it. It then reexecutes the service manager
-> off the (possibly now new) root file system, which will enqueue a
-> new boot transaction as in a normal reboot.
-This can therefore be used to fix conditions where systemd itself
-needs to be restarted, or a lot of processes need to, but not the
-kernel.
-This has not been tested, but could speed up some restart conditions.
-## Notifying users
-Users should be notified when rebooting hosts. Normally, the
-`shutdown(1)` command noisily prints warnings on terminals which will
-give a heads up to connected users, but many services do not rely on
-interactive terminals. It is therefore important to notify users over
-our chat rooms (currently [IRC](howto/irc)).
-The `reboot` script can send notifications when rebooting hosts. For
-that, credentials must be supplied, either through the `HTTP_USER` and
-`HTTP_PASSWORD` environment, or (preferably) through a `~/.netrc`
-file. The file should look something like this:
-    machine kgb-bot.torproject.org login TPA password REDACTED
-The password (`REDACTED` in the above line) is available on the bot
-host (currently `chives`) in
-`/etc/kgb-bot/kgb.conf.d/client-repo-TPA.conf` or in trocla, with the
-`profile::kgb_bot::repo::TPA`.
-To confirm this works before running reboots, you should run this
-fabric task directly:
-    fab kgb.relay "test"
-For example:
-    anarcat@angela:fabric-tasks$ fab kgb.relay "mic check"
-    INFO: mic check
-... should result in:
-    16:16:26 <KGB-TPA> mic check
-When rebooting, the users will see this in the `#tor-admin` channel:
-```
-13:13:56 <KGB-TPA> scheduled reboot on host web-fsn-02.torproject.org in 10 minutes
-13:24:56 <KGB-TPA> host web-fsn-02.torproject.org rebooted
-```
-A heads up should be (manually) relayed in the `#tor-project` channel,
-inviting users to follow that progress in `#tor-admin`.
-Ideally, we would have a map of where each server should send
-notifications. For example, the `tb-build-*` servers should notify
-`#tor-browser-dev`. This would require a rather more convoluted
-configuration, as each KGB "account" is bound to a single channel for
-the moment...