Loading howto/reboots.md 0 → 100644 +197 −0 Original line number Diff line number Diff line [[_TOC_]] # Reboots Sometimes it is necessary to perform a reboot on the hosts, when the kernel is updated. Prometheus will warn about this with the `NeedsReboot` alert, which looks like: Servers running bookworm needs to reboot You can see the list of pending reboots with this Fabric task: fab fleet.pending-reboots See below for how to handle specific situations. ## Full fleet reboot This is the most likely scenario, especially when we were able to upgrade all of the servers to the same, stable, release of debian. In this case, the faster way to run reboots is to reboot ganeti nodes with all of their contained instances in order to clear out reboots for many servers at once, then reboot the hosts that are not in ganeti. Note that to make the reboots run more smoothly, you can temporarily modify your [yubikey touch policy](howto/yubikey#touch-policy) to remove the need to always confirm by touching the key. ### Rebooting Ganeti nodes See the [Ganeti reboot procedures](howto/ganeti#rebooting) for this procedure. ### Remaining nodes The [Karma alert dashboard](https://karma.torproject.org/?q=%40state%3Dactive&q=alertname%3DNeedsReboot) will show remaining hosts that might have been missed by the above procedure. But if you want to run more upgrades in parallel and are doing a fleet-wide reboot, while running the Ganeti reboots (above), you can perform reboots on the hosts _not_ on Ganeti cluster by pulling the list of hosts from LDAP: fab -H $(ssh db.torproject.org 'ldapsearch -H ldap://db.torproject.org -x -ZZ -b "ou=hosts,dc=torproject,dc=org" "(!(physicalHost=gnt-*))" hostname' | sed -n '/hostname/{s/hostname: //;p}' | grep -v ".*-node-[0-9]\+\|^#" | paste -sd ',') fleet.reboot-host ## Rebooting a single host If this is only a virtual machine, and the only one affected, it can be rebooted directly. This can be done with the `fabric-tasks` task `fleet.reboot-host`: fab -H test-01.torproject.org,test-02.torproject.org fleet.reboot-host By default, the script will wait 2 minutes before hosts: that should be changed to *30 minutes* if the hosts are part of a mirror network to give the monitoring systems (`mini-nag`) time to rotate the hosts in and out of DNS: fab -H mirror-01.torproject.org,mirror-02.torproject.org fleet.reboot-host --delay-hosts 1800 If the host has an encrypted filesystem and is hooked up with Mandos, it will return automatically. Otherwise it might need a password to be entered at boot time, either through the initramfs (if it has the `profile::fde` class in Puppet) or manually, after the boot. That is the case for the `mandos-01` server itself, for example, as it currently can't unlock itself, naturally. Note that you can cancel a reboot with `--kind=cancel`. This also cascades down Ganeti nodes. ## Batch rebooting multiple hosts IMPORTANT: before following this procedure, make sure that only a subset of the hosts need a restart. If *all* hosts need a reboot, it's likely going to be faster and easier to reboot the entire clusters at once, see the [Ganeti reboot procedures](howto/ganeti#rebooting) instead. NOTE: Reboots will tend to stop for user confirmation whenever packages get upgraded just before the reboot. To prevent the process from waiting for your manual input, it is suggested that upgrades are run first, using cumin. See [how to run upgrades in the section above](#manual-upgrades-with-cumin). LDAP hosts have information about how they can be rebooted, in the `rebootPolicy` field. Here are what the various fields mean: * `justdoit` - can be rebooted any time, with a 10 minute delay, possibly in parallel * `rotation` - part of a cluster where each machine needs to be rebooted one at a time, with a 30 minute delay for DNS to update * `manual` - needs to be done by hand or with a special tool (fabric in case of ganeti, reboot-host in the case of KVM, nothing for windows boxes) Therefore, it's possible to selectively reboot some of those hosts in batches. Again, this is pretty rare: typically, you would either reboot only a single host or *all* hosts, in which case a cluster-wide reboot (with Ganeti, below) would be more appropriate. This routine should be able to reboot all hosts with a `rebootPolicy` defined to `justdoit` or `rotation`: echo "rebooting 'justdoit' hosts with a 10-minute delay, every 2 minutes...." fab -H $(ssh db.torproject.org 'ldapsearch -H ldap://db.torproject.org -x -ZZ -b ou=hosts,dc=torproject,dc=org -LLL "(rebootPolicy=justdoit)" hostname | awk "\$1 == \"hostname:\" {print \$2}" | sort -R' | paste -sd ',') fleet.reboot-host --delay-shutdown-minutes=10 --delay-hosts-seconds=120 echo "rebooting 'rotation' hosts with a 10-minute delay, every 30 minutes...." fab -H $(ssh db.torproject.org 'ldapsearch -H ldap://db.torproject.org -x -ZZ -b ou=hosts,dc=torproject,dc=org -LLL "(rebootPolicy=rotation)" hostname | awk "\$1 == \"hostname:\" {print \$2}" | sort -R' | paste -sd ',') fleet.reboot-host --delay-shutdown-minutes=10 --delay-hosts-seconds=1800 Another example, this will reboot all hosts running Debian `bookworm`, in random order: fab -H $(ssh puppetdb-01.torproject.org "curl -s -G http://localhost:8080/pdb/query/v4 --data-urlencode 'query=inventory[certname] { facts.os.distro.codename = \"bookworm\" }'" | jq -r '.[].certname' | sort -R | paste -sd ',') fleet.reboot-host And this will reboot all hosts with a pending kernel upgrade (updates only when puppet agent runs), again in random order: fab -H $(ssh puppetdb-01.torproject.org "curl -s -G http://localhost:8080/pdb/query/v4 --data-urlencode 'query=inventory[certname] { facts.apt_reboot_required = true }'" | jq -r '.[].certname' | sort -R | paste -sd ',') fleet.reboot-host And this is the list of all *physical* hosts with a pending upgrade, alphabetically: fab -H $(ssh puppetdb-01.torproject.org "curl -s -G http://localhost:8080/pdb/query/v4 --data-urlencode 'query=inventory[certname] { facts.apt_reboot_required = true and facts.virtual = \"physical\" }'" | jq -r '.[].certname' | sort | paste -sd ',') fleet.reboot-host ## Userland reboots systemd 254 (Debian 13 trixie and above) has a special command: systemctl soft-reboot That will "shut down and reboot userspace". As the [manual page explains](https://manpages.debian.org/testing/systemd/systemd-soft-reboot.service.8.en.html): > systemd-soft-reboot.service is a system service that is pulled in by > soft-reboot.target and is responsible for performing a > userspace-only reboot operation. When invoked, it will send the > SIGTERM signal to any processes left running (but does not follow up > with SIGKILL, and does not wait for the processes to exit). If the > /run/nextroot/ directory exists (which may be a regular directory, a > directory mount point or a symlink to either) then it will switch > the file system root to it. It then reexecutes the service manager > off the (possibly now new) root file system, which will enqueue a > new boot transaction as in a normal reboot. This can therefore be used to fix conditions where systemd itself needs to be restarted, or a lot of processes need to, but not the kernel. This has not been tested, but could speed up some restart conditions. ## Notifying users Users should be notified when rebooting hosts. Normally, the `shutdown(1)` command noisily prints warnings on terminals which will give a heads up to connected users, but many services do not rely on interactive terminals. It is therefore important to notify users over our chat rooms (currently [IRC](howto/irc)). The `reboot` script can send notifications when rebooting hosts. For that, credentials must be supplied, either through the `HTTP_USER` and `HTTP_PASSWORD` environment, or (preferably) through a `~/.netrc` file. The file should look something like this: machine kgb-bot.torproject.org login TPA password REDACTED The password (`REDACTED` in the above line) is available on the bot host (currently `chives`) in `/etc/kgb-bot/kgb.conf.d/client-repo-TPA.conf` or in trocla, with the `profile::kgb_bot::repo::TPA`. To confirm this works before running reboots, you should run this fabric task directly: fab kgb.relay "test" For example: anarcat@angela:fabric-tasks$ fab kgb.relay "mic check" INFO: mic check ... should result in: 16:16:26 <KGB-TPA> mic check When rebooting, the users will see this in the `#tor-admin` channel: ``` 13:13:56 <KGB-TPA> scheduled reboot on host web-fsn-02.torproject.org in 10 minutes 13:24:56 <KGB-TPA> host web-fsn-02.torproject.org rebooted ``` A heads up should be (manually) relayed in the `#tor-project` channel, inviting users to follow that progress in `#tor-admin`. Ideally, we would have a map of where each server should send notifications. For example, the `tb-build-*` servers should notify `#tor-browser-dev`. This would require a rather more convoluted configuration, as each KGB "account" is bound to a single channel for the moment... Loading
howto/reboots.md 0 → 100644 +197 −0 Original line number Diff line number Diff line [[_TOC_]] # Reboots Sometimes it is necessary to perform a reboot on the hosts, when the kernel is updated. Prometheus will warn about this with the `NeedsReboot` alert, which looks like: Servers running bookworm needs to reboot You can see the list of pending reboots with this Fabric task: fab fleet.pending-reboots See below for how to handle specific situations. ## Full fleet reboot This is the most likely scenario, especially when we were able to upgrade all of the servers to the same, stable, release of debian. In this case, the faster way to run reboots is to reboot ganeti nodes with all of their contained instances in order to clear out reboots for many servers at once, then reboot the hosts that are not in ganeti. Note that to make the reboots run more smoothly, you can temporarily modify your [yubikey touch policy](howto/yubikey#touch-policy) to remove the need to always confirm by touching the key. ### Rebooting Ganeti nodes See the [Ganeti reboot procedures](howto/ganeti#rebooting) for this procedure. ### Remaining nodes The [Karma alert dashboard](https://karma.torproject.org/?q=%40state%3Dactive&q=alertname%3DNeedsReboot) will show remaining hosts that might have been missed by the above procedure. But if you want to run more upgrades in parallel and are doing a fleet-wide reboot, while running the Ganeti reboots (above), you can perform reboots on the hosts _not_ on Ganeti cluster by pulling the list of hosts from LDAP: fab -H $(ssh db.torproject.org 'ldapsearch -H ldap://db.torproject.org -x -ZZ -b "ou=hosts,dc=torproject,dc=org" "(!(physicalHost=gnt-*))" hostname' | sed -n '/hostname/{s/hostname: //;p}' | grep -v ".*-node-[0-9]\+\|^#" | paste -sd ',') fleet.reboot-host ## Rebooting a single host If this is only a virtual machine, and the only one affected, it can be rebooted directly. This can be done with the `fabric-tasks` task `fleet.reboot-host`: fab -H test-01.torproject.org,test-02.torproject.org fleet.reboot-host By default, the script will wait 2 minutes before hosts: that should be changed to *30 minutes* if the hosts are part of a mirror network to give the monitoring systems (`mini-nag`) time to rotate the hosts in and out of DNS: fab -H mirror-01.torproject.org,mirror-02.torproject.org fleet.reboot-host --delay-hosts 1800 If the host has an encrypted filesystem and is hooked up with Mandos, it will return automatically. Otherwise it might need a password to be entered at boot time, either through the initramfs (if it has the `profile::fde` class in Puppet) or manually, after the boot. That is the case for the `mandos-01` server itself, for example, as it currently can't unlock itself, naturally. Note that you can cancel a reboot with `--kind=cancel`. This also cascades down Ganeti nodes. ## Batch rebooting multiple hosts IMPORTANT: before following this procedure, make sure that only a subset of the hosts need a restart. If *all* hosts need a reboot, it's likely going to be faster and easier to reboot the entire clusters at once, see the [Ganeti reboot procedures](howto/ganeti#rebooting) instead. NOTE: Reboots will tend to stop for user confirmation whenever packages get upgraded just before the reboot. To prevent the process from waiting for your manual input, it is suggested that upgrades are run first, using cumin. See [how to run upgrades in the section above](#manual-upgrades-with-cumin). LDAP hosts have information about how they can be rebooted, in the `rebootPolicy` field. Here are what the various fields mean: * `justdoit` - can be rebooted any time, with a 10 minute delay, possibly in parallel * `rotation` - part of a cluster where each machine needs to be rebooted one at a time, with a 30 minute delay for DNS to update * `manual` - needs to be done by hand or with a special tool (fabric in case of ganeti, reboot-host in the case of KVM, nothing for windows boxes) Therefore, it's possible to selectively reboot some of those hosts in batches. Again, this is pretty rare: typically, you would either reboot only a single host or *all* hosts, in which case a cluster-wide reboot (with Ganeti, below) would be more appropriate. This routine should be able to reboot all hosts with a `rebootPolicy` defined to `justdoit` or `rotation`: echo "rebooting 'justdoit' hosts with a 10-minute delay, every 2 minutes...." fab -H $(ssh db.torproject.org 'ldapsearch -H ldap://db.torproject.org -x -ZZ -b ou=hosts,dc=torproject,dc=org -LLL "(rebootPolicy=justdoit)" hostname | awk "\$1 == \"hostname:\" {print \$2}" | sort -R' | paste -sd ',') fleet.reboot-host --delay-shutdown-minutes=10 --delay-hosts-seconds=120 echo "rebooting 'rotation' hosts with a 10-minute delay, every 30 minutes...." fab -H $(ssh db.torproject.org 'ldapsearch -H ldap://db.torproject.org -x -ZZ -b ou=hosts,dc=torproject,dc=org -LLL "(rebootPolicy=rotation)" hostname | awk "\$1 == \"hostname:\" {print \$2}" | sort -R' | paste -sd ',') fleet.reboot-host --delay-shutdown-minutes=10 --delay-hosts-seconds=1800 Another example, this will reboot all hosts running Debian `bookworm`, in random order: fab -H $(ssh puppetdb-01.torproject.org "curl -s -G http://localhost:8080/pdb/query/v4 --data-urlencode 'query=inventory[certname] { facts.os.distro.codename = \"bookworm\" }'" | jq -r '.[].certname' | sort -R | paste -sd ',') fleet.reboot-host And this will reboot all hosts with a pending kernel upgrade (updates only when puppet agent runs), again in random order: fab -H $(ssh puppetdb-01.torproject.org "curl -s -G http://localhost:8080/pdb/query/v4 --data-urlencode 'query=inventory[certname] { facts.apt_reboot_required = true }'" | jq -r '.[].certname' | sort -R | paste -sd ',') fleet.reboot-host And this is the list of all *physical* hosts with a pending upgrade, alphabetically: fab -H $(ssh puppetdb-01.torproject.org "curl -s -G http://localhost:8080/pdb/query/v4 --data-urlencode 'query=inventory[certname] { facts.apt_reboot_required = true and facts.virtual = \"physical\" }'" | jq -r '.[].certname' | sort | paste -sd ',') fleet.reboot-host ## Userland reboots systemd 254 (Debian 13 trixie and above) has a special command: systemctl soft-reboot That will "shut down and reboot userspace". As the [manual page explains](https://manpages.debian.org/testing/systemd/systemd-soft-reboot.service.8.en.html): > systemd-soft-reboot.service is a system service that is pulled in by > soft-reboot.target and is responsible for performing a > userspace-only reboot operation. When invoked, it will send the > SIGTERM signal to any processes left running (but does not follow up > with SIGKILL, and does not wait for the processes to exit). If the > /run/nextroot/ directory exists (which may be a regular directory, a > directory mount point or a symlink to either) then it will switch > the file system root to it. It then reexecutes the service manager > off the (possibly now new) root file system, which will enqueue a > new boot transaction as in a normal reboot. This can therefore be used to fix conditions where systemd itself needs to be restarted, or a lot of processes need to, but not the kernel. This has not been tested, but could speed up some restart conditions. ## Notifying users Users should be notified when rebooting hosts. Normally, the `shutdown(1)` command noisily prints warnings on terminals which will give a heads up to connected users, but many services do not rely on interactive terminals. It is therefore important to notify users over our chat rooms (currently [IRC](howto/irc)). The `reboot` script can send notifications when rebooting hosts. For that, credentials must be supplied, either through the `HTTP_USER` and `HTTP_PASSWORD` environment, or (preferably) through a `~/.netrc` file. The file should look something like this: machine kgb-bot.torproject.org login TPA password REDACTED The password (`REDACTED` in the above line) is available on the bot host (currently `chives`) in `/etc/kgb-bot/kgb.conf.d/client-repo-TPA.conf` or in trocla, with the `profile::kgb_bot::repo::TPA`. To confirm this works before running reboots, you should run this fabric task directly: fab kgb.relay "test" For example: anarcat@angela:fabric-tasks$ fab kgb.relay "mic check" INFO: mic check ... should result in: 16:16:26 <KGB-TPA> mic check When rebooting, the users will see this in the `#tor-admin` channel: ``` 13:13:56 <KGB-TPA> scheduled reboot on host web-fsn-02.torproject.org in 10 minutes 13:24:56 <KGB-TPA> host web-fsn-02.torproject.org rebooted ``` A heads up should be (manually) relayed in the `#tor-project` channel, inviting users to follow that progress in `#tor-admin`. Ideally, we would have a map of where each server should send notifications. For example, the `tb-build-*` servers should notify `#tor-browser-dev`. This would require a rather more convoluted configuration, as each KGB "account" is bound to a single channel for the moment...