Loading policy.md +4 −4 Original line number Diff line number Diff line Loading @@ -28,13 +28,13 @@ the Git repository for this wiki, run the command: * [TPA-RFC-45: Mail architecture](policy/tpa-rfc-45-mail-architecture) * [TPA-RFC-47: Email account retirement](policy/tpa-rfc-47-email-account-retirement) * [TPA-RFC-66: Migrate to Gitlab Ultimate Edition](policy/tpa-rfc-66-gitlab-ultimate-program) * [TPA-RFC-80: Debian trixie upgrade schedule](policy/tpa-rfc-80-debian-trixie-upgrade-schedule) ## Proposed * [TPA-RFC-77: Puppet merge](policy/tpa-rfc-77-puppet-merge) * [TPA-RFC-78: Dangerzone retirement](policy/tpa-rfc-78-dangerzone-retirement) * [TPA-RFC-79: General merge request workflows](policy/tpa-rfc-79-general-merge-request-workflows) * [TPA-RFC-80: Debian trixie upgrade schedule](policy/tpa-rfc-80-debian-trixie-upgrade-schedule) ## Standard Loading policy/tpa-rfc-80-debian-trixie-upgrade-schedule.md +71 −37 Original line number Diff line number Diff line Loading @@ -3,11 +3,16 @@ title: TPA-RFC-80: Debian trixie upgrade schedule costs: staff, 4+ weeks approval: TPA, service admins affected users: TPA, service admins deadline: TODO status: draft deadline: 2 weeks, 2025-03-18 status: proposed discussion: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41990 --- Summary: start upgrading servers during the Debian "trixie" freeze, if it goes well, complete most of the fleet upgrade in around June 2025, with full completion by the end of 2025, with a 2026 year free of major upgrades entirely. Improve automation. # Background Debian 13 "trixie", currently "testing" is going into freeze soon, which Loading Loading @@ -58,14 +63,15 @@ and proposal like this one would link against the upstream release notes. Unfortunately, at the time writing, upstream hasn't yet produced release notes (as we're still in testing). TODO: well the above sounds bad. maybe we shouldn't upgrade during freeze after all? We're hoping the procedure will be fine-tuned by the time we're ready to coordinate the second batch of updates, around May 20204, when we will send reminders to affected teams. ## Upgrade schedule The upgrade is split in multiple batches: - installer changes: TODO - automation and installer changes - low complexity: mostly TPA services and less critical Tails servers Loading @@ -76,7 +82,7 @@ The upgrade is split in multiple batches: - high complexity: Tails VMs running services not from the official Debian repositories - cleanup: TODO - cleanup The free time between the first two batches will also allow us to cover for unplanned contingencies: upgrades that could drag on and Loading @@ -87,6 +93,21 @@ that should be "fun" for the team. This policy has proven to be effective in the previous upgrades and we are eager to repeat it again. ### Upgrade automation and installer changes First, we tweak the installers to deploy trixie by default to avoid installing further "old" systems. This includes the bare-metal installers but also and especially the virtual machine installers and container images. We also want to work on automating the upgrade procedure further. We've had catastrophic errors in the PostgreSQL upgrade procedure in the past, in particular, but the whole procedure is now considered ripe for automation, see [tpo/tpa/team#41485][] for details. [tpo/tpa/team#41485]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41485 ### Batch 1: low complexity, April-May 2025 This is actually scheduled in two weeks: TPA boxes will be upgraded in Loading Loading @@ -158,7 +179,9 @@ this work, in a single week. [first batch of bookworm machines]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41251 Feedback and coordination of this batch happens in [issue batch 1 TODO](). Feedback and coordination of this batch happens in [issue batch 1][]. [issue batch 1]: "https://gitlab.torproject.org/tpo/tpa/team/-/issues/42071" ### Batch 2: moderate complexity, May-June 2025 Loading Loading @@ -241,7 +264,9 @@ will likely take us 60 hours (or two weeks) to complete the upgrade. [second batch of bookworm upgrades]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41252 Feedback and coordination of this batch happens in [issue batch 2 TODO](). Feedback and coordination of this batch happens in [issue batch 2][]. [issue batch 2]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/42070 ### Batch 3: high complexity, 2025 Q3-Q4 Loading @@ -257,21 +282,21 @@ eventually be made part of the second batch. 15 TPA machines: ``` alberti.torproject.org dal-node-01.torproject.org dal-node-02.torproject.org dal-node-03.torproject.org fsn-node-01.torproject.org fsn-node-02.torproject.org fsn-node-03.torproject.org fsn-node-04.torproject.org fsn-node-05.torproject.org fsn-node-06.torproject.org fsn-node-07.torproject.org fsn-node-08.torproject.org nevii.torproject.org pauli.torproject.org puppetdb-01.torproject.org - [ ] alberti.torproject.org - [ ] dal-node-01.torproject.org - [ ] dal-node-02.torproject.org - [ ] dal-node-03.torproject.org - [ ] fsn-node-01.torproject.org - [ ] fsn-node-02.torproject.org - [ ] fsn-node-03.torproject.org - [ ] fsn-node-04.torproject.org - [ ] fsn-node-05.torproject.org - [ ] fsn-node-06.torproject.org - [ ] fsn-node-07.torproject.org - [ ] fsn-node-08.torproject.org - [ ] nevii.torproject.org - [ ] pauli.torproject.org - [ ] puppetdb-01.torproject.org ``` It seems like the [bookworm Ganeti upgrade][] took roughly 10h of Loading @@ -281,17 +306,17 @@ possibly 20h. 11 Tails machines: ``` isoworker1.dragon isoworker2.dragon isoworker3.dragon isoworker4.dragon isoworker5.dragon isoworker6.iguana isoworker7.iguana isoworker8.iguana jenkins.dragon survey.lizard translate.lizard - [ ] isoworker1.dragon - [ ] isoworker2.dragon - [ ] isoworker3.dragon - [ ] isoworker4.dragon - [ ] isoworker5.dragon - [ ] isoworker6.iguana - [ ] isoworker7.iguana - [ ] isoworker8.iguana - [ ] jenkins.dragon - [ ] survey.lizard - [ ] translate.lizard ``` [bookworm Ganeti upgrade]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41254 Loading @@ -299,11 +324,20 @@ translate.lizard The challenge with Tails upgrades is the coordination with the Tails team, in particular for the Jenkins upgrades. Feedback and coordination of this batch happens in [issue batch 3 TODO](). Feedback and coordination of this batch happens in [issue batch 3][]. [issue batch 3]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/42069 ### Cleanup work ## Upgrade automation Once the upgrade is completed and the entire fleet is again running a single OS, it's time for cleanup. This involves updating configuration files to the new versions and removing old compatibility code in Puppet, removing old container images, and generally wrapping things up. TODO: document we want to start automating upgrades more This process has been historically neglected, but we're hoping to wrap this up, worst case in 2026. # Alternatives considered Loading Loading
policy.md +4 −4 Original line number Diff line number Diff line Loading @@ -28,13 +28,13 @@ the Git repository for this wiki, run the command: * [TPA-RFC-45: Mail architecture](policy/tpa-rfc-45-mail-architecture) * [TPA-RFC-47: Email account retirement](policy/tpa-rfc-47-email-account-retirement) * [TPA-RFC-66: Migrate to Gitlab Ultimate Edition](policy/tpa-rfc-66-gitlab-ultimate-program) * [TPA-RFC-80: Debian trixie upgrade schedule](policy/tpa-rfc-80-debian-trixie-upgrade-schedule) ## Proposed * [TPA-RFC-77: Puppet merge](policy/tpa-rfc-77-puppet-merge) * [TPA-RFC-78: Dangerzone retirement](policy/tpa-rfc-78-dangerzone-retirement) * [TPA-RFC-79: General merge request workflows](policy/tpa-rfc-79-general-merge-request-workflows) * [TPA-RFC-80: Debian trixie upgrade schedule](policy/tpa-rfc-80-debian-trixie-upgrade-schedule) ## Standard Loading
policy/tpa-rfc-80-debian-trixie-upgrade-schedule.md +71 −37 Original line number Diff line number Diff line Loading @@ -3,11 +3,16 @@ title: TPA-RFC-80: Debian trixie upgrade schedule costs: staff, 4+ weeks approval: TPA, service admins affected users: TPA, service admins deadline: TODO status: draft deadline: 2 weeks, 2025-03-18 status: proposed discussion: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41990 --- Summary: start upgrading servers during the Debian "trixie" freeze, if it goes well, complete most of the fleet upgrade in around June 2025, with full completion by the end of 2025, with a 2026 year free of major upgrades entirely. Improve automation. # Background Debian 13 "trixie", currently "testing" is going into freeze soon, which Loading Loading @@ -58,14 +63,15 @@ and proposal like this one would link against the upstream release notes. Unfortunately, at the time writing, upstream hasn't yet produced release notes (as we're still in testing). TODO: well the above sounds bad. maybe we shouldn't upgrade during freeze after all? We're hoping the procedure will be fine-tuned by the time we're ready to coordinate the second batch of updates, around May 20204, when we will send reminders to affected teams. ## Upgrade schedule The upgrade is split in multiple batches: - installer changes: TODO - automation and installer changes - low complexity: mostly TPA services and less critical Tails servers Loading @@ -76,7 +82,7 @@ The upgrade is split in multiple batches: - high complexity: Tails VMs running services not from the official Debian repositories - cleanup: TODO - cleanup The free time between the first two batches will also allow us to cover for unplanned contingencies: upgrades that could drag on and Loading @@ -87,6 +93,21 @@ that should be "fun" for the team. This policy has proven to be effective in the previous upgrades and we are eager to repeat it again. ### Upgrade automation and installer changes First, we tweak the installers to deploy trixie by default to avoid installing further "old" systems. This includes the bare-metal installers but also and especially the virtual machine installers and container images. We also want to work on automating the upgrade procedure further. We've had catastrophic errors in the PostgreSQL upgrade procedure in the past, in particular, but the whole procedure is now considered ripe for automation, see [tpo/tpa/team#41485][] for details. [tpo/tpa/team#41485]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41485 ### Batch 1: low complexity, April-May 2025 This is actually scheduled in two weeks: TPA boxes will be upgraded in Loading Loading @@ -158,7 +179,9 @@ this work, in a single week. [first batch of bookworm machines]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41251 Feedback and coordination of this batch happens in [issue batch 1 TODO](). Feedback and coordination of this batch happens in [issue batch 1][]. [issue batch 1]: "https://gitlab.torproject.org/tpo/tpa/team/-/issues/42071" ### Batch 2: moderate complexity, May-June 2025 Loading Loading @@ -241,7 +264,9 @@ will likely take us 60 hours (or two weeks) to complete the upgrade. [second batch of bookworm upgrades]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41252 Feedback and coordination of this batch happens in [issue batch 2 TODO](). Feedback and coordination of this batch happens in [issue batch 2][]. [issue batch 2]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/42070 ### Batch 3: high complexity, 2025 Q3-Q4 Loading @@ -257,21 +282,21 @@ eventually be made part of the second batch. 15 TPA machines: ``` alberti.torproject.org dal-node-01.torproject.org dal-node-02.torproject.org dal-node-03.torproject.org fsn-node-01.torproject.org fsn-node-02.torproject.org fsn-node-03.torproject.org fsn-node-04.torproject.org fsn-node-05.torproject.org fsn-node-06.torproject.org fsn-node-07.torproject.org fsn-node-08.torproject.org nevii.torproject.org pauli.torproject.org puppetdb-01.torproject.org - [ ] alberti.torproject.org - [ ] dal-node-01.torproject.org - [ ] dal-node-02.torproject.org - [ ] dal-node-03.torproject.org - [ ] fsn-node-01.torproject.org - [ ] fsn-node-02.torproject.org - [ ] fsn-node-03.torproject.org - [ ] fsn-node-04.torproject.org - [ ] fsn-node-05.torproject.org - [ ] fsn-node-06.torproject.org - [ ] fsn-node-07.torproject.org - [ ] fsn-node-08.torproject.org - [ ] nevii.torproject.org - [ ] pauli.torproject.org - [ ] puppetdb-01.torproject.org ``` It seems like the [bookworm Ganeti upgrade][] took roughly 10h of Loading @@ -281,17 +306,17 @@ possibly 20h. 11 Tails machines: ``` isoworker1.dragon isoworker2.dragon isoworker3.dragon isoworker4.dragon isoworker5.dragon isoworker6.iguana isoworker7.iguana isoworker8.iguana jenkins.dragon survey.lizard translate.lizard - [ ] isoworker1.dragon - [ ] isoworker2.dragon - [ ] isoworker3.dragon - [ ] isoworker4.dragon - [ ] isoworker5.dragon - [ ] isoworker6.iguana - [ ] isoworker7.iguana - [ ] isoworker8.iguana - [ ] jenkins.dragon - [ ] survey.lizard - [ ] translate.lizard ``` [bookworm Ganeti upgrade]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/41254 Loading @@ -299,11 +324,20 @@ translate.lizard The challenge with Tails upgrades is the coordination with the Tails team, in particular for the Jenkins upgrades. Feedback and coordination of this batch happens in [issue batch 3 TODO](). Feedback and coordination of this batch happens in [issue batch 3][]. [issue batch 3]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/42069 ### Cleanup work ## Upgrade automation Once the upgrade is completed and the entire fleet is again running a single OS, it's time for cleanup. This involves updating configuration files to the new versions and removing old compatibility code in Puppet, removing old container images, and generally wrapping things up. TODO: document we want to start automating upgrades more This process has been historically neglected, but we're hoping to wrap this up, worst case in 2026. # Alternatives considered Loading