-
zen authored
refs team#41974
zen authoredrefs team#41974
- Background
- Proposal
- Goals
- Must have
- Non-goals
- Phase 1: Codebase preparation
- Converge in structure
- Converge in substance
- Phase 2: Puppet server switch
- Phase 3: Codebase homogeneity
- Next steps
- Costs
- Per-task worst-case duration estimate
- Per-phase worst-case time estimate
- Timeline
- Alternatives considered
- References
title: Tails and TPA Puppet codebase merge
deadline: 2025-02-26
status: proposed
Background
TPA-RFC-73 identified Puppet as a bottleneck for the merge between TPA and Tails infrastructure, as it blocks keeping, migrating and merging several other services. Merging codebases and ditching one of the Puppet servers is a complex move, so in this document we detail how that will be done.
Proposal
Goals
Must have
- One Puppet Server to rule them all
- Adoption of TPA's solution for handling Puppet modules and ENC
- Convergence in Puppet modules versions
- Commit signing (as it's fundamental for Tails' current backup solution)
Non-goals
This proposal is not about:
- Completely refactoring and deduplicating code, as that will be done step-by-step while we handle each services individually after the Puppet Server merge
- Ditching one way to store secrets in favor of another, as that will be done separately in the future, after both teams had the chance to experience Trocla and hiera-eyaml
- Tackling individual service merges, such as backups, dns, monitoring and firewall; these will be tackled individually once all infra is under one Puppet Server
- Applying new code standards everywhere; at most, we'll come up with general guidelines that could (maybe should) be used for new code and, in the future, for refactoring
Phase 1: Codebase preparation
This phase ensures that, once Tails code is copied to Tor's Puppet Control repo:
- Code structure will match and be coherent
- Tails code will not affect Tor's infra and Tor's code will not affect Tails infra
Note: Make sure to freeze all Puppet code refactoring on both sides before starting.
Converge in structure
Tails:
- (1.1) Switch from Git submodules to using g10k (#41974)
- (1.2) Remove ENC configuration, Tails don't really use it and the Puppet server switch will implement Tor's instead
- (1.3) Move node definitions under
manifests/nodes.pp
to roles - (1.4) Switch to the directory structure used by Tor:
- Move custom non-profile modules (
bitcoind
,borgbackup
,etckeeper
,gitolite
,rbac
,reprepro
,rss2email
,tails
,tirewall
andyapgp
) tolegacy/
. Note: there are no naming conflicts in this case. - Make sure to leave only 3rd party modules under
modules/
. There are 2 naming conflicts here (unbound
andnetwork
): Tails uses these from voxpupuli and Tor uses custom ones inlegacy/
, so in these cases we deprecate the Tor ones in favor of voxpupuli's. - Rename
hieradata
todata
- Rename
profiles
tosite
- Move custom non-profile modules (
- (1.5) Move default configuration to a new
profile::tails
class and include it in all nodes
Converge in substance
Tails:
- (1.6) Rename all profiles from
tails::profile
toprofile::tails
- (1.7) Ensure all exported resources' tags are prefixed with tails_
- (1.8) Upgrade 3rd-party modules to match TPA versions
Tor:
- (1.9) Install all 3rd-party modules that are used by Tails but not by Tor
- (1.10) Isolate all exported resources and collectors using tags
- (1.11) Move default configuration to a new
profile::tpa
class and include it in all nodes - (1.12) Enforce signed commits
- (1.13) Ensure all private data is moved to Trocla and publish the repo (tpo/tpa/team#29387)
- (1.14) Import the
tails::profile::puppet::eyaml
profile into TPA'sprofile::puppet::server
- (1.15) Copy the EYAML keys from the Tails to the Tor puppet server, and adapt
hiera.yaml
to use them - (1.16) Upgrade 3rd-party modules to match Tails versions
When we say "upgrade", we don't mean to upgrade to the latest upstream version of a module, but to the latest release that is highest version between the two codebases while also satisfying dependency requirements.
In other words, we don't "upgrade everything to latest", we "upgrade to Tails", or "upgrade to TPA", depending on the module. It's likely going to be "upgrade to Tails versions" everywhere, that said, considering the Tails codebase is generally tidier.
Phase 2: Puppet server switch
This phase moves all nodes from one Puppet server to the other:
- (2.1) Copy code (
legacy
modules and profiles) from Tails to Tor - (2.2) Create a flag that determines whether a node is Tails or TPA and which base class it should include, and assign nodes to their corresponding base class using the flag above
- (2.3) Point Tails nodes to the Tor Puppet server
- (2.4) Retire the Tails' Puppet server
Phase 3: Codebase homogeneity
This phase paves the way towards a cleaner future:
- (3.1) Remove all
tails::profile::puppet
profiles - (3.2) Merge the 8 conflicting Tails and TPA profiles:
grub
limesurvey
mta
nginx
podman
rspamd
sudo
sysctl
- (3.3) Move the remaining 114 non-conflicting Tails profiles to
profile
(without::tails
)
At this point, we'll have 244 profiles.
Next steps
From here on, there's a single code base on a single Puppet server, and nodes from both fleets (Tails and TPA) use the same environment.
The code base is not, however, fully merged just yet, of course. A possible way forward to merge services might be like this:
- To "merge" a service, a class existing in one profile (say
profile::prometheus
fromprofile::tpa
) is progressively added to all nodes on the other side, and eventually to the other profile (sayprofile::tails
)
So while we don't have a detailed step-by-step plan to merge all services, the above should give us general guidelines to merge services on a need-to basis, and progress in the merge roadmap.
Costs
To estimate costs of tasks in days of work, We use the same parameters as proposed in Jacob Kaplan-Moss' estimation technique.
"Complexity" estimates the size of a task in days, accounting for all other things a worker has to deal with during a normal workday:
Complexity | Time |
---|---|
small | 1 day |
medium | 3 days |
large | 1 week (5 days) |
extra-large | 2 weeks (10 days) |
"Uncertainty" is a scale factor applied to the length to get a pessimistic estimate if things go wrong:
Uncertainty Level | Multiplier |
---|---|
low | 1.1 |
moderate | 1.5 |
high | 2.0 |
extreme | 5.0 |
Per-task worst-case duration estimate
Task | Codebase | Complexity | Uncertainty | Expected (days) | Worst case (days) |
---|---|---|---|---|---|
(1.1) Switch to g10k | Tails | small | high | 2 | 4 |
(1.2) Remove ENC | Tails | small | low | 1 | 1.1 |
(1.3) Move nodes do roles | Tails | medium | low | 3 | 3.3 |
(1.4) Switch directory structure | Tails | small | moderate | 1 | 1.5 |
(1.5) Create default profile | Tails | small | moderate | 1 | 1.5 |
(1.6) Rename Tails profiles | Tails | small | low | 1 | 1.1 |
(1.7) Prefix exported resources | Tails | medium | low | 3 | 3.3 |
(1.8) Upgrade 3rd party modules | Tails | large | moderate | 5 | 7.5 |
(1.9) Install missing 3rd party modules | Tor | small | low | 1 | 1.1 |
(1.10) Prefix exported resources | Tor | medium | low | 3 | 3.3 |
(1.11) Create default profile | Tor | small | moderate | 1 | 1.5 |
(1.12) Enforce signed commits | Tor | medium | moderate | 3 | 4.5 |
(1.13) Move private data to Trocla | Tor | large | moderate | 5 | 7.5 |
(1.14) Publish repository | Tor | large | moderate | 5 | 7.5 |
(1.15) Enable EYAML | Tor | small | low | 1 | 1.1 |
(1.16) Upgrade 3rd party modules | Tor | x-large | high | 10 | 20 |
(2.1) Copy code | Tor | small | low | 1 | 1.1 |
(2.2) Differentiate Tails and Tor nodes | Tor | small | moderate | 1 | 1.5 |
(2.3) Switch Tails' nodes to Tor's Puppet server | Tor | large | extreme | 5 | 25 |
(2.4) Retire the Tails Puppet server | Tor | small | low | 1 | 1.1 |
(3.1) Ditch the Tails' Puppet profile | Tor | small | low | 1 | 1.1 |
(3.2) Merge conflicting profiles | Tor | large | extreme | 5 | 25 |
(3.3) Ditch the profile::tails namespace |
Tor | small | low | 1 | 1.1 |
Per-phase worst-case time estimate
Task | Worst case (days) | Worst case (weeks) |
---|---|---|
Phase 1: Codebase preparation | 69.8 | 17.45 |
Phase 2: Puppet server switch | 28.7 | 7.2 |
Phase 3: Codebase homogeneity | 27.2 | 6.8 |
Worst case duration: 125.7 days =~ 31.5 weeks
Timeline
The following parallel activities will probably influence (i.e. delay) this plan:
- Upgrade to Debian Trixie: maybe start on March, ideally finish by the end of 2025
- North-hemisphere summer vacations
Base on the above estimates, taking into account the potential delays, and stretching it a bit for a worst case scenario, here is a rough per-month timeline:
- March (all Tails):
- (1.1) Switch to g10k
- (1.2) Remove ENC
- (1.3) Move nodes to roles
- (1.4) Switch directory structure
- April:
- (1.5) Create default profile
- (1.6) Rename Tails profiles
- (1.7) Prefix exported resources
- (1.8) Upgrade 3rd party modules (Tails)
- May:
- (1.8) Upgrade 3rd party modules (Tails) (continuation)
- (1.9) Install missing 3rd party modules (Tor)
- (1.10) Prefix exported resources (Tor)
- (1.11) Create default profile (Tor)
- June (all Tor from now on):
- (1.12) Enforce signed commits
- (1.13) Move private data to Trocla
- July:
- (1.14) Publish repository
- (1.15) Enable EYAML
- (1.16) Upgrade 3rd party modules
- August:
- (1.16) Upgrade 3rd party modules (continuation)
- September:
- (2.1) Copy code
- (2.2) Differentiate Tails and Tor nodes
- (2.3) Switch Tails' nodes to Tor's Puppet server
- October:
- (2.3) Switch Tails' nodes to Tor's Puppet server (continuation)
- November:
- (2.4) Retire the Tails Puppet server
- (3.1) Ditch the Tails' Puppet profile
- Devember:
- (3.2) Merge conflicting profiles
- January:
- (3.2) Merge conflicting profiles (continuation)
- (3.3) Ditch the
profile::tails
namespace
Alternatives considered
- Migrate services to TPA before moving Puppet: some of the Tails services heavily depend on others and/or on the network setup. For example, Jenkins Agents on different machines talk to a Jenkins Orchestrator and a Gitolite server hosted on different VMs, then build nightly ISOs that are copied to the web VM and published over HTTP. Migrating all of these over to TPA's infra would be much more complex than just merging Puppet.