title: TPA-RFC-82: Merge Tails and Tor support policies
deadline: 2025-04-14
status: proposed
Summary: merge Tails rotations with TPA's star of the week into a single role, merge Tails and TPA's support policies.
Background
The Tails and Tor merge process created a situation in which there are now two separate infrastructures as well as two separate support processes and policies. The full infrastructure merge is expected to take 5 years to complete, but we want to prioritize merging the teams into a single entity.
Proposal
As much as reasonably possible, every team member should be able to handle issues on both TPA and Tails infrastructure. Decreasing the level of specialization will allow for sharing support workload in a way that is more even and spaced out for all team members.
Goals
Must have
- A list of tasks that should be handled during rotations that includes triage, routine tasks and interruption handling and comprises all expectations for both the TPA "star of the week" and the Tails "sysadmin on shift"
- A process to make sure every TPA members is able to support both infrastructures
- Guidelines for directing users to the correct place or process to get support
Non-Goals
Merging the following is not a goal of this policy:
- Tools used by each team
- Mailing lists
- Technical workflows
The goal is really just to make everyone comfortable to work on both sides of the infra and to merge rotation shifts.
Support tasks
TPA-RFC-2: Support defines different support levels, but in the context of this proposal we use the tasks that are the responsibility of the "star of the week" as a basis for the merge of rotation shifts:
- Triage of new issues
- Routine tasks
- Keep an eye on the monitoring system (karma
and
#tor-alerts
on IRC) - Organise incident response
Tails processes are merged into each of the items above, even though with different timelines.
Triage of new issues
For triage of new issues, we abolish the previous processes used by Tails, and users of Tails services should now:
- Stop creating new issues in the tpo/tpa/tails-sysadmin> project, and instead start using the tpo/tpa/team> project or dedicated projects when available (eg. tpo/tpa/puppet-weblate>).
- Stop using the ~"To Do" label, and start using per-service labels, when available, or the generic Tails label when the relevant Tails service doesn't have a specific label.
Triage of Tails issues will follow the same triage process as other TPA issues and, apart from the changes listed above, the process should be the same for any user requesting support.
Routine tasks
The following routine tasks are expected from the Tails Sysadmin on shift:
- update ACLs upon request (eg. Gitolite, GitLab, etc)
- major upgrades of operating systems
- manual upgrades (such as Jenkins, Weblate, etc)
- reboot and restart systems for security issues or faults
- interface with providers
- update GitLab configuration (using gitlab-config)
- process abuse reports in Tails' GitLab
Most of these were already described in TPA's "routine" tasks and the ones that were not are now also explicitly included there. Note that, until the infra merge is complete, these tasks will have to be operated in both infras.
The following processes were explicitly mentioned as expectations Tails Sysadmins (not necessarily on shift), and are either superseded by the current processes TPA has in place to organize its work or just made obsolete:
task | action |
---|---|
avoid work duplication | superseded by TPA's triage process and check-ins |
support the sysadmin on shift | superseded by TPA's triage process and check-ins |
cover for the sysadmin on shift after 48h of MIA | obsolete |
self-evaluation of work | obsolete |
shift schedule | eventually replaced by TPA rotations ("star of the week") |
Jenkins upgrade (including plugins) | absorbed by TPA as a new task |
LimeSurvey upgrade | absorbed by TPA with the LimeSurvey merge |
Weblate upgrade | absorbed by TPA as a new task |
Monitoring system
As per TPA-RFC-73, the plan is to ditch Tails' Icinga2 in favor of Tor's Prometheus, which is blocked by significant part of the Puppet merge.
Asking the TPA crew to get used to Tails Icinga2 in the meantime is not a good option because:
- Tor has recently ditched Icinga, and asking them to adopt something like it once again would be demotivating
- The system will eventually change anyway and using people's time to adopt it would not be a good investment of resources.
Because of the above, we choose to delay the merge of tasks that depend on the monitoring system until after Puppet is merged and the Tails infra has been been migrated to Prometheus. The estimate is we could start working on the migration of the monitoring system on November 2025, so we should probably not count on having that finished before the end of 2025.
This decision impacts some of the routine tasks (eg. examine disk usage, check for the need of server reboots) and "keeping an eye in the monitoring system" in general. In the meantime, we can merge triage, routine tasks that don't depend on the monitoring system and organization of incident response.
Incident response
Tails doesn't have a formal incident response process, so in this case the TPA process is just adopted as is.
Support merge process
The merge process is incremental:
- Phase 0: Separate shifts (this is what happens now)
- Phase 1: Triage and organization of incident response
- Phase 2: Routine tasks
- Phase 3: Merged support
Phase 0 - Separate shifts
This phase corresponds to what happens now: there are 2 different support teams essentially giving support for 2 different infras.
Phase 1 - Triage and organization of incident response
During this period, the TPA star of the week works in conjunction with the Tails Sysadmin on shifts in triage of new issues and organisation of incident response, when needed.
Each week there'll be two people looking at the relevant dashboards, and they should communicate to resolve questions that may arise about triage. Similarly, if there are incidents, they'll coordinate to handle together the organization of responses.
Phase 2 - Routine tasks
Once Tails monitoring has been migrated to Prometheus, the TPA star of the week and the Tails Sysadmin on shift can start collaborating on routine tasks and, when possible, start working on issues related to "each other's infra".
In this phase we still maintain 2 different support calendars, and Tails+Tor support pairs are changed every week according to these calendars.
Note that there are much more support requests on the TPA side, and much less sysadmin hours on the Tails side, so this should be done proportionately. The idea is to allow for smooth onboarding of both teams on both infras, so they should support each other to make sure any questions are answered and any blocks are removed.
Some routine tasks that are not related to monitoring may start earlier than the date we set for Phase 2 in the timeline below. Upgrades to Debian Trixie are one example of activity that will help both teams getting comfortable with each other's infra: "To help with merging rotations in the two teams, TPA staff will upgrade Tails machines, with Tails folks assistance, and vice-versa."
Phase 3 - Merged support
Every TPA member is now able to conduct all routine tasks and handle triage and interrupts in both infrastructures. We abolish the "Tails Sysadmin Shifts" calendar and incorporate all TPA members in the "Star of the week" rotation calendar.
Scope
Affected users
This policy mainly affects TPA members and any user of Tails services that needs to make a support request. Most impacted users are members of the Tails Team, as they are the main users of the Tails services, and, eventually, members of the Community and Fundraising teams, as they're probable users of some of Tails services such as the Tails website and Weblate.
Timeline
Phase | Timeline |
---|---|
Phase 0 - Separate shifts | now - mid-April 2025 |
Phase 1 - Triage and organization of incident response | mid-April - December 2025 |
Phase 2 - Routine tasks | January 2026 |
Phase 3 - Merged support | April 2026 |