title: TPA-RFC-2: support
Summary: to get help, open a ticket, ask on IRC for simple things, or send us an email for private things. TPA doesn't manage all services (service admin definition). Criterion for supported services and support levels.
It is important to define how users get help from, what is an emergency for, and what is supported by the sysadmin team (AKA "TPA"). So far, only the former has been defined, rather informally, but has yet to be collectively agreed withing the larger team.
This proposal aims to document the current situation and propose new support levels and a support policy that will provide clear guidelines and expectations for the various teams inside TPO.
This first emerged during an audit of the TPO infrastructure by anarcat in July 2019 (ticket 31243), itself taken from section 2 of the "ops report card", which is Are "the 3 empowering policies" defined and published? Those policies are defined as:
- How do users get help?
- What is an emergency?
- What is supported?
Which we translate in the following policy proposals:
- Support channels
- Support levels
- Supported services, which includes the service admins definition and how service transition between the teams (if at all)
Support requests and questions are encouraged to be documented and communicated to the team.
Quick question: chat
If you have "just a quick question" or some quick thing we can help
you with, ask us on IRC: you can find us in
irc.oftc.net and in other tor channels.
It's possible we ask you to create a ticket if we're in a pinch. It's also a good way to bring your attention to some emergency or ticket that was filed elsewhere.
Bug reports, feature requests and others: issue tracker
Most requests and questions should go into the issue tracker, which is currently GitLab (direct link to a new ticket form). Try to find a good label describing the service you're having a problem with, but in doubt, just file the issue with as much details as you can.
You can also mark an issue as confidential, in which case only members of the team (and the larger "tpo" organisation on GitLab) will be able to read it.
(Note that the issue tracker will be changed to GitLab shortly, at which point the above links will be updated.)
Private question and fallback: email
If you want to discuss a sensitive matter that requires privacy or are unsure how to reach us, you can always write to us by email, at email@example.com.
We consider there are three "support levels" for problems that come up with services:
- code red: immediate emergency, fix ASAP
- code yellow: serious problem that doesn't require immediate attention but that could turn into a code red if nothing is down
- routine: file a bug report, we'll get to it soon!
We do not have 24/7 on-call support, so requests are processed during work times of available staff. We do try to provide continuous support as much as possible, but it's possible that some weekends or vacations are unattended for more than a day. This is the definition of a "business day".
The TPA team is currently small and there might be specific situations where a code RED might require more time than expected and as a organization we need to do an effort in understanding that.
A "code red" is a critical condition that requires immediate action. It's what we consider an "emergency". Our SLA for those is 24h business days, as defined above. Services qualifying for a code red are:
Other services fall under "routine" or "code yellow" below, which can be upgraded in priority.
Examples of problems falling under code red include:
- website unreachable
- emails to torproject.org not reaching our server
Some problems fall under other teams and are not the responsibility of TPA, even if they can be otherwise considered a code red.
So, for example, those are not code reds for TPA:
- website has a major design problem rendering it unusable
- donation backend failing because of a problem in CiviCRM
- gmail refusing all email forwards
- encrypted mailing lists failures
- gitolite refuses connections
A "code yellow" is a situation where we are overwhelmed but there isn't exactly an immediate emergency to deal with. A good introduction is this SRECON19 presentation (slides). The basic idea is that a code yellow is a "problem [that] creeps up on you over time and suddenly the hole is so deep you can’t find the way out".
There's no clear timeline on when such a problem can be resolved. If the problem is serious enough, it may eventually be upgraded to a code red by the approval of a team lead after a week's delay, regardless of the affected service. In that case, a "hot fix" (some hack like throwing hardware at the problem) may be deployed instead of fixing the actual long term issue, in which case the problem becomes a code yellow again.
Examples of a code yellow include:
- Trac gets overwhelmed (ticket 29672)
- Gitweb performance problems (ticket 32133)
- upgrade metrics.tpo to buster in the hope of fixing broken graphs (ticket 32998)
Routine tasks are normal requests that are not an emergency and can be processed as part of the normal workflow.
Example of routine tasks include:
- account creation
- group access changes
- email alias changes
- static web component changes
- examine disk usage warning
- security upgrades
- server reboots
Services supported by TPA must fulfill the following criteria:
- The software needs to have an active release cycle
- It needs to provide installation instructions, debugging procedures
- It needs to maintain a bug tracker and/or some means to contact upstream
- Debian GNU/Linux is the only supported operating system, and TPA supports only the "stable" and "oldstable" distributions, until the latter becomes EOL
- At least two person from the Tor community should be willing to help to maintain the service
Note that TPA does not support Debian LTS.
Also note that it is the responsibility of service admins (see below) to upgrade their services to keep up with the Debian release schedule.
(Note: this section used to live in doc/admins and is the current "service admin" definition, mostly untouched.)
Within the admin team we have system admins (also known as sysadmins, TSA or TPA) and services admins. While the distinction between the two might seem blurry, the rule of thumb is that sysadmins do not maintain every service that we offer. Rather, they maintain the underlying computers -- make sure they get package updates, make sure they stay on the network, etc.
Then it's up to the service admins to deploy and maintain their services (onionoo, atlas, blog, etc) on top of those machines.
For example, "the blog is returning 503 errors" is probably the responsibility of a service admin, i.e. the blog service is experiencing a problem. Instead, "the blog doesn't ping" or "i cannot open a TCP connection" is a sysadmin thing, i.e. the machine running the blog service has an issue. More examples:
- installing a Debian package
- deploy a firewall rule
- add a new user (or a group, or a user to a group, etc)
Service admin tasks:
- the donation site is not handling credit cards correctly
- a video on media.torproject.org is returning 403 because its permissions are wrong
- the check.tp.o web service crashed
The above distinction between sysadmins and service admins is often weak since Tor has trouble maintaining a large service admin team. There are instead core Tor people that are voluntarily responsible for a service, for a while.
If a service is important for the Tor community the sysadmin team might adopt it even when there aren't designated services admins.
In order for a service to be adopted by the sysadmin team, it needs to fulfill the criteria established for "Supported services" by TPA, above.
When a service is adopted by the sysadmin team, the sysadmins will make an estimation of costs and resources required to maintain the service over time. The documentation should follow the service documentation template at howto/template.
There needs to be some commitment by individuals Tor project contributors and also by the project that the service will receive funding to keep it working.
Policy was submitted to the team on 2020-06-03 and adopted by the team on 2020-06-10, at which point it was submitted to tor-internal for broader approval. It will be marked as "standard" on 2020-06-17 if there are no objections there.
This proposal was adopted as a
standard on 2020-06-17.