|
|
---
|
|
|
title: TPA-RFC-2: support
|
|
|
---
|
|
|
|
|
|
Summary: to get help, open a ticket, ask on IRC for simple things, or
|
|
|
send us an email for private things. TPA doesn't manage all services
|
|
|
(service admin definition). Criterion for supported services and
|
|
|
support levels.
|
|
|
|
|
|
[[_TOC_]]
|
|
|
|
|
|
# Background
|
|
|
|
|
|
It is important to define how users get help from, what is an
|
|
|
emergency for, and what is supported by the sysadmin team (AKA
|
|
|
"TPA"). So far, only the former has been defined, rather informally,
|
|
|
but has yet to be collectively agreed withing the larger team.
|
|
|
|
|
|
This proposal aims to document the current situation and propose new
|
|
|
support levels and a support policy that will provide clear guidelines
|
|
|
and expectations for the various teams inside TPO.
|
|
|
|
|
|
This first emerged during an audit of the TPO infrastructure by anarcat
|
|
|
in July 2019 ([ticket 31243][]), itself taken from [section 2][] of
|
|
|
the "ops report card", which is *Are "the 3 empowering policies"
|
|
|
defined and published?* Those policies are defined as:
|
|
|
|
|
|
1. How do users get help?
|
|
|
2. What is an emergency?
|
|
|
3. What is supported?
|
|
|
|
|
|
Which we translate in the following policy proposals:
|
|
|
|
|
|
* Support channels
|
|
|
* Support levels
|
|
|
* Supported services, which includes the service admins definition
|
|
|
and how service transition between the teams (if at all)
|
|
|
|
|
|
# Proposal
|
|
|
|
|
|
<a name="how-to-get-help"></a>
|
|
|
|
|
|
## Support channels
|
|
|
|
|
|
Support requests and questions are encouraged to be documented and
|
|
|
communicated to the team.
|
|
|
|
|
|
Those instructions concern mostly internal Tor matters. For users of
|
|
|
Tor software, you will be better served by visiting
|
|
|
[support.torproject.org][] or [mailing lists][].
|
|
|
|
|
|
[support.torproject.org]: https://support.torproject.org/
|
|
|
[mailing lists]: https://lists.torproject.org/
|
|
|
|
|
|
### Quick question: chat
|
|
|
|
|
|
If you have "just a quick question" or some quick thing we can help
|
|
|
you with, ask us on IRC: you can find us in `#tpo-admin` on
|
|
|
`irc.oftc.net` and in other tor channels.
|
|
|
|
|
|
It's possible we ask you to create a ticket if we're in a pinch. It's
|
|
|
also a good way to bring your attention to some emergency or ticket
|
|
|
that was filed elsewhere.
|
|
|
|
|
|
### Bug reports, feature requests and others: issue tracker
|
|
|
|
|
|
Most requests and questions should go into the issue tracker, which is
|
|
|
currently [Trac][] ([direct link to a new ticket form][]). Try to find
|
|
|
the right component, but when in doubt, pick [Internal Services/Tor
|
|
|
Sysadmin Team][].
|
|
|
|
|
|
[Trac]: https://trac.torproject.org
|
|
|
[direct link to a new ticket form]: https://trac.torproject.org/projects/tor/newticket
|
|
|
[Internal Services/Tor Sysadmin Team]: https://trac.torproject.org/projects/tor/newticket?component=Internal+Services%2FTor+Sysadmin+Team
|
|
|
|
|
|
(Note that the issue tracker will be changed to GitLab shortly, at
|
|
|
which point the above links will be updated.)
|
|
|
|
|
|
### Private question and fallback: email
|
|
|
|
|
|
If you want to discuss a sensitive matter that requires privacy or are
|
|
|
unsure how to reach us, you can always write to us by email, at
|
|
|
[torproject-admin@torproject.org][].
|
|
|
|
|
|
[torproject-admin@torproject.org]: mailto:torproject-admin@torproject.org
|
|
|
|
|
|
## Support levels
|
|
|
|
|
|
We consider there are three "support levels" for problems that come up
|
|
|
with services:
|
|
|
|
|
|
* code red: immediate emergency, fix ASAP
|
|
|
* code yellow: serious problem that doesn't require immediate
|
|
|
attention but that could turn into a code red if nothing is down
|
|
|
* routine: file a bug report, we'll get to it soon!
|
|
|
|
|
|
We do not have 24/7 on-call support, so requests are processed during
|
|
|
work times of available staff. We do try to provide continuous support
|
|
|
as much as possible, but it's possible that some weekends or vacations
|
|
|
are unattended for more than a day. This is the definition of a
|
|
|
"business day".
|
|
|
|
|
|
The TPA team is currently small and there might be specific situations
|
|
|
where a code RED might require more time than expected and as a
|
|
|
organization we need to do an effort in understanding that.
|
|
|
|
|
|
### Code red
|
|
|
|
|
|
A "code red" is a critical condition that requires immediate
|
|
|
action. It's what we consider an "emergency". Our SLA for those is
|
|
|
24h business days, as defined above. Services qualifying for a code
|
|
|
red are:
|
|
|
|
|
|
* incoming email and forwards
|
|
|
* [main website][]
|
|
|
* [donation website][]
|
|
|
|
|
|
[main website]: https://www.torproject.org/
|
|
|
[donation website]: https://donate.torproject.org/
|
|
|
|
|
|
Other services fall under "routine" or "code yellow" below, which can
|
|
|
be upgraded in priority.
|
|
|
|
|
|
Examples of problems falling under code red include:
|
|
|
|
|
|
* website unreachable
|
|
|
* emails to torproject.org not reaching our server
|
|
|
|
|
|
Some problems fall under other teams and are not the responsibility of
|
|
|
TPA, even if they can be otherwise considered a code red.
|
|
|
|
|
|
So, for example, those are *not* code reds for TPA:
|
|
|
|
|
|
* website has a major design problem rendering it unusable
|
|
|
* donation backend failing because of a problem in CiviCRM
|
|
|
* gmail refusing all email forwards
|
|
|
* encrypted mailing lists failures
|
|
|
* gitolite refuses connections
|
|
|
|
|
|
### Code yellow
|
|
|
|
|
|
A "[code yellow][]" is a situation where we are overwhelmed but there
|
|
|
isn't exactly an immediate emergency to deal with. A good introduction
|
|
|
is this [SRECON19 presentation][] ([slides][]). The basic idea is
|
|
|
that a code yellow is a "problem [that] creeps up on you over time and
|
|
|
suddenly the hole is so deep you can’t find the way out".
|
|
|
|
|
|
[code yellow]: https://devops.com/code-yellow-when-operations-isnt-perfect/
|
|
|
[SRECON19 presentation]: https://www.usenix.org/conference/srecon19americas/presentation/kehoe
|
|
|
[slides]: https://www.usenix.org/sites/default/files/conference/protected-files/sre19amer_slides_kehoe.pdf
|
|
|
|
|
|
There's no clear timeline on when such a problem can be resolved. If
|
|
|
the problem is serious enough, it *may* eventually be upgraded to a
|
|
|
code red by the approval of a team lead after a week's delay,
|
|
|
regardless of the affected service. In that case, a "hot fix" (some
|
|
|
hack like throwing hardware at the problem) may be deployed instead of
|
|
|
fixing the actual long term issue, in which case the problem becomes a
|
|
|
code yellow again.
|
|
|
|
|
|
Examples of a code yellow include:
|
|
|
|
|
|
* Trac gets overwhelmed ([ticket 29672][])
|
|
|
* Gitweb performance problems ([ticket 32133][])
|
|
|
* upgrade metrics.tpo to buster in the hope of fixing broken graphs
|
|
|
([ticket 32998][])
|
|
|
|
|
|
[ticket 29672]: https://bugs.torproject.org/29672
|
|
|
[ticket 32133]: https://bugs.torproject.org/32133
|
|
|
[ticket 32998]: https://bugs.torproject.org/32998
|
|
|
|
|
|
### Routine
|
|
|
|
|
|
Routine tasks are normal requests that are not an emergency and can be
|
|
|
processed as part of the normal workflow.
|
|
|
|
|
|
Example of routine tasks include:
|
|
|
|
|
|
* account creation
|
|
|
* group access changes
|
|
|
* email alias changes
|
|
|
* static web component changes
|
|
|
* examine disk usage warning
|
|
|
* security upgrades
|
|
|
* server reboots
|
|
|
|
|
|
## Supported services
|
|
|
|
|
|
Services supported by TPA must fulfill the following criteria:
|
|
|
|
|
|
1. The software needs to have an active release cycle
|
|
|
2. It needs to provide installation instructions, debugging
|
|
|
procedures
|
|
|
3. It needs to maintain a bug tracker and/or some means to contact
|
|
|
upstream
|
|
|
4. Debian GNU/Linux is the only supported operating system, and TPA
|
|
|
supports only the "stable" and "oldstable" distributions, until
|
|
|
the latter becomes EOL
|
|
|
5. At least two person from the Tor community should be willing
|
|
|
to help to maintain the service
|
|
|
|
|
|
Note that TPA does *not* support Debian LTS.
|
|
|
|
|
|
Also note that it is the responsibility of [service
|
|
|
admins][] (see below) to upgrade their services to keep
|
|
|
up with the Debian release schedule.
|
|
|
|
|
|
[service admins]: #service-admins
|
|
|
<a name="service-admins"></a>
|
|
|
|
|
|
### Service admins
|
|
|
|
|
|
(Note: this section used to live in doc/admins and is the current
|
|
|
"service admin" definition, mostly untouched.)
|
|
|
|
|
|
Within the **admin team** we have **system admins** (also known as
|
|
|
sysadmins, TSA or TPA) and **services admins**. While the distinction
|
|
|
between the two might seem blurry, the rule of thumb is that
|
|
|
**sysadmins** do not maintain every service that we offer. Rather,
|
|
|
they maintain the underlying computers -- make sure they get package
|
|
|
updates, make sure they stay on the network, etc.
|
|
|
|
|
|
Then it's up to the **service admins** to deploy and maintain their
|
|
|
[services][] (onionoo, atlas, blog, etc) on top of those machines.
|
|
|
|
|
|
[services]: https://gitlab.torproject.org/legacy/trac/-/wikis/org/operations/Infrastructure
|
|
|
|
|
|
For example, **"the blog is returning 503 errors"** is probably the
|
|
|
responsibility of a **service admin**, i.e. the blog service is
|
|
|
experiencing a problem. Instead, **"the blog doesn't ping"** or **"i
|
|
|
cannot open a TCP connection"** is a **sysadmin** thing, i.e. the
|
|
|
machine running the blog service has an issue. More examples:
|
|
|
|
|
|
Sysadmin tasks:
|
|
|
|
|
|
* installing a Debian package
|
|
|
* deploy a firewall rule
|
|
|
* add a new user (or a group, or a user to a group, etc)
|
|
|
|
|
|
Service admin tasks:
|
|
|
|
|
|
* the donation site is not handling credit cards correctly
|
|
|
* a video on media.torproject.org is returning 403 because its permissions are wrong
|
|
|
* the check.tp.o web service crashed
|
|
|
|
|
|
### Service adoption
|
|
|
|
|
|
The above distinction between sysadmins and service admins is often
|
|
|
weak since Tor has trouble maintaining a large service admin
|
|
|
team. There are instead core Tor people that are voluntarily
|
|
|
responsible for a service, for a while.
|
|
|
|
|
|
If a service is important for the Tor community the sysadmin team
|
|
|
might adopt it even when there aren't designated services admins.
|
|
|
|
|
|
In order for a service to be adopted by the sysadmin team, it needs to
|
|
|
fulfill the criteria established for "[Supported services][]" by TPA,
|
|
|
above.
|
|
|
|
|
|
[Supported services]: #Supported_services
|
|
|
|
|
|
When a service is adopted by the sysadmin team, the sysadmins will
|
|
|
make an estimation of costs and resources required to maintain the
|
|
|
service over time. The documentation should follow the service
|
|
|
documentation template at [howto/template](howto/template).
|
|
|
|
|
|
There needs to be some commitment by individuals Tor project
|
|
|
contributors and also by the project that the service will receive
|
|
|
funding to keep it working.
|
|
|
|
|
|
# Deadline
|
|
|
|
|
|
Policy was submitted to the team on 2020-06-03 and adopted by the team
|
|
|
on 2020-06-10, at which point it was submitted to tor-internal for
|
|
|
broader approval. It will be marked as "standard" on 2020-06-17 if
|
|
|
there are no objections there.
|
|
|
|
|
|
# Status
|
|
|
|
|
|
This proposal was adopted as a `standard` on 2020-06-17.
|
|
|
|
|
|
# References
|
|
|
|
|
|
* [ticket 31243][]
|
|
|
* [section 2][] of the [Ops report card][]
|
|
|
|
|
|
[ticket 31243]: https://bugs.torproject.org/31243
|
|
|
[section 2]: http://opsreportcard.com/section/2
|
|
|
[Ops report card]: http://opsreportcard.com/ |