Skip to content
Snippets Groups Projects
Commit e64eea6e authored by Hiro's avatar Hiro :surfer:
Browse files

Add tpa-rfc-2 about support policy.

This define how users get support, what's an emergency and what is supported.
It also add some guidelines regarding how and when should the sysadmin team adopt a service
parent 37198aa4
No related branches found
No related tags found
No related merge requests found
......@@ -33,7 +33,7 @@ issue.
If the physical host is not responding or is empty (in which case it
*is* a physical host), you need to file a ticket with the upstream
provider. This information is available in [Nagios][]:
provider. This information is available in [Nagios][]:
1. search for the server name in the search box
2. click on the server
......@@ -89,101 +89,4 @@ TBD
Support policies
================
We consider there are three "support levels" for problems that come up
with services:
* code red: immediate emergency, fix ASAP
* code yellow: serious problem that doesn't require immediate
attention but that could turn into a code red if nothing is donw
* routine: file a bug report, we'll get to it soon!
We do not have 24/7 oncall support, so requests are processed during
work times of available staff. We do try to provide continuous support
as much as possible, but it's possible that some weekends or vacations
are unattended for more than a day. This is the definition of a
"business day".
The TPA team is currently small and there might be specific situations
where a code RED might require more time than expected and as a
organization we need to do an effort in understanding that.
TPA is responsible for the base operating system and not *all*
services running on TPO infrastructure, see the [[service admin
definition|doc/admins]] for details on that distinction.
Debian GNU/Linux is the only supported operating system, and we
support only the "stable" and "oldstable" distributions, until the
latter becomes EOL. We do *not* support Debian LTS. It is the
responsability of service admins to upgrade their services to keep up
with the Debian release schedule.
Code red
--------
A "code red" is a critical condition that requires immediate
action. It's what we consider an "emergency". Our SLA for those is
24h business days, as defined above. Services qualifying for a code
red are:
* incoming email and forwards
* [main website](https://www.torproject.org/)
* [donation website](https://donate.torproject.org/)
Other services fall under "routine" or "code yellow" below, which can
be upgraded in priority.
Examples of problems falling under code red include:
* website unreachable
* emails to torproject.org not reaching our server
Some problems fall under other teams and are not the responsability of
TPA, even if they can be otherwise considered a code red.
So, for example, those are *not* code reds for TPA:
* website has a major design problem rendering it unusable
* donation backend failing because of a problem in CiviCRM
* gmail refusing all email forwards
* encrypted mailing lists failures
* gitolite refuses connexions
Code yellow
-----------
A "[code yellow](https://devops.com/code-yellow-when-operations-isnt-perfect/)" is a situation where we are overwhelmed but there
isn't exactly an immediate emergency to deal with. A good introduction
is this [SRECON19 presentation](https://www.usenix.org/conference/srecon19americas/presentation/kehoe) ([slides](https://www.usenix.org/sites/default/files/conference/protected-files/sre19amer_slides_kehoe.pdf)). The basic idea is
that a code yellow is a "problem [that] creeps up on you over time and
suddenly the hole is so deep you can’t find the way out".
There's no clear timeline on when such a problem can be resolved. If
the problem is serious enough, it *may* eventually be upgraded to a
code red by the approval of a team lead after a week's delay,
regardless of the affected service. In that case, a "hot fix" (some
hack like throwing hardware at the problem) may be deployed instead of
fixing the actual long term issue, in which case the problem becomes a
code yellow again.
Examples of a code yellow include:
* Trac gets overwhelmed ([ticket 29672](https://bugs.torproject.org/29672))
* gitweb performance problems ([ticket 32133](https://bugs.torproject.org/32133))
* upgrade metrics.tpo to buster in the hope of fixing broken graphs
([ticket 32998](https://bugs.torproject.org/32998))
Routine
-------
Routine tasks are normal requests that are not an emergency and can be
processed as part of the normal workflow.
Example of routine tasks include:
* account creation
* group access changes
* email alias changes
* static web component changes
* examine disk usage warning
* security upgrades
* server reboots
Please see [/tsa//policy/tpa-rfc-2-support/](../policy/tpa-rfc-2-support/)
[[!meta title="TPA-RFC-2: support"]]
Summary: we define three different support levels for services that the sysamins
support
# Background
We consider there are three "support levels" for problems that come up
with services:
* code red: immediate emergency, fix ASAP
* code yellow: serious problem that doesn't require immediate
attention but that could turn into a code red if nothing is donw
* routine: file a bug report, we'll get to it soon!
We do not have 24/7 oncall support, so requests are processed during
work times of available staff. We do try to provide continuous support
as much as possible, but it's possible that some weekends or vacations
are unattended for more than a day. This is the definition of a
"business day".
The TPA team is currently small and there might be specific situations
where a code RED might require more time than expected and as a
organization we need to do an effort in understanding that.
TPA is responsible for the base operating system and not *all*
services running on TPO infrastructure, see the [[service admin
definition|doc/admins]] for details on that distinction.
Debian GNU/Linux is the only supported operating system, and we
support only the "stable" and "oldstable" distributions, until the
latter becomes EOL. We do *not* support Debian LTS. It is the
responsability of service admins to upgrade their services to keep up
with the Debian release schedule.
# Support levels
Code red
--------
A "code red" is a critical condition that requires immediate
action. It's what we consider an "emergency". Our SLA for those is
24h business days, as defined above. Services qualifying for a code
red are:
* incoming email and forwards
* [main website](https://www.torproject.org/)
* [donation website](https://donate.torproject.org/)
Other services fall under "routine" or "code yellow" below, which can
be upgraded in priority.
Examples of problems falling under code red include:
* website unreachable
* emails to torproject.org not reaching our server
Some problems fall under other teams and are not the responsability of
TPA, even if they can be otherwise considered a code red.
So, for example, those are *not* code reds for TPA:
* website has a major design problem rendering it unusable
* donation backend failing because of a problem in CiviCRM
* gmail refusing all email forwards
* encrypted mailing lists failures
* gitolite refuses connexions
Code yellow
-----------
A "[code yellow](https://devops.com/code-yellow-when-operations-isnt-perfect/)" is a situation where we are overwhelmed but there
isn't exactly an immediate emergency to deal with. A good introduction
is this [SRECON19 presentation](https://www.usenix.org/conference/srecon19americas/presentation/kehoe) ([slides](https://www.usenix.org/sites/default/files/conference/protected-files/sre19amer_slides_kehoe.pdf)). The basic idea is
that a code yellow is a "problem [that] creeps up on you over time and
suddenly the hole is so deep you can’t find the way out".
There's no clear timeline on when such a problem can be resolved. If
the problem is serious enough, it *may* eventually be upgraded to a
code red by the approval of a team lead after a week's delay,
regardless of the affected service. In that case, a "hot fix" (some
hack like throwing hardware at the problem) may be deployed instead of
fixing the actual long term issue, in which case the problem becomes a
code yellow again.
Examples of a code yellow include:
* Trac gets overwhelmed ([ticket 29672](https://bugs.torproject.org/29672))
* gitweb performance problems ([ticket 32133](https://bugs.torproject.org/32133))
* upgrade metrics.tpo to buster in the hope of fixing broken graphs
([ticket 32998](https://bugs.torproject.org/32998))
Routine
-------
Routine tasks are normal requests that are not an emergency and can be
processed as part of the normal workflow.
Example of routine tasks include:
* account creation
* group access changes
* email alias changes
* static web component changes
* examine disk usage warning
* security upgrades
* server reboots
# How and when should the sysadmin team adopt a service
Over the years we have operated with a "soft" distinction between sysadmins and
services admins as defined in: ​https://help.torproject.org/tsa/doc/admins/
This distinction is often weak since Tor doesn't have a service admin team. There
are instead core Tor people that are voluntarily responsible for a service, for
a while.
If a service is important for the Tor community the sysadmin team might adopt it
even when there aren't designated services admins.
In order for a service to be adopted by the sysadmin team:
- The software needs to have an active release cycle,
- The software needs to provide installation instructions, debugging procedures,
- The software needs to maintain a bug tracker and/or some means to contact upstream,
- It needs to run on the lastest Debian stable,
- When a new Debian release become stable it needs to support it within 3 months.
- 1 extra person from the Tor community should be willing to help to maintain the
service in addition to 1 person from the sysadmin team.
When a service is adopted by the sysadmin team, the sysadmins will make an estimation
of costs and resources required to maintain the service over time.
There needs to be some commitment by individuals Tor project contributors and also
by the project that the service will receive funding to keep it working.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment