Add tpa-rfc-2 about support policy.

This define how users get support, what's an emergency and what is supported. It also add some guidelines regarding how and when should the sysadmin team adopt a service

Add tpa-rfc-2 about support policy.
e64eea6e · Hiro · 37198aa4 · e64eea6e · e64eea6e
Commit e64eea6e authored 5 years ago by Hiro
--- a/tsa/howto/incident-response.mdwn
+++ b/tsa/howto/incident-response.mdwn
@@ -33,7 +33,7 @@ issue.

 If the physical host is not responding or is empty (in which case it
 *is* a physical host), you need to file a ticket with the upstream
-provider. This information is available in [Nagios][]: 
+provider. This information is available in [Nagios][]:

 1. search for the server name in the search box
 2. click on the server
@@ -89,101 +89,4 @@ TBD
 Support policies
 ================

-We consider there are three "support levels" for problems that come up
-with services:
-
- * code red: immediate emergency, fix ASAP
- * code yellow: serious problem that doesn't require immediate
-   attention but that could turn into a code red if nothing is donw
- * routine: file a bug report, we'll get to it soon!
-
-We do not have 24/7 oncall support, so requests are processed during
-work times of available staff. We do try to provide continuous support
-as much as possible, but it's possible that some weekends or vacations
-are unattended for more than a day. This is the definition of a
-"business day".
-
-The TPA team is currently small and there might be specific situations
-where a code RED might require more time than expected and as a
-organization we need to do an effort in understanding that.
-
-TPA is responsible for the base operating system and not *all*
-services running on TPO infrastructure, see the [[service admin
-definition|doc/admins]] for details on that distinction.
-
-Debian GNU/Linux is the only supported operating system, and we
-support only the "stable" and "oldstable" distributions, until the
-latter becomes EOL. We do *not* support Debian LTS. It is the
-responsability of service admins to upgrade their services to keep up
-with the Debian release schedule.
-
-Code red
--------
-
-A "code red" is a critical condition that requires immediate
-action. It's what we consider an "emergency". Our SLA for those is
-24h business days, as defined above. Services qualifying for a code
-red are:
-
- * incoming email and forwards
- * [main website](https://www.torproject.org/)
- * [donation website](https://donate.torproject.org/)
-
-Other services fall under "routine" or "code yellow" below, which can
-be upgraded in priority.
-
-Examples of problems falling under code red include:
-
- * website unreachable
- * emails to torproject.org not reaching our server
-
-Some problems fall under other teams and are not the responsability of
-TPA, even if they can be otherwise considered a code red.
-
-So, for example, those are *not* code reds for TPA:
-
- * website has a major design problem rendering it unusable
- * donation backend failing because of a problem in CiviCRM
- * gmail refusing all email forwards
- * encrypted mailing lists failures
- * gitolite refuses connexions
-
-Code yellow
-----------
-
-A "[code yellow](https://devops.com/code-yellow-when-operations-isnt-perfect/)" is a situation where we are overwhelmed but there
-isn't exactly an immediate emergency to deal with. A good introduction
-is this [SRECON19 presentation](https://www.usenix.org/conference/srecon19americas/presentation/kehoe) ([slides](https://www.usenix.org/sites/default/files/conference/protected-files/sre19amer_slides_kehoe.pdf)). The basic idea is
-that a code yellow is a "problem [that] creeps up on you over time and
-suddenly the hole is so deep you can’t find the way out".
-
-There's no clear timeline on when such a problem can be resolved. If
-the problem is serious enough, it *may* eventually be upgraded to a
-code red by the approval of a team lead after a week's delay,
-regardless of the affected service. In that case, a "hot fix" (some
-hack like throwing hardware at the problem) may be deployed instead of
-fixing the actual long term issue, in which case the problem becomes a
-code yellow again.
-
-Examples of a code yellow include:
-
- * Trac gets overwhelmed ([ticket 29672](https://bugs.torproject.org/29672))
- * gitweb performance problems ([ticket 32133](https://bugs.torproject.org/32133))
- * upgrade metrics.tpo to buster in the hope of fixing broken graphs
-   ([ticket 32998](https://bugs.torproject.org/32998))
-
-Routine
-------
-
-Routine tasks are normal requests that are not an emergency and can be
-processed as part of the normal workflow.
-
-Example of routine tasks include:
-
- * account creation
- * group access changes
- * email alias changes
- * static web component changes
- * examine disk usage warning
- * security upgrades
- * server reboots
+Please see [/tsa//policy/tpa-rfc-2-support/](../policy/tpa-rfc-2-support/)
--- a/tsa/policy/tpa-rfc-2-support.mdwn
+++ b/tsa/policy/tpa-rfc-2-support.mdwn
+[[!meta title="TPA-RFC-2: support"]]
+
+Summary: we define three different support levels for services that the sysamins
+support
+
+# Background
+
+We consider there are three "support levels" for problems that come up
+with services:
+
+ * code red: immediate emergency, fix ASAP
+ * code yellow: serious problem that doesn't require immediate
+   attention but that could turn into a code red if nothing is donw
+ * routine: file a bug report, we'll get to it soon!
+
+We do not have 24/7 oncall support, so requests are processed during
+work times of available staff. We do try to provide continuous support
+as much as possible, but it's possible that some weekends or vacations
+are unattended for more than a day. This is the definition of a
+"business day".
+
+The TPA team is currently small and there might be specific situations
+where a code RED might require more time than expected and as a
+organization we need to do an effort in understanding that.
+
+TPA is responsible for the base operating system and not *all*
+services running on TPO infrastructure, see the [[service admin
+definition|doc/admins]] for details on that distinction.
+
+Debian GNU/Linux is the only supported operating system, and we
+support only the "stable" and "oldstable" distributions, until the
+latter becomes EOL. We do *not* support Debian LTS. It is the
+responsability of service admins to upgrade their services to keep up
+with the Debian release schedule.
+
+# Support levels
+
+Code red
+--------
+
+A "code red" is a critical condition that requires immediate
+action. It's what we consider an "emergency". Our SLA for those is
+24h business days, as defined above. Services qualifying for a code
+red are:
+
+ * incoming email and forwards
+ * [main website](https://www.torproject.org/)
+ * [donation website](https://donate.torproject.org/)
+
+Other services fall under "routine" or "code yellow" below, which can
+be upgraded in priority.
+
+Examples of problems falling under code red include:
+
+ * website unreachable
+ * emails to torproject.org not reaching our server
+
+Some problems fall under other teams and are not the responsability of
+TPA, even if they can be otherwise considered a code red.
+
+So, for example, those are *not* code reds for TPA:
+
+ * website has a major design problem rendering it unusable
+ * donation backend failing because of a problem in CiviCRM
+ * gmail refusing all email forwards
+ * encrypted mailing lists failures
+ * gitolite refuses connexions
+
+Code yellow
+-----------
+
+A "[code yellow](https://devops.com/code-yellow-when-operations-isnt-perfect/)" is a situation where we are overwhelmed but there
+isn't exactly an immediate emergency to deal with. A good introduction
+is this [SRECON19 presentation](https://www.usenix.org/conference/srecon19americas/presentation/kehoe) ([slides](https://www.usenix.org/sites/default/files/conference/protected-files/sre19amer_slides_kehoe.pdf)). The basic idea is
+that a code yellow is a "problem [that] creeps up on you over time and
+suddenly the hole is so deep you can’t find the way out".
+
+There's no clear timeline on when such a problem can be resolved. If
+the problem is serious enough, it *may* eventually be upgraded to a
+code red by the approval of a team lead after a week's delay,
+regardless of the affected service. In that case, a "hot fix" (some
+hack like throwing hardware at the problem) may be deployed instead of
+fixing the actual long term issue, in which case the problem becomes a
+code yellow again.
+
+Examples of a code yellow include:
+
+ * Trac gets overwhelmed ([ticket 29672](https://bugs.torproject.org/29672))
+ * gitweb performance problems ([ticket 32133](https://bugs.torproject.org/32133))
+ * upgrade metrics.tpo to buster in the hope of fixing broken graphs
+   ([ticket 32998](https://bugs.torproject.org/32998))
+
+Routine
+-------
+
+Routine tasks are normal requests that are not an emergency and can be
+processed as part of the normal workflow.
+
+Example of routine tasks include:
+
+ * account creation
+ * group access changes
+ * email alias changes
+ * static web component changes
+ * examine disk usage warning
+ * security upgrades
+ * server reboots
+
+# How and when should the sysadmin team adopt a service
+
+Over the years we have operated with a "soft" distinction between sysadmins and
+services admins as defined in: https://help.torproject.org/tsa/doc/admins/
+
+This distinction is often weak since Tor doesn't have a service admin team. There
+are instead core Tor people that are voluntarily responsible for a service, for
+a while.
+
+If a service is important for the Tor community the sysadmin team might adopt it
+even when there aren't designated services admins.
+
+In order for a service to be adopted by the sysadmin team:
+
+- The software needs to have an active release cycle,
+- The software needs to provide installation instructions, debugging procedures,
+- The software needs to maintain a bug tracker and/or some means to contact upstream,
+- It needs to run on the lastest Debian stable,
+- When a new Debian release become stable it needs to support it within 3 months.
+- 1 extra person from the Tor community should be willing to help to maintain the
+service in addition to 1 person from the sysadmin team.
+
+When a service is adopted by the sysadmin team, the sysadmins will make an estimation
+of costs and resources required to maintain the service over time.
+There needs to be some commitment by individuals Tor project contributors and also
+by the project that the service will receive funding to keep it working.