TPA-RFC-2: define how users get support, what's an emergency and what is supported

Trac:
Child Ticket(s): #33108 (moved)

added component::internal services/tor sysadmin team owner::anarcat priority::medium severity::normal status::merge-ready tpa-roadmap-may tparfc type::task labels

i formalized the current support channels in https://help.torproject.org/tsa/doc/how-to-get-help/

remove from checklist, as i want to close that ticket and it will be open forever if it depends on all the tickets generated from it.

Trac:
Owner: tpa to anarcat
Status: new to assigned
Parent: #30881 (moved) to N/A

we should setup tiers of supported services somehow. it seems a priority short list of services that should stay up is the "donation and main websites, and the incoming email forwards service". but maybe we can have three tiers, with that list being the first one.

maybe it could be:

first tier: donation site, main website, incoming email
second tier: other sysadmin services (e.g. irc bouncer)
third tier: current "service admin" services (e.g. gitlab)?

This doesn't mean service admins shouldn't prioritize their stuff, but it would give sysadmins the opportunity to prioritize work on the sysadmin services there.

of course, maybe we want to change that distinction between service admins and sysadmins as well. As things stand now, i'm keeping that distinction, but I doubt it will be very feasible when/if the git/trac/gitlab server crashes for me, even as a sysadmin, to pretend it's not my responsability. ;)

alright, i've reviewed our documentation on this, and we actually had a draft of something we could start with. instead of "tiers" it's based on "code red/yellow". a code "red" is a "drop everything" priority. i still include the same services in that code red, i just change the name and set the boundaries a little more clearly.

i've detailed the policy here:

https://help.torproject.org/tsa/howto/incident-response/#Support_policies

the TL;DR:

code red: incoming email, donation, website
code yellow: something that might become a code red, but is not urgent yet (e.g. trac performance problem)
routine: account creation, etc - everything else
a code yellow can be upgraded to a code red after a one week delay with team lead approval
we don't have 24/7 support
requests are processed during work hours of available staff
we try to schedule holidays to avoid multiple "offline" days but those can still occur
we support only Debian stable and oldstable (not LTS)

asked hiro for review, thanks! :)

then will push to vegas

Trac:
Status: assigned to needs_review

Trac:
Keywords: N/A deleted, tparfc added
Summary: 2. define how users get support, what's an emergency and what is supported to TPARFC-2: define how users get support, what's an emergency and what is supported

Trac:
Summary: TPARFC-2: define how users get support, what's an emergency and what is supported to TPA-RFC-2: define how users get support, what's an emergency and what is supported

The link to "gitweb performance problems (ticket 32133)" actually goes to debian's 32133.

You probably meant trac #32133 (moved).

See: https://help.torproject.org/tsa/howto/incident-response/#Code_yellow

Trac:
Status: needs_review to needs_revision

I think the draft is actually good as a start. I just would like to add that as the sysadmin team is currently small and there might be specific situations where a code RED might require more time than expected and as a organization we need to do an effort in understanding that.

Another observation I have is that we could add to this a procedure regarding when and if the sysadmin team decide to adopt a service.

E.g. gitlab. If we shutdown tor git at some point that would be where all our code lives and that worries me a bit because I think that would become a complex first tier service.

In this procedure we might take into account that if a team request a service they have to be also responsible for it. I.e. dedicating time and resources to maintain the service. Sometimes if the service is important for the organization we should require that at least a few people from the org step up and take that service as a collective responsibility.

These are just a few observations.

The link to "gitweb performance problems (ticket 32133)" actually goes to debian's 32133.

You probably meant trac #32133 (moved).

oh good catch, fixed, thanks!

Trac:
Status: needs_revision to needs_review

I think the draft is actually good as a start. I just would like to add that as the sysadmin team is currently small and there might be specific situations where a code RED might require more time than expected and as a organization we need to do an effort in understanding that.

That's what I tried to explain in the first part, with the "work times of available staff" bit. But maybe we could expand and include your sentence above to make that crystal clear. :) I've done just that now, see if it fixed it. :)

Another observation I have is that we could add to this a procedure regarding when and if the sysadmin team decide to adopt a service.

True! that would be a good procedure to have. But for now I'd like to focus on the "oncall" side of things...

For the record, we discussed this last in stockholm and those are the relevant notes, I think:

We end up with having to keep hosts and services running long after the initial people who wanted it left. We also run some things directly as torproject-admin. We should have some list of requirements for things we (and also others) run on our infra. This list would include that sw needs to have proper releases and installation instructions and procedures, a bug tracker, some means to contact upstream, and it needs to run in the lastest Debian stable (and when there is a new Debian stable, it needs to run on that within a month or three.) There needs to be some commitment of maintainership, not only by individuals but by the project/corp, meaning a promise of recurring money to keep this service working. It's never just about setting up. We really really want at least two people who know and maintain each service. Also, this policy should apply not only to incoming services, but it should apply to all the things we run and we should regularly evaluate whether services meet them.

Extracted from https://trac.torproject.org/projects/tor/wiki/org/meetings/2019Stockholm/Notes/SysadminTeamRoadmapping

So maybe it's just a matter of spelling this out in bullet points and adding it to the support policy?

E.g. gitlab. If we shutdown tor git at some point that would be where all our code lives and that worries me a bit because I think that would become a complex first tier service.

For the record, I consider gitlab to be a "service" under the "service admins" umbrella. I have explicitely pushed back on the idea of throwing TPA under that bus for now, and we will need to have a team managing gitlab if we want this thing to work at all. :)

Of course, we have a tendency of falling back to TPA when things fail in the service admins team, but at least we should have that buffer for now, until we redefine those distinctions.

In this procedure we might take into account that if a team request a service they have to be also responsible for it. I.e. dedicating time and resources to maintain the service. Sometimes if the service is important for the organization we should require that at least a few people from the org step up and take that service as a collective responsibility.

Absolutely. Before we close this ticket, let's make a service admission policy, based on your comments here and the Stockholm discussion...

Do you want to draft something? You seem to have good ideas! :) Otherwise i can try to make a summary...

this is being drafted in #33108 (moved).

next steps here are:

move the policy proposal into https://help.torproject.org/tsa/policy/
draft improvements to factor in #33108 (moved)
send the draft officially to tpa at the end of the TPA-RFC-1 delay, if approved (next friday, 2020-02-14)

Trac:
Keywords: N/A deleted, tpa-roadmap-march added

hiro has volunteered to followup on this process.

Trac:
Owner: anarcat to hiro
Status: needs_review to assigned

Added to https://help.torproject.org/tsa/policy/tpa-rfc-2-support/

Trac:
Resolution: N/A to fixed
Status: assigned to closed

this should be submitted to a larger group before it's marked as approved, i think. following tpa-rfc-1, i think the rfc is now in the "draft" state and it should be brought up for discussion within tpa, and maybe other teams.

thanks for drafting this! :)

Trac:
Resolution: fixed to N/A
Status: closed to reopened

i'll bring this around for wider approval, approved by hiro

Trac:
Owner: hiro to anarcat
Status: reopened to assigned
Keywords: tpa-roadmap-march deleted, tpa-roadmap-april added

Trac:
Keywords: tpa-roadmap-april deleted, tpa-roadmap-may added

i did a significant review of the proposal. it seemed to me that the stuff from #33108 (moved) overlapped quite a bit with the existing support levels and policies, so I started merging those. and then I realized that the "service admins" definition belongs there too, along with "how do I get help".

before you know it i had reorganized the entire thing. so I sent an email to TPA for a final approval, and plan to bring this to wider approval (tor-internal, i guess?) next week if no one in tpa objects.

Trac:
Status: assigned to needs_review

approved by TPA during today's meeting, waiting another week for approval on tor-internal.

i made a small change during the meeting to include gitlab in the support channels.

Trac:
Status: needs_review to merge_ready

mentioned in issue #33108 (moved)

TPA-RFC-2: define how users get support, what's an emergency and what is supported

Child items 0

Activity