Right now, this is unofficially "open a ticket in Trac", "ping us over IRC for small stuff", or "write us an email". This could be made more official somewhere.
== What is an emergency?
I am not sure this is formally defined.
== What is supported?
We have the distinction between systems and service admins. We did talk in Stockholm about clarifying that item, so this is worth expanding further.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related.
Learn more.
we should setup tiers of supported services somehow. it seems a priority short list of services that should stay up is the "donation and main websites, and the incoming email forwards service". but maybe we can have three tiers, with that list being the first one.
maybe it could be:
first tier: donation site, main website, incoming email
second tier: other sysadmin services (e.g. irc bouncer)
third tier: current "service admin" services (e.g. gitlab)?
This doesn't mean service admins shouldn't prioritize their stuff, but it would give sysadmins the opportunity to prioritize work on the sysadmin services there.
of course, maybe we want to change that distinction between service admins and sysadmins as well. As things stand now, i'm keeping that distinction, but I doubt it will be very feasible when/if the git/trac/gitlab server crashes for me, even as a sysadmin, to pretend it's not my responsability. ;)
alright, i've reviewed our documentation on this, and we actually had a draft of something we could start with. instead of "tiers" it's based on "code red/yellow". a code "red" is a "drop everything" priority. i still include the same services in that code red, i just change the name and set the boundaries a little more clearly.
Trac: Keywords: N/Adeleted, tparfc added Summary: 2. define how users get support, what's an emergency and what is supported to TPARFC-2: define how users get support, what's an emergency and what is supported
Trac: Summary: TPARFC-2: define how users get support, what's an emergency and what is supported to TPA-RFC-2: define how users get support, what's an emergency and what is supported
I think the draft is actually good as a start. I just would like to add that as the sysadmin team is currently small and there might be specific situations where a code RED might require more time than expected and as a organization we need to do an effort in understanding that.
Another observation I have is that we could add to this a procedure regarding when and if the sysadmin team decide to adopt a service.
E.g. gitlab. If we shutdown tor git at some point that would be where all our code lives and that worries me a bit because I think that would become a complex first tier service.
In this procedure we might take into account that if a team request a service they have to be also responsible for it. I.e. dedicating time and resources to maintain the service. Sometimes if the service is important for the organization we should require that at least a few people from the org step up and take that service as a collective responsibility.
I think the draft is actually good as a start. I just would like to add that as the sysadmin team is currently small and there might be specific situations where a code RED might require more time than expected and as a organization we need to do an effort in understanding that.
That's what I tried to explain in the first part, with the "work times of available staff" bit. But maybe we could expand and include your sentence above to make that crystal clear. :) I've done just that now, see if it fixed it. :)
Another observation I have is that we could add to this a procedure regarding when and if the sysadmin team decide to adopt a service.
True! that would be a good procedure to have. But for now I'd like to focus on the "oncall" side of things...
For the record, we discussed this last in stockholm and those are the relevant notes, I think:
We end up with having to keep hosts and services running long after the initial people who wanted it left. We also run some things directly as torproject-admin. We should have some list of requirements for things we (and also others) run on our infra. This list would include that sw needs to have proper releases and installation instructions and procedures, a bug tracker, some means to contact upstream, and it needs to run in the lastest Debian stable (and when there is a new Debian stable, it needs to run on that within a month or three.) There needs to be some commitment of maintainership, not only by individuals but by the project/corp, meaning a promise of recurring money to keep this service working. It's never just about setting up. We really really want at least two people who know and maintain each service. Also, this policy should apply not only to incoming services, but it should apply to all the things we run and we should regularly evaluate whether services meet them.
So maybe it's just a matter of spelling this out in bullet points and adding it to the support policy?
E.g. gitlab. If we shutdown tor git at some point that would be where all our code lives and that worries me a bit because I think that would become a complex first tier service.
For the record, I consider gitlab to be a "service" under the "service admins" umbrella. I have explicitely pushed back on the idea of throwing TPA under that bus for now, and we will need to have a team managing gitlab if we want this thing to work at all. :)
Of course, we have a tendency of falling back to TPA when things fail in the service admins team, but at least we should have that buffer for now, until we redefine those distinctions.
In this procedure we might take into account that if a team request a service they have to be also responsible for it. I.e. dedicating time and resources to maintain the service. Sometimes if the service is important for the organization we should require that at least a few people from the org step up and take that service as a collective responsibility.
Absolutely. Before we close this ticket, let's make a service admission policy, based on your comments here and the Stockholm discussion...
Do you want to draft something? You seem to have good ideas! :) Otherwise i can try to make a summary...
this should be submitted to a larger group before it's marked as approved, i think. following tpa-rfc-1, i think the rfc is now in the "draft" state and it should be brought up for discussion within tpa, and maybe other teams.
thanks for drafting this! :)
Trac: Resolution: fixed toN/A Status: closed to reopened
i did a significant review of the proposal. it seemed to me that the stuff from #33108 (moved) overlapped quite a bit with the existing support levels and policies, so I started merging those. and then I realized that the "service admins" definition belongs there too, along with "how do I get help".
before you know it i had reorganized the entire thing. so I sent an email to TPA for a final approval, and plan to bring this to wider approval (tor-internal, i guess?) next week if no one in tpa objects.