# Onion Services Site Reliability Engineering [[_TOC_]] The SRE role aims to provide tools and tech support to deploy and monitor high availability Onion Services sites. ## Oniongroove The public-facing details of the software suite to manage Onion Service sites are available at the [Oniongroove repository](https://gitlab.torproject.org/rhatto/oniongroove/). ### Objectives and key results (OKRs) **Objective:** this role is part of the Onion Support Group's mission to increase the adoption of Onion Services, for which we can select the following goals: 0. Provide easy ways to setup and maintain Onion Services. How to measure this? 1. With sane defaults. How to measure this? 2. That can be configurable and extensible. How to measure this? ## Kickstarting ### Initial plan These are the proposed kickstarting steps for this role: 0. Meeting with dgoulet, hiro and anarcat to get advice on kickstarting the project: what/where to look for about specs, tools, goals, security checklists, limits etc. Check meeting notes [here](https://gitlab.torproject.org/tpo/onion-support/-/wikis/Meetings/2022-02-08-Onion-Services-SRE-Kickstart) and [here](https://lists.torproject.org/pipermail/tor-project/2022-February/003288.html). 1. Research on all relevant deployment technologies: build a first matrix. 2. Then meeting with the media organizations: inventory, compliances check etc. 3. Build the second matrix (use cases). ### Initial considerations While brainstorming about this role, the following considerations were sketched: 0. Software suite: Sponsor 123 project includes provisioning/monitoring onion services as deliverables, but the effort could be used to create a generic product (a "suite") which would include an Onionbalance deployer. 1. External instance(s): for the Sponsor 123 contract, a single instance of this "CDN" solution could be used to manage all sites, instead of having to manage many instances (and dashboards) in parallel. Future contracts with other third-parties could either be managed using that same instance or having their own instances (isolation). 2. Internal instance: another, internal instance could be set to manage all sites listed at https://onion.torproject.org if TPA likes and decides to adopt the solution :) 3. Existing considerations at the [Oniongroove Scope](https://gitlab.torproject.org/rhatto/oniongroove/-/blob/main/specs.md#Scope). 3. Other considerations: see [rhatto's skill-test research](https://gitlab.torproject.org/tpo/tpa/skill-test-onion-sre-candidate-sr/-/blob/main/research.md). ### Questions General: 0. If you were the Onion Services SRE, how would you implement this project? 1. What existing solutions to look at, and what to avoid? Architecture: 0. What people think about the architecture proposed by rhatto during his skill-test (without paying attention to the improvised implementation he coded)? https://gitlab.torproject.org/tpo/tpa/skill-test-onion-sre-candidate-sr/-/blob/main/README.md#chosen-architecture 1. Which other limits are important to be considered in the scope of this project, like the current upper bound of 8 Onionbalance backend servers? Implementation: 0. What are the dimensions for the comparison matrix of existing DevOps solutions such as Puppet, Ansbile, Terraform and Salt (and specific modules/recipes/cookbooks /roles)? 1. Is this list complete for the second matrix (initial use cases survey)? https://gitlab.torproject.org/tpo/onion-support/-/wikis/What-we-need-to-know-about-each-setup 2. How TPA manages passphrases and secrets for existing systems and keys? Answer: check [evaluate password management options (#29677)](https://gitlab.torproject.org/tpo/tpa/team/-/issues/29677). 3. What (if any) TPA (or other) security policies should be observed in this project? Anseer: check [Tor security policy (#41)](https://gitlab.torproject.org/tpo/team/-/issues/41) 4. Which solutions are in use to manage the sites listed at https://onion.torproject.org/? Answer: custom puppet modules (currently not public). 5. How does the Tor daemon scales currently? How many connections it can support at the same time? Management: 0. Sponsor 123 Project Plan timeline predicts setup of first .onion sites on M1 and M2, with 2-5 business days to set up a single .onion site. But coding a solution could take longer. How to do then? Answer: suggested approach is to have a detailed discovery phase while coding the initial solution in parallel. Some rework migth be needed, but we can gain time in overall.