TPA issueshttps://gitlab.torproject.org/groups/tpo/tpa/-/issues2024-03-27T15:29:42Zhttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41456monitor technical debt and legacy2024-03-27T15:29:42Zanarcatmonitor technical debt and legacyI often say that we have a huge technical debt in TPA, and that we keep needing to close things down and document and so on.
But we do not have hard data on this. After reading [Managing Technical Debt](https://jacobian.org/2023/dec/20/...I often say that we have a huge technical debt in TPA, and that we keep needing to close things down and document and so on.
But we do not have hard data on this. After reading [Managing Technical Debt](https://jacobian.org/2023/dec/20/tech-debt/), I realized we should at least keep track of metrics about this. What's interesting about that article is it says we shouldn't necessarily set *targets*, but keeping track of metrics would be a good start.
He specifically [suggests DORA metrics](https://jacobian.org/2022/jun/17/dora-metrics/), but I'm not sure it's the best match for us. Here's what I think we should monitor:
* tickets
* "lead time" (time between when a ticket enters backlog/next/doing and closing)
* start using the ~"Technical Debt" ticket and measure ticket counts
* general per-queue ticket counts (already done in monthly reports, put in prometheus, see https://gitlab.torproject.org/tpo/tpa/team/-/issues/40591)
* incidents:
* "lead time" is specially important here: how long do tickets get opened in incidents? might also be a measure of MTTR (mean time to recovery)
* "change failure rate": measure how many incidents are caused by deployment failures
* documentation: systematically measure how many services we have and how well they are documented (this is partially done, by hand, in the `service.md` wiki page, but could be somehow automated)
* untracked package counts: use anarcat's [puppet-package-check](https://gitlab.com/anarcat/scripts/-/blob/main/puppet-package-check?ref_type=heads) to generate metrics on how many packages are *not* managed by puppet, per host, as a rough estimate of the "puppetization ratio"
* unit test coverage: across all our software projects (or maybe per project?), what is the coverage of unit tests? (requires CI and extraction of those numbers in an exporter)
* out of date systems: how long does it take to update the fleet, and how long do we live on LTS? (at least partly tracked in Prometheus now, but not retained long enough to have good metrics, see also #40330)
The end result here is a small set of metrics that describe the current state of affairs, and its evolution over time. It will allow us to more easily realize when we're in trouble (e.g. https://gitlab.torproject.org/tpo/tpa/team/-/issues/41411) and evaluate how much effort we should put into this.
It might be more effective to have those metrics beyond the "one year" mark. Ticket counts, for example, are kept forever in the minutes, and that's a good thing, so we should consider expanding the storage retention here (#40330).
One thing Kaplan-Moss advises is to set time apart to deal with technical debt, he advises 10%. He also says we shouldn't set "sprints" to deal with technical debt, but I disagree with that: I have found that Debian upgrades are working well with sprints and wonder to what else we could extend the practice. On the other hand, the docs hack week wasn't a clear success for us, so maybe he's at least partly right in some aspects.cleanup and publish the sysadmin codebaseanarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40909TPA-RFC-38 wiki replacement2024-03-20T18:27:42ZKezTPA-RFC-38 wiki replacementThis is the discussion ticket for [TPA-RFC-38: Setting Up a Wiki Service](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-38-new-wiki-service). This ticket serves as a place where people can suggest changes to the RFC, ...This is the discussion ticket for [TPA-RFC-38: Setting Up a Wiki Service](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-38-new-wiki-service). This ticket serves as a place where people can suggest changes to the RFC, as well as suggest goals and must-have features for the new wiki serviceanarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40421enhance incident response procedures2024-02-13T16:04:39Zanarcatenhance incident response procedurestoday we had an ... interesting situation with the puppet infrastructure. while we have actually recovered pretty well, all things considered, it would be important to enhance our response to such situation so that they are less stressfu...today we had an ... interesting situation with the puppet infrastructure. while we have actually recovered pretty well, all things considered, it would be important to enhance our response to such situation so that they are less stressful and why not, even more "fun", if i can be so daring.
some background reading:
* [Got game? Secrets of great incident management](https://bitfieldconsulting.com/blog/got-game-secrets-of-great-incident-management)
* [pager duty incident response documentation](https://response.pagerduty.com/)
some ideas:
* have an issue template for incidents (so, in git, which requires a git repository here, but maybe it's finally time to merge the wiki repo here anyways), available offline
* run simulations/games
* have post-mortem templates, here's the [pager duty template](https://response.pagerduty.com/after/post_mortem_template/)
* gitlab has some [incident management primitives](https://docs.gitlab.com/ee/operations/incident_management/) including aforementioned "[incidents](https://docs.gitlab.com/ee/operations/incident_management/incidents.html)" (which are really just issues)...
* ... but also [integrations](https://docs.gitlab.com/ee/operations/incident_management/integrations.html) which is especially interesting considering they have *native* Prometheus integration, which might require switching from nagios to prometheus (#29864)
anyways, the core idea here is:
1. have incident roles (note-taker, driver, comms, etc)
2. incident and post-mortem templates
3. run gameshttps://gitlab.torproject.org/tpo/tpa/team/-/issues/33733How do home directories work?2021-09-15T18:41:58ZirlHow do home directories work?There seems to be little consistency here, which isn't what I expect from an orchestrated process, so I'm maybe missing something.
Each service has a directory in /srv/{service}.torproject.org/ and then sometimes there is a home directo...There seems to be little consistency here, which isn't what I expect from an orchestrated process, so I'm maybe missing something.
Each service has a directory in /srv/{service}.torproject.org/ and then sometimes there is a home directory, which is sometimes linked in some way to /home/{service}. When there are multiple users for a service, they can share the same /srv directory but then have inconsistent naming of home directories.
Is there some documentation I can read to make sense of this?
Context: I'm putting together our Ansible roles (legacy/trac#33715) that should replicate what TPA will give us when we move things to a TPA host after we're convinced it's ready for deployment and we know what the specs will be, but I'm having trouble generalising even from just the Onionoo and Exit Scanner setups.
I'd like to be able to set some variables, like what usernames exist, what groups exist, and what paths will exist and should be used for stuff, and then let this role set that up. The service specific (e.g. Onionoo or Exit Scanner) roles will then run equally on our AWS dev instances and the production TPA instance.anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40960Document our privacy-preserving webserver log setup for the world2024-03-11T21:39:34ZRoger DingledineDocument our privacy-preserving webserver log setup for the worldWe use a novel log format for our webservers, which makes sure we don't collect the IP addresses of our visitors, and doesn't record the precise timestamp of the visits, yet still produces a format compatible with various log parsing too...We use a novel log format for our webservers, which makes sure we don't collect the IP addresses of our visitors, and doesn't record the precise timestamp of the visits, yet still produces a format compatible with various log parsing tools.
Everybody in the world should be doing this.
We should document what we do and how and why, and tell the world so everybody else can do it too.
Apparently Debian uses the same approach we do, so we have some adoption already, but much more remains!
See
http://seclists.org/nmap-announce/2004/16
for some of our original motivation.
And see
http://lists.spi-inc.org/pipermail/spi-general/2016-December/003645.html
for a summary of what we do currently.
We should also invite/encourage people to find bugs in our set-up. It can always get better!
And lastly, a blog post like this will be really useful to point to when we start doing analysis and graphs and metrics and stuff.