TPA issueshttps://gitlab.torproject.org/groups/tpo/tpa/-/issues2024-02-08T16:19:30Zhttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41485automate major upgrades2024-02-08T16:19:30Zanarcatautomate major upgradeswe currently have automated upgrades for the day-to-day debian package upgrades, through unattended-upgrades (#31957). but major upgrades are not scripted, other than ad-hoc commands copy-pasted from an otherwise excellent wiki page.
we...we currently have automated upgrades for the day-to-day debian package upgrades, through unattended-upgrades (#31957). but major upgrades are not scripted, other than ad-hoc commands copy-pasted from an otherwise excellent wiki page.
we should automate this.
during the %"Debian 12 bookworm upgrade", tor weather suffered a catastrophic failure (#41388) due to a flaw in the postgresql upgrade procedure, so that should probably be our first target: automate that procedure, which would normally keep that kind of problem from occuring again (as we can do error checking better).
but ideally, we'd automate the entire procedure. See also https://wiki.debian.org/AutomatedUpgradeDebian 13 trixie upgradehttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41484deploy fabric-tasks on install and keep up to date in puppet2024-03-27T14:40:01Zanarcatdeploy fabric-tasks on install and keep up to date in puppetall hosts should have a copy of fabric-tasks. there's many useful things in that repo, and we should keep expanding it to have more useful things.
it would skip a step in the install procedure, but it would also allow us to dump ad-hoc ...all hosts should have a copy of fabric-tasks. there's many useful things in that repo, and we should keep expanding it to have more useful things.
it would skip a step in the install procedure, but it would also allow us to dump ad-hoc scripts that we currently leave lying around in /root or elsewhere.
this is part of the automated install task (#31239).(next) cluster scalinganarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41483metricsdb-01 out of swap2024-02-17T00:06:09ZKezmetricsdb-01 out of swapNagios has an alert for metricsdb-01: SWAP CRITICAL - 4% free (65MB out of 2047MB). It's almost exclusively because of a victoria-metric process: `victoria-metric 1800892 kB`.
@hiro I'm assigning this to you because you'll probably know...Nagios has an alert for metricsdb-01: SWAP CRITICAL - 4% free (65MB out of 2047MB). It's almost exclusively because of a victoria-metric process: `victoria-metric 1800892 kB`.
@hiro I'm assigning this to you because you'll probably know what to do with it better than meHiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41482Automate renewal of self-signed LDAP cert2024-01-19T17:07:36ZKezAutomate renewal of self-signed LDAP certIn #41479 I renewed the self-signed LDAP cert for two years (730 days). That means that next time we renew it will be right after the holidays in 2026. It's not too much of a pain since it's only every 2 years, but it would be nice to no...In #41479 I renewed the self-signed LDAP cert for two years (730 days). That means that next time we renew it will be right after the holidays in 2026. It's not too much of a pain since it's only every 2 years, but it would be nice to not have to renew it right after we come back from our holiday break.
We could either automate the procedure entirely, or I could renew it again in a month or so so that the current cert will expire in February 2026. @anarcat any preferences or suggestions?Jérôme Charaouilavamind@torproject.orgJérôme Charaouilavamind@torproject.orghttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41464nextcloud is returning 502 bad gateway2024-03-14T00:48:16ZJim Newsomenextcloud is returning 502 bad gatewayI'm getting 502 bad gateway for https://nc.torproject.net/. Verified by thorin as wellI'm getting 502 bad gateway for https://nc.torproject.net/. Verified by thorin as wellmicahmicah@torproject.orgmicahmicah@torproject.orghttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41456monitor technical debt and legacy2024-03-27T15:29:42Zanarcatmonitor technical debt and legacyI often say that we have a huge technical debt in TPA, and that we keep needing to close things down and document and so on.
But we do not have hard data on this. After reading [Managing Technical Debt](https://jacobian.org/2023/dec/20/...I often say that we have a huge technical debt in TPA, and that we keep needing to close things down and document and so on.
But we do not have hard data on this. After reading [Managing Technical Debt](https://jacobian.org/2023/dec/20/tech-debt/), I realized we should at least keep track of metrics about this. What's interesting about that article is it says we shouldn't necessarily set *targets*, but keeping track of metrics would be a good start.
He specifically [suggests DORA metrics](https://jacobian.org/2022/jun/17/dora-metrics/), but I'm not sure it's the best match for us. Here's what I think we should monitor:
* tickets
* "lead time" (time between when a ticket enters backlog/next/doing and closing)
* start using the ~"Technical Debt" ticket and measure ticket counts
* general per-queue ticket counts (already done in monthly reports, put in prometheus, see https://gitlab.torproject.org/tpo/tpa/team/-/issues/40591)
* incidents:
* "lead time" is specially important here: how long do tickets get opened in incidents? might also be a measure of MTTR (mean time to recovery)
* "change failure rate": measure how many incidents are caused by deployment failures
* documentation: systematically measure how many services we have and how well they are documented (this is partially done, by hand, in the `service.md` wiki page, but could be somehow automated)
* untracked package counts: use anarcat's [puppet-package-check](https://gitlab.com/anarcat/scripts/-/blob/main/puppet-package-check?ref_type=heads) to generate metrics on how many packages are *not* managed by puppet, per host, as a rough estimate of the "puppetization ratio"
* unit test coverage: across all our software projects (or maybe per project?), what is the coverage of unit tests? (requires CI and extraction of those numbers in an exporter)
* out of date systems: how long does it take to update the fleet, and how long do we live on LTS? (at least partly tracked in Prometheus now, but not retained long enough to have good metrics, see also #40330)
The end result here is a small set of metrics that describe the current state of affairs, and its evolution over time. It will allow us to more easily realize when we're in trouble (e.g. https://gitlab.torproject.org/tpo/tpa/team/-/issues/41411) and evaluate how much effort we should put into this.
It might be more effective to have those metrics beyond the "one year" mark. Ticket counts, for example, are kept forever in the minutes, and that's a good thing, so we should consider expanding the storage retention here (#40330).
One thing Kaplan-Moss advises is to set time apart to deal with technical debt, he advises 10%. He also says we shouldn't set "sprints" to deal with technical debt, but I disagree with that: I have found that Debian upgrades are working well with sprints and wonder to what else we could extend the practice. On the other hand, the docs hack week wasn't a clear success for us, so maybe he's at least partly right in some aspects.cleanup and publish the sysadmin codebaseanarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41455move ooni.torproject.org to our mirros and/or fix CAA hardening for subdomain2024-01-23T18:16:04Zanarcatmove ooni.torproject.org to our mirros and/or fix CAA hardening for subdomainIn #41386, we have tried to harden our CAA records, but this impacted the OONI folks who couldn't renew their certificates. A workaround was deployed on the subdomain, but we'd like to re-harden this bit by either:
1. make the ooni.tor...In #41386, we have tried to harden our CAA records, but this impacted the OONI folks who couldn't renew their certificates. A workaround was deployed on the subdomain, but we'd like to re-harden this bit by either:
1. make the ooni.torproject.org redirects part of our normal "vanity hosts" redirections on the static mirror system, or;
2. restrict the CAA record to a specific (set of?) let's encrypt accounts
@art, which one should we be, and what timeline should we look for this?https://gitlab.torproject.org/tpo/tpa/team/-/issues/41454Migrate metrics-store-01 to object storage2024-01-04T19:34:23ZHiroMigrate metrics-store-01 to object storageWe have agreed we can migrate metrics-store-01 to object storage.We have agreed we can migrate metrics-store-01 to object storage.HiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41453evaluate gitlab optimisations for large / monorepos2024-01-22T19:44:52Zanarcatevaluate gitlab optimisations for large / monoreposWhile looking at GitLab backups (#40518), I stumbled upon this page:
https://docs.gitlab.com/ee/user/project/repository/monorepos/
It has interesting recommendations for "monorepos" which, really, they mean "large reqpositories". We sh...While looking at GitLab backups (#40518), I stumbled upon this page:
https://docs.gitlab.com/ee/user/project/repository/monorepos/
It has interesting recommendations for "monorepos" which, really, they mean "large reqpositories". We should look into those directives and see what optimizations we could make.
This is mostly for the applications' team repositories, of course, so /cc @richardhttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41450Move collector.torproject.org to serve files stored in object storage2024-01-04T19:33:19ZHiroMove collector.torproject.org to serve files stored in object storageIn https://gitlab.torproject.org/tpo/tpa/team/-/issues/41416 we have discussed how we can move the tarballs from metrics-store-01 and those collector creates to object storage.
For metrics-store-01 we can just move the files, and once w...In https://gitlab.torproject.org/tpo/tpa/team/-/issues/41416 we have discussed how we can move the tarballs from metrics-store-01 and those collector creates to object storage.
For metrics-store-01 we can just move the files, and once we have the bucket, we can just update the links in the wiki where we list our archives.
For collector we need a way for people to browse the archives and download tarballs recursively if needed. I am thinking that we should preserve what we serve on collector.tpo, just have the links point to the buckets.
Once this is done, we can also discuss how we could generate the tarballs and move them to minio.https://gitlab.torproject.org/tpo/tpa/team/-/issues/41449estimate hardware requirements to host collector and metrics store in object ...2024-03-26T15:44:07Zanarcatestimate hardware requirements to host collector and metrics store in object storage / minioIn #41416, we have agreed to start moving storage from a filesystem into object storage for collector and metrics-store-01. This involves creating a separate bucket for each service and access tokens for each (which is easy enough) but w...In #41416, we have agreed to start moving storage from a filesystem into object storage for collector and metrics-store-01. This involves creating a separate bucket for each service and access tokens for each (which is easy enough) but we also need to consider the impact of the object storage server, since this is kind of a big deal.
Right now, the storage usage is as follows:
| machine | used | free |
|----------------|---------|---------|
| colchicifolium | 819GiB | 1.65TiB |
| collector-02 | 55GiB | 255GiB |
| metrics-store | 742GiB | 1.54GiB |
| **total** | 1.51TiB | 3.14TiB |
Source:
https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&var-class=All&var-instance=colchicifolium.torproject.org&var-instance=collector-02.torproject.org&var-instance=metrics-store-01.torproject.org&from=now-1y&to=now&refresh=5s
Note that the total includes all disks partitions, including `/`, so it might inflate the total a bit.
We need to figure if we can host this in the current object storage infrastructure, including backups (#41415), and if not, how much it will cost to deploy new resources to do so.
/cc @lavamindanarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41448datacenter evacuation / replacement options2024-01-19T18:57:24Zanarcatdatacenter evacuation / replacement optionsFirst off, we are *not* currently planning to migrate, replace, or evacuate our presence at Hetzner or any other provider. That is a massive undertaking that we would not want to embark on without a significant cost/benefit analysis. The...First off, we are *not* currently planning to migrate, replace, or evacuate our presence at Hetzner or any other provider. That is a massive undertaking that we would not want to embark on without a significant cost/benefit analysis. The last time we evaluated this (#41374), we decided to stay.
That being said, it seems to me worthwhile to keep an eye out for other ... *opportunities* in hosting servers, specifically in Europe, but this could also include locations in Asia our south america. The point is to have diversity here.
Our [hardware requirements](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/doc/hardware-requirements) have been expanded to cover for hosting requirements found during the Cymru migration (#40897).
So this issue is to keep track of such ideas as they come up. Ideas should be documented as (possibly internal) comments(next) cluster scalinganarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41447track SSH logins by SSH key instead of usernames2023-12-14T18:51:06Zanarcattrack SSH logins by SSH key instead of usernamesWe have a handful of SSH services that all operate on the same UNIX users: `git@git.tpo` is the typical one, but I believe this also applies to `git@gitlab.tpo`. It certainly applies to root accounts as well.
Normally, when you login to...We have a handful of SSH services that all operate on the same UNIX users: `git@git.tpo` is the typical one, but I believe this also applies to `git@gitlab.tpo`. It certainly applies to root accounts as well.
Normally, when you login to a server PAM adds an entry to the `utmp` "log" keeping track of your terminal, IP address and username, and how long you're logged in (in `wtmp`). For those servers, this information is close to useless and makes audits cumbersome because you actually need to go through `auth.log` and reverse-map SSH keys instead.
Friends wrote the [ssh-key-wtmp](https://git.autistici.org/ai3/tools/ssh-key-wtmp) PAM plugin which does this. It's not packaged in Debian, it's a bunch of golang that *might* be packageable however, even though it vendors a bit of code.
The way that thing works is it hooks up in PAM and writes better logs in a separate log file. It also logs the IP address used in the connexion, alongside a Maxmind GeoIP and Tor exit list lookup.https://gitlab.torproject.org/tpo/tpa/team/-/issues/41431migrate gitlab-02 to new gnt-dal cluster2023-12-07T15:22:34Zanarcatmigrate gitlab-02 to new gnt-dal clusterwe're going to host more and more gitlab stuff in object storage (e.g. #41425) and already have runners there. it makes sense to move gitlab-02 to the new gnt-dal cluster, which has faster disks and more powerful CPUs.
this should help ...we're going to host more and more gitlab stuff in object storage (e.g. #41425) and already have runners there. it makes sense to move gitlab-02 to the new gnt-dal cluster, which has faster disks and more powerful CPUs.
this should help us deal with the current overload in the gnt-fsn cluster as well (incident #41429).https://gitlab.torproject.org/tpo/tpa/team/-/issues/41428review TPA-RFC policy in the face of criticism2024-02-20T16:08:20Zanarcatreview TPA-RFC policy in the face of criticismhttps://jacobian.org/2023/dec/1/against-rfcs/ has excellent points against RFCs, we should review it and its followup, https://jacobian.org/2023/dec/5/how-to-decide/https://jacobian.org/2023/dec/1/against-rfcs/ has excellent points against RFCs, we should review it and its followup, https://jacobian.org/2023/dec/5/how-to-decide/anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41415design and implement backup strategy for MinIO buckets or the entire server2024-01-19T18:57:23Zanarcatdesign and implement backup strategy for MinIO buckets or the entire serverWe're considering using MinIO for more and more things, mainly GitLab (artifacts storage in #41403 and gitaly backups in #40518) but possibly other (e.g. metrics storage in tpo/network-health/metrics/collector#40023).
Right now, we don'...We're considering using MinIO for more and more things, mainly GitLab (artifacts storage in #41403 and gitaly backups in #40518) but possibly other (e.g. metrics storage in tpo/network-health/metrics/collector#40023).
Right now, we don't have any backups of that server, which is probably fine: we only store container images there, which can be regenerated in case of a catastrophe. But if we start storing gitaly backups and gitlab artifacts, it needs to be permanent now.
Research how backups can be performed, develop a policy and implement it.
Next steps:
* [x] research articles anarcat found on the topic (see wallabag)
* [x] discuss the idea in the network
* [x] decide if we want this per bucket or per site
* [ ] write up a proposal
* [ ] implement proposal
* [ ] document and test backup/restore proceduresanarcatanarcathttps://gitlab.torproject.org/tpo/tpa/anon_ticket/-/issues/61Internal Server Error (500) when attempting to access a ticket2024-03-21T15:12:23ZcypherpunksInternal Server Error (500) when attempting to access a ticketWhen I attempt to access a ticket, I become redirected to a empty page with a plan text:
"Error 500 Internal Server Error"
in the top left page cornerWhen I attempt to access a ticket, I become redirected to a empty page with a plan text:
"Error 500 Internal Server Error"
in the top left page cornerAlexander Færøyahf@torproject.orgAlexander Færøyahf@torproject.orghttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41412fail2ban ineffective on submit-012023-11-22T18:01:51Zanarcatfail2ban ineffective on submit-01We're seeing repeated failed authentication attempts in the postfix logs and they do not seem to get picked up by fail2ban, investigate.We're seeing repeated failed authentication attempts in the postfix logs and they do not seem to get picked up by fail2ban, investigate.https://gitlab.torproject.org/tpo/tpa/team/-/issues/41410monitor GitLab's incoming email processing2023-11-21T21:39:07Zanarcatmonitor GitLab's incoming email processingIn #41409, incoming email stopped being processed by GitLab. No alarm was raised, and only because @boklm noticed did we even know we need to do something.
We should monitor the number of mails in /srv/mail/git@gitlab.torproject.org/Mai...In #41409, incoming email stopped being processed by GitLab. No alarm was raised, and only because @boklm noticed did we even know we need to do something.
We should monitor the number of mails in /srv/mail/git@gitlab.torproject.org/Maildir/. If it's above zero for, say, two minutes, a flag should be raised. We should also check the age of that mailbox so that it's not older than, say, a week or so, to confirm that email is coming in as well, although this is a poor replacement for end-to-end testing...https://gitlab.torproject.org/tpo/tpa/team/-/issues/41405Consider changing project location for non-tpa projects2024-02-19T15:05:27Zmicahmicah@torproject.orgConsider changing project location for non-tpa projectsWe have a few projects, and are likely to get more, that are missing a good place to call home in gitlab, because they don't have a better place to go. Because of this, they end up as projects in tpo/tpa:
https://gitlab.torproject.org/t...We have a few projects, and are likely to get more, that are missing a good place to call home in gitlab, because they don't have a better place to go. Because of this, they end up as projects in tpo/tpa:
https://gitlab.torproject.org/tpo/tpa/triage-ops
https://gitlab.torproject.org/tpo/tpa/renovate-cron
https://gitlab.torproject.org/tpo/tpa/base-images(?)
As part of the Hackweek Collaborative editing project, @meskio made https://gitlab.torproject.org/meskio/archivist and it also needs a home outside of his personal project space. In thinking about where it could go, I started to wonder if there might be a better place than just tossing all these projects into the tpa space, and pinky promising that they aren't TPA's responsibility, even though they are there.
What if we made a different group for these projects, under `/tpo`, and made that the home for this kind of stuff instead? Possible names could be `/tpo/automation`, `/tpo/bots` `/tpo/ai`, `/tpo/robotinvasion`, or something more clever that you come up with :grinning:
Curious to hear what tpa's thoughts are on this, or should we just push the @meskio project into `/tpa/archivist`?anarcatanarcat