The Tor Project issueshttps://gitlab.torproject.org/groups/tpo/-/issues2024-02-13T16:04:39Zhttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40421enhance incident response procedures2024-02-13T16:04:39Zanarcatenhance incident response procedurestoday we had an ... interesting situation with the puppet infrastructure. while we have actually recovered pretty well, all things considered, it would be important to enhance our response to such situation so that they are less stressfu...today we had an ... interesting situation with the puppet infrastructure. while we have actually recovered pretty well, all things considered, it would be important to enhance our response to such situation so that they are less stressful and why not, even more "fun", if i can be so daring.
some background reading:
* [Got game? Secrets of great incident management](https://bitfieldconsulting.com/blog/got-game-secrets-of-great-incident-management)
* [pager duty incident response documentation](https://response.pagerduty.com/)
some ideas:
* have an issue template for incidents (so, in git, which requires a git repository here, but maybe it's finally time to merge the wiki repo here anyways), available offline
* run simulations/games
* have post-mortem templates, here's the [pager duty template](https://response.pagerduty.com/after/post_mortem_template/)
* gitlab has some [incident management primitives](https://docs.gitlab.com/ee/operations/incident_management/) including aforementioned "[incidents](https://docs.gitlab.com/ee/operations/incident_management/incidents.html)" (which are really just issues)...
* ... but also [integrations](https://docs.gitlab.com/ee/operations/incident_management/integrations.html) which is especially interesting considering they have *native* Prometheus integration, which might require switching from nagios to prometheus (#29864)
anyways, the core idea here is:
1. have incident roles (note-taker, driver, comms, etc)
2. incident and post-mortem templates
3. run gameshttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40416long server names crash the backup server2022-04-07T16:02:52Zanarcatlong server names crash the backup serverin tpo/tpa/team#40364 I went a little overboard and created a server named:
static-gitlab-shim-source.torproject.org
I even thought of adding a -01 in there. That short name (`static-gitlab-shim-source`) is 25 characters long which...in tpo/tpa/team#40364 I went a little overboard and created a server named:
static-gitlab-shim-source.torproject.org
I even thought of adding a -01 in there. That short name (`static-gitlab-shim-source`) is 25 characters long which leads to a label on the backup server that crashes Bacula:
Sep 24 17:14:45 bacula-director-01 bacula-dir[1467]: Config error: name torproject-static-gitlab-shim-source.torproject.org-full.${Year}-${Month:p/2/0/r}-${Day:p/2/0/r}_${Hour:p/2/0/r}:${Minute:p/2/0/r} length 130 too long, max is 127
Now: maybe I should have used a shorter server name (and I have since retired the box). But it seems to me that a single server with a bad configuration shouldn't hang the entire backup server.https://gitlab.torproject.org/tpo/tpa/team/-/issues/40405consider disabling read/write work queues on SSD devices2024-02-08T16:21:05Zanarcatconsider disabling read/write work queues on SSD devicesseems like we could do a significant (twofold, [according to cloudflare](https://blog.cloudflare.com/speeding-up-linux-disk-encryption/)) performance improvement on SSD drives if we disable "work queues" in dm-crypt, by specifying `no-re...seems like we could do a significant (twofold, [according to cloudflare](https://blog.cloudflare.com/speeding-up-linux-disk-encryption/)) performance improvement on SSD drives if we disable "work queues" in dm-crypt, by specifying `no-read-workqueue` and `no-write-workqueue` in `/etc/crypttab`. this is available with kernels starting with Linux 5.9, so maybe this needs to wait until the bullseye upgrade, however.
The [arch wiki](https://wiki.archlinux.org/) has [good documentation on how to enable this][docs].
[docs]: https://wiki.archlinux.org/title/Dm-crypt/Specialties#Disable_workqueue_for_increased_solid_state_drive_(SSD)_performanceDebian 12 bookworm upgradehttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40404establish policy for email services2022-04-06T21:00:59Zanarcatestablish policy for email serviceswe've had a few examples recently where I questioned support requests about email aliases, for example #40395, #40391, and especially #40378 vs #40348.
in general, the broad question here is, for email services when should we use one of...we've had a few examples recently where I questioned support requests about email aliases, for example #40395, #40391, and especially #40378 vs #40348.
in general, the broad question here is, for email services when should we use one of those services:
* mailman
* forwards (ie. `tor-puppet.git/modules/postfix/files/virtual`)
* schleuder
* RT
* Discourse
* CiviCRM
* GitLab
In other words, there's a *lot* of stuff that can receive and forward email. When should we use which?improve mail serviceshttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40396Move CSP style attributes into external stylesheets2022-04-07T16:06:52ZcypherpunksMove CSP style attributes into external stylesheetsSuggested by the Mozilla Observatory https://observatory.mozilla.org/analyze.html?host=torproject.org
> Your current CSP policy allows the use of `'unsafe-inline'` inside of `style-src`. Moving `style` attributes into external styleshee...Suggested by the Mozilla Observatory https://observatory.mozilla.org/analyze.html?host=torproject.org
> Your current CSP policy allows the use of `'unsafe-inline'` inside of `style-src`. Moving `style` attributes into external stylesheets not only makes you safer, but also makes your code easier to maintain.https://gitlab.torproject.org/tpo/tpa/team/-/issues/40380mandos monitoring2023-03-30T01:37:59Zanarcatmandos monitoringas part of the automate reboots project (#33406), it seems like we could automatically reboot *some* nodes provided that (a) they are correctly set in mandos and (b) that mandos actually works.
this requires monitoring of the individual...as part of the automate reboots project (#33406), it seems like we could automatically reboot *some* nodes provided that (a) they are correctly set in mandos and (b) that mandos actually works.
this requires monitoring of the individual nodes in mandos, which we do not seem to be currently doing.anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40379Create onion service for GitLab Pages2023-10-12T10:51:48ZJérôme Charaouilavamind@torproject.orgCreate onion service for GitLab PagesThis might be possible to implement via the nginx reverse proxy, eg. by modifying the `Hosts:` header.This might be possible to implement via the nginx reverse proxy, eg. by modifying the `Hosts:` header.Jérôme Charaouilavamind@torproject.orgJérôme Charaouilavamind@torproject.orghttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40373refactor and publish ipsec puppet module2022-10-29T14:58:06Zanarcatrefactor and publish ipsec puppet moduleOur ipsec puppet module is pretty good, all things considered. It went through many iterations, and actually works pretty well.
It has grown quite complicated, however: the `ipsec::peer` and `ipsec::network` constructs are somewhat conf...Our ipsec puppet module is pretty good, all things considered. It went through many iterations, and actually works pretty well.
It has grown quite complicated, however: the `ipsec::peer` and `ipsec::network` constructs are somewhat confusing, and use `concat` where they could just drop files in `/etc/ipsec.secrets.d` and `/etc/ipsec.conf.d` instead. They also do not support configuring only *one side* of the connexion, which is why `ipsec::client` was written, separately.
So the first task is to rebuild `ipsec::peer` and `ipsec::network` based on `ipsec::client`, possibly getting rid of one of the defines/class.
Then publish this on the Puppet forge.
Alternatively, consider using an existing module, if there's a really enticing option.cleanup and publish the sysadmin codebasehttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40330multi-year prometheus metrics storage2024-02-20T16:19:11Zanarcatmulti-year prometheus metrics storagein #31244, we discussed provisionning the prometheus server to store "long-term" prometheus metrics. what that meant wasn't exactly clear: we settled on "one year" (as opposed to the default "30 days"). but now I feel it's not enough. i'...in #31244, we discussed provisionning the prometheus server to store "long-term" prometheus metrics. what that meant wasn't exactly clear: we settled on "one year" (as opposed to the default "30 days"). but now I feel it's not enough. i'd like to have a permanent record of metrics: basically a multi-year account of all metrics (or maybe just critical metrics, but that's harder to figure out).
in the original ticket, we considered the idea of having a secondary server that would scrape metrics from the first, but ended up rejecting this idea. one of the reasons is that then we need Grafana and other tools to talk to multiple datasources and it makes things complicated.
but in [this post](https://www.robustperception.io/looking-beyond-retention), one of the prometheus authors mentions the possibility of the primary prometheus server using the secondary, long-term server as a datasource itself for metrics it doesn't have available.
that way we could have a much smaller (say 30 days!) server that would fetch metrics every N seconds like now), and a much longer term server that would fetch metrics much less frequently (say 5 minutes or an hour).
calculate what the requirements would be based on the current and projected metrics counts, retention period and scrape frequencies.https://gitlab.torproject.org/tpo/tpa/team/-/issues/40310setup an arm232 CI builder on OSUOSL infra2024-02-08T19:27:24Zanarcatsetup an arm232 CI builder on OSUOSL infrahttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40306[Wishlist] Improve workflow to prevent GPG subkey to expire again next time2022-04-07T16:08:57ZRoger Shimizu[Wishlist] Improve workflow to prevent GPG subkey to expire again next timeFor issue like #40115 and #40299,
we already experienced GPG subkey expiration a few times in the past.
This caused downstream project like torbrowser-launcher failed to install TorBrowser, because it checks the download file integrity b...For issue like #40115 and #40299,
we already experienced GPG subkey expiration a few times in the past.
This caused downstream project like torbrowser-launcher failed to install TorBrowser, because it checks the download file integrity by GPG.
So I'm wondering whether we can improve GPG subkey fresh in your regular workflow, or add a timer notification.
It's not urgent, but better to fix before next expiration date, Jan 04 2022.
Thank you!https://gitlab.torproject.org/tpo/tpa/team/-/issues/40301Bounce messages showing everyone on alias (security issue for job aliases?)2023-03-22T18:11:57Zal smithBounce messages showing everyone on alias (security issue for job aliases?)An email on the jobs-fundraising@tpo email alias was entered incorrectly, and applicants received a bounce message that showed all members of the email alias, not just the incorrect email. Is this a security issue to investigate?An email on the jobs-fundraising@tpo email alias was entered incorrectly, and applicants received a bounce message that showed all members of the email alias, not just the incorrect email. Is this a security issue to investigate?https://gitlab.torproject.org/tpo/tpa/team/-/issues/40257Send notification to #tor-project after bad relay related dir-auth update2022-04-06T21:11:36ZGeorg KoppenSend notification to #tor-project after bad relay related dir-auth updateWhen someone makes changes to the `dir-auth` repo `#tor-internal` gets notified, which is great. Could we get a git hook or something that gets a notification to `#tor-project` saying `Sebastian, weasel, micah, arma1, stefani: dirauth up...When someone makes changes to the `dir-auth` repo `#tor-internal` gets notified, which is great. Could we get a git hook or something that gets a notification to `#tor-project` saying `Sebastian, weasel, micah, arma1, stefani: dirauth update. Thanks!` so the directory authorities get a ping once they need to take an action regarding bad relays.
Right now I am copying and pasting that by hand which kind of works but could benefit from automation. I am not sure what git hooks can do but as we track bad relay work in Gitlab and mention tickets in the commit message we could try making use of that to know when to send the notification to IRC and when not.https://gitlab.torproject.org/tpo/tpa/team/-/issues/40216Add Matrix alerts to Prometheus AlertManager2023-09-29T17:05:56ZirlAdd Matrix alerts to Prometheus AlertManagerWe currently send emails from the Prometheus AlertManager which is great as long as those emails are read by the right person in a timely manner. There are some issues though:
* mail may be send with unencrypted transport (containing se...We currently send emails from the Prometheus AlertManager which is great as long as those emails are read by the right person in a timely manner. There are some issues though:
* mail may be send with unencrypted transport (containing sensitive log information)
* difficult to update list of recipients
* no easy place to see history of alerts
* if the mail server is down, you get no alerts
Some effort has recently been made to start using Matrix inside Tor, and this seems like an oppportunity to move with that momentum and solve some or all of the above issues.
The webhook receiver used in the other project I mentioned is: https://github.com/jaywink/matrix-alertmanager.
If desirable, I could write an Ansible role to run this as a systemd user service on a TPA machine as we have done for Metrics services in the past, or you could write some Puppet to do the same.https://gitlab.torproject.org/tpo/tpa/team/-/issues/40202can't send email to state.gov2024-01-22T16:34:28Zanarcatcan't send email to state.govwriting to USER@state.gov gives us this error:
```
<REDACTED@state.gov>: TLSA lookup error for christopher-ew.state.gov:25
```
it's actually from multiple endpoints, my home server and riseup also see this, so this is actually an error...writing to USER@state.gov gives us this error:
```
<REDACTED@state.gov>: TLSA lookup error for christopher-ew.state.gov:25
```
it's actually from multiple endpoints, my home server and riseup also see this, so this is actually an error with state.gov, i would argue... still worth taking a look.
/cc @gaba
battle plan:
* [x] <del>confirm with state.gov folks that emails are failing because they check the eugeni TLS cert</del> state.gov is unwilling to provide more information, but we'll just go with that assertion, as it seems fair that our MX should provide publicly verifiable certificates in the standard CA infrastructure (on top of DNSSEC checks)
* [ ] if so, establish a plan to rebuild a MX with "real" TLS certificates, which is now documented in the [roadmap](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/roadmap/2021)
* [ ] bypass DNSSEC checks for state.gov so *we* can send mail there
* [ ] bring up their misconfiguration on DNSSEC forums (optional)improve mail servicesanarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40168track and respond to email spam complaints systematically2022-04-06T21:00:58Zanarcattrack and respond to email spam complaints systematicallyRight now we get complaints about spam to postmaster@tpo but do not necessarily act on those. Worst, there might be places where we just don't get notifications because we do not register to other provider's interfaces.
Some ideas:
* ...Right now we get complaints about spam to postmaster@tpo but do not necessarily act on those. Worst, there might be places where we just don't get notifications because we do not register to other provider's interfaces.
Some ideas:
* subscribe to <https://fbl.returnpath.net/>
* register on [Google's postmaster tools](https://gmail.com/postmaster/)
* try to figure out whatever is going on with Outlook (see https://gitlab.torproject.org/tpo/tpa/team/-/issues/33037#note_2725160)
* use some automation to measure feedback, for example [feedback-loop](https://git.autistici.org/ai3/tools/feedback-loop)
We already have improved our Prometheus metrics and Grafana dashboards as part of #33037, so there's already that, but work remains to be done to ensure we have good delivery.
This is part of the 2021 roadmap.improve mail serviceshttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40163evaluate and reduce server's power usage2023-03-14T17:51:28Zanarcatevaluate and reduce server's power usageWhile we do not directly control our physical infrastructure, we still do use actual power, which has an environmental cost and, therefore, is part of the major existential threat facing humanity at the peak of its history. Reducing powe...While we do not directly control our physical infrastructure, we still do use actual power, which has an environmental cost and, therefore, is part of the major existential threat facing humanity at the peak of its history. Reducing power usage is not only an economic incentive, it's an existential necessity.
The first step is to do monitoring, however. I found out about a project call [scaphandre](https://github.com/hubblo-org/scaphandre) which provides Prometheus metrics and Grafana dashboards for actual power usage on physical servers. While that may not cover our machines "in the cloud", it may work on our physical hardware.https://gitlab.torproject.org/tpo/tpa/team/-/issues/40129user management procedures are poorly documented2023-10-20T18:57:06Zanarcatuser management procedures are poorly documentedas identified by @arma in https://gitlab.torproject.org/tpo/tpa/team/-/issues/40126#note_2721379, it's not really clear how to actually create and remove accounts. we do have https://gitlab.torproject.org/tpo/tpa/team/-/issues/32519 whic...as identified by @arma in https://gitlab.torproject.org/tpo/tpa/team/-/issues/40126#note_2721379, it's not really clear how to actually create and remove accounts. we do have https://gitlab.torproject.org/tpo/tpa/team/-/issues/32519 which concerns the overall onboarding/offboarding process, but the actually nitty-gritty details of *how* to do things for sysadmins is really badly documented. in https://gitlab.torproject.org/tpo/tpa/team/-/issues/40126#note_2721468, i noted:
> This documentation seems to be a total mess. There is:
>
> * [howto/new-person](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/new-person) which you have found and seems to document how to get a new *sysadmin* on board
> * [doc/accounts](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/doc/accounts) which documents "accounts" in general, and is more targeted at users
> * [howto/create-a-new-user](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/create-a-new-user) actually documents how to create a new user
> * [howto/ldap](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/ldap) which documents "LDAP" in general and has a rather poor user-facing documentation and is mostly targeted about running the service
> * and then of course userdir-ldap-cgi has [its own inline documentation](https://db.torproject.org/) maintained as HTML/Perl templates shipped with the debian package and managed through git.
>
> Someone(tm) needs to sit down and make sense of this. I kind of made matters worse myself by creating howto/ldap and howto/new-person of course... :( so I guess i'm probably that someone.
So the task here is to merge or split or cleanup those pages so that one doesn't get lost like @arma did. Here it's not a matter of policy, it's just about creating a cohesive documentation. I suspect the following should happen, but this is just a first brainstorm and i'm open to suggestions:
- [ ] [howto/new-person](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/new-person) - should be merged into another page, a special section in create-new-user maybe? or renamed to "new-admin"?
- [ ] [doc/accounts](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/doc/accounts) - merge with create-a-new-user?
- [ ] [howto/create-a-new-user](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/create-a-new-user) - merge with howto/ldap? but keep in mind there are things about sudo in there
- [ ] [howto/ldap](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/ldap) - should this take over the userdir-ldap-cgi documentation below and cover *everything*?
- [ ] userdir-ldap-cgi has [its own inline documentation](https://db.torproject.org/) - maybe deprecate this and point to the wiki?
TBD.
Also note that our [retirement procedures](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/retire-a-user) are *also* fairly inadequate and would need much love. this was supposed to be covered by #32519 but was somehow overlooked... :(anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40124Move auth to Grafana instead of puppet2024-01-30T15:20:51ZHiroMove auth to Grafana instead of puppetThis ticket tracks moving authentication out of puppet for Grafana.
In #40102 we created a shared account for people accessing Grafana. This doesn't really scale and it would be nice if we could create accounts in Grafana directly.This ticket tracks moving authentication out of puppet for Grafana.
In #40102 we created a shared account for people accessing Grafana. This doesn't really scale and it would be nice if we could create accounts in Grafana directly.anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40116disable TLS 1.0 and 1.12024-03-25T20:05:50Zweasel (Peter Palfrader)disable TLS 1.0 and 1.1ssllabs now gives bad grades for servers that even offer TLS 1.0 and 1.1. Modern browsers deprecated TLS 1.0 or 1.1.
Re support see also:
* https://en.wikipedia.org/wiki/Transport_Layer_Security#Applications_and_adoption
* https://cani...ssllabs now gives bad grades for servers that even offer TLS 1.0 and 1.1. Modern browsers deprecated TLS 1.0 or 1.1.
Re support see also:
* https://en.wikipedia.org/wiki/Transport_Layer_Security#Applications_and_adoption
* https://caniuse.com/tls1-1
* https://caniuse.com/tls1-2
And it seems 1.2 has been around quite long. I propose we stop offering TLS 1.0 and 1.1 on our webservers.