TPA issueshttps://gitlab.torproject.org/groups/tpo/tpa/-/issues2022-07-18T12:54:32Zhttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40592retire zammad-01.torproject.net on June 27 20222022-07-18T12:54:32Zanarcatretire zammad-01.torproject.net on June 27 2022As part of the cdr.link evaluation (tpo/tpa/team#40578), it was agreed we would setup a "hands off TPA" machine inside our account on Hetzner cloud. the machine is in a separate "non-TPA" project in Hetzner cloud, and hosted under the za...As part of the cdr.link evaluation (tpo/tpa/team#40578), it was agreed we would setup a "hands off TPA" machine inside our account on Hetzner cloud. the machine is in a separate "non-TPA" project in Hetzner cloud, and hosted under the zammad-01.torproject.net domain name.
It should be destroyed on June 1st 2022, but a heads up should be sent earlier, obviously. The retirement checklist is basically:
1. [x] send a notification in advance, wait a week
2. [x] shutdown the VM, wait a week
3. [x] destroy the VM on july 7th
4. [x] remove it from the spreadsheet
Marked as due mid-may to send a heads up then.Jérôme Charaouilavamind@torproject.orgJérôme Charaouilavamind@torproject.org2022-07-07https://gitlab.torproject.org/tpo/tpa/team/-/issues/40775failure to create SAN-backed VM in gnt-chi2022-07-18T19:27:30Zanarcatfailure to create SAN-backed VM in gnt-chime and @lavamind tried to create a VM in the gnt-chi cluster, backed by the SAN, and we couldn't figure it out. this was for tpo/tpa/team#40683.
at first, the problem was this:
```
09:36:12 <lavamind> aaargh
09:36:21 <lavamind> it crea...me and @lavamind tried to create a VM in the gnt-chi cluster, backed by the SAN, and we couldn't figure it out. this was for tpo/tpa/team#40683.
at first, the problem was this:
```
09:36:12 <lavamind> aaargh
09:36:21 <lavamind> it created a 150G root partition
09:36:31 <lavamind> yeah thats not good
```
then anarcat tried to create GPT partitions for the device and make ganeti adopt this:
```
gnt-instance add -n chi-node-01.torproject.org -o debootstrap+bullseye -t blockdev --no-wait-for-sync --net 0:ip=pool,network=gnt-chi-01 --no-ip-check --no-name-check --disk 0:adopt=/dev/disk/by-id/dm-name-telegram-bot-01 --backend-parameters memory=8g,vcpus=2 telegram-bot-01.torproject.org
gnt-instance shutdown --timeout=0 telegram-bot-01.torproject.org
gnt-instance reinstall telegram-bot-01.torproject.org
```
and that didn't work at all: it failed with
```
device-mapper: reload ioctl on telegram-bot-01-1 failed: No such device or address
create/reload failed on telegram-bot-01-1
mke2fs: No such file or directory while trying to determine filesystem size
```
that seems to be a problem in the [patch lavamind submitted to work with the SAN](https://github.com/ganeti/instance-debootstrap/pull/17). after hot-fixing this, the script would still fail with:
```
Re-reading the partition table failed.: Invalid argument
```
it was found that the partition was being recreated by the install script, specifically in the `create` hook, because of the default `PARTITION_STYLE=msdos`.
then we tried with `PARTITION_STYLE=none` in `/etc/default/ganeti-instance-debootstrap`. @anarcat also ran `partprobe` because that was recommended by `sgdisk`, but that turned out to be a bad idea because it added a bunch of irrelevant mappings everywhere.
with `PARTITION_STYLE=none`, the VM does go through the install, but somehow fails silently. after the failed install, it's in the `ADMIN_down` state. it's unclear why it fails, because all hooks complete succesfully. last lines of the install log:
```
I: swap configuration hook in /etc/ganeti/instance-debootstrap/hooks/swap
Only one disk found, creating a 512M /swapfile instead
512+0 records in
512+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 0.747554 s, 718 MB/s
mkswap: /tmp/tmp.seBabqJnRb/swapfile: insecure permissions 0644, 0600 suggested.
Setting up swapspace version 1, size = 512 MiB (536866816 bytes)
no label, UUID=ab94d001-901b-49e1-badf-8fc06966e554
I: make /tmp a tmpfs
```
notice how "only one disk" is "found" above... that's not a great sign. after the install, also, the partition table on the device is completely empty, which is reasonable because ... well, that's what we asked for.
a possible fix is to do *multiple* `--disk adopt`... stanzas. This is how web-chi-03 (#40193) was setup (see https://gitlab.torproject.org/tpo/tpa/team/-/issues/40131#note_2728663) and could serve as a simple workaround for now, which probably doesn't even require hacking at the PARTITION_STYLE. the downside, of course, is significant confusion because you have a partition setup at the SAN layer and then, on the first partition, *another* MSDOS partition setup. a little odd.
there's [an issue upstream about GPT support](https://github.com/ganeti/instance-debootstrap/issues/5) but in my opinion, it's kind of a distraction... regardless of the partition format, ganeti-instance-debootstrap should be able to handle partitioned drives correctly...
right now it looks like it just wipes whatever you give it, with nothing (if PARTITION_STYLE=none) or with a MSDOS partition (if PARTITION_STYLE=msdos). one has to wonder why it bothers with partitionning in the first place, in a sense...
update: we deliberated quite a bit on design here, and here's the checklist we came up with in the end.
* [ ] rewrite `tpo-create-san-disks` in Python
* [ ] add support for handling multipath configuration across the cluster
* [ ] update the ganeti node creation to mention copying the config from a previous node in the gnt-chi cluster
* [ ] update the instance gnt-chi creation docs to use a single disk by default (and warn that swapfile should be resized after install if we worry about memory usage)
* [ ] summarize this ticket and design decisions in a wiki page ... somewhere? possibly howto/new-machine-cymru?Jérôme Charaouilavamind@torproject.orgJérôme Charaouilavamind@torproject.org2022-07-13https://gitlab.torproject.org/tpo/tpa/team/-/issues/40815recurring OOM issue on materculae2023-02-27T21:02:40ZHirorecurring OOM issue on materculaeWe have a recurring OOM issue on materculae where every couple of days I have to restart postgresql. I am not sure why this is happening lately - I have never touched this service. Is it related to the postgresql upgrade? :shrug:We have a recurring OOM issue on materculae where every couple of days I have to restart postgresql. I am not sure why this is happening lately - I have never touched this service. Is it related to the postgresql upgrade? :shrug:anarcatanarcat2022-07-27https://gitlab.torproject.org/tpo/tpa/team/-/issues/40814OOM issue on meronense after upgrade2024-02-02T03:23:35ZHiroOOM issue on meronense after upgradeNoticed metrics.tpo is not getting all its updates since postgresql has been upgraded to v13.
I have started the script manually: https://gitlab.torproject.org/tpo/network-health/metrics/metrics-bin/-/blob/main/website/run-web.sh
And f...Noticed metrics.tpo is not getting all its updates since postgresql has been upgraded to v13.
I have started the script manually: https://gitlab.torproject.org/tpo/network-health/metrics/metrics-bin/-/blob/main/website/run-web.sh
And found out the job was being killed:
```
[308908.109696] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-4020.scope,task=java,pid=375579,uid=1512
[308908.109723] Out of memory: Killed process 375579 (java) total-vm:14411748kB, anon-rss:7917568kB, file-rss:0kB, shmem-rss:32kB, UID:1512 pgtables:23120kB oom_score_adj:0
```
cc: @gkanarcatanarcat2022-07-27https://gitlab.torproject.org/tpo/tpa/team/-/issues/40808TPA-RFC-26: LimeSurvey upgrade2022-08-25T13:56:20ZJérôme Charaouilavamind@torproject.orgTPA-RFC-26: LimeSurvey upgradeOur current LimeSurvey instance at https://survey.torproject.org is to be migrated to LimeSurvey 5.
[TPA-RFC-26](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-26-limesurvey-upgrade) was published and outlines the ste...Our current LimeSurvey instance at https://survey.torproject.org is to be migrated to LimeSurvey 5.
[TPA-RFC-26](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-26-limesurvey-upgrade) was published and outlines the steps that should be taken by TPA and survey authors in this context.
Questions and comments may be posted in this issue for discussion.Debian 11 bullseye upgradeJérôme Charaouilavamind@torproject.orgJérôme Charaouilavamind@torproject.org2022-08-08https://gitlab.torproject.org/tpo/tpa/team/-/issues/40836check if btcpayserver needs an upgrade before mid august2022-08-16T19:00:40Zanarcatcheck if btcpayserver needs an upgrade before mid august
see also #40763.
see also #40763.anarcatanarcat2022-08-10https://gitlab.torproject.org/tpo/tpa/team/-/issues/40810retire subnotabile2022-08-29T20:28:49Zanarcatretire subnotabile* [x] Migrate remaining surveys to `survey-01`
* [x] Shutdown apache2 and puppet-run on `subnotabile`
* [x] Change vhost domain from `survey-new.torproject.org` to `survey.torproject.org` on `survey-01`
* [x] Configure onion service for ...* [x] Migrate remaining surveys to `survey-01`
* [x] Shutdown apache2 and puppet-run on `subnotabile`
* [x] Change vhost domain from `survey-new.torproject.org` to `survey.torproject.org` on `survey-01`
* [x] Configure onion service for `survey.torproject.org` on `survey-01`
* [x] Cleanup `survey-new.torproject.org` certificate
1. [x] notification
2. [x] remove from nagios
3. [x] stop VM
4. [x] retire host
5. [x] remove from LDAP
6. [x] grep
7. [x] tor-passwords
8. [ ] ~~DNSwl~~ (N/A)
9. [ ] ~~wiki/nextcloud~~ (N/A)
10. [ ] ~~remove from racks~~ (N/A)
11. [x] reverse DNSDebian 11 bullseye upgradeJérôme Charaouilavamind@torproject.orgJérôme Charaouilavamind@torproject.org2022-08-31https://gitlab.torproject.org/tpo/tpa/team/-/issues/29864TPA-RFC-33: consider replacing nagios with prometheus2023-05-17T18:06:54ZanarcatTPA-RFC-33: consider replacing nagios with prometheusAs a followup to the Prometheus/Grafana setup started in #29681, I am wondering if we should also consider replacing the Nagios/Icinga server with Prometheus. I have done a little research on the subject and figured it might be good to a...As a followup to the Prometheus/Grafana setup started in #29681, I am wondering if we should also consider replacing the Nagios/Icinga server with Prometheus. I have done a little research on the subject and figured it might be good to at least document the current state of affairs.
This would remove a complex piece of architecture we have at TPO that was designed before Puppet was properly deployed. Prometheus has an interesting federated design that allows it to scale to multiple machines easily, along with a high availability component for the alertmanager that allows it to be more reliable than a traditionnal Nagios configuration. It would also simplify our architecture as the Nagios server automation is a complex mix of Debian packages and git hooks that is serving us well, but hard to comprehend and debug for new administrators. (I managed to wipe the entire Nagios config myself on my first week on the job by messing up a configuration file.) Having the monitoring server fully deployed by Puppet would be a huge improvement, even if it would be done with Nagios instead of Prometheus, of course.
Right now the Nagios server is actually running Icinga 1.13, a Nagios fork, on a heztner machine (`hetzner-hel1-01`). It's doing its job generally well although it feels a *little* noisy, but that's to be expected form Nagios servers. Reducing the number of alerts seems to be an objective, explicitely documented in #29410, for example.
Both Grafana and Prometheus can do alerting, with various mechanisms and plugins. I haven't investigated those deeply, but in general that's not a problem in alerting: you fire some script or API and the rest happens. I suspect we could port the current Nagios alerting scripts to Prometheus fairly easily, although I haven't investigated our scripts in details.
The problem is reproducing the check scripts and their associated alert threshold. In the Nagios world, when a check is installed, it *comes* with its own health ("OK", "WARNING", "CRITICAL") threshold and TPO has developed a wide variety of such checks. According to the current Nagios dashboard, it monitors 4612 services on 88 hosts (which is interesting considering LDAP thinks there are 78). That looks terrifying, but it's actually a set of 9 commands running on the Nagios server, including the complex `check_nrpe` system, which is basically a client-side nagios that has its own set of checks. And that's where the "cardinal explosion" happens: on a typical host, there are 315 such checks implemented.
That's the hard part: convert those 324 checks into Prometheus alerts, one at a time. Unfortunately, there are no "built-in" or even "third-party" "prometheus alert sets" that I could find in my [original research](https://anarc.at/blog/2018-01-17-monitoring-prometheus/), although that might have changed in the last year.
Each check in Prometheus is basically a YAML file describing a Prometheus query that, when it evaluates to "true" (e.g. disk_space > 90%), sends an alert. It's not impossible to do that conversion, it's just a lot of work.
To do this progressively while allowing us to make new alerts on Prometheus instead of Nagios, I suggest to proceed the same way Cloudflare did, which is to establish a "Nagios to Prometheus" bridge, by which Nagios doesn't send the alerts on its own and instead forwards them to the Prometheus server, a plugin they called [Promsaint](https://github.com/cloudflare/promsaint).
With the bridge in place, Nagios checks can be migrated into Prometheus alerts progressively without disruption. Note that Cloudflare documented their experience with Prometheus in [this 2017 promcon talk](https://promcon.io/2017-munich/talks/monitoring-cloudflares-planet-scale-edge-network-with-prometheus/). Cloudflare also made an alert dashboard called [unsee](https://github.com/cloudflare/unsee) (see also the fork called [karma](https://github.com/prymitive/karma)) and [elasticsearch integration](https://github.com/cloudflare/alertmanager2es) which might be good to investigate further.
Another useful piece is this [NRPE to Prometheus exporter](https://www.robustperception.io/nagios-nrpe-prometheus-exporter), which allows Prometheus to directly scrape NRPE targets. It doesn't include Prometheus alerts and instead relies on a Grafana dashboard to show possible problems so, as such, I don't think it's that useful an alternative. There's a [similar approach using check_mk](https://github.com/m-lab/prometheus-nagios-exporter) instead.
Another possible approach is to send alerts from Nagios based on Prometheus checks, using the [Prometheus nagios plugins](https://github.com/prometheus/nagios_plugins). This might allow us to get rid of NRPE everywhere but it would probably be useful only if we do want to keep Nagios in the long term and remove NRPE in favor of the existing Prometheus exporters.
So, battle plan is basically this:
1. `apt install prometheus-alertmanager`
2. reimplement the Nagios alerting commands
3. send Nagios alerts through the alertmanager
4. rewrite (non-NRPE) commands (9) as Prometheus alerts
5. optionnally, scrape the NRPE metrics from Prometheus
6. optionnally, create a dashboard and/or alerts for the NRPE metrics
7. rewrite NRPE commands (300+) as Prometheus alerts
8. turn off the Nagios server
9. remove all traces of NRPE on all nodes
Update: this, obviously, will require more discussion than just implementing the above battle plan, as there isn't a consensus in the team towards Prometheus as a replacement for Icinga. I have assigned TPA-RFC-33 to this and started drafting requirements and personas in #40755Debian 11 bullseye upgradeanarcatanarcat2022-09-01https://gitlab.torproject.org/tpo/tpa/team/-/issues/40741Package the latest puppet agent release in Debian2023-06-28T19:45:45ZJérôme Charaouilavamind@torproject.orgPackage the latest puppet agent release in DebianWith Puppet agent 7.x series now supporting ruby 3.0, it should now be possible to update the Debian package in bookworm, the next Debian release. The current maintainer of the [puppet](https://tracker.debian.org/pkg/puppet) Debian packa...With Puppet agent 7.x series now supporting ruby 3.0, it should now be possible to update the Debian package in bookworm, the next Debian release. The current maintainer of the [puppet](https://tracker.debian.org/pkg/puppet) Debian package has stated that they do not intend to undertake this task.
I've published a draft attempt at such a package update [here](https://salsa.debian.org/lavamind/puppet-agent) in my personal namespace on Salsa but the idea would be to move it to the puppet-team namespace after review, replacing the current experimental puppet-agent package.
- [x] package puppet agent dependencies
- [x] package puppet agent 7
- [x] experimental upload
- [x] propose removal of old puppet package (anarcat)
- [x] remove old package from debian (anarcat), in progress: removal bug is [1021202](https://bugs.debian.org/1021202)
- [x] upload puppet agent 7 to unstableDebian 12 bookworm upgradeanarcatanarcat2022-09-05https://gitlab.torproject.org/tpo/tpa/team/-/issues/40883Renew Harica TLS certificate for donate.tpo onion2022-10-13T23:46:03ZJérôme Charaouilavamind@torproject.orgRenew Harica TLS certificate for donate.tpo onionOur Harica TLS certificate for onion address `yoaenchicimox2qdc47p36zm3cuclq7s7qxx6kvxqaxjodigfifljqqd.onion` is expiring in 2 weeks. We should renew it.Our Harica TLS certificate for onion address `yoaenchicimox2qdc47p36zm3cuclq7s7qxx6kvxqaxjodigfifljqqd.onion` is expiring in 2 weeks. We should renew it.Jérôme Charaouilavamind@torproject.orgJérôme Charaouilavamind@torproject.org2022-09-16https://gitlab.torproject.org/tpo/tpa/team/-/issues/40789migration of gettor into rdsys2023-07-26T08:36:18Zmeskiomeskio@torproject.orgmigration of gettor into rdsysWe have reimplemented gettor as part of rdsys. The plan is to put that into production by the end of the month.
The tasks I know of that are needed for this migration:
* [x] create new rdsys-frontend-01 VM with rdsys and gettor users wi...We have reimplemented gettor as part of rdsys. The plan is to put that into production by the end of the month.
The tasks I know of that are needed for this migration:
* [x] create new rdsys-frontend-01 VM with rdsys and gettor users with sudo for all anti-censorship team.
* [x] expose the rdsys-backend (localhost:7100/resources-stream) in apache from polyanthum to be reachable from rdsys-frontend-01 (and only from that host)
* [x] on rdsys-frontend-01, setup a dovecot imap-only mailbox (like gitlab and civicrm) where `gettor@torproject.org` emails arrive. (`gettor@torproject.org` emails are currently arriving to gettor-01 and is being sent to gettor over a postfix pipe script), that implies:
* [x] have a smtp server to send email with `gettor@torproject.org` email address. ~~Doesn't need to be in the same machine, rdsys has support to do plain auth.~~ should just be localhost delivery, make rdsys-frontend-01 a "mailhost" in puppet
* [x] have a metrics endpoint for prometheus metrics. ~~https://rdsys-frontend.torproject.org/metrics~~ rdsys-gettor.torproject.org/metrics pointed to localhost:7700/metrics
* [x] change the `gettor@torproject.org` forward on eugeni to point to `gettor@rdsys-frontend.torproject.org`
* [x] ~~remove gettor-01 machine as is not used anymore, needs coordination with anti-censorship team~~ see #40915
The actual move of the gettor@torproject.org email address into it's own imap server and the shutting down of gettor-01 need coordination with the anti-censorship team. And can't happen before June 27th, as we'll not be ready on rdsys side.anarcatanarcat2022-09-26https://gitlab.torproject.org/tpo/tpa/team/-/issues/40908reverse DNS broken at cymru2022-10-11T14:13:27Zanarcatreverse DNS broken at cymrui just opened a ticket with cymru named "URGENT: reverse DNS for 38.229.82.0/24 broken", it was assigned the ticket number CST-316.
i noticed this while trying to launch gettor-rdsys (#40789), mails would fail to route to eugeni with:
...i just opened a ticket with cymru named "URGENT: reverse DNS for 38.229.82.0/24 broken", it was assigned the ticket number CST-316.
i noticed this while trying to launch gettor-rdsys (#40789), mails would fail to route to eugeni with:
```
host eugeni.torproject.org[49.12.57.136] said: 450 4.7.25 Client host rejected: cannot find your hostname, [38.229.82.36] (in reply to RCPT TO command)
```
and indeed reverse DNS is broken on that IP... hell, here's a copy of the ticket i sent to cymru:
> ```
> anarcat@curie:~$ host 38.229.82.36
> Host 36.82.229.38.in-addr.arpa. not found: 3(NXDOMAIN)
> anarcat@curie:~[1]$
> ```
>
> It looks like the entire zone delegation was removed:
>
> ```
> anarcat@curie:~$ dig -x 38.229.82.36
>
> ; <<>> DiG 9.16.33-Debian <<>> -x 38.229.82.36
> ;; global options: +cmd
> ;; Got answer:
> ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 56474
> ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
>
> ;; OPT PSEUDOSECTION:
> ; EDNS: version: 0, flags:; udp: 1232
> ;; QUESTION SECTION:
> ;36.82.229.38.in-addr.arpa. IN PTR
>
> ;; AUTHORITY SECTION:
> 82.229.38.in-addr.arpa. 3355 IN SOA ns1.cymru.com. empty.empty. 39 3600 600 1209600 3600
>
> ;; Query time: 52 msec
> ;; SERVER: 1.1.1.1#53(1.1.1.1)
> ;; WHEN: Wed Sep 28 10:24:27 EDT 2022
> ;; MSG SIZE rcvd: 114
> ```
>
> 38.229.82.0/24 used to be delegated to tor's nameservers, which are:
>
> ```
> torproject.org. 86400 IN NS ns1.torproject.org.
> torproject.org. 86400 IN NS ns3.torproject.org.
> torproject.org. 86400 IN NS ns4.torproject.org.
> torproject.org. 86400 IN NS ns5.torproject.org.
> torproject.org. 86400 IN NS nsp.dnsnode.net.
> ```
>
> this is causing an outage on our end as servers in that cluster are
> having trouble delivering mail.anarcatanarcat2022-10-11https://gitlab.torproject.org/tpo/tpa/nextcloud/-/issues/10TPA-RFC-39: Nextcloud user account policy2022-10-17T18:20:09ZanarcatTPA-RFC-39: Nextcloud user account policyI was surprised to find out that Nextcloud wasn't limited to tor-internal, in tpo/tpa/team#40772 (confidential: some offboarding ticket). if not only tor-internal users can have access, we should clarify who *does* get access, for how lo...I was surprised to find out that Nextcloud wasn't limited to tor-internal, in tpo/tpa/team#40772 (confidential: some offboarding ticket). if not only tor-internal users can have access, we should clarify who *does* get access, for how long, and who approves that.
/cc @gaba
Discussion ticket for https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-39-nextcloud-account-policyJérôme Charaouilavamind@torproject.orgJérôme Charaouilavamind@torproject.org2022-10-13https://gitlab.torproject.org/tpo/tpa/team/-/issues/40906retire web-chi-03 and web-chi-04 nodes2022-10-18T12:51:03Zanarcatretire web-chi-03 and web-chi-04 nodeswe have setup new mirrors to replace web-chi-03 and web-chi-04 in #40904. this ticket is about retiring the latter servers once we are certain there is absolutely no traffic on there.
1. [x] ~~announcement~~ (done), check that traffic ...we have setup new mirrors to replace web-chi-03 and web-chi-04 in #40904. this ticket is about retiring the latter servers once we are certain there is absolutely no traffic on there.
1. [x] ~~announcement~~ (done), check that traffic is still low on the hosts: https://grafana.torproject.org/d/53QNFNtZz/traffic-per-class?orgId=1&var-class=All&var-node=web-bhs-05.torproject.org:9100&var-node=web-bhs-06.torproject.org:9100&var-node=web-chi-03.torproject.org:9100&var-node=web-chi-04.torproject.org:9100&var-node=web-fsn-01.torproject.org:9100&var-node=web-fsn-02.torproject.org:9100&from=now-30m&to=now&refresh=1m
2. [x] nagios
3. [x] retire the host in fabric (which shuts it down), in one week
4. [x] remove from LDAP with `ldapvi`
5. [x] power-grep
6. [x] remove from tor-passwords
7. [ ] ~~remove from DNSwl~~ (n/a)
8. [x] remove from docs
9. [ ] ~~remove from racks~~ (n/a)
10. [x] remove from reverse DNSJérôme Charaouilavamind@torproject.orgJérôme Charaouilavamind@torproject.org2022-10-19https://gitlab.torproject.org/tpo/tpa/team/-/issues/40915retire gettor-012022-11-10T18:11:51Zanarcatretire gettor-01in #40789 we moved gettor to rdsys-frontend, which makes this VM moot. start the retirement process in one week (e.g. turn off the VM in a week).
1. [x] announcement (this ticket should be enough)
2. [x] nagios
3. [x] retire the host...in #40789 we moved gettor to rdsys-frontend, which makes this VM moot. start the retirement process in one week (e.g. turn off the VM in a week).
1. [x] announcement (this ticket should be enough)
2. [x] nagios
3. [x] retire the host in fabric (in a week)
4. [x] remove from LDAP with `ldapvi`
5. [x] power-grep
6. [x] remove from tor-passwords
7. [x] remove from DNSwl
8. [x] remove from docs
9. [x] remove from racks
10. [x] remove from reverse DNS
11. [x] archive old gitolite repos (except gettor-web, see https://gitlab.torproject.org/tpo/web/team/-/issues/44)anarcatanarcat2022-11-07https://gitlab.torproject.org/tpo/tpa/team/-/issues/40564TPA-RFC-41: Consider replacing or fixing Schleuder2023-04-01T03:02:39ZAlexander Færøyahf@torproject.orgTPA-RFC-41: Consider replacing or fixing SchleuderHello,
The title is a bit of a joke, but the gist of the issue here is that Schleuder seems to make everybody sad and miserable.
Over the past few weeks we had to do:
- Transition the Community Council list to new members. That caused...Hello,
The title is a bit of a joke, but the gist of the issue here is that Schleuder seems to make everybody sad and miserable.
Over the past few weeks we had to do:
- Transition the Community Council list to new members. That caused troubles where we needed help from TPA.
- @nickm wrote a very important email to the Network Team Security list which nobody received. @dgoulet got the log out which gave the error message, but @nickm has not been notified about this automatically from the system.
- Issues with handling key updates when the keys isn't on the currently-functional-whatever-that-may-mean OpenPGP keyserver.
- It seems like we have /some/ overlap between tor-security@ and network-team-security@, but maybe we should just consolidate these two into a single end-point for such reports? Since I'm not on the former: does the browser team gets as many security issues that way as the network team does?
We don't use Schleuder much in the organization right now. Only for "sensitive" topics such as the Community Council, and the different methods to report security issues to us.
Since https://gitlab.com/gitlab-org/gitlab/-/issues/222908 is still open, Gitlab doesn't seem to be the sole solution to this issue unfortunately and wouldn't work in the CC case at all :-/
Can we try to come up with an alternative?
CC'ing @cohosh here too as CC contact.
CC'ing @arma and @sysrqb as they are on tor-security@ too.
update: https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-41-schleuder-retirement drafted
next steps:
- [ ] retire network-team-security@
- [ ] decide what to do with tor-security-encrypted@
- [ ] decide what to do with tor-security@
- [ ] make a ticket to setup a new VM for schleuder and setup the web interface (see also tpo/tpa/team#40981)old service retirement 2023anarcatanarcat2022-12-01https://gitlab.torproject.org/tpo/tpa/team/-/issues/40924TPA-RFC-42: 2023 TPA roadmap planning2024-01-18T19:26:31ZanarcatTPA-RFC-42: 2023 TPA roadmap planningestablish roadmap for 2023, see
https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-42-roadmap-2023
* [x] brainstorm roadmap
* [x] adopt roadmap
* [ ] ~~update budget~~
* [ ] ~~prepare presentation~~
* [ ] ~~presen...establish roadmap for 2023, see
https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-42-roadmap-2023
* [x] brainstorm roadmap
* [x] adopt roadmap
* [ ] ~~update budget~~
* [ ] ~~prepare presentation~~
* [ ] ~~present presentation~~anarcatanarcat2022-12-05https://gitlab.torproject.org/tpo/tpa/team/-/issues/40981TPA-RFC-44: Email emergency recovery2022-12-15T21:54:24ZanarcatTPA-RFC-44: Email emergency recoveryTo respond to the bouncing email crisis (tpo/web/civicrm#74), I've drafted a new proposal to implement emergency measures but also a long term plan to host our own email properly:
https://gitlab.torproject.org/tpo/tpa/team/-/wikis/polic...To respond to the bouncing email crisis (tpo/web/civicrm#74), I've drafted a new proposal to implement emergency measures but also a long term plan to host our own email properly:
https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-44-email-emergency-recovery
This ticket provides a space to review the proposal, express dissent, encouragement, or any other comments.
next steps:
2. [x] make tickets for the work to be done in emergency
3. [x] update the status page (https://status.torproject.org/issues/2022-11-30-mail-delivery/)
4. [x] SPF (hard), DKIM and DMARC (soft) records on CiviCRM (#40986)
6. [x] DKIM signatures on eugeni and submission (#40988)
7. [x] DKIM signature on all mail hosts (#40989)
8. [x] Deploy SPF ~~(hard)~~, DKIM, and DMARC records for all of torproject.org (#40990)
5. [x] ~~Deploy a new, sender-rewriting, mail exchanger (#40987)~~ postponed to next year, followup in #41009
9. [x] update the status page (https://status.torproject.org/issues/2022-11-30-mail-delivery/)
10. [x] update the documentation in howto/submission and service/email.md
11. [x] ~~split long-term parts of TPA-RFC-44 *out* of it into a new proposal (yes, editing the standard, omg)~~ followup in #41009improve mail servicesanarcatanarcat2022-12-07https://gitlab.torproject.org/tpo/tpa/team/-/issues/40798TPA-RFC-31: outsource email services2022-12-05T18:53:38ZanarcatTPA-RFC-31: outsource email servicesThe proposal to host the entirety of our email services in-house, TPA-RFC-15, was officially rejected (see tpo/tpa/team#40363 and wiki-replica@ea20e615 for details). Now we need to figure out which part of email we'll outsource, and to ...The proposal to host the entirety of our email services in-house, TPA-RFC-15, was officially rejected (see tpo/tpa/team#40363 and wiki-replica@ea20e615 for details). Now we need to figure out which part of email we'll outsource, and to whom.
This ticket is to track the drafting and adoption of that proposal. Once that's done, new tickest should be created for those individual tasks.
quick brainstorm of a checklist:
- [x] brainstorm requirements here
- [x] adopt requirements
- [x] figure out what we'll do with the existing email services (e.g. probably retire submission?)
- [x] personas
- [x] list possible providers
- [x] generic
- [x] transactional
- [x] checkin with isa about what SLA we want
- [ ] officialize quotes, don't forget to mention SLA
- [ ] costs
- [x] staff, setup
- [x] staff, ongoing
- [ ] hosting
- [ ] timeline
- [ ] approval: same as TPA-RFC-15? (TPA, internal, ops, in that order?)
- [ ] deadline: maybe draft this within 2-3 weeks max, adoption in 4-6 weeks?
- [ ] review TPA-RFC-15 to see if we forgot any bits
any other ideas?
draft lives in https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-31-outsource-emailimprove mail servicesanarcatanarcat2022-12-08https://gitlab.torproject.org/tpo/tpa/team/-/issues/40944answer the opsreportcard, AKA the "limoncelli test", 2022 edition2022-12-20T20:07:51Zanarcatanswer the opsreportcard, AKA the "limoncelli test", 2022 editiona few months after starting work inside TPA (in July 2019), i had enough of a footing to think, "okay, I think i can find my way around here, what's next". then I made #30881, which goes like this:
> Tom Limoncelli is the reknowned auth...a few months after starting work inside TPA (in July 2019), i had enough of a footing to think, "okay, I think i can find my way around here, what's next". then I made #30881, which goes like this:
> Tom Limoncelli is the reknowned author of [Time management for sysadmins](https://www.tomontime.com/) and [practice of network and system administration](https://the-sysadmin-book.com/), two excellent books I recommend every sysadmin reads attentively.
>
> He made up a [32-question test](https://everythingsysadmin.com/the-test.pdf) (PDF, website version on [opsreportcard.com](http://opsreportcard.com/) or the [previous one-page HTML version](http://web.archive.org/web/20120827040816/http://everythingsysadmin.com:80/the-test.html)) that covers the basic of a well-rounded setup. I believe we will get a good score, but going through the list will make sure we don't miss anything.
I didn't establish what a "good score" was, but we certainly didn't get a "passing [grade](https://en.wikipedia.org/wiki/Grading_in_education)" (60%+??), according to the summary (https://gitlab.torproject.org/tpo/tpa/team/-/issues/30881#note_2541524), produced in October 2019:
> * Section A: **Public Facing Practices: 1.5/3 (50%)** tickets: #31242, #31243, #31244
> * Section B: **Modern Team Practices: 3.5/7 (50%)** tickets: #30880, #29387, missing: post-mortem, total puppetization, design docs, ticket prioritization of stability
> * Section C: **Operational Practices: 0.5/5 (10%)** tickets: none yet, missing: "ops docs" for each service, pager rotation schedule, dev/stage/prod environments, canary process
> * Section D: **Automation Practices: 1.5/3 (50%)** tickets: #31242, missing: reduce email noise
> * Section E: **Fleet Management Processes: 2.5/4 (63%)** tickets: #30273, #31969, #31239, #31957, #29304
> * Section F: **Disaster Preperation Practices: 4/5 (80%)** tickets: none yet, missing: disaster recovery plan
> * Section G: **Security Practices: 0.5/5 (10%)** tickets: #32519, missing: malware scanners, security policy, security audits, global root password rotation
>
> ## Final score: 14/32 (44%)
A lot of good things came out of this process, like the service templates, lots of automation, formal support policies, and so on. We should look at what was fixed in there (i see, for example, lots of the tickets above marked as closed, which is a good thing!) and how we could improve. This involves redoing the questionnaire, but also revisiting whether the process worked in the first place, and how well.
- [x] section A **Public Facing Practices**: 3/3 (2020: 1.5/3), excellent, mostly done
- [x] section B **Modern team practices**: 6/7 (2020: 3.5/7), excellent, just need to formalize post-mortem process and publishing our source code (#29387)
- [x] section C **Operational practices**: 1.5/5 (2020: 0.5/5), slight improvement, but still lots of docs missing, need monitoring for all services, figure out monitoring (#40755), import a dev/stage/prod culture
- [x] section D **Automation practices**: 1.5/3 (2020: 1.5/3), unchanged, still lots of email noise, no configuration management without Puppet access
- [x] section E **Fleet management practices**: 2/4 (2020: 2.5/4), mostly unchanged: installs still not automated (#31239), inventory chaotic (#30273)
- [x] section F **"We acknowledge that hardware breaks" practices**: 4/5 (2020: 4/5), unchanged, still missing disaster recovery plan (#40628)
- [x] section G **Security practices**: 0/5 (2020: 0.5/5), worse: no security policy (tpo/team#41), needs improvement to the password manager (#29677), need to rethink central authentication
## Final score: 18/32 56% (2020: 14/32, 44%)
This is an improvement, but there is still a lot of work to do. We're almost at the passing grade!
It seems like the most critical aspects we need to work on (outlined by a "star" in the [PDF version of the test](https://everythingsysadmin.com/the-test.pdf)) are:
* C: Operational practices:
* \*11. Does each service have an OpsDoc? (no plan)
* \*12. Does each service have appropriate monitoring? (improving thanks to Prometheus)
* E: Fleet management practices:
* \*19. Is there a database of all machines? (#30273, no plan)
* F. "We acknowledge that hardware breaks" practices:
* \*26. Are your disaster recovery plans tested periodically? (#40628)
* G. Security practices:
* \*28. Do desktops/laptops/servers run self-updating, silent, anti-malware software? (no plan)
* \*29. Do you have a written security policy? (no, tpo/team#41)anarcatanarcat2022-12-12