TPA issueshttps://gitlab.torproject.org/groups/tpo/tpa/-/issues2024-03-27T15:26:46Zhttps://gitlab.torproject.org/tpo/tpa/team/-/issues/13134Figure out access rights to new dists.torproject.org2024-03-27T15:26:46ZAndrew LewmanFigure out access rights to new dists.torproject.orgFigure out access rights to new dists.torproject.org so people can upload their precious binaries of love.Figure out access rights to new dists.torproject.org so people can upload their precious binaries of love.WebsiteV3https://gitlab.torproject.org/tpo/tpa/team/-/issues/32462convert existing varnish caches into nginx2024-02-01T16:28:05Zanarcatconvert existing varnish caches into nginxin legacy/trac#32239 we have setup nginx as a caching frontend for the blog. but we also use varnish elsewhere in our infrastructure, specifically on the onionoo services. those are currently being rebuilt (legacy/trac#31659) so maybe it...in legacy/trac#32239 we have setup nginx as a caching frontend for the blog. but we also use varnish elsewhere in our infrastructure, specifically on the onionoo services. those are currently being rebuilt (legacy/trac#31659) so maybe it would be better to wait for that to complete to finish that transition.
this will require some refactoring of the "cache" role as it currently hardcodes the blog as a service. one idea i had was to have a "name => backends" hash, but maybe that would be too blunt of an instrument and we'd just need to split the roles.
in any case, there's some puppet refactoring involved.https://gitlab.torproject.org/tpo/tpa/team/-/issues/29816replace "Tor VM hosts" spreadsheet with Grafana dashboard2023-08-28T19:02:17Zanarcatreplace "Tor VM hosts" spreadsheet with Grafana dashboardOur KVM allocation strategy is currently managed through a Google spreadsheet. This is suboptimal for a few reasons:
1. it is hard to keep up to date - for example, moly is not listed in there even though it's in LDAP as a "KVM host"
...Our KVM allocation strategy is currently managed through a Google spreadsheet. This is suboptimal for a few reasons:
1. it is hard to keep up to date - for example, moly is not listed in there even though it's in LDAP as a "KVM host"
2. it's not real time data - for example, even if a host is allocated one vCPU, it might be totally idle most of the time and doing mostly network or disk, while another one might hit the CPU hard. actual load is what matters
3. ~~it's hosted by Google - that has a few problems, the most important of which is that some TPA do not actually *want* to use Google services and might be reluctant to update it, worsening problem 1~~ that part is fixed: we have moved it to Nextcloud
I propose we shift this to a Grafana dashboard. I already have a prototype in the form of the [Node exporter server metrics Grafana Dashboard](https://grafana.com/dashboards/405) which shows multiple hosts basic stats in parallel. I set the default of the dashboard in Grafana to show the 6 KVM hosts:
<https://grafana.torproject.org/d/ER3U2cqmk/node-exporter-server-metrics?orgId=1&from=now-12h&to=now&var-node=kvm4.torproject.org:9100&var-node=kvm5.torproject.org:9100&var-node=macrum.torproject.org:9100&var-node=moly.torproject.org:9100&var-node=textile.torproject.org:9100&var-node=unifolium.torproject.org:9100>
That looks like this:
![https://paste.anarc.at/snaps/snap-2019.04.17-16.48.43.png](https://paste.anarc.at/snaps/snap-2019.04.17-16.48.43.png)
.. but it's not ideal:
* it's showing irrelevant stats for this purpose like context switches or detailed disk or memory stats
* it's missing critical information like the number of KVM guests hosted on the machine, how many CPUs and disk space is allocated and so on
This is the information we should be showing:
* disk capacity vs allocation
* disk utilization
* CPU count vs allocation
* actual CPU utilization
* load?
* memory capacity vs allocation
* actual memory usage
Some of that information currently lives *only* in the spreadsheet. For example, disk allocations are only available there, as the KVM guests run on QCOW (Qemu Copy On Write) filesystems that only take space when actually used by the guest. This has the advantage of allowing us to over-provision, but means we must keep that metadata somewhere else.
So for now it's in the spreadsheet, but we could find a way to move it somewhere Prometheus can scrape. One trick that Prometheus has is that it can expose metrics stored as text files in `/var/lib/prometheus/node-exporter/*.prom`. This is how the smartctl and APT metrics get shipped for example: a cron job (well, a systemd timer) regularly writes that file, atomically. So one option could be to move this information to (say) LDAP or Puppet/Hiera and write that information into that file using a cronjob (LDAP) or Puppet (Hiera).
Then we'd build a custom Grafana dashboard and get rid of the other spreadsheet.
A stop-gap measure might be to simplify the spreadsheet and move it to a plain text markdown file. We would lose the automatic calculations the spreadsheet provide, in exchange for easier updating and transparency.anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/31159Monitor anti-censorship www services with prometheus2023-07-31T02:23:25ZPhilipp Winterphw@torproject.orgMonitor anti-censorship www services with prometheusIn the anti-censorship team we currently monitor [several services](https://trac.torproject.org/projects/tor/wiki/org/teams/AntiCensorshipTeam/InfrastructureMonitoring) with sysmon. We recently discovered that sysmon doesn't seem to fol...In the anti-censorship team we currently monitor [several services](https://trac.torproject.org/projects/tor/wiki/org/teams/AntiCensorshipTeam/InfrastructureMonitoring) with sysmon. We recently discovered that sysmon doesn't seem to follow HTTP 301 redirects. This means that if a web service dies but the 301 redirect still works (e.g., BridgeDB is dead but its apache reverse proxy still works), sysmon won't notice.
Now that prometheus is running, we should fill this monitoring gap by testing the following web sites:
* https://bridges.torproject.org
* https://snowflake.torproject.org
* https://gettor.torproject.org
Our test should ensure that these sites serve the content we expect, e.g., make sure that bridges.tp.o contains the string "BridgeDB" in its HTML. Testing the HTTP status code does not suffice: if BridgeDB is down, the reverse proxy may still respond.
I wonder if prometheus could also help us with legacy/trac#12802 by sending an email to bridges@tp.o and making sure that it responds with at least one bridge?
Checklist:
1. [ ] monitor services in Nagios: BridgeDB, Snowflake, and GetTor
2. [ ] deploy Prometheus's "blackbox exporter" for default bridges, which are external services
3. [ ] delegate to (and train) the anti-censorship team the blackbox exporter configuration
3. [ ] experiment with Prometheus's "alertmanager", which can send notifications if a monitoring target goes offline
4. [X] grant the anti-censorship team access to Prometheus's grafana dashboard.HiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/30881answer the opsreportcard questionnaire, AKA the "limoncelli test"2022-12-20T19:13:24Zanarcatanswer the opsreportcard questionnaire, AKA the "limoncelli test"Tom Limoncelli is the reknowned author of [Time management for sysadmins](https://www.tomontime.com/) and [practice of network and system administration](https://the-sysadmin-book.com/), two excellent books I recommend every sysadmin rea...Tom Limoncelli is the reknowned author of [Time management for sysadmins](https://www.tomontime.com/) and [practice of network and system administration](https://the-sysadmin-book.com/), two excellent books I recommend every sysadmin reads attentively.
He made up a [32-question test](https://everythingsysadmin.com/the-test.pdf) (PDF, website version on [opsreportcard.com](http://opsreportcard.com/) or the [previous one-page HTML version](http://web.archive.org/web/20120827040816/http://everythingsysadmin.com:80/the-test.html)) that covers the basic of a well-rounded setup. I believe we will get a good score, but going through the list will make sure we don't miss anything.anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/31243TPA-RFC-2: define how users get support, what's an emergency and what is supp...2022-12-20T18:59:27ZanarcatTPA-RFC-2: define how users get support, what's an emergency and what is supportedExtract from parent ticket (#30881):
# 2. Are "the 3 empowering policies" defined and published?
http://opsreportcard.com/section/2
Specifically, this is three questions:
## How do users get help?
Right now, this is unofficially "op...Extract from parent ticket (#30881):
# 2. Are "the 3 empowering policies" defined and published?
http://opsreportcard.com/section/2
Specifically, this is three questions:
## How do users get help?
Right now, this is unofficially "open a ticket in Trac", "ping us over IRC for small stuff", or "write us an email". This could be made more official somewhere.
## What is an emergency?
I am not sure this is formally defined.
## What is supported?
We have the distinction between systems and service admins. We did [talk in Stockholm](https://trac.torproject.org/projects/tor/wiki/org/meetings/2019Stockholm/Notes/SysadminTeamRoadmapping) about clarifying that item, so this is worth expanding further.anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/schleuder/-/issues/21486Create GPG key for network-team-security@ Schleuder list2022-11-30T20:02:27ZDavid Gouletdgoulet@torproject.orgCreate GPG key for network-team-security@ Schleuder listWe need a GPG key for that list.We need a GPG key for that list.HiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/30880document backup/restore procedures2022-10-25T18:36:48Zanarcatdocument backup/restore proceduresBackup system design and restore procedures are currently not well documented in our wiki. Try a few restores and document the heck out of this. The [ops report card](http://opsreportcard.com/section/11) recommends services be documented...Backup system design and restore procedures are currently not well documented in our wiki. Try a few restores and document the heck out of this. The [ops report card](http://opsreportcard.com/section/11) recommends services be documented with a template like this:
1. Overview: Overview of the service: what is it, why do we have it, who are the primary contacts, how to report bugs, links to design docs and other relevant information.
2. Build: How to build the software that makes the service. Where to download it from, where the source code repository is, steps for building and making a package or other distribution mechanisms. If it is software that you modify in any way (open source project you contribute to or a local project) include instructions for how a new developer gets started. Ideally the end result is a package that can be copied to other machines for installation.
3. Deploy: How to deploy the software. How to build a server from scratch: RAM/disk requirements, OS version and configuration, what packages to install, and so on. If this is automated with a configuration management tool like cfengine/puppet/chef (and it should be), then say so.
4. Common Tasks: Step-by-step instructions for common things like provisioning (add/change/delete), common problems and their solutions, and so on.
5. Pager Playbook: A list of every alert your monitoring system may generate for this service and a step-by-step "what do to when..." for each of them.
6. DR: Disaster Recovery Plans and procedure. If a service machine died how would you fail-over to the hot/cold spare?
7. SLA: Service Level Agreement. The (social or real) contract you make with your customers. Typically things like Uptime Goal (how many 9s), RPO (Recovery Point Objective) and RTO (Recovery Time Objective).
While we don't use that template anywhere yet (and it somehow conflicts with the [documentation best practices](https://www.divio.com/blog/documentation/), we can probably find a middle ground of some sort...anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/34063[RT-admin] Check if spam filter script is running2022-08-25T14:27:10ZGus[RT-admin] Check if spam filter script is runningAccording to RT service documentation[1], there are some maintenance actions happening like spam training in RT. Since we're receiving a lot of spam, we should verify if spam filter is actually running.
```
Spam training
Every mail s...According to RT service documentation[1], there are some maintenance actions happening like spam training in RT. Since we're receiving a lot of spam, we should verify if spam filter is actually running.
```
Spam training
Every mail sent to RT is also sent to the rtmailarchive account. This is required to be able to train SpamAssassin as it can only learn from unaltered email messages.
A three steps cronjob is run daily.
Step 1: Every mail in Maildir/.help* is checked against the RT. For each message, we look up a matching ticket using the Message-Id header. If the ticket is in a help* queue and has status resolved, we move it to the ham training folder. If the ticket in in the spam queue and has status resolved, we move it to the spam training folder. If the file is more than 100 days old, we delete it.
Step 2: SpamAssassin is fed with the content of the ham and spam training folder. After the process, the message is moved to the corresponding learned folder.
Step 3: Message in the learned folders are deleteed.
```
[1]
https://trac.torproject.org/projects/tor/wiki/org/operations/services/rt.torproject.org#Spamtraininganarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/24846create logo for Tor Project | Survey2022-08-25T13:56:20ZIsabela Fernandescreate logo for Tor Project | SurveyWe should have a logo following the styleguide.torprojectWe should have a logo following the styleguide.torprojectAntonelaantonela@torproject.orgAntonelaantonela@torproject.orghttps://gitlab.torproject.org/tpo/tpa/team/-/issues/8548Put pyobfsproxy/pyptlib releases in dist2022-08-25T13:47:07ZGeorge KadianakisPut pyobfsproxy/pyptlib releases in distI should upload the pyobfs/pyptlib releases somewhere so that that they can be moved to distI should upload the pyobfs/pyptlib releases somewhere so that that they can be moved to disthttps://gitlab.torproject.org/tpo/tpa/team/-/issues/32390decomission storm / bracteata on February 11, 20202022-06-21T14:39:28ZGabagaba@torproject.orgdecomission storm / bracteata on February 11, 2020Hi!
We are migrating into nc.torproject.net. We are planning to shutdown storm in February. This is the ticket for us not to forget :)Hi!
We are migrating into nc.torproject.net. We are planning to shutdown storm in February. This is the ticket for us not to forget :)anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/29410Can Prometheus help with multiple checks turning into one single alarm?2022-06-20T19:55:57ZLinus Nordberglinus@torproject.orgCan Prometheus help with multiple checks turning into one single alarm?This question came up when discussing doing more checks of services over IPv6.This question came up when discussing doing more checks of services over IPv6.https://gitlab.torproject.org/tpo/tpa/team/-/issues/29681replace munin with prometheus and grafana2022-06-20T19:55:57Zanarcatreplace munin with prometheus and grafanamunin died in a fire and people want to try out prometheus, let's do that.
this will also involve setting up a Grafana instance as the built-in graphs in Prometheus are too limited and/or hard to configure.munin died in a fire and people want to try out prometheus, let's do that.
this will also involve setting up a Grafana instance as the built-in graphs in Prometheus are too limited and/or hard to configure.anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/33062investigate kreb's advice on DNS hijacking2022-06-03T23:47:50Zanarcatinvestigate kreb's advice on DNS hijackingAfter reviewing [this article about recent DNS hijacking incidents](https://krebsonsecurity.com/2019/02/a-deep-dive-on-the-recent-widespread-dns-hijacking-attacks/), I think it might be worth reviewing the recommendations given in the ar...After reviewing [this article about recent DNS hijacking incidents](https://krebsonsecurity.com/2019/02/a-deep-dive-on-the-recent-widespread-dns-hijacking-attacks/), I think it might be worth reviewing the recommendations given in the article, which are basically:
1. [x] use DNSSEC
2. [ ] Use registration features like Registry Lock that can help protect domain names records from being changed
3. [ ] Use access control lists for applications, Internet traffic and monitoring
4. [ ] Use 2-factor authentication, and require it to be used by all relevant users and subcontractors
5. [x] In cases where passwords are used, pick unique passwords and consider password managers
6. [ ] Review accounts with registrars and other providers
7. [ ] Monitor certificates by monitoring, for example, Certificate Transparency Logs (#40677)
Some of those are impractical: for example 2FA will not work for us if we have one shared account with a provider.
Others have already been done: we have a good DNSSEC deployment and manage passwords properly.
Mainly, I'm curious about investigating Registry lock and CT logs monitoring, the latter which could be added as a Nagios thing, maybe.https://gitlab.torproject.org/tpo/tpa/team/-/issues/29386Implement and deploy script for spamming people about account, group and host...2022-05-03T17:45:37ZLinus Nordberglinus@torproject.orgImplement and deploy script for spamming people about account, group and host expirationweasel (Peter Palfrader)weasel (Peter Palfrader)https://gitlab.torproject.org/tpo/tpa/team/-/issues/29385Adapt LDAP scripts to honour expiration dates2022-05-03T17:45:37ZLinus Nordberglinus@torproject.orgAdapt LDAP scripts to honour expiration datesweasel (Peter Palfrader)weasel (Peter Palfrader)https://gitlab.torproject.org/tpo/tpa/team/-/issues/29384Add to LDAP, for each group, an expiration date2022-05-03T17:45:37ZLinus Nordberglinus@torproject.orgAdd to LDAP, for each group, an expiration dateweasel (Peter Palfrader)weasel (Peter Palfrader)https://gitlab.torproject.org/tpo/tpa/team/-/issues/29383Add to LDAP, for each user account, an expiration date2022-05-03T17:45:37ZLinus Nordberglinus@torproject.orgAdd to LDAP, for each user account, an expiration dateweasel (Peter Palfrader)weasel (Peter Palfrader)https://gitlab.torproject.org/tpo/tpa/team/-/issues/29382Add to LDAP, for each host, expiration date and list of "stakeholders"2022-05-03T17:45:37ZLinus Nordberglinus@torproject.orgAdd to LDAP, for each host, expiration date and list of "stakeholders"