The Tor Project issues

The Tor Project issues https://gitlab.torproject.org/groups/tpo/-/issues 2024-03-07T14:23:37Z https://gitlab.torproject.org/tpo/tpa/team/-/issues/41526 Deploy onionperf files parser on metricsdb-01 2024-03-07T14:23:37Z Hiro

Deploy onionperf files parser on metricsdb-01

We need to deploy https://gitlab.torproject.org/tpo/network-health/metrics/tor_fusion/ on metricsdb-01. Basically this thing will run, download onionperf files from collector and parse them. This will just happen once a day around 1am UT... We need to deploy https://gitlab.torproject.org/tpo/network-health/metrics/tor_fusion/ on metricsdb-01. Basically this thing will run, download onionperf files from collector and parse them. This will just happen once a day around 1am UTC as at midnight is when collector fetches the archives from the various onionperf clients. It's a little rust app and was thinking to create a group and user like for the metrics-api. But maybe it's a bit overkill and I should just put it in the parser space? Doing Metrics Stale Hiro Hiro https://gitlab.torproject.org/tpo/tpa/team/-/issues/41515 meronense OOM 2024-02-05T19:52:19Z anarcat

meronense OOM

today, metrics.tpo went down because the OOM killer was invoked. not sure what happened. i restarted both metrics-r and metrics-web.service, pending further investigation. this happened before, of course. we bumped the memory on that bo... today, metrics.tpo went down because the OOM killer was invoked. not sure what happened. i restarted both metrics-r and metrics-web.service, pending further investigation. this happened before, of course. we bumped the memory on that box to 20GB in #41335 and had issues after the bullseye upgrade as well (#40814), both incidents should be investigated. those are just the incidents that pop up in the gitlab "Similar issues", further investigation in other issues probably warranted. possibly related with the bookworm upgrade, of course (#41252). Doing Metrics anarcat anarcat https://gitlab.torproject.org/tpo/tpa/team/-/issues/41424 Finding a reasonable backup strategy for metricsdb-01 2023-12-19T20:20:19Z Hiro

Finding a reasonable backup strategy for metricsdb-01

Today with @anarcat we have briefly discussed a change of retention policy for metricsdb-01 postgresql database. If I am not mistaken the policy changed from 30 days retention to 7 days. I think that is correct but we could even find a... Today with @anarcat we have briefly discussed a change of retention policy for metricsdb-01 postgresql database. If I am not mistaken the policy changed from 30 days retention to 7 days. I think that is correct but we could even find a policy that is easier to maintain over time. I think we could assume that whatever we have on metricsdb-01 could be recreated from archives that we store on collector. The only data that escape that rule would be the tags and notes that we intend to attach to relays with tagtor. For those we could setup a timer that would dump the few tables (I think 4 in total) that store that data. The dumps wouldn't be big or slow to create, so that could be a solution. I am not sure is there anything else we could consider, but I am open to suggestions. /cc @gk @micah Backup Doing Metrics PostgreSQL anarcat anarcat https://gitlab.torproject.org/tpo/tpa/team/-/issues/41380 onionoo-backend-01 running filling up swap 2024-01-10T16:47:43Z anarcat

onionoo-backend-01 running filling up swap

![image](/uploads/d6877b98d7d21a676d788bb27f144e68/image.png) https://grafana.torproject.org/d/amgrk2Qnk/memory-usage?orgId=1&var-class=All&var-node=onionoo-backend-01.torproject.org&var-node=onionoo-backend-02.torproject.org&from=now-1y... ![image](/uploads/d6877b98d7d21a676d788bb27f144e68/image.png) https://grafana.torproject.org/d/amgrk2Qnk/memory-usage?orgId=1&var-class=All&var-node=onionoo-backend-01.torproject.org&var-node=onionoo-backend-02.torproject.org&from=now-1y&to=now something on onionoo-backend-01 is eating up all swap. it seems to have stabilized now, but it tripped the critical warnings in nagios. @hiro any idea what's going on here? Doing Metrics Stale lifecycle Hiro Hiro https://gitlab.torproject.org/tpo/tpa/team/-/issues/41343 Onionoo backends out of disk space 2023-11-20T21:57:17Z Hiro

Onionoo backends out of disk space

Seems the onionoo backends have run out of disk space on /srv. Can we increase space? I think if we could add 10 More GB to each host at least (ideally 20) it would be ok. Seems the onionoo backends have run out of disk space on /srv. Can we increase space? I think if we could add 10 More GB to each host at least (ideally 20) it would be ok. Doing Metrics lifecycle anarcat anarcat https://gitlab.torproject.org/tpo/tpa/team/-/issues/41307 Deploy network status api on metricsdb-01 2023-09-06T20:18:23Z Hiro

Deploy network status api on metricsdb-01

We need to deploy the network status api onto metricsdb-01. This is a web based services that reads data out of the postgresdb and Victoria Metrics (https://gitlab.torproject.org/tpo/network-health/metrics/networkstatusapi/) I am going... We need to deploy the network status api onto metricsdb-01. This is a web based services that reads data out of the postgresdb and Victoria Metrics (https://gitlab.torproject.org/tpo/network-health/metrics/networkstatusapi/) I am going to add it behind apache and protect with http auth as this is not a public service yet. \cc @gk Doing Metrics Sponsor 112 Hiro Hiro https://gitlab.torproject.org/tpo/tpa/team/-/issues/41258 materculae out of disk space 2023-09-21T01:51:41Z Kez

materculae out of disk space

previous ticket: #40826 it's been a year, and nagios is complaining about materculae's /srv partition ``` # df -h /srv Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg_materculae-srv 147G 135G 4.3G 97%... previous ticket: #40826 it's been a year, and nagios is complaining about materculae's /srv partition ``` # df -h /srv Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg_materculae-srv 147G 135G 4.3G 97% /srv ``` in the previous ticket (#40826) @anarcat changed the warning threshold, which is why this warning popped up now. according to grafana, the usage has only been about 15G in the past year, and the growth is linear. we could add another 20G and revisit in a year, or throw 40G or 60G at it to push things further down the road. ![image](/uploads/e8ddf8b69703273f73d891586f7fc137/image.png) Doing Metrics anarcat anarcat 2023-09-22 https://gitlab.torproject.org/tpo/tpa/team/-/issues/41167 connect to postgresql db on new metrics DB via tls 2023-06-27T15:13:20Z Hiro

connect to postgresql db on new metrics DB via tls

Would it be possible to get a read only user to connect to the postgresql db on metrics-psqlts-01 via tls? This would be used to access it via grafana, but also allow metrics developers to query the data. Possibly people that would acc... Would it be possible to get a read only user to connect to the postgresql db on metrics-psqlts-01 via tls? This would be used to access it via grafana, but also allow metrics developers to query the data. Possibly people that would access this would be: @hiro @gk @mattrighetti Doing GSoC Metrics Jérôme Charaoui lavamind@torproject.org Jérôme Charaoui lavamind@torproject.org https://gitlab.torproject.org/tpo/tpa/team/-/issues/41161 rebuild corsicum into collector-02.torproject.org 2023-05-23T16:24:44Z anarcat

rebuild corsicum into collector-02.torproject.org

we need to migrate out of the old Sunet cluster into the new Safespring cluster, corsicum needs to be retired and rebuilt into collector-02. see also #40684. we need to migrate out of the old Sunet cluster into the new Safespring cluster, corsicum needs to be retired and rebuilt into collector-02. see also #40684. Doing Metrics anarcat anarcat https://gitlab.torproject.org/tpo/tpa/team/-/issues/41130 Deploy new metrics database stack 2023-07-07T08:00:20Z Hiro

Deploy new metrics database stack

I have been testing our victoriametrics + postgresql setup on metrics-psqlts-01 for a while, and now that we are close to have a prod deployment of this pipeline I'd like to have things properly in puppet. I have a branch with a tentati... I have been testing our victoriametrics + postgresql setup on metrics-psqlts-01 for a while, and now that we are close to have a prod deployment of this pipeline I'd like to have things properly in puppet. I have a branch with a tentative setup that I'd like to have your opinion on called metrics-deploy. This branch has also support to deploy a python web app to access and query both the postgresql db and victoria metrics. Victoria metrics runs with docker, but without compose. I am not sure you'd prefer a compose setup, since this is a single service. An alternative would be to run the full stack with compose. Would postgresql backups work in that case? I am going to be out next week. So maybe we could discuss this in costa rica face to face? Doing Metrics Stale lifecycle Hiro Hiro https://gitlab.torproject.org/tpo/tpa/team/-/issues/41114 Disk space increase on metrics-psqlts-01 2023-04-04T00:02:14Z Hiro

Disk space increase on metrics-psqlts-01

Can we add 20 more Gigas on metrics-psqlts-01? Can we add 20 more Gigas on metrics-psqlts-01? Doing Metrics lifecycle https://gitlab.torproject.org/tpo/tpa/team/-/issues/41026 data update service and timer on meronense 2023-01-10T16:41:24Z Hiro

data update service and timer on meronense

I would need some help figuring out why the update service on meronense doesn't wait for the previous run to finish before starting a new one. The timer and service are in puppet and they only start this script: https://gitlab.torprojec... I would need some help figuring out why the update service on meronense doesn't wait for the previous run to finish before starting a new one. The timer and service are in puppet and they only start this script: https://gitlab.torproject.org/tpo/network-health/metrics/metrics-bin/-/blob/main/website/run-web.sh \cc @gk Doing Metrics anarcat anarcat https://gitlab.torproject.org/tpo/tpa/team/-/issues/40910 CRITICAL disk usage on metrics-psqlts-01 2022-10-04T14:03:07Z Kez

CRITICAL disk usage on metrics-psqlts-01

Icinga is reporting a critical issue for disk usage - all since 2022-09-29 22:57:01 ``` DISK CRITICAL - free space: / 561 MB (5% inode=87%): /dev 3962 MB (100% inode=99%): /dev/shm 3978 MB (99% inode=99%): /run 795 MB (99% inode=99%): /... Icinga is reporting a critical issue for disk usage - all since 2022-09-29 22:57:01 ``` DISK CRITICAL - free space: / 561 MB (5% inode=87%): /dev 3962 MB (100% inode=99%): /dev/shm 3978 MB (99% inode=99%): /run 795 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /tmp 3978 MB (100% inode=99%): /run/credentials 795 MB (99% inode=99%): /var/tmp 561 MB (5% inode=87%): ``` Doing Metrics anarcat anarcat https://gitlab.torproject.org/tpo/tpa/team/-/issues/40773 install newer obfs4proxy on polyanthum 2022-06-02T14:33:08Z Roger Dingledine

install newer obfs4proxy on polyanthum

This is a similar ticket to https://gitlab.torproject.org/tpo/tpa/team/-/issues/40758 We currently have obfs4proxy 0.0.8 installed on bridges.tpo. And we use that obfs4proxy to test obfs4 reachability of all the bridges. But because of... This is a similar ticket to https://gitlab.torproject.org/tpo/tpa/team/-/issues/40758 We currently have obfs4proxy 0.0.8 installed on bridges.tpo. And we use that obfs4proxy to test obfs4 reachability of all the bridges. But because of https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/40804, we are testing with an old and buggy and only partly compatible obfs4! The obfs4 version in Tor Browser is 0.0.12, which means Tor clients are getting the new better handshake. We should move bridgestrap so it tests obfs4 bridges using the same handshake that Tor Browser users will attempt. And the way we do that is by upgrading the obfs4proxy package. https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/obfs4/-/issues/33736#note_2786764 says that as of some months ago, obfs4proxy 0.0.13 is in bullseye-backports. Does that mean we just add a line to the puppet stanza and we're there? :) [Cc'ing @meskio so he knows about the ticket] Doing Metrics lifecycle anarcat anarcat https://gitlab.torproject.org/tpo/tpa/team/-/issues/40770 postgresql DB with timescale plugin installed 2022-06-13T14:10:48Z Hiro

postgresql DB with timescale plugin installed

I'd like to start testing storing all metrics data into a DB as described in: https://gitlab.torproject.org/tpo/network-health/team/-/wikis/metrics/collector/pipeline For the time being I'd just need a postgresql instance, with timescal... I'd like to start testing storing all metrics data into a DB as described in: https://gitlab.torproject.org/tpo/network-health/team/-/wikis/metrics/collector/pipeline For the time being I'd just need a postgresql instance, with timescaledb plugin installed that I could send data to over TLS. In this first step my plan is just to start storing data into tables and eventually have this as a task in collector. Doing Metrics PostgreSQL anarcat anarcat https://gitlab.torproject.org/tpo/tpa/team/-/issues/40764 onionoo down, serving malformed JSON 2022-07-26T15:41:40Z Jérôme Charaoui lavamind@torproject.org

onionoo down, serving malformed JSON

About an hour after the bullseye upgrade of `onionoo-backend-01`, onionoo started failing Nagios checks: ``` # /usr/lib/nagios/plugins/tor-check-onionoo 127.0.0.1:8080 CRITICAL: Error parsing JSON format: Expecting value: line 6 column ... About an hour after the bullseye upgrade of `onionoo-backend-01`, onionoo started failing Nagios checks: ``` # /usr/lib/nagios/plugins/tor-check-onionoo 127.0.0.1:8080 CRITICAL: Error parsing JSON format: Expecting value: line 6 column 20 (char 98) {"version":"%s", "build_revision":"%s", ``` Indeed, it seems to be serving malformed JSON: ``` # curl 127.0.0.1:6081/summary?limit=0 {"version":"%s", "build_revision":"%s", "relays_published":"%s", "relays":[ ], "relays_truncated":%d, "bridges_published":"2022-05-18 14:44:41", "bridges":[ ], "bridges_truncated":%d} ``` Whether the upgrade is what caused this incident is unclear at this point, because `onionoo-backend-01` was confirmed working immediately after the upgrade, and because the problem started approximately at the same time for it and `onionoo-backend-02`, which has *not* been upgraded. Doing Metrics Jérôme Charaoui lavamind@torproject.org Jérôme Charaoui lavamind@torproject.org https://gitlab.torproject.org/tpo/tpa/team/-/issues/40750 materculae hits the OOM killer since bullseye upgrade 2022-06-27T15:00:11Z anarcat

materculae hits the OOM killer since bullseye upgrade

last night (from my perspective), PostgreSQL crashed on materculae. in systemd's logs, we see: ``` May 05 05:25:33 materculae systemd[1]: postgresql@13-main.service: A process of this unit has been killed by the OOM killer. ``` then a ... last night (from my perspective), PostgreSQL crashed on materculae. in systemd's logs, we see: ``` May 05 05:25:33 materculae systemd[1]: postgresql@13-main.service: A process of this unit has been killed by the OOM killer. ``` then a bunch of errors happened in the postgresql log: ``` 2022-05-05 05:25:33 GMT LOG: server process (PID 16279) was terminated by signal 9: Killed 2022-05-05 05:25:33 GMT DETAIL: Failed process was running: select * from search_by_date_address24($1, $2) as result 2022-05-05 05:25:33 GMT LOG: terminating any other active server processes 2022-05-05 05:25:33 GMT WARNING: terminating connection because of crash of another server process 2022-05-05 05:25:33 GMT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2022-05-05 05:25:33 GMT HINT: In a moment you should be able to reconnect to the database and repeat your command. ``` it's unclear why this is happening, but it's clearly a regression from the upgrade. here's a memory graph from the last 3 days: ![image](/uploads/839a371de3d308807d2f3394d39977c7/image.png) https://grafana.torproject.org/d/xfpJB9FGz/1-node-exporter-for-prometheus-dashboard-en-v20201010?orgId=1&var-origin_prometheus=&var-job=node&var-hostname=All&var-node=materculae.torproject.org:9100&var-device=All&var-interval=2m&var-maxmount=%2Fhome&var-show_hostname=materculae&var-total=93&viewPanel=156&from=now-3d&to=now&refresh=1m i *think* the upgrade completed at about 15:26 UTC yesterday, at least according to the graph. (this comment is later, but that's probably just me reporting after the fact: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40692#note_2799945). then we can see the server restarting (the blank), and slowly reclaiming memory. then there's this unusual jump at 22:18 and things go a little out of whack for a few hours, but seem to stabilise at a somewhat reasonable pattern at 11:00 next day. that's about 1GB more memory usage than the previous normal though, so that's already a little worrisome. but then, at 22:46UTC, memory usage just starts to grown linearly, eventually hitting the above OOM at around 5:30 or so. it seems we don't have prometheus instrumentation for postgresql on that host at all right now, so i guess that would be one next step. /cc @hiro Debian 11 bullseye upgrade Doing Metrics PostgreSQL anarcat anarcat https://gitlab.torproject.org/tpo/tpa/team/-/issues/40535 colchicifolium disk full 2023-06-07T15:45:23Z anarcat

colchicifolium disk full

colchicifolium's disk is rising steadily, this is the last year: ![image](/uploads/e781feb8a476adcb640ab6a275d25e6b/image.png) we can see when we added 50G then 200G more. @hiro is thinking about redesigning this service, but in the m... colchicifolium's disk is rising steadily, this is the last year: ![image](/uploads/e781feb8a476adcb640ab6a275d25e6b/image.png) we can see when we added 50G then 200G more. @hiro is thinking about redesigning this service, but in the meantime, let's give this poor server a break. Doing Metrics anarcat anarcat https://gitlab.torproject.org/tpo/tpa/team/-/issues/41575 meronense disk full, possibly due to materculae outage 2024-04-15T20:19:35Z anarcat

meronense disk full, possibly due to materculae outage

/cc @hiro /cc @hiro Emergency Metrics Needs Review anarcat anarcat https://gitlab.torproject.org/tpo/tpa/team/-/issues/41452 estimate storage requirements for metricsdb and backups 2024-03-24T14:01:45Z anarcat

estimate storage requirements for metricsdb and backups

in #41424, we have agreed to continue with the monolithic postgresql design for the time being, more or less -- collector will move to object storage and there's a possibility of introducing other optimizations (https://gitlab.torproject... in #41424, we have agreed to continue with the monolithic postgresql design for the time being, more or less -- collector will move to object storage and there's a possibility of introducing other optimizations (https://gitlab.torproject.org/tpo/tpa/team/-/issues/41416#note_2978071) -- but for now that's the plan. We'll need to scale up storage for metricsdb. Right now, the storage usage is as follows: | machine | used | size | |----------------|---------|----------| | metricsdb-01 | 1.07TiB | 7.88TiB | | bungei pg | 1.36TiB | 2.96TiB | | **total** | 1.43TiB | 10.84TiB | Source: https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&var-class=All&var-instance=bungei.torproject.org&from=now-90d&to=now&refresh=5s&var-Filters=mountpoint%7C%3D%7C%2Fsrv%2Fbackups%2Fpg https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&var-class=All&var-instance=metricsdb-01.torproject.org&from=now-1y&to=now&refresh=5s The specification is that we need weekly backups of the postgresql database (*not* WAL logs) except for a subset of tables that need hourly or better backups (ideally WAL). The estimate is the database size *at launch* will be around 5TiB, with a 500GiB growth per year. This could involve building a new storage server to handle those backups (#41364) and we feel it would be a good idea to start working with [Barman](https://pgbarman.org/) for this system. The output of this issue is an estimate for hardware needs, a rough architectural draft, and subsequent tickets to make necessary changes to reflect said architecture. /cc @lavamind Metrics Needs Review PostgreSQL lifecycle anarcat anarcat