The Tor Project issueshttps://gitlab.torproject.org/groups/tpo/-/issues2024-03-28T13:25:06Zhttps://gitlab.torproject.org/tpo/onion-services/onionspray-log-parser/-/issues/11Slowness on onionspray-get-logs-from-s3fs2024-03-28T13:25:06ZSilvio RhattoSlowness on onionspray-get-logs-from-s3fs# Tasks
* [ ] Investigate why [onionspray-get-logs-from-s3fs][] is being slow, and how that can be fixed.
* [ ] If can't be fixed easily, recomend users to try [onionspray-get-logs-from-s3][] first.
[onionspray-get-logs-from-s3fs]: htt...# Tasks
* [ ] Investigate why [onionspray-get-logs-from-s3fs][] is being slow, and how that can be fixed.
* [ ] If can't be fixed easily, recomend users to try [onionspray-get-logs-from-s3][] first.
[onionspray-get-logs-from-s3fs]: https://gitlab.torproject.org/tpo/onion-services/onionspray-log-parser/-/blob/main/onionspray-get-logs-from-s3fs
[onionspray-get-logs-from-s3]: https://gitlab.torproject.org/tpo/onion-services/onionspray-log-parser/-/blob/main/onionspray-get-logs-from-s3
# Time estimation
* Complexity: very small (0.5 day)
* Uncertainty: low (x1.1)
* [Reference](https://jacobian.org/2021/may/25/my-estimation-technique/) (adapted)Silvio RhattoSilvio Rhattohttps://gitlab.torproject.org/tpo/onion-services/onionspray-log-parser/-/issues/10Output template2024-03-28T14:23:08ZSilvio RhattoOutput template# Tasks
* [ ] Support for output with custom templating.
* [ ] Support for Markdown table output.
# Time estimation
* Complexity: very small (0.5 day)
* Uncertainty: low (x1.1)
* [Reference](https://jacobian.org/2021/may/25/my-estimat...# Tasks
* [ ] Support for output with custom templating.
* [ ] Support for Markdown table output.
# Time estimation
* Complexity: very small (0.5 day)
* Uncertainty: low (x1.1)
* [Reference](https://jacobian.org/2021/may/25/my-estimation-technique/) (adapted)Silvio RhattoSilvio Rhatto2024-04-01https://gitlab.torproject.org/tpo/tpa/team/-/issues/41526Deploy onionperf files parser on metricsdb-012024-03-07T14:23:37ZHiroDeploy onionperf files parser on metricsdb-01We need to deploy https://gitlab.torproject.org/tpo/network-health/metrics/tor_fusion/ on metricsdb-01.
Basically this thing will run, download onionperf files from collector and parse them. This will just happen once a day around 1am UT...We need to deploy https://gitlab.torproject.org/tpo/network-health/metrics/tor_fusion/ on metricsdb-01.
Basically this thing will run, download onionperf files from collector and parse them. This will just happen once a day around 1am UTC as at midnight is when collector fetches the archives from the various onionperf clients.
It's a little rust app and was thinking to create a group and user like for the metrics-api. But maybe it's a bit overkill and I should just put it in the parser space?HiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41516metricsdb-01 root filesystem is full2024-02-05T20:09:05ZJérôme Charaouilavamind@torproject.orgmetricsdb-01 root filesystem is fullFor over a week, the root filesystem on `metricsdb-01` has been filled to 100%.
The cause seems to be related to logs lines such as this being added tens (even hundreds) of thousands of times every day:
Feb 05 04:05:37 metricsdb-01...For over a week, the root filesystem on `metricsdb-01` has been filled to 100%.
The cause seems to be related to logs lines such as this being added tens (even hundreds) of thousands of times every day:
Feb 05 04:05:37 metricsdb-01 run[3664186]: 2024-02-05 04:05:37,453 WARN o.t.m.d.p.WebStatsParser:114 ERROR: duplicate key value violates unique constraint "log_line_pkey"
Feb 05 04:05:37 metricsdb-01 run[3664186]: Detail: Key (digest)=(g4tX2M7Beig0hqfn2OaUHKGTpXTjel+p8wrfWoTzK+8) already exists.HiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41515meronense OOM2024-02-05T19:52:19Zanarcatmeronense OOMtoday, metrics.tpo went down because the OOM killer was invoked. not sure what happened. i restarted both metrics-r and metrics-web.service, pending further investigation.
this happened before, of course. we bumped the memory on that bo...today, metrics.tpo went down because the OOM killer was invoked. not sure what happened. i restarted both metrics-r and metrics-web.service, pending further investigation.
this happened before, of course. we bumped the memory on that box to 20GB in #41335 and had issues after the bullseye upgrade as well (#40814), both incidents should be investigated. those are just the incidents that pop up in the gitlab "Similar issues", further investigation in other issues probably warranted.
possibly related with the bookworm upgrade, of course (#41252).anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41514metricsdb-01 is out of disk space on /2024-02-14T15:38:44ZKezmetricsdb-01 is out of disk space on /Roger reported metrics.tpo as being down (website returning 503). I checked nagios, and it looks like metricsdb-01 is out of disk space on the root partition. No other metrics-related issues are being reported in nagios, so I assume this...Roger reported metrics.tpo as being down (website returning 503). I checked nagios, and it looks like metricsdb-01 is out of disk space on the root partition. No other metrics-related issues are being reported in nagios, so I assume this is what's causing the metrics.tpo outage.HiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41512Simplify onionoo architecture2024-03-26T15:45:11ZHiroSimplify onionoo architectureCurrently onionoo is a service comprised of 4 VMs: two backends with the onionoo java apps serving and updating the data, and two frontends.
At the time the service was launched this architecture made a lot of sense, but I think now we ...Currently onionoo is a service comprised of 4 VMs: two backends with the onionoo java apps serving and updating the data, and two frontends.
At the time the service was launched this architecture made a lot of sense, but I think now we could simplify its maintenance by reducing it to a backend with a web server (like nginx) with some aggressive caching.
I was hoping that we would get sooner to the point where onionoo would be retired, but given the current pace of development of the metrics pipeline, I personally think it makes sense to reduce this service now so that it is easier to maintain for metrics and tpa.
What do you think?HiroHirohttps://gitlab.torproject.org/tpo/onion-services/onionspray/-/issues/35MetricsPort support2024-02-01T05:18:15ZSilvio RhattoMetricsPort support# Tasks
* [x] Add `MetricsPort` and `MetricsPortPolicy` support.
* [x] Document how to monitor Onion Services.
# Time estimation
* Complexity: very small (0.5 day)
* Uncertainty: low (x1.1)
* [Reference](https://jacobian.org/2021/may/...# Tasks
* [x] Add `MetricsPort` and `MetricsPortPolicy` support.
* [x] Document how to monitor Onion Services.
# Time estimation
* Complexity: very small (0.5 day)
* Uncertainty: low (x1.1)
* [Reference](https://jacobian.org/2021/may/25/my-estimation-technique/) (adapted)Onionspray 1.6.0Silvio RhattoSilvio Rhatto2024-01-31https://gitlab.torproject.org/tpo/tpa/team/-/issues/41483metricsdb-01 out of swap2024-02-17T00:06:09ZKezmetricsdb-01 out of swapNagios has an alert for metricsdb-01: SWAP CRITICAL - 4% free (65MB out of 2047MB). It's almost exclusively because of a victoria-metric process: `victoria-metric 1800892 kB`.
@hiro I'm assigning this to you because you'll probably know...Nagios has an alert for metricsdb-01: SWAP CRITICAL - 4% free (65MB out of 2047MB). It's almost exclusively because of a victoria-metric process: `victoria-metric 1800892 kB`.
@hiro I'm assigning this to you because you'll probably know what to do with it better than meHiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41454Migrate metrics-store-01 to object storage2024-01-04T19:34:23ZHiroMigrate metrics-store-01 to object storageWe have agreed we can migrate metrics-store-01 to object storage.We have agreed we can migrate metrics-store-01 to object storage.HiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41452estimate storage requirements for metricsdb and backups2024-03-24T14:01:45Zanarcatestimate storage requirements for metricsdb and backupsin #41424, we have agreed to continue with the monolithic postgresql design for the time being, more or less -- collector will move to object storage and there's a possibility of introducing other optimizations (https://gitlab.torproject...in #41424, we have agreed to continue with the monolithic postgresql design for the time being, more or less -- collector will move to object storage and there's a possibility of introducing other optimizations (https://gitlab.torproject.org/tpo/tpa/team/-/issues/41416#note_2978071) -- but for now that's the plan.
We'll need to scale up storage for metricsdb.
Right now, the storage usage is as follows:
| machine | used | size |
|----------------|---------|----------|
| metricsdb-01 | 1.07TiB | 7.88TiB |
| bungei pg | 1.36TiB | 2.96TiB |
| **total** | 1.43TiB | 10.84TiB |
Source:
https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&var-class=All&var-instance=bungei.torproject.org&from=now-90d&to=now&refresh=5s&var-Filters=mountpoint%7C%3D%7C%2Fsrv%2Fbackups%2Fpg
https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&var-class=All&var-instance=metricsdb-01.torproject.org&from=now-1y&to=now&refresh=5s
The specification is that we need weekly backups of the postgresql database (*not* WAL logs) except for a subset of tables that need hourly or better backups (ideally WAL).
The estimate is the database size *at launch* will be around 5TiB, with a 500GiB growth per year.
This could involve building a new storage server to handle those backups (#41364) and we feel it would be a good idea to start working with [Barman](https://pgbarman.org/) for this system.
The output of this issue is an estimate for hardware needs, a rough architectural draft, and subsequent tickets to make necessary changes to reflect said architecture.
/cc @lavamindanarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41450Move collector.torproject.org to serve files stored in object storage2024-01-04T19:33:19ZHiroMove collector.torproject.org to serve files stored in object storageIn https://gitlab.torproject.org/tpo/tpa/team/-/issues/41416 we have discussed how we can move the tarballs from metrics-store-01 and those collector creates to object storage.
For metrics-store-01 we can just move the files, and once w...In https://gitlab.torproject.org/tpo/tpa/team/-/issues/41416 we have discussed how we can move the tarballs from metrics-store-01 and those collector creates to object storage.
For metrics-store-01 we can just move the files, and once we have the bucket, we can just update the links in the wiki where we list our archives.
For collector we need a way for people to browse the archives and download tarballs recursively if needed. I am thinking that we should preserve what we serve on collector.tpo, just have the links point to the buckets.
Once this is done, we can also discuss how we could generate the tarballs and move them to minio.https://gitlab.torproject.org/tpo/tpa/team/-/issues/41449estimate hardware requirements to host collector and metrics store in object ...2024-03-26T15:44:07Zanarcatestimate hardware requirements to host collector and metrics store in object storage / minioIn #41416, we have agreed to start moving storage from a filesystem into object storage for collector and metrics-store-01. This involves creating a separate bucket for each service and access tokens for each (which is easy enough) but w...In #41416, we have agreed to start moving storage from a filesystem into object storage for collector and metrics-store-01. This involves creating a separate bucket for each service and access tokens for each (which is easy enough) but we also need to consider the impact of the object storage server, since this is kind of a big deal.
Right now, the storage usage is as follows:
| machine | used | free |
|----------------|---------|---------|
| colchicifolium | 819GiB | 1.65TiB |
| collector-02 | 55GiB | 255GiB |
| metrics-store | 742GiB | 1.54GiB |
| **total** | 1.51TiB | 3.14TiB |
Source:
https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&var-class=All&var-instance=colchicifolium.torproject.org&var-instance=collector-02.torproject.org&var-instance=metrics-store-01.torproject.org&from=now-1y&to=now&refresh=5s
Note that the total includes all disks partitions, including `/`, so it might inflate the total a bit.
We need to figure if we can host this in the current object storage infrastructure, including backups (#41415), and if not, how much it will cost to deploy new resources to do so.
/cc @lavamindanarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41424Finding a reasonable backup strategy for metricsdb-012023-12-19T20:20:19ZHiroFinding a reasonable backup strategy for metricsdb-01Today with @anarcat we have briefly discussed a change of retention policy for metricsdb-01 postgresql database.
If I am not mistaken the policy changed from 30 days retention to 7 days. I think that is correct but we could even find a...Today with @anarcat we have briefly discussed a change of retention policy for metricsdb-01 postgresql database.
If I am not mistaken the policy changed from 30 days retention to 7 days. I think that is correct but we could even find a policy that is easier to maintain over time.
I think we could assume that whatever we have on metricsdb-01 could be recreated from archives that we store on collector. The only data that escape that rule would be the tags and notes that we intend to attach to relays with tagtor.
For those we could setup a timer that would dump the few tables (I think 4 in total) that store that data. The dumps wouldn't be big or slow to create, so that could be a solution.
I am not sure is there anything else we could consider, but I am open to suggestions.
/cc @gk @micahanarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41416Discuss possible issue with storage for metrics services2024-03-24T14:05:32ZHiroDiscuss possible issue with storage for metrics servicesThere have been various discussions about what is the best long term strategy to scale metrics services.
The long term run, at the time of writing, is to concentrate all our storage on metricsdb which represent the pipeline v2.0 with po...There have been various discussions about what is the best long term strategy to scale metrics services.
The long term run, at the time of writing, is to concentrate all our storage on metricsdb which represent the pipeline v2.0 with postgresql and Victoria Metrics.
Some issue have been discussed regarding the long term growth rate of this [setup](https://gitlab.torproject.org/tpo/network-health/metrics/collector/-/issues/40023#note_2968760).
I understand tpa now can offer object storage, but we are now more than 1 year into developing the new pipeline and there are many issues that we should consider on metrics side. This ticket though, is not to discuss those issues as much to understand what tpa can support in the long run before making a development plan from the network health perspective.anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41380onionoo-backend-01 running filling up swap2024-01-10T16:47:43Zanarcatonionoo-backend-01 running filling up swap![image](/uploads/d6877b98d7d21a676d788bb27f144e68/image.png)
https://grafana.torproject.org/d/amgrk2Qnk/memory-usage?orgId=1&var-class=All&var-node=onionoo-backend-01.torproject.org&var-node=onionoo-backend-02.torproject.org&from=now-1y...![image](/uploads/d6877b98d7d21a676d788bb27f144e68/image.png)
https://grafana.torproject.org/d/amgrk2Qnk/memory-usage?orgId=1&var-class=All&var-node=onionoo-backend-01.torproject.org&var-node=onionoo-backend-02.torproject.org&from=now-1y&to=now
something on onionoo-backend-01 is eating up all swap. it seems to have stabilized now, but it tripped the critical warnings in nagios.
@hiro any idea what's going on here?HiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41372pg backups filling up on bungei2024-03-26T15:15:15Zanarcatpg backups filling up on bungeisimilar to #41361 except now it's the `/srv/backups/pg` partition that's filling up...
1 year graph:
![image](/uploads/6500ce9736e25737fd16357e8d1f0d19/image.png)
https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&from=now-1...similar to #41361 except now it's the `/srv/backups/pg` partition that's filling up...
1 year graph:
![image](/uploads/6500ce9736e25737fd16357e8d1f0d19/image.png)
https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&from=now-1y&to=now&var-class=All&var-instance=bungei.torproject.org
30 days:
![image](/uploads/8b193a1cc848d97cde37ab43b49d2c77/image.png)
https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&from=now-30d&to=now&var-class=All&var-instance=bungei.torproject.org
change rate is -1TB per month according to grafana.
/cc @gkanarcatanarcat2024-03-21https://gitlab.torproject.org/tpo/anti-censorship/bridgestrap/-/issues/39test bridges every hour2024-03-19T13:14:07Zmeskiomeskio@torproject.orgtest bridges every hourWe want to use bridgestrap results to know if a bridge is running, instead of using the 'Running' flag (https://gitlab.torproject.org/tpo/network-health/team/-/issues/318). For that bridgstrap will need to update it's collector file ever...We want to use bridgestrap results to know if a bridge is running, instead of using the 'Running' flag (https://gitlab.torproject.org/tpo/network-health/team/-/issues/318). For that bridgstrap will need to update it's collector file every hour, currently bridgestrap tests bridges every 18h and publishes the collector file every day.
Is bridgestrap able to test all bridges every hour? Or do we need to consider other options (https://gitlab.torproject.org/tpo/core/arti/-/issues/717)?meskiomeskio@torproject.orgmeskiomeskio@torproject.orghttps://gitlab.torproject.org/tpo/core/tor/-/issues/40871Tor incorrectly stores stats on incoming PT connections2023-12-10T21:38:18ZAlexander Færøyahf@torproject.orgTor incorrectly stores stats on incoming PT connections@trinity-1686a and @dcf discussed this issue on tor-dev@ in https://lists.torproject.org/pipermail/tor-dev/2023-October/014858.html
It seems like we have a bug after we updated our connectiong tracking code to track incoming connections...@trinity-1686a and @dcf discussed this issue on tor-dev@ in https://lists.torproject.org/pipermail/tor-dev/2023-October/014858.html
It seems like we have a bug after we updated our connectiong tracking code to track incoming connections earlier. We don't handle the transport name parameter of our eager call to `geoip_note_client_seen()`.
@trinity-1686a may potentially have a patch for this. I think it would be good if we could get some testing on this before we merge it.
Would you be up for running your Tor instance with a patch that potentially fixes this issue, @dcf ?Tor: 0.4.8.x-post-stabletrinity-1686atrinity-1686ahttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41343Onionoo backends out of disk space2023-11-20T21:57:17ZHiroOnionoo backends out of disk spaceSeems the onionoo backends have run out of disk space on /srv. Can we increase space? I think if we could add 10 More GB to each host at least (ideally 20) it would be ok.Seems the onionoo backends have run out of disk space on /srv. Can we increase space? I think if we could add 10 More GB to each host at least (ideally 20) it would be ok.anarcatanarcat