The Tor Project issueshttps://gitlab.torproject.org/groups/tpo/-/issues2024-03-26T15:44:07Zhttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41449estimate hardware requirements to host collector and metrics store in object ...2024-03-26T15:44:07Zanarcatestimate hardware requirements to host collector and metrics store in object storage / minioIn #41416, we have agreed to start moving storage from a filesystem into object storage for collector and metrics-store-01. This involves creating a separate bucket for each service and access tokens for each (which is easy enough) but w...In #41416, we have agreed to start moving storage from a filesystem into object storage for collector and metrics-store-01. This involves creating a separate bucket for each service and access tokens for each (which is easy enough) but we also need to consider the impact of the object storage server, since this is kind of a big deal.
Right now, the storage usage is as follows:
| machine | used | free |
|----------------|---------|---------|
| colchicifolium | 819GiB | 1.65TiB |
| collector-02 | 55GiB | 255GiB |
| metrics-store | 742GiB | 1.54GiB |
| **total** | 1.51TiB | 3.14TiB |
Source:
https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&var-class=All&var-instance=colchicifolium.torproject.org&var-instance=collector-02.torproject.org&var-instance=metrics-store-01.torproject.org&from=now-1y&to=now&refresh=5s
Note that the total includes all disks partitions, including `/`, so it might inflate the total a bit.
We need to figure if we can host this in the current object storage infrastructure, including backups (#41415), and if not, how much it will cost to deploy new resources to do so.
/cc @lavamindanarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41416Discuss possible issue with storage for metrics services2024-03-24T14:05:32ZHiroDiscuss possible issue with storage for metrics servicesThere have been various discussions about what is the best long term strategy to scale metrics services.
The long term run, at the time of writing, is to concentrate all our storage on metricsdb which represent the pipeline v2.0 with po...There have been various discussions about what is the best long term strategy to scale metrics services.
The long term run, at the time of writing, is to concentrate all our storage on metricsdb which represent the pipeline v2.0 with postgresql and Victoria Metrics.
Some issue have been discussed regarding the long term growth rate of this [setup](https://gitlab.torproject.org/tpo/network-health/metrics/collector/-/issues/40023#note_2968760).
I understand tpa now can offer object storage, but we are now more than 1 year into developing the new pipeline and there are many issues that we should consider on metrics side. This ticket though, is not to discuss those issues as much to understand what tpa can support in the long run before making a development plan from the network health perspective.anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41372pg backups filling up on bungei2024-03-26T15:15:15Zanarcatpg backups filling up on bungeisimilar to #41361 except now it's the `/srv/backups/pg` partition that's filling up...
1 year graph:
![image](/uploads/6500ce9736e25737fd16357e8d1f0d19/image.png)
https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&from=now-1...similar to #41361 except now it's the `/srv/backups/pg` partition that's filling up...
1 year graph:
![image](/uploads/6500ce9736e25737fd16357e8d1f0d19/image.png)
https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&from=now-1y&to=now&var-class=All&var-instance=bungei.torproject.org
30 days:
![image](/uploads/8b193a1cc848d97cde37ab43b49d2c77/image.png)
https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&from=now-30d&to=now&var-class=All&var-instance=bungei.torproject.org
change rate is -1TB per month according to grafana.
/cc @gkanarcatanarcat2024-03-21https://gitlab.torproject.org/tpo/tpa/team/-/issues/40809upgrade meronense to PostgreSQL 132022-07-22T17:19:13Zanarcatupgrade meronense to PostgreSQL 13as part of the second batch of upgrades (tpo/tpa/team#40692), we need to upgrade meronense to PostgreSQL 13
opening a ticket because the standard procedure failedas part of the second batch of upgrades (tpo/tpa/team#40692), we need to upgrade meronense to PostgreSQL 13
opening a ticket because the standard procedure failedDebian 11 bullseye upgradeanarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41516metricsdb-01 root filesystem is full2024-02-05T20:09:05ZJérôme Charaouilavamind@torproject.orgmetricsdb-01 root filesystem is fullFor over a week, the root filesystem on `metricsdb-01` has been filled to 100%.
The cause seems to be related to logs lines such as this being added tens (even hundreds) of thousands of times every day:
Feb 05 04:05:37 metricsdb-01...For over a week, the root filesystem on `metricsdb-01` has been filled to 100%.
The cause seems to be related to logs lines such as this being added tens (even hundreds) of thousands of times every day:
Feb 05 04:05:37 metricsdb-01 run[3664186]: 2024-02-05 04:05:37,453 WARN o.t.m.d.p.WebStatsParser:114 ERROR: duplicate key value violates unique constraint "log_line_pkey"
Feb 05 04:05:37 metricsdb-01 run[3664186]: Detail: Key (digest)=(g4tX2M7Beig0hqfn2OaUHKGTpXTjel+p8wrfWoTzK+8) already exists.HiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41483metricsdb-01 out of swap2024-02-17T00:06:09ZKezmetricsdb-01 out of swapNagios has an alert for metricsdb-01: SWAP CRITICAL - 4% free (65MB out of 2047MB). It's almost exclusively because of a victoria-metric process: `victoria-metric 1800892 kB`.
@hiro I'm assigning this to you because you'll probably know...Nagios has an alert for metricsdb-01: SWAP CRITICAL - 4% free (65MB out of 2047MB). It's almost exclusively because of a victoria-metric process: `victoria-metric 1800892 kB`.
@hiro I'm assigning this to you because you'll probably know what to do with it better than meHiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41335Increase ram on meronense to 20GB2024-02-02T03:23:34ZHiroIncrease ram on meronense to 20GBLately seems like the update service for the metrics website has been generating oom errors on meronense and gets being killed by the kernel. At first I thought the cap on the ram usage was being ignored for some reason. So I lowered the...Lately seems like the update service for the metrics website has been generating oom errors on meronense and gets being killed by the kernel. At first I thought the cap on the ram usage was being ignored for some reason. So I lowered the ram cap and Java is indeed respecting that, but that value is now too low for the service to run and process the data. Some jobs end up running for too many days which ultimately means we aren't processing the data as we should.
I propose to increase the RAM on the VM. I know this might not be ideal, but I don't have another way to fix this at this point. Hopefully with metricsdb working correctly we will be able to migrate everything there soon.anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41293convert meronense @reboot cron jobs to systemd services2023-11-08T14:40:39ZKezconvert meronense @reboot cron jobs to systemd servicesmeronense frequently has need-restart warnings for `cron.service` due to two `@reboot` cron jobs. because we can't just let needrestart take care of things, or restart the services manually, TPA needs to find someone on the metrics team ...meronense frequently has need-restart warnings for `cron.service` due to two `@reboot` cron jobs. because we can't just let needrestart take care of things, or restart the services manually, TPA needs to find someone on the metrics team to restart the services, or we need to reboot the whole server. that's disruptive to both the metrics team, and TPA. rather than living with those disruptions (or just ignoring the nagios warning), we should convert those cron jobs to a systemd service.
the jobs are `metrics-web-start` and `metrics-web-start-rserve` defined in `tor-puppet/modules/profile/manifests/metrics.pp`. they should be simple enough to convert, they each just `cd` into a directory and then run a script. i think the biggest concern with converting them is checking in with the metrics team and making sure they're okay with the change, and making sure that nothing breaks.HiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40965Planning how to deploy victoriametrics on metrics-psqlts-012022-11-28T22:51:23ZHiroPlanning how to deploy victoriametrics on metrics-psqlts-01I am thinking to start deploy victoriametrics to metrics-psqlts-01.
The quickstart guide suggest to either use docker or snap https://docs.victoriametrics.com/Quick-Start.html.
What would be your take on this? I know there is a debian p...I am thinking to start deploy victoriametrics to metrics-psqlts-01.
The quickstart guide suggest to either use docker or snap https://docs.victoriametrics.com/Quick-Start.html.
What would be your take on this? I know there is a debian package but it is also a bit outdated.
Cheers!
cc: @gkSponsor 112 : Combating Malicious RelaysHiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40945Restore a db on meronense from backup2022-11-01T16:11:26ZHiroRestore a db on meronense from backupI have an issue on metrics tpo where it would save me weeks of re-parsing descriptors if I could restore two tables on the userstats db from backups.
I would need ideally to have a userstats_copy db from before the 1st of October 2020 s...I have an issue on metrics tpo where it would save me weeks of re-parsing descriptors if I could restore two tables on the userstats db from backups.
I would need ideally to have a userstats_copy db from before the 1st of October 2020 so I can copy over data for the month of september.
- [x] setup new temp VM (@anarcat, done, metrics-backup-01)
- [x] restore meronense backup (@anarcat, done)
- [x] start the database, replay WAL files up to october 1st (`service postgresql start`, @anarcat )
- [x] restore the table to meronense (@hiro)
- [x] remove the SSH key from bungei (`rm /etc/ssh/userkeys/torbackup.more`, @anarcat)
- [x] remove firewall rule (`iptables-legacy -D INPUT -s 49.12.57.139 -j ACCEPT`, @anarcat t)
- [x] wait for hiro to give the go to retire the server (@anarcat)
- [x] retire the backup host (@anarcat), which is:
- [x] remove backups
- [x] `gnt-instance remove`
- [x] remove from LDAP
- [x] remove from tor-passwords
- [x] remove from puppetanarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40814OOM issue on meronense after upgrade2024-02-02T03:23:35ZHiroOOM issue on meronense after upgradeNoticed metrics.tpo is not getting all its updates since postgresql has been upgraded to v13.
I have started the script manually: https://gitlab.torproject.org/tpo/network-health/metrics/metrics-bin/-/blob/main/website/run-web.sh
And f...Noticed metrics.tpo is not getting all its updates since postgresql has been upgraded to v13.
I have started the script manually: https://gitlab.torproject.org/tpo/network-health/metrics/metrics-bin/-/blob/main/website/run-web.sh
And found out the job was being killed:
```
[308908.109696] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-0.slice/session-4020.scope,task=java,pid=375579,uid=1512
[308908.109723] Out of memory: Killed process 375579 (java) total-vm:14411748kB, anon-rss:7917568kB, file-rss:0kB, shmem-rss:32kB, UID:1512 pgtables:23120kB oom_score_adj:0
```
cc: @gkanarcatanarcat2022-07-27https://gitlab.torproject.org/tpo/tpa/team/-/issues/40705Retire metabase2022-05-17T20:06:19ZHiroRetire metabaseWe are not using metabase since the DB on meronense is too slow. On top of it, there have been a few scary bugs in the java world, so it makes sense to retire it.We are not using metabase since the DB on meronense is too slow. On top of it, there have been a few scary bugs in the java world, so it makes sense to retire it.HiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41514metricsdb-01 is out of disk space on /2024-02-14T15:38:44ZKezmetricsdb-01 is out of disk space on /Roger reported metrics.tpo as being down (website returning 503). I checked nagios, and it looks like metricsdb-01 is out of disk space on the root partition. No other metrics-related issues are being reported in nagios, so I assume this...Roger reported metrics.tpo as being down (website returning 503). I checked nagios, and it looks like metricsdb-01 is out of disk space on the root partition. No other metrics-related issues are being reported in nagios, so I assume this is what's causing the metrics.tpo outage.HiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41454Migrate metrics-store-01 to object storage2024-01-04T19:34:23ZHiroMigrate metrics-store-01 to object storageWe have agreed we can migrate metrics-store-01 to object storage.We have agreed we can migrate metrics-store-01 to object storage.HiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41450Move collector.torproject.org to serve files stored in object storage2024-01-04T19:33:19ZHiroMove collector.torproject.org to serve files stored in object storageIn https://gitlab.torproject.org/tpo/tpa/team/-/issues/41416 we have discussed how we can move the tarballs from metrics-store-01 and those collector creates to object storage.
For metrics-store-01 we can just move the files, and once w...In https://gitlab.torproject.org/tpo/tpa/team/-/issues/41416 we have discussed how we can move the tarballs from metrics-store-01 and those collector creates to object storage.
For metrics-store-01 we can just move the files, and once we have the bucket, we can just update the links in the wiki where we list our archives.
For collector we need a way for people to browse the archives and download tarballs recursively if needed. I am thinking that we should preserve what we serve on collector.tpo, just have the links point to the buckets.
Once this is done, we can also discuss how we could generate the tarballs and move them to minio.https://gitlab.torproject.org/tpo/tpa/team/-/issues/40826materculae will run out of disk space in a year2023-09-21T01:19:30Zanarcatmaterculae will run out of disk space in a yearso we've just had a soft warning that materculae has hit our magic 15% free disk limit. looking at this graph, it seems we've taken up about 17GB in the last year, with 20 remaining:
![image](/uploads/e4118ed09adfc0dc1d86b88b60acb63c/im...so we've just had a soft warning that materculae has hit our magic 15% free disk limit. looking at this graph, it seems we've taken up about 17GB in the last year, with 20 remaining:
![image](/uploads/e4118ed09adfc0dc1d86b88b60acb63c/image.png)
https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&var-instance=materculae.torproject.org&from=now-1y&to=now
so this will become a real problem, but not before a year (!). i'd still like to figure out what to do with this to keep nagios clean... is it normal that the disk usage keeps growing? maybe we can grow available disk space already? `/var/lib/postgresql` is at 150GB right now.
/cc @hirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41222Is the web ui disabled for our VictoriaMetrics version?2023-06-13T12:37:36ZHiroIs the web ui disabled for our VictoriaMetrics version?I see the web ui for VictoriaMetrics at https://metrics-db.torproject.org/vmui/ is returning a 404.
\@gkI see the web ui for VictoriaMetrics at https://metrics-db.torproject.org/vmui/ is returning a 404.
\@gkSponsor 112 : Combating Malicious RelaysJérôme Charaouilavamind@torproject.orgJérôme Charaouilavamind@torproject.orghttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41342Change apache log format2023-12-11T18:23:02ZHiroChange apache log formatI was wondering if it would be a terrible idea to change apache log format to JSON? What do you all think?
/cc @gkI was wondering if it would be a terrible idea to change apache log format to JSON? What do you all think?
/cc @gkanarcatanarcathttps://gitlab.torproject.org/tpo/onion-services/onionspray/-/issues/35MetricsPort support2024-02-01T05:18:15ZSilvio RhattoMetricsPort support# Tasks
* [x] Add `MetricsPort` and `MetricsPortPolicy` support.
* [x] Document how to monitor Onion Services.
# Time estimation
* Complexity: very small (0.5 day)
* Uncertainty: low (x1.1)
* [Reference](https://jacobian.org/2021/may/...# Tasks
* [x] Add `MetricsPort` and `MetricsPortPolicy` support.
* [x] Document how to monitor Onion Services.
# Time estimation
* Complexity: very small (0.5 day)
* Uncertainty: low (x1.1)
* [Reference](https://jacobian.org/2021/may/25/my-estimation-technique/) (adapted)Onionspray 1.6.0Silvio RhattoSilvio Rhatto2024-01-31https://gitlab.torproject.org/tpo/anti-censorship/bridgestrap/-/issues/39test bridges every hour2024-03-19T13:14:07Zmeskiomeskio@torproject.orgtest bridges every hourWe want to use bridgestrap results to know if a bridge is running, instead of using the 'Running' flag (https://gitlab.torproject.org/tpo/network-health/team/-/issues/318). For that bridgstrap will need to update it's collector file ever...We want to use bridgestrap results to know if a bridge is running, instead of using the 'Running' flag (https://gitlab.torproject.org/tpo/network-health/team/-/issues/318). For that bridgstrap will need to update it's collector file every hour, currently bridgestrap tests bridges every 18h and publishes the collector file every day.
Is bridgestrap able to test all bridges every hour? Or do we need to consider other options (https://gitlab.torproject.org/tpo/core/arti/-/issues/717)?meskiomeskio@torproject.orgmeskiomeskio@torproject.org