The Tor Project issueshttps://gitlab.torproject.org/groups/tpo/-/issues2024-03-07T14:23:37Zhttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41526Deploy onionperf files parser on metricsdb-012024-03-07T14:23:37ZHiroDeploy onionperf files parser on metricsdb-01We need to deploy https://gitlab.torproject.org/tpo/network-health/metrics/tor_fusion/ on metricsdb-01.
Basically this thing will run, download onionperf files from collector and parse them. This will just happen once a day around 1am UT...We need to deploy https://gitlab.torproject.org/tpo/network-health/metrics/tor_fusion/ on metricsdb-01.
Basically this thing will run, download onionperf files from collector and parse them. This will just happen once a day around 1am UTC as at midnight is when collector fetches the archives from the various onionperf clients.
It's a little rust app and was thinking to create a group and user like for the metrics-api. But maybe it's a bit overkill and I should just put it in the parser space?HiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41516metricsdb-01 root filesystem is full2024-02-05T20:09:05ZJérôme Charaouilavamind@torproject.orgmetricsdb-01 root filesystem is fullFor over a week, the root filesystem on `metricsdb-01` has been filled to 100%.
The cause seems to be related to logs lines such as this being added tens (even hundreds) of thousands of times every day:
Feb 05 04:05:37 metricsdb-01...For over a week, the root filesystem on `metricsdb-01` has been filled to 100%.
The cause seems to be related to logs lines such as this being added tens (even hundreds) of thousands of times every day:
Feb 05 04:05:37 metricsdb-01 run[3664186]: 2024-02-05 04:05:37,453 WARN o.t.m.d.p.WebStatsParser:114 ERROR: duplicate key value violates unique constraint "log_line_pkey"
Feb 05 04:05:37 metricsdb-01 run[3664186]: Detail: Key (digest)=(g4tX2M7Beig0hqfn2OaUHKGTpXTjel+p8wrfWoTzK+8) already exists.HiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41515meronense OOM2024-02-05T19:52:19Zanarcatmeronense OOMtoday, metrics.tpo went down because the OOM killer was invoked. not sure what happened. i restarted both metrics-r and metrics-web.service, pending further investigation.
this happened before, of course. we bumped the memory on that bo...today, metrics.tpo went down because the OOM killer was invoked. not sure what happened. i restarted both metrics-r and metrics-web.service, pending further investigation.
this happened before, of course. we bumped the memory on that box to 20GB in #41335 and had issues after the bullseye upgrade as well (#40814), both incidents should be investigated. those are just the incidents that pop up in the gitlab "Similar issues", further investigation in other issues probably warranted.
possibly related with the bookworm upgrade, of course (#41252).anarcatanarcathttps://gitlab.torproject.org/tpo/onion-services/onionspray/-/issues/35MetricsPort support2024-02-01T05:18:15ZSilvio RhattoMetricsPort support# Tasks
* [x] Add `MetricsPort` and `MetricsPortPolicy` support.
* [x] Document how to monitor Onion Services.
# Time estimation
* Complexity: very small (0.5 day)
* Uncertainty: low (x1.1)
* [Reference](https://jacobian.org/2021/may/...# Tasks
* [x] Add `MetricsPort` and `MetricsPortPolicy` support.
* [x] Document how to monitor Onion Services.
# Time estimation
* Complexity: very small (0.5 day)
* Uncertainty: low (x1.1)
* [Reference](https://jacobian.org/2021/may/25/my-estimation-technique/) (adapted)Onionspray 1.6.0Silvio RhattoSilvio Rhatto2024-01-31https://gitlab.torproject.org/tpo/tpa/team/-/issues/41452estimate storage requirements for metricsdb and backups2024-03-24T14:01:45Zanarcatestimate storage requirements for metricsdb and backupsin #41424, we have agreed to continue with the monolithic postgresql design for the time being, more or less -- collector will move to object storage and there's a possibility of introducing other optimizations (https://gitlab.torproject...in #41424, we have agreed to continue with the monolithic postgresql design for the time being, more or less -- collector will move to object storage and there's a possibility of introducing other optimizations (https://gitlab.torproject.org/tpo/tpa/team/-/issues/41416#note_2978071) -- but for now that's the plan.
We'll need to scale up storage for metricsdb.
Right now, the storage usage is as follows:
| machine | used | size |
|----------------|---------|----------|
| metricsdb-01 | 1.07TiB | 7.88TiB |
| bungei pg | 1.36TiB | 2.96TiB |
| **total** | 1.43TiB | 10.84TiB |
Source:
https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&var-class=All&var-instance=bungei.torproject.org&from=now-90d&to=now&refresh=5s&var-Filters=mountpoint%7C%3D%7C%2Fsrv%2Fbackups%2Fpg
https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&var-class=All&var-instance=metricsdb-01.torproject.org&from=now-1y&to=now&refresh=5s
The specification is that we need weekly backups of the postgresql database (*not* WAL logs) except for a subset of tables that need hourly or better backups (ideally WAL).
The estimate is the database size *at launch* will be around 5TiB, with a 500GiB growth per year.
This could involve building a new storage server to handle those backups (#41364) and we feel it would be a good idea to start working with [Barman](https://pgbarman.org/) for this system.
The output of this issue is an estimate for hardware needs, a rough architectural draft, and subsequent tickets to make necessary changes to reflect said architecture.
/cc @lavamindanarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41424Finding a reasonable backup strategy for metricsdb-012023-12-19T20:20:19ZHiroFinding a reasonable backup strategy for metricsdb-01Today with @anarcat we have briefly discussed a change of retention policy for metricsdb-01 postgresql database.
If I am not mistaken the policy changed from 30 days retention to 7 days. I think that is correct but we could even find a...Today with @anarcat we have briefly discussed a change of retention policy for metricsdb-01 postgresql database.
If I am not mistaken the policy changed from 30 days retention to 7 days. I think that is correct but we could even find a policy that is easier to maintain over time.
I think we could assume that whatever we have on metricsdb-01 could be recreated from archives that we store on collector. The only data that escape that rule would be the tags and notes that we intend to attach to relays with tagtor.
For those we could setup a timer that would dump the few tables (I think 4 in total) that store that data. The dumps wouldn't be big or slow to create, so that could be a solution.
I am not sure is there anything else we could consider, but I am open to suggestions.
/cc @gk @micahanarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41416Discuss possible issue with storage for metrics services2024-03-24T14:05:32ZHiroDiscuss possible issue with storage for metrics servicesThere have been various discussions about what is the best long term strategy to scale metrics services.
The long term run, at the time of writing, is to concentrate all our storage on metricsdb which represent the pipeline v2.0 with po...There have been various discussions about what is the best long term strategy to scale metrics services.
The long term run, at the time of writing, is to concentrate all our storage on metricsdb which represent the pipeline v2.0 with postgresql and Victoria Metrics.
Some issue have been discussed regarding the long term growth rate of this [setup](https://gitlab.torproject.org/tpo/network-health/metrics/collector/-/issues/40023#note_2968760).
I understand tpa now can offer object storage, but we are now more than 1 year into developing the new pipeline and there are many issues that we should consider on metrics side. This ticket though, is not to discuss those issues as much to understand what tpa can support in the long run before making a development plan from the network health perspective.anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41380onionoo-backend-01 running filling up swap2024-01-10T16:47:43Zanarcatonionoo-backend-01 running filling up swap![image](/uploads/d6877b98d7d21a676d788bb27f144e68/image.png)
https://grafana.torproject.org/d/amgrk2Qnk/memory-usage?orgId=1&var-class=All&var-node=onionoo-backend-01.torproject.org&var-node=onionoo-backend-02.torproject.org&from=now-1y...![image](/uploads/d6877b98d7d21a676d788bb27f144e68/image.png)
https://grafana.torproject.org/d/amgrk2Qnk/memory-usage?orgId=1&var-class=All&var-node=onionoo-backend-01.torproject.org&var-node=onionoo-backend-02.torproject.org&from=now-1y&to=now
something on onionoo-backend-01 is eating up all swap. it seems to have stabilized now, but it tripped the critical warnings in nagios.
@hiro any idea what's going on here?HiroHirohttps://gitlab.torproject.org/tpo/anti-censorship/bridgestrap/-/issues/39test bridges every hour2024-03-19T13:14:07Zmeskiomeskio@torproject.orgtest bridges every hourWe want to use bridgestrap results to know if a bridge is running, instead of using the 'Running' flag (https://gitlab.torproject.org/tpo/network-health/team/-/issues/318). For that bridgstrap will need to update it's collector file ever...We want to use bridgestrap results to know if a bridge is running, instead of using the 'Running' flag (https://gitlab.torproject.org/tpo/network-health/team/-/issues/318). For that bridgstrap will need to update it's collector file every hour, currently bridgestrap tests bridges every 18h and publishes the collector file every day.
Is bridgestrap able to test all bridges every hour? Or do we need to consider other options (https://gitlab.torproject.org/tpo/core/arti/-/issues/717)?meskiomeskio@torproject.orgmeskiomeskio@torproject.orghttps://gitlab.torproject.org/tpo/core/tor/-/issues/40871Tor incorrectly stores stats on incoming PT connections2023-12-10T21:38:18ZAlexander Færøyahf@torproject.orgTor incorrectly stores stats on incoming PT connections@trinity-1686a and @dcf discussed this issue on tor-dev@ in https://lists.torproject.org/pipermail/tor-dev/2023-October/014858.html
It seems like we have a bug after we updated our connectiong tracking code to track incoming connections...@trinity-1686a and @dcf discussed this issue on tor-dev@ in https://lists.torproject.org/pipermail/tor-dev/2023-October/014858.html
It seems like we have a bug after we updated our connectiong tracking code to track incoming connections earlier. We don't handle the transport name parameter of our eager call to `geoip_note_client_seen()`.
@trinity-1686a may potentially have a patch for this. I think it would be good if we could get some testing on this before we merge it.
Would you be up for running your Tor instance with a patch that potentially fixes this issue, @dcf ?Tor: 0.4.8.x-post-stabletrinity-1686atrinity-1686ahttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41343Onionoo backends out of disk space2023-11-20T21:57:17ZHiroOnionoo backends out of disk spaceSeems the onionoo backends have run out of disk space on /srv. Can we increase space? I think if we could add 10 More GB to each host at least (ideally 20) it would be ok.Seems the onionoo backends have run out of disk space on /srv. Can we increase space? I think if we could add 10 More GB to each host at least (ideally 20) it would be ok.anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41342Change apache log format2023-12-11T18:23:02ZHiroChange apache log formatI was wondering if it would be a terrible idea to change apache log format to JSON? What do you all think?
/cc @gkI was wondering if it would be a terrible idea to change apache log format to JSON? What do you all think?
/cc @gkanarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41335Increase ram on meronense to 20GB2024-02-02T03:23:34ZHiroIncrease ram on meronense to 20GBLately seems like the update service for the metrics website has been generating oom errors on meronense and gets being killed by the kernel. At first I thought the cap on the ram usage was being ignored for some reason. So I lowered the...Lately seems like the update service for the metrics website has been generating oom errors on meronense and gets being killed by the kernel. At first I thought the cap on the ram usage was being ignored for some reason. So I lowered the ram cap and Java is indeed respecting that, but that value is now too low for the service to run and process the data. Some jobs end up running for too many days which ultimately means we aren't processing the data as we should.
I propose to increase the RAM on the VM. I know this might not be ideal, but I don't have another way to fix this at this point. Hopefully with metricsdb working correctly we will be able to migrate everything there soon.anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41307Deploy network status api on metricsdb-012023-09-06T20:18:23ZHiroDeploy network status api on metricsdb-01We need to deploy the network status api onto metricsdb-01.
This is a web based services that reads data out of the postgresdb and Victoria Metrics (https://gitlab.torproject.org/tpo/network-health/metrics/networkstatusapi/)
I am going...We need to deploy the network status api onto metricsdb-01.
This is a web based services that reads data out of the postgresdb and Victoria Metrics (https://gitlab.torproject.org/tpo/network-health/metrics/networkstatusapi/)
I am going to add it behind apache and protect with http auth as this is not a public service yet.
\cc @gkHiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41293convert meronense @reboot cron jobs to systemd services2023-11-08T14:40:39ZKezconvert meronense @reboot cron jobs to systemd servicesmeronense frequently has need-restart warnings for `cron.service` due to two `@reboot` cron jobs. because we can't just let needrestart take care of things, or restart the services manually, TPA needs to find someone on the metrics team ...meronense frequently has need-restart warnings for `cron.service` due to two `@reboot` cron jobs. because we can't just let needrestart take care of things, or restart the services manually, TPA needs to find someone on the metrics team to restart the services, or we need to reboot the whole server. that's disruptive to both the metrics team, and TPA. rather than living with those disruptions (or just ignoring the nagios warning), we should convert those cron jobs to a systemd service.
the jobs are `metrics-web-start` and `metrics-web-start-rserve` defined in `tor-puppet/modules/profile/manifests/metrics.pp`. they should be simple enough to convert, they each just `cd` into a directory and then run a script. i think the biggest concern with converting them is checking in with the metrics team and making sure they're okay with the change, and making sure that nothing breaks.HiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41258materculae out of disk space2023-09-21T01:51:41ZKezmaterculae out of disk spaceprevious ticket: #40826
it's been a year, and nagios is complaining about materculae's /srv partition
```
# df -h /srv
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_materculae-srv 147G 135G 4.3G 97%...previous ticket: #40826
it's been a year, and nagios is complaining about materculae's /srv partition
```
# df -h /srv
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_materculae-srv 147G 135G 4.3G 97% /srv
```
in the previous ticket (#40826) @anarcat changed the warning threshold, which is why this warning popped up now.
according to grafana, the usage has only been about 15G in the past year, and the growth is linear. we could add another 20G and revisit in a year, or throw 40G or 60G at it to push things further down the road.
![image](/uploads/e8ddf8b69703273f73d891586f7fc137/image.png)anarcatanarcat2023-09-22https://gitlab.torproject.org/tpo/tpa/team/-/issues/41222Is the web ui disabled for our VictoriaMetrics version?2023-06-13T12:37:36ZHiroIs the web ui disabled for our VictoriaMetrics version?I see the web ui for VictoriaMetrics at https://metrics-db.torproject.org/vmui/ is returning a 404.
\@gkI see the web ui for VictoriaMetrics at https://metrics-db.torproject.org/vmui/ is returning a 404.
\@gkSponsor 112 : Combating Malicious RelaysJérôme Charaouilavamind@torproject.orgJérôme Charaouilavamind@torproject.orghttps://gitlab.torproject.org/tpo/onion-services/onionspray-log-parser/-/issues/7Add a flags on eotk-get-logs-from-s3 to select from/to dates2023-06-05T16:09:03ZSilvio RhattoAdd a flags on eotk-get-logs-from-s3 to select from/to dates* [x] Add a flags on `eotk-get-logs-from-s3` to allowing filtering logs by a data range or a single month. Only logs in that range (or in that month) should be copied.
* [x] Inform S123 analytics when this flag is ready to be tested.* [x] Add a flags on `eotk-get-logs-from-s3` to allowing filtering logs by a data range or a single month. Only logs in that range (or in that month) should be copied.
* [x] Inform S123 analytics when this flag is ready to be tested.Silvio RhattoSilvio Rhatto2023-05-31https://gitlab.torproject.org/tpo/tpa/team/-/issues/41167connect to postgresql db on new metrics DB via tls2023-06-27T15:13:20ZHiroconnect to postgresql db on new metrics DB via tlsWould it be possible to get a read only user to connect to the postgresql db on metrics-psqlts-01 via tls?
This would be used to access it via grafana, but also allow metrics developers to query the data.
Possibly people that would acc...Would it be possible to get a read only user to connect to the postgresql db on metrics-psqlts-01 via tls?
This would be used to access it via grafana, but also allow metrics developers to query the data.
Possibly people that would access this would be:
@hiro
@gk
@mattrighettiJérôme Charaouilavamind@torproject.orgJérôme Charaouilavamind@torproject.orghttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41161rebuild corsicum into collector-02.torproject.org2023-05-23T16:24:44Zanarcatrebuild corsicum into collector-02.torproject.orgwe need to migrate out of the old Sunet cluster into the new Safespring cluster, corsicum needs to be retired and rebuilt into collector-02.
see also #40684.we need to migrate out of the old Sunet cluster into the new Safespring cluster, corsicum needs to be retired and rebuilt into collector-02.
see also #40684.anarcatanarcat