The Tor Project issueshttps://gitlab.torproject.org/groups/tpo/-/issues2024-03-07T14:23:37Zhttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41526Deploy onionperf files parser on metricsdb-012024-03-07T14:23:37ZHiroDeploy onionperf files parser on metricsdb-01We need to deploy https://gitlab.torproject.org/tpo/network-health/metrics/tor_fusion/ on metricsdb-01.
Basically this thing will run, download onionperf files from collector and parse them. This will just happen once a day around 1am UT...We need to deploy https://gitlab.torproject.org/tpo/network-health/metrics/tor_fusion/ on metricsdb-01.
Basically this thing will run, download onionperf files from collector and parse them. This will just happen once a day around 1am UTC as at midnight is when collector fetches the archives from the various onionperf clients.
It's a little rust app and was thinking to create a group and user like for the metrics-api. But maybe it's a bit overkill and I should just put it in the parser space?HiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41515meronense OOM2024-02-05T19:52:19Zanarcatmeronense OOMtoday, metrics.tpo went down because the OOM killer was invoked. not sure what happened. i restarted both metrics-r and metrics-web.service, pending further investigation.
this happened before, of course. we bumped the memory on that bo...today, metrics.tpo went down because the OOM killer was invoked. not sure what happened. i restarted both metrics-r and metrics-web.service, pending further investigation.
this happened before, of course. we bumped the memory on that box to 20GB in #41335 and had issues after the bullseye upgrade as well (#40814), both incidents should be investigated. those are just the incidents that pop up in the gitlab "Similar issues", further investigation in other issues probably warranted.
possibly related with the bookworm upgrade, of course (#41252).anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41424Finding a reasonable backup strategy for metricsdb-012023-12-19T20:20:19ZHiroFinding a reasonable backup strategy for metricsdb-01Today with @anarcat we have briefly discussed a change of retention policy for metricsdb-01 postgresql database.
If I am not mistaken the policy changed from 30 days retention to 7 days. I think that is correct but we could even find a...Today with @anarcat we have briefly discussed a change of retention policy for metricsdb-01 postgresql database.
If I am not mistaken the policy changed from 30 days retention to 7 days. I think that is correct but we could even find a policy that is easier to maintain over time.
I think we could assume that whatever we have on metricsdb-01 could be recreated from archives that we store on collector. The only data that escape that rule would be the tags and notes that we intend to attach to relays with tagtor.
For those we could setup a timer that would dump the few tables (I think 4 in total) that store that data. The dumps wouldn't be big or slow to create, so that could be a solution.
I am not sure is there anything else we could consider, but I am open to suggestions.
/cc @gk @micahanarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41380onionoo-backend-01 running filling up swap2024-01-10T16:47:43Zanarcatonionoo-backend-01 running filling up swap![image](/uploads/d6877b98d7d21a676d788bb27f144e68/image.png)
https://grafana.torproject.org/d/amgrk2Qnk/memory-usage?orgId=1&var-class=All&var-node=onionoo-backend-01.torproject.org&var-node=onionoo-backend-02.torproject.org&from=now-1y...![image](/uploads/d6877b98d7d21a676d788bb27f144e68/image.png)
https://grafana.torproject.org/d/amgrk2Qnk/memory-usage?orgId=1&var-class=All&var-node=onionoo-backend-01.torproject.org&var-node=onionoo-backend-02.torproject.org&from=now-1y&to=now
something on onionoo-backend-01 is eating up all swap. it seems to have stabilized now, but it tripped the critical warnings in nagios.
@hiro any idea what's going on here?HiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41343Onionoo backends out of disk space2023-11-20T21:57:17ZHiroOnionoo backends out of disk spaceSeems the onionoo backends have run out of disk space on /srv. Can we increase space? I think if we could add 10 More GB to each host at least (ideally 20) it would be ok.Seems the onionoo backends have run out of disk space on /srv. Can we increase space? I think if we could add 10 More GB to each host at least (ideally 20) it would be ok.anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41307Deploy network status api on metricsdb-012023-09-06T20:18:23ZHiroDeploy network status api on metricsdb-01We need to deploy the network status api onto metricsdb-01.
This is a web based services that reads data out of the postgresdb and Victoria Metrics (https://gitlab.torproject.org/tpo/network-health/metrics/networkstatusapi/)
I am going...We need to deploy the network status api onto metricsdb-01.
This is a web based services that reads data out of the postgresdb and Victoria Metrics (https://gitlab.torproject.org/tpo/network-health/metrics/networkstatusapi/)
I am going to add it behind apache and protect with http auth as this is not a public service yet.
\cc @gkHiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41258materculae out of disk space2023-09-21T01:51:41ZKezmaterculae out of disk spaceprevious ticket: #40826
it's been a year, and nagios is complaining about materculae's /srv partition
```
# df -h /srv
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_materculae-srv 147G 135G 4.3G 97%...previous ticket: #40826
it's been a year, and nagios is complaining about materculae's /srv partition
```
# df -h /srv
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_materculae-srv 147G 135G 4.3G 97% /srv
```
in the previous ticket (#40826) @anarcat changed the warning threshold, which is why this warning popped up now.
according to grafana, the usage has only been about 15G in the past year, and the growth is linear. we could add another 20G and revisit in a year, or throw 40G or 60G at it to push things further down the road.
![image](/uploads/e8ddf8b69703273f73d891586f7fc137/image.png)anarcatanarcat2023-09-22https://gitlab.torproject.org/tpo/tpa/team/-/issues/41167connect to postgresql db on new metrics DB via tls2023-06-27T15:13:20ZHiroconnect to postgresql db on new metrics DB via tlsWould it be possible to get a read only user to connect to the postgresql db on metrics-psqlts-01 via tls?
This would be used to access it via grafana, but also allow metrics developers to query the data.
Possibly people that would acc...Would it be possible to get a read only user to connect to the postgresql db on metrics-psqlts-01 via tls?
This would be used to access it via grafana, but also allow metrics developers to query the data.
Possibly people that would access this would be:
@hiro
@gk
@mattrighettiJérôme Charaouilavamind@torproject.orgJérôme Charaouilavamind@torproject.orghttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41161rebuild corsicum into collector-02.torproject.org2023-05-23T16:24:44Zanarcatrebuild corsicum into collector-02.torproject.orgwe need to migrate out of the old Sunet cluster into the new Safespring cluster, corsicum needs to be retired and rebuilt into collector-02.
see also #40684.we need to migrate out of the old Sunet cluster into the new Safespring cluster, corsicum needs to be retired and rebuilt into collector-02.
see also #40684.anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41130Deploy new metrics database stack2023-07-07T08:00:20ZHiroDeploy new metrics database stackI have been testing our victoriametrics + postgresql setup on metrics-psqlts-01 for a while, and now that we are close to have a prod deployment of this pipeline I'd like to have things properly in puppet.
I have a branch with a tentati...I have been testing our victoriametrics + postgresql setup on metrics-psqlts-01 for a while, and now that we are close to have a prod deployment of this pipeline I'd like to have things properly in puppet.
I have a branch with a tentative setup that I'd like to have your opinion on called metrics-deploy.
This branch has also support to deploy a python web app to access and query both the postgresql db and victoria metrics.
Victoria metrics runs with docker, but without compose. I am not sure you'd prefer a compose setup, since this is a single service.
An alternative would be to run the full stack with compose. Would postgresql backups work in that case?
I am going to be out next week. So maybe we could discuss this in costa rica face to face?HiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41114Disk space increase on metrics-psqlts-012023-04-04T00:02:14ZHiroDisk space increase on metrics-psqlts-01Can we add 20 more Gigas on metrics-psqlts-01?Can we add 20 more Gigas on metrics-psqlts-01?https://gitlab.torproject.org/tpo/tpa/team/-/issues/41026data update service and timer on meronense2023-01-10T16:41:24ZHirodata update service and timer on meronenseI would need some help figuring out why the update service on meronense doesn't wait for the previous run to finish before starting a new one.
The timer and service are in puppet and they only start this script: https://gitlab.torprojec...I would need some help figuring out why the update service on meronense doesn't wait for the previous run to finish before starting a new one.
The timer and service are in puppet and they only start this script: https://gitlab.torproject.org/tpo/network-health/metrics/metrics-bin/-/blob/main/website/run-web.sh
\cc @gkanarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40910CRITICAL disk usage on metrics-psqlts-012022-10-04T14:03:07ZKezCRITICAL disk usage on metrics-psqlts-01Icinga is reporting a critical issue for disk usage - all since 2022-09-29 22:57:01
```
DISK CRITICAL - free space: / 561 MB (5% inode=87%): /dev 3962 MB (100% inode=99%): /dev/shm 3978 MB (99% inode=99%): /run 795 MB (99% inode=99%): /...Icinga is reporting a critical issue for disk usage - all since 2022-09-29 22:57:01
```
DISK CRITICAL - free space: / 561 MB (5% inode=87%): /dev 3962 MB (100% inode=99%): /dev/shm 3978 MB (99% inode=99%): /run 795 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /tmp 3978 MB (100% inode=99%): /run/credentials 795 MB (99% inode=99%): /var/tmp 561 MB (5% inode=87%):
```anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40773install newer obfs4proxy on polyanthum2022-06-02T14:33:08ZRoger Dingledineinstall newer obfs4proxy on polyanthumThis is a similar ticket to https://gitlab.torproject.org/tpo/tpa/team/-/issues/40758
We currently have obfs4proxy 0.0.8 installed on bridges.tpo. And we use that obfs4proxy to test obfs4 reachability of all the bridges.
But because of...This is a similar ticket to https://gitlab.torproject.org/tpo/tpa/team/-/issues/40758
We currently have obfs4proxy 0.0.8 installed on bridges.tpo. And we use that obfs4proxy to test obfs4 reachability of all the bridges.
But because of https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/40804, we are testing with an old and buggy and only partly compatible obfs4!
The obfs4 version in Tor Browser is 0.0.12, which means Tor clients are getting the new better handshake.
We should move bridgestrap so it tests obfs4 bridges using the same handshake that Tor Browser users will attempt.
And the way we do that is by upgrading the obfs4proxy package.
https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/obfs4/-/issues/33736#note_2786764 says that as of some months ago, obfs4proxy 0.0.13 is in bullseye-backports.
Does that mean we just add a line to the puppet stanza and we're there? :)
[Cc'ing @meskio so he knows about the ticket]anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40770postgresql DB with timescale plugin installed2022-06-13T14:10:48ZHiropostgresql DB with timescale plugin installedI'd like to start testing storing all metrics data into a DB as described in:
https://gitlab.torproject.org/tpo/network-health/team/-/wikis/metrics/collector/pipeline
For the time being I'd just need a postgresql instance, with timescal...I'd like to start testing storing all metrics data into a DB as described in:
https://gitlab.torproject.org/tpo/network-health/team/-/wikis/metrics/collector/pipeline
For the time being I'd just need a postgresql instance, with timescaledb plugin installed that I could send data to over TLS.
In this first step my plan is just to start storing data into tables and eventually have this as a task in collector.anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40764onionoo down, serving malformed JSON2022-07-26T15:41:40ZJérôme Charaouilavamind@torproject.orgonionoo down, serving malformed JSONAbout an hour after the bullseye upgrade of `onionoo-backend-01`, onionoo started failing Nagios checks:
```
# /usr/lib/nagios/plugins/tor-check-onionoo 127.0.0.1:8080
CRITICAL: Error parsing JSON format: Expecting value: line 6 column ...About an hour after the bullseye upgrade of `onionoo-backend-01`, onionoo started failing Nagios checks:
```
# /usr/lib/nagios/plugins/tor-check-onionoo 127.0.0.1:8080
CRITICAL: Error parsing JSON format: Expecting value: line 6 column 20 (char 98) {"version":"%s",
"build_revision":"%s",
```
Indeed, it seems to be serving malformed JSON:
```
# curl 127.0.0.1:6081/summary?limit=0
{"version":"%s",
"build_revision":"%s",
"relays_published":"%s",
"relays":[
],
"relays_truncated":%d,
"bridges_published":"2022-05-18 14:44:41",
"bridges":[
],
"bridges_truncated":%d}
```
Whether the upgrade is what caused this incident is unclear at this point, because `onionoo-backend-01` was confirmed working immediately after the upgrade, and because the problem started approximately at the same time for it and `onionoo-backend-02`, which has *not* been upgraded.Jérôme Charaouilavamind@torproject.orgJérôme Charaouilavamind@torproject.orghttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40750materculae hits the OOM killer since bullseye upgrade2022-06-27T15:00:11Zanarcatmaterculae hits the OOM killer since bullseye upgradelast night (from my perspective), PostgreSQL crashed on materculae. in systemd's logs, we see:
```
May 05 05:25:33 materculae systemd[1]: postgresql@13-main.service: A process of this unit has been killed by the OOM killer.
```
then a ...last night (from my perspective), PostgreSQL crashed on materculae. in systemd's logs, we see:
```
May 05 05:25:33 materculae systemd[1]: postgresql@13-main.service: A process of this unit has been killed by the OOM killer.
```
then a bunch of errors happened in the postgresql log:
```
2022-05-05 05:25:33 GMT LOG: server process (PID 16279) was terminated by signal 9: Killed
2022-05-05 05:25:33 GMT DETAIL: Failed process was running: select * from search_by_date_address24($1, $2) as result
2022-05-05 05:25:33 GMT LOG: terminating any other active server processes
2022-05-05 05:25:33 GMT WARNING: terminating connection because of crash of another server process
2022-05-05 05:25:33 GMT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
2022-05-05 05:25:33 GMT HINT: In a moment you should be able to reconnect to the database and repeat your command.
```
it's unclear why this is happening, but it's clearly a regression from the upgrade. here's a memory graph from the last 3 days:
![image](/uploads/839a371de3d308807d2f3394d39977c7/image.png)
https://grafana.torproject.org/d/xfpJB9FGz/1-node-exporter-for-prometheus-dashboard-en-v20201010?orgId=1&var-origin_prometheus=&var-job=node&var-hostname=All&var-node=materculae.torproject.org:9100&var-device=All&var-interval=2m&var-maxmount=%2Fhome&var-show_hostname=materculae&var-total=93&viewPanel=156&from=now-3d&to=now&refresh=1m
i *think* the upgrade completed at about 15:26 UTC yesterday, at least according to the graph. (this comment is later, but that's probably just me reporting after the fact: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40692#note_2799945).
then we can see the server restarting (the blank), and slowly reclaiming memory. then there's this unusual jump at 22:18 and things go a little out of whack for a few hours, but seem to stabilise at a somewhat reasonable pattern at 11:00 next day. that's about 1GB more memory usage than the previous normal though, so that's already a little worrisome.
but then, at 22:46UTC, memory usage just starts to grown linearly, eventually hitting the above OOM at around 5:30 or so.
it seems we don't have prometheus instrumentation for postgresql on that host at all right now, so i guess that would be one next step.
/cc @hiroDebian 11 bullseye upgradeanarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40535colchicifolium disk full2023-06-07T15:45:23Zanarcatcolchicifolium disk fullcolchicifolium's disk is rising steadily, this is the last year:
![image](/uploads/e781feb8a476adcb640ab6a275d25e6b/image.png)
we can see when we added 50G then 200G more.
@hiro is thinking about redesigning this service, but in the m...colchicifolium's disk is rising steadily, this is the last year:
![image](/uploads/e781feb8a476adcb640ab6a275d25e6b/image.png)
we can see when we added 50G then 200G more.
@hiro is thinking about redesigning this service, but in the meantime, let's give this poor server a break.anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41452estimate storage requirements for metricsdb and backups2024-01-31T19:35:07Zanarcatestimate storage requirements for metricsdb and backupsin #41424, we have agreed to continue with the monolithic postgresql design for the time being, more or less -- collector will move to object storage and there's a possibility of introducing other optimizations (https://gitlab.torproject...in #41424, we have agreed to continue with the monolithic postgresql design for the time being, more or less -- collector will move to object storage and there's a possibility of introducing other optimizations (https://gitlab.torproject.org/tpo/tpa/team/-/issues/41416#note_2978071) -- but for now that's the plan.
We'll need to scale up storage for metricsdb.
Right now, the storage usage is as follows:
| machine | used | size |
|----------------|---------|----------|
| metricsdb-01 | 1.07TiB | 7.88TiB |
| bungei pg | 1.36TiB | 2.96TiB |
| **total** | 1.43TiB | 10.84TiB |
Source:
https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&var-class=All&var-instance=bungei.torproject.org&from=now-90d&to=now&refresh=5s&var-Filters=mountpoint%7C%3D%7C%2Fsrv%2Fbackups%2Fpg
https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&var-class=All&var-instance=metricsdb-01.torproject.org&from=now-1y&to=now&refresh=5s
The specification is that we need weekly backups of the postgresql database (*not* WAL logs) except for a subset of tables that need hourly or better backups (ideally WAL).
The estimate is the database size *at launch* will be around 5TiB, with a 500GiB growth per year.
This could involve building a new storage server to handle those backups (#41364) and we feel it would be a good idea to start working with [Barman](https://pgbarman.org/) for this system.
The output of this issue is an estimate for hardware needs, a rough architectural draft, and subsequent tickets to make necessary changes to reflect said architecture.
/cc @lavamindanarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41416Discuss possible issue with storage for metrics services2023-12-21T15:08:08ZHiroDiscuss possible issue with storage for metrics servicesThere have been various discussions about what is the best long term strategy to scale metrics services.
The long term run, at the time of writing, is to concentrate all our storage on metricsdb which represent the pipeline v2.0 with po...There have been various discussions about what is the best long term strategy to scale metrics services.
The long term run, at the time of writing, is to concentrate all our storage on metricsdb which represent the pipeline v2.0 with postgresql and Victoria Metrics.
Some issue have been discussed regarding the long term growth rate of this [setup](https://gitlab.torproject.org/tpo/network-health/metrics/collector/-/issues/40023#note_2968760).
I understand tpa now can offer object storage, but we are now more than 1 year into developing the new pipeline and there are many issues that we should consider on metrics side. This ticket though, is not to discuss those issues as much to understand what tpa can support in the long run before making a development plan from the network health perspective.anarcatanarcat