The Tor Project issueshttps://gitlab.torproject.org/groups/tpo/-/issues2024-02-01T05:18:15Zhttps://gitlab.torproject.org/tpo/onion-services/onionspray/-/issues/35MetricsPort support2024-02-01T05:18:15ZSilvio RhattoMetricsPort support# Tasks
* [x] Add `MetricsPort` and `MetricsPortPolicy` support.
* [x] Document how to monitor Onion Services.
# Time estimation
* Complexity: very small (0.5 day)
* Uncertainty: low (x1.1)
* [Reference](https://jacobian.org/2021/may/...# Tasks
* [x] Add `MetricsPort` and `MetricsPortPolicy` support.
* [x] Document how to monitor Onion Services.
# Time estimation
* Complexity: very small (0.5 day)
* Uncertainty: low (x1.1)
* [Reference](https://jacobian.org/2021/may/25/my-estimation-technique/) (adapted)Onionspray 1.6.0Silvio RhattoSilvio Rhatto2024-01-31https://gitlab.torproject.org/tpo/network-health/team/-/issues/250Capture telemetry about bootstrapping times by PT configuration in censored r...2022-12-15T11:42:23ZdonutsCapture telemetry about bootstrapping times by PT configuration in censored regionsAs part of the [Sponsor 96 project](https://gitlab.torproject.org/groups/tpo/-/milestones/24) we've implemented a new feature in Tor Browser called Connection Assist (historically referred to as [mostly] automatic censorship detection), ...As part of the [Sponsor 96 project](https://gitlab.torproject.org/groups/tpo/-/milestones/24) we've implemented a new feature in Tor Browser called Connection Assist (historically referred to as [mostly] automatic censorship detection), which gives users the option of trying a second bootstrap after the first fails due to censorship of the Tor Network. During the second bootstrap, Tor Browser looks up the user's location via a new moat API, and returns a short shopping list of bridge configurations to try in order (see [circumvention.json](https://gitlab.torproject.org/tpo/anti-censorship/rdsys-admin/-/blob/main/conf/circumvention.json)), that should circumvent Tor Network blocking in their country.
In addition to Tor Browser, OnionShare will also implement the censorship circumvention API – and other Tor-powered apps will likely follow suit in future too.
However, bootstrapping times in the target regions for S96 (specifically China & Tibet, rather than Hong Kong) remain a source of concern. Long bootsrapping times create uncertainty over whether or not Tor is actually connecting, or is stuck in a state of infinite bootstrapping (which we've observed too, see: https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/40970).
We're currently considering a number of workarounds to help alleviate these issues, including (for example):
- Displaying contextual hints about bootstrapping times by region and PT to help set user expectations
- Providing encouragement when Tor has been stuck at the same bootstrapping step for X amount of time
- Introducing timeouts which display non-blocking errors, the duration of which will need to be set per-region (thus providing a means to escape from the dreaded infinite bootstrap issue)
Given the above, it would be useful to measure bootstrapping times by PT/bridge configuration in censored regions. OONI already includes this measurement in their Snowflake tests, [see this example](https://explorer.ooni.org/measurement/20220615T081636Z_torsf_CN_9808_n1_kW4lyakvsSN7XhIG) for instance.
In addition, there may be an opportunity to improve how we collect data about working PT/bridge configurations in order to keep the circumvention.json up to date and as effective as possible.
Three options have been proposed so far:
1. Capturing telemetry about bootstrapping at the network level, i.e. on metrics.torproject.org
2. Adding additional tests to vantage points in the target regions
3. Measuring bootstrapping at the application level, e.g. by implementing cleaninsights.org in Tor Browser, OnionShare etc.Sponsor 96: Rapid Expansion of Access to the Uncensored Internet through Tor in China, Hong Kong, & Tibethttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41526Deploy onionperf files parser on metricsdb-012024-03-07T14:23:37ZHiroDeploy onionperf files parser on metricsdb-01We need to deploy https://gitlab.torproject.org/tpo/network-health/metrics/tor_fusion/ on metricsdb-01.
Basically this thing will run, download onionperf files from collector and parse them. This will just happen once a day around 1am UT...We need to deploy https://gitlab.torproject.org/tpo/network-health/metrics/tor_fusion/ on metricsdb-01.
Basically this thing will run, download onionperf files from collector and parse them. This will just happen once a day around 1am UTC as at midnight is when collector fetches the archives from the various onionperf clients.
It's a little rust app and was thinking to create a group and user like for the metrics-api. But maybe it's a bit overkill and I should just put it in the parser space?HiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41515meronense OOM2024-02-05T19:52:19Zanarcatmeronense OOMtoday, metrics.tpo went down because the OOM killer was invoked. not sure what happened. i restarted both metrics-r and metrics-web.service, pending further investigation.
this happened before, of course. we bumped the memory on that bo...today, metrics.tpo went down because the OOM killer was invoked. not sure what happened. i restarted both metrics-r and metrics-web.service, pending further investigation.
this happened before, of course. we bumped the memory on that box to 20GB in #41335 and had issues after the bullseye upgrade as well (#40814), both incidents should be investigated. those are just the incidents that pop up in the gitlab "Similar issues", further investigation in other issues probably warranted.
possibly related with the bookworm upgrade, of course (#41252).anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41424Finding a reasonable backup strategy for metricsdb-012023-12-19T20:20:19ZHiroFinding a reasonable backup strategy for metricsdb-01Today with @anarcat we have briefly discussed a change of retention policy for metricsdb-01 postgresql database.
If I am not mistaken the policy changed from 30 days retention to 7 days. I think that is correct but we could even find a...Today with @anarcat we have briefly discussed a change of retention policy for metricsdb-01 postgresql database.
If I am not mistaken the policy changed from 30 days retention to 7 days. I think that is correct but we could even find a policy that is easier to maintain over time.
I think we could assume that whatever we have on metricsdb-01 could be recreated from archives that we store on collector. The only data that escape that rule would be the tags and notes that we intend to attach to relays with tagtor.
For those we could setup a timer that would dump the few tables (I think 4 in total) that store that data. The dumps wouldn't be big or slow to create, so that could be a solution.
I am not sure is there anything else we could consider, but I am open to suggestions.
/cc @gk @micahanarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41380onionoo-backend-01 running filling up swap2024-01-10T16:47:43Zanarcatonionoo-backend-01 running filling up swap![image](/uploads/d6877b98d7d21a676d788bb27f144e68/image.png)
https://grafana.torproject.org/d/amgrk2Qnk/memory-usage?orgId=1&var-class=All&var-node=onionoo-backend-01.torproject.org&var-node=onionoo-backend-02.torproject.org&from=now-1y...![image](/uploads/d6877b98d7d21a676d788bb27f144e68/image.png)
https://grafana.torproject.org/d/amgrk2Qnk/memory-usage?orgId=1&var-class=All&var-node=onionoo-backend-01.torproject.org&var-node=onionoo-backend-02.torproject.org&from=now-1y&to=now
something on onionoo-backend-01 is eating up all swap. it seems to have stabilized now, but it tripped the critical warnings in nagios.
@hiro any idea what's going on here?HiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41343Onionoo backends out of disk space2023-11-20T21:57:17ZHiroOnionoo backends out of disk spaceSeems the onionoo backends have run out of disk space on /srv. Can we increase space? I think if we could add 10 More GB to each host at least (ideally 20) it would be ok.Seems the onionoo backends have run out of disk space on /srv. Can we increase space? I think if we could add 10 More GB to each host at least (ideally 20) it would be ok.anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41307Deploy network status api on metricsdb-012023-09-06T20:18:23ZHiroDeploy network status api on metricsdb-01We need to deploy the network status api onto metricsdb-01.
This is a web based services that reads data out of the postgresdb and Victoria Metrics (https://gitlab.torproject.org/tpo/network-health/metrics/networkstatusapi/)
I am going...We need to deploy the network status api onto metricsdb-01.
This is a web based services that reads data out of the postgresdb and Victoria Metrics (https://gitlab.torproject.org/tpo/network-health/metrics/networkstatusapi/)
I am going to add it behind apache and protect with http auth as this is not a public service yet.
\cc @gkHiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41258materculae out of disk space2023-09-21T01:51:41ZKezmaterculae out of disk spaceprevious ticket: #40826
it's been a year, and nagios is complaining about materculae's /srv partition
```
# df -h /srv
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_materculae-srv 147G 135G 4.3G 97%...previous ticket: #40826
it's been a year, and nagios is complaining about materculae's /srv partition
```
# df -h /srv
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg_materculae-srv 147G 135G 4.3G 97% /srv
```
in the previous ticket (#40826) @anarcat changed the warning threshold, which is why this warning popped up now.
according to grafana, the usage has only been about 15G in the past year, and the growth is linear. we could add another 20G and revisit in a year, or throw 40G or 60G at it to push things further down the road.
![image](/uploads/e8ddf8b69703273f73d891586f7fc137/image.png)anarcatanarcat2023-09-22https://gitlab.torproject.org/tpo/tpa/team/-/issues/41167connect to postgresql db on new metrics DB via tls2023-06-27T15:13:20ZHiroconnect to postgresql db on new metrics DB via tlsWould it be possible to get a read only user to connect to the postgresql db on metrics-psqlts-01 via tls?
This would be used to access it via grafana, but also allow metrics developers to query the data.
Possibly people that would acc...Would it be possible to get a read only user to connect to the postgresql db on metrics-psqlts-01 via tls?
This would be used to access it via grafana, but also allow metrics developers to query the data.
Possibly people that would access this would be:
@hiro
@gk
@mattrighettiJérôme Charaouilavamind@torproject.orgJérôme Charaouilavamind@torproject.orghttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41161rebuild corsicum into collector-02.torproject.org2023-05-23T16:24:44Zanarcatrebuild corsicum into collector-02.torproject.orgwe need to migrate out of the old Sunet cluster into the new Safespring cluster, corsicum needs to be retired and rebuilt into collector-02.
see also #40684.we need to migrate out of the old Sunet cluster into the new Safespring cluster, corsicum needs to be retired and rebuilt into collector-02.
see also #40684.anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41130Deploy new metrics database stack2023-07-07T08:00:20ZHiroDeploy new metrics database stackI have been testing our victoriametrics + postgresql setup on metrics-psqlts-01 for a while, and now that we are close to have a prod deployment of this pipeline I'd like to have things properly in puppet.
I have a branch with a tentati...I have been testing our victoriametrics + postgresql setup on metrics-psqlts-01 for a while, and now that we are close to have a prod deployment of this pipeline I'd like to have things properly in puppet.
I have a branch with a tentative setup that I'd like to have your opinion on called metrics-deploy.
This branch has also support to deploy a python web app to access and query both the postgresql db and victoria metrics.
Victoria metrics runs with docker, but without compose. I am not sure you'd prefer a compose setup, since this is a single service.
An alternative would be to run the full stack with compose. Would postgresql backups work in that case?
I am going to be out next week. So maybe we could discuss this in costa rica face to face?HiroHirohttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41114Disk space increase on metrics-psqlts-012023-04-04T00:02:14ZHiroDisk space increase on metrics-psqlts-01Can we add 20 more Gigas on metrics-psqlts-01?Can we add 20 more Gigas on metrics-psqlts-01?https://gitlab.torproject.org/tpo/tpa/team/-/issues/41026data update service and timer on meronense2023-01-10T16:41:24ZHirodata update service and timer on meronenseI would need some help figuring out why the update service on meronense doesn't wait for the previous run to finish before starting a new one.
The timer and service are in puppet and they only start this script: https://gitlab.torprojec...I would need some help figuring out why the update service on meronense doesn't wait for the previous run to finish before starting a new one.
The timer and service are in puppet and they only start this script: https://gitlab.torproject.org/tpo/network-health/metrics/metrics-bin/-/blob/main/website/run-web.sh
\cc @gkanarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40910CRITICAL disk usage on metrics-psqlts-012022-10-04T14:03:07ZKezCRITICAL disk usage on metrics-psqlts-01Icinga is reporting a critical issue for disk usage - all since 2022-09-29 22:57:01
```
DISK CRITICAL - free space: / 561 MB (5% inode=87%): /dev 3962 MB (100% inode=99%): /dev/shm 3978 MB (99% inode=99%): /run 795 MB (99% inode=99%): /...Icinga is reporting a critical issue for disk usage - all since 2022-09-29 22:57:01
```
DISK CRITICAL - free space: / 561 MB (5% inode=87%): /dev 3962 MB (100% inode=99%): /dev/shm 3978 MB (99% inode=99%): /run 795 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /tmp 3978 MB (100% inode=99%): /run/credentials 795 MB (99% inode=99%): /var/tmp 561 MB (5% inode=87%):
```anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40773install newer obfs4proxy on polyanthum2022-06-02T14:33:08ZRoger Dingledineinstall newer obfs4proxy on polyanthumThis is a similar ticket to https://gitlab.torproject.org/tpo/tpa/team/-/issues/40758
We currently have obfs4proxy 0.0.8 installed on bridges.tpo. And we use that obfs4proxy to test obfs4 reachability of all the bridges.
But because of...This is a similar ticket to https://gitlab.torproject.org/tpo/tpa/team/-/issues/40758
We currently have obfs4proxy 0.0.8 installed on bridges.tpo. And we use that obfs4proxy to test obfs4 reachability of all the bridges.
But because of https://gitlab.torproject.org/tpo/applications/tor-browser/-/issues/40804, we are testing with an old and buggy and only partly compatible obfs4!
The obfs4 version in Tor Browser is 0.0.12, which means Tor clients are getting the new better handshake.
We should move bridgestrap so it tests obfs4 bridges using the same handshake that Tor Browser users will attempt.
And the way we do that is by upgrading the obfs4proxy package.
https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/obfs4/-/issues/33736#note_2786764 says that as of some months ago, obfs4proxy 0.0.13 is in bullseye-backports.
Does that mean we just add a line to the puppet stanza and we're there? :)
[Cc'ing @meskio so he knows about the ticket]anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40770postgresql DB with timescale plugin installed2022-06-13T14:10:48ZHiropostgresql DB with timescale plugin installedI'd like to start testing storing all metrics data into a DB as described in:
https://gitlab.torproject.org/tpo/network-health/team/-/wikis/metrics/collector/pipeline
For the time being I'd just need a postgresql instance, with timescal...I'd like to start testing storing all metrics data into a DB as described in:
https://gitlab.torproject.org/tpo/network-health/team/-/wikis/metrics/collector/pipeline
For the time being I'd just need a postgresql instance, with timescaledb plugin installed that I could send data to over TLS.
In this first step my plan is just to start storing data into tables and eventually have this as a task in collector.anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40764onionoo down, serving malformed JSON2022-07-26T15:41:40ZJérôme Charaouilavamind@torproject.orgonionoo down, serving malformed JSONAbout an hour after the bullseye upgrade of `onionoo-backend-01`, onionoo started failing Nagios checks:
```
# /usr/lib/nagios/plugins/tor-check-onionoo 127.0.0.1:8080
CRITICAL: Error parsing JSON format: Expecting value: line 6 column ...About an hour after the bullseye upgrade of `onionoo-backend-01`, onionoo started failing Nagios checks:
```
# /usr/lib/nagios/plugins/tor-check-onionoo 127.0.0.1:8080
CRITICAL: Error parsing JSON format: Expecting value: line 6 column 20 (char 98) {"version":"%s",
"build_revision":"%s",
```
Indeed, it seems to be serving malformed JSON:
```
# curl 127.0.0.1:6081/summary?limit=0
{"version":"%s",
"build_revision":"%s",
"relays_published":"%s",
"relays":[
],
"relays_truncated":%d,
"bridges_published":"2022-05-18 14:44:41",
"bridges":[
],
"bridges_truncated":%d}
```
Whether the upgrade is what caused this incident is unclear at this point, because `onionoo-backend-01` was confirmed working immediately after the upgrade, and because the problem started approximately at the same time for it and `onionoo-backend-02`, which has *not* been upgraded.Jérôme Charaouilavamind@torproject.orgJérôme Charaouilavamind@torproject.orghttps://gitlab.torproject.org/tpo/tpa/team/-/issues/40535colchicifolium disk full2023-06-07T15:45:23Zanarcatcolchicifolium disk fullcolchicifolium's disk is rising steadily, this is the last year:
![image](/uploads/e781feb8a476adcb640ab6a275d25e6b/image.png)
we can see when we added 50G then 200G more.
@hiro is thinking about redesigning this service, but in the m...colchicifolium's disk is rising steadily, this is the last year:
![image](/uploads/e781feb8a476adcb640ab6a275d25e6b/image.png)
we can see when we added 50G then 200G more.
@hiro is thinking about redesigning this service, but in the meantime, let's give this poor server a break.anarcatanarcathttps://gitlab.torproject.org/tpo/tpa/team/-/issues/41452estimate storage requirements for metricsdb and backups2024-03-24T14:01:45Zanarcatestimate storage requirements for metricsdb and backupsin #41424, we have agreed to continue with the monolithic postgresql design for the time being, more or less -- collector will move to object storage and there's a possibility of introducing other optimizations (https://gitlab.torproject...in #41424, we have agreed to continue with the monolithic postgresql design for the time being, more or less -- collector will move to object storage and there's a possibility of introducing other optimizations (https://gitlab.torproject.org/tpo/tpa/team/-/issues/41416#note_2978071) -- but for now that's the plan.
We'll need to scale up storage for metricsdb.
Right now, the storage usage is as follows:
| machine | used | size |
|----------------|---------|----------|
| metricsdb-01 | 1.07TiB | 7.88TiB |
| bungei pg | 1.36TiB | 2.96TiB |
| **total** | 1.43TiB | 10.84TiB |
Source:
https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&var-class=All&var-instance=bungei.torproject.org&from=now-90d&to=now&refresh=5s&var-Filters=mountpoint%7C%3D%7C%2Fsrv%2Fbackups%2Fpg
https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&var-class=All&var-instance=metricsdb-01.torproject.org&from=now-1y&to=now&refresh=5s
The specification is that we need weekly backups of the postgresql database (*not* WAL logs) except for a subset of tables that need hourly or better backups (ideally WAL).
The estimate is the database size *at launch* will be around 5TiB, with a 500GiB growth per year.
This could involve building a new storage server to handle those backups (#41364) and we feel it would be a good idea to start working with [Barman](https://pgbarman.org/) for this system.
The output of this issue is an estimate for hardware needs, a rough architectural draft, and subsequent tickets to make necessary changes to reflect said architecture.
/cc @lavamindanarcatanarcat