Collector issueshttps://gitlab.torproject.org/tpo/network-health/metrics/collector/-/issues2021-10-05T14:44:34Zhttps://gitlab.torproject.org/tpo/network-health/metrics/collector/-/issues/33502Do not let appended descriptor files grow too large2021-10-05T14:44:34ZKarsten LoesingDo not let appended descriptor files grow too largeI revisited legacy/trac#20395 last week. The issue is that metrics-lib cannot handle large descriptor files, because it first reads the entire file into memory before splitting it into single descriptors and parsing them. While it would ...I revisited legacy/trac#20395 last week. The issue is that metrics-lib cannot handle large descriptor files, because it first reads the entire file into memory before splitting it into single descriptors and parsing them. While it would be possible to parse large descriptor files after making some major code changes (using `FileChannel` and doing lazy parsing), I don't think that we have to do that. After all, we're writing these large descriptor files ourselves in CollecTor, and it's up to us to stop doing that.
Going back in time, the original reason for concatenating multiple descriptors into a single file was that rsyncing many tiny files from one host to another host was just slow. So we appended server descriptors and extra-info descriptors into a single file. This works well with server descriptors or extra-info descriptors published within 1 hour or even 10 hours. It does not work that well anymore with all server descriptors or extra-info descriptors synced from another CollecTor instance when starting a new instance (legacy/trac#20335). It works even less well when importing one or more monthly tarballs containing server descriptors or extra-info descriptors (legacy/trac#27716).
My suggestion is that we define a configurable limit for appended descriptor files of, say, 20 MiB. And when storing a descriptor, we check whether appending a descriptor to an existing descriptor file would exceed this limit and start a new descriptor file in that case.
There are some technical details to work out, but I think they can be solved. I also don't expect this to produce a lot of code, not even complex code changes. The benefit would be that we could resolve legacy/trac#20395 and legacy/trac#27716 by implementing this.
Thoughts on the general idea?https://gitlab.torproject.org/tpo/network-health/metrics/collector/-/issues/31695Allow pushing Metrics to CollecTor from trusted endpoints2020-12-01T10:53:51ZirlAllow pushing Metrics to CollecTor from trusted endpointsSwitch from pull to push model for archiving OnionPerf data: Another aspect related to collecting data is that, right now, data collection works by periodically pulling new .tpf files from known OnionPerf instances. This has at least two...Switch from pull to push model for archiving OnionPerf data: Another aspect related to collecting data is that, right now, data collection works by periodically pulling new .tpf files from known OnionPerf instances. This has at least two problems: there's a delay between OnionPerfs producing new files and CollecTor pulling them, and adding new instances requires editing a config file on the CollecTor host. Maybe we can switch to a push model where CollecTor accepts measurements from any OnionPerf instance, and CollecTor clients like the Tor Metrics website decide which measurements to aggregate and visualize. Note that switching to a push model requires installing some basic authentication mechanisms like cryptographic identities and signatures, in order to prevent anyone from pushing wrong data, overwriting correct data, or even storing arbitrary data.https://gitlab.torproject.org/tpo/network-health/metrics/collector/-/issues/28324Extend CollecTor to fetch recent, non-current consensuses and votes2023-01-23T14:54:17ZKarsten LoesingExtend CollecTor to fetch recent, non-current consensuses and votesThere are discussions to extend dir-spec to serve recent, non-current consensuses and votes (legacy/trac#21378). As of now, only the most recent, current consensus and votes are available, as well as the next ones, 5-10 minutes before th...There are discussions to extend dir-spec to serve recent, non-current consensuses and votes (legacy/trac#21378). As of now, only the most recent, current consensus and votes are available, as well as the next ones, 5-10 minutes before they become valid.
This extension is fantastic news, because we currently rely on CollecTor to run once per hour. And if it doesn't, we'd be missing the consenus and votes from that hour. We can compensate temporary failures to some extent by having two CollecTor instances running and synchronizing missing descriptors. But ideally, we'd be able to fetch previous consensuses and votes from the Tor directories.
This is currently blocking on legacy/trac#21378. But as soon as that ticket is resolved, we can start extending CollecTor to fetch recent, non-current consensuses and votes.https://gitlab.torproject.org/tpo/network-health/metrics/collector/-/issues/26089collect and archive DNS resolver data of tor exits2023-01-23T13:58:30Zcypherpunkscollect and archive DNS resolver data of tor exitscontext:
https://medium.com/@nusenu/who-controls-tors-dns-traffic-a74a7632e8ca
"
5. Add DNS related information to Relay Search (a long term item)
It would be nice and probably effective to have information about DNS resolvers show up ...context:
https://medium.com/@nusenu/who-controls-tors-dns-traffic-a74a7632e8ca
"
5. Add DNS related information to Relay Search (a long term item)
It would be nice and probably effective to have information about DNS resolvers show up on Relay Search, because it is a popular tool for relay operators to check on their relay state. Operators could easily see if they use any less desirable DNS resolvers if that information is shown on Relay Search. That way we could even reach operators who have no or invalid ContactInfo data, but multiple steps are required before this could happen:
The currently unavailable data needs to be
collected and regularly updated
"https://gitlab.torproject.org/tpo/network-health/metrics/collector/-/issues/24431Provide fallback mirror lists2021-10-05T14:38:35ZiwakehProvide fallback mirror listsRetrieve and provide fallback mirror lists (for details see parent).
Steps:
* determine an automated way to retrieve the latest list in a timely manner, i.e., as soon as it is used.
* create a CollecTor module for retrieving and storing...Retrieve and provide fallback mirror lists (for details see parent).
Steps:
* determine an automated way to retrieve the latest list in a timely manner, i.e., as soon as it is used.
* create a CollecTor module for retrieving and storing the lists (a parser will be provided by metrics-lib).https://gitlab.torproject.org/tpo/network-health/metrics/collector/-/issues/21515Add auxiliary data on Tor relays and bridges to CollecTor2020-12-02T09:38:33ZKarsten LoesingAdd auxiliary data on Tor relays and bridges to CollecTorThis ticket is the result of a local TODO list review and combines a few related ideas. Some of the ideas here are new, but some are really old and have been sitting on my list forever.
The general idea here is that CollecTor could pro...This ticket is the result of a local TODO list review and combines a few related ideas. Some of the ideas here are new, but some are really old and have been sitting on my list forever.
The general idea here is that CollecTor could provide auxiliary data on Tor relays and bridges. The main goal would be that other applications like Onionoo and Metrics but also Nyx can use this data to provide richer information on relays and bridges to their users. A secondary goal would be that CollecTor would serve as an archive for this data for future applications that don't exist yet.
Auxiliary data might include:
1. GeoIP country database: This is the same data as the Tor daemon uses internally to resolve relay IP addresses to country codes. We would be able to produce historical data by extracting `src/config/geoip` files from the Tor daemon Git repository. This data could be used by Metrics to bring back the relays by country graph.
2. GeoIP city database: This data would be the same as Onionoo uses to resolve relay IP addresses to city names. The main advantage of having this file in CollecTor would be that Onionoo could automatically pull this data instead of relying on the operator to update GeoIP files.
3. GeoIP ASN database: This is similar to 2 but for ASN information.
4. Bridge GeoIP country database: Here's an idea to provide country information for bridges despite replacing IP addresses by hashes. CollecTor could keep a list of all bridge IP addresses in a given month and use the GeoIP country database from 1 to produce a custom database for resolving bridge IP addresses to country codes. Basically, that database would contain hashed fingerprints, 10.x.y.z IP addresses, and country codes. CollecTor would add a new line to this file whenever it observes a new bridge IP address, which would happen once per hour in particular at the beginning of a month. This file would change once per month when hashes for 10.x.y.z addresses change. However, this means that we'd have to reprocess the entire bridge tarball archive to generate older database files, because we have long deleted the inputs for generating those old 10.x.y.z IP addresses. Consumers of this data would be Onionoo but also Metrics for a new bridge country graph.
5. Relay reverse DNS entries: Right now, Onionoo runs its own rDNS resolver. But we could as well run that as part of CollecTor and provide the output data in a new data format to everyone who needs it. There would also be other consumers of this data, including the relay controller Nyx which would be display rDNS entries without risking to leak who is fetching that information.
This is a lot, but maybe there's even more. It's probably useful to discuss these different new data sets together. Once we decide we want to provide some or even all of them we should switch to child tickets. And just to set expectations right, it's probably going to take months to find enough time to implement these new data sets, if we think it's a good idea.https://gitlab.torproject.org/tpo/network-health/metrics/collector/-/issues/20350Replace create-tarball.sh shell script with Java module2023-01-23T14:51:08ZiwakehReplace create-tarball.sh shell script with Java moduleThis [script's](https://gitweb.torproject.org/collector.git/tree/src/main/resources/create-tarballs.sh) should be transferred to java.
The new `createtars` module should:
* provide at least the functionality of the script
* be configur...This [script's](https://gitweb.torproject.org/collector.git/tree/src/main/resources/create-tarballs.sh) should be transferred to java.
The new `createtars` module should:
* provide at least the functionality of the script
* be configurable as other CollecTor modules
* not impede other modules
Please collect more features and functionality that the script can't/doesn't provide, but which should be part of this module in the comments below.CollecTor 2.0.0https://gitlab.torproject.org/tpo/network-health/metrics/collector/-/issues/20236Make changes to bridgedescs module for bulk-processing tarballs2020-12-01T15:08:07ZKarsten LoesingMake changes to bridgedescs module for bulk-processing tarballsI recently finished re-processing the entire bridge descriptor archive for legacy/trac#19317. However, I had to make some changes to avoid running out of memory or wasting time on unnecessary operations. I now went through the changes ...I recently finished re-processing the entire bridge descriptor archive for legacy/trac#19317. However, I had to make some changes to avoid running out of memory or wasting time on unnecessary operations. I now went through the changes and cleaned them up a bit, because I'd like to merge some/most/all (?) of them for the next time we need to bulk-process the bridge descriptor archive. I'll post a branch once I have a ticket number.
We should discuss which of these commits should go in by default (maybe ed48f03, ae5c53c, and e514d30?), which should only be enabled in a special bulk-processing mode (maybe df96751, 27cbfc8, and 68b29c2?), which should have their own config option (ugh!), or which we drop because we don't need as badly for processing descriptors in bulk.
Clearly, these commits need work, but I figured it's better to clean them up a bit now than attempt to do that in four or eight weeks. Branch follows in a minute.https://gitlab.torproject.org/tpo/network-health/metrics/collector/-/issues/20098Make reference checker more accurate2023-01-23T14:51:05ZKarsten LoesingMake reference checker more accurateAs of February this year we're using a reference checker to spot missing descriptors that reads files in `recent/relay-descriptors/` and warns if too many referenced descriptors cannot be found.
However, our reference checker has been t...As of February this year we're using a reference checker to spot missing descriptors that reads files in `recent/relay-descriptors/` and warns if too many referenced descriptors cannot be found.
However, our reference checker has been too noisy for me to pay much attention.
I didn't look at the logs in detail yet, but I came up with a possible improvement: we should only count an extra-info descriptor as missing if the referencing server descriptor is referenced from a consensus or vote. This is supposed to exclude all extra-info descriptors that are referenced from server descriptors uploaded to the directory authorities by bogus relays without also uploading the corresponding extra-info descriptors.
Maybe there are other tweaks that make these warnings more accurate and again worth checking by the operator.https://gitlab.torproject.org/tpo/network-health/metrics/collector/-/issues/19834Rethink how we handle issues while sanitizing bridge descriptors2023-01-23T15:00:57ZKarsten LoesingRethink how we handle issues while sanitizing bridge descriptorsThe bridge descriptor sanitizer parses tarballs containing non-sanitized bridge descriptors, modifies their content by removing bridge IP addresses and other sensitive parts, and writes sanitized versions of those bridge descriptors to d...The bridge descriptor sanitizer parses tarballs containing non-sanitized bridge descriptors, modifies their content by removing bridge IP addresses and other sensitive parts, and writes sanitized versions of those bridge descriptors to disk.
The sanitizer needs to recognize the lines contained in bridge descriptors to distinguish between lines that must be changed and others that can be kept unchanged, and it needs to be able to understand the exact format of certain lines in order to sanitize their contents.
This process can go wrong in various ways, and we need to decide how to handle those situations. Possible situations are:
1. A tarball is malformed or can otherwise not be opened.
2. A tarball contains one or more files that cannot be opened.
3. A tarball file contains an unknown descriptor type.
4. An internal problem prohibits sanitizing descriptor parts (e.g., missing secret for sanitizing IP address).
5. A descriptor is missing parts that are required for properly sanitizing its contents.
6. A descriptor contains an unrecognized line.
7. A descriptor line doesn't follow the expected format, contains fewer or more arguments, etc.
Possible ways of handling such situations are:
A. Skip a line we don't understand and keep the rest of the descriptor.
B. Skip a descriptor.
C. Skip the file contained in the tarball and continue with the next.
D. Abort processing the tarball.
E. Skip the entire tarball, including discarding any descriptors processed before running into the problem, and attempt to process the tarball again in the next execution.
F. Abstain from processing a given descriptor type until a problem has been resolved.
G. Discard any descriptors processed in a tarball until running into the problem, abort the current execution, and refuse starting the next execution until the problem has been resolved.
H. (in addition to A-G). Inform the operator by logging the problem.
I. (in addition to A-G). Warn the operator and ask them to resolve the problem.
Looking at this list, I think that my preferred ways of handling problems would be something like:
- B+H in situations 5, 6, and 7;
- E+I in situations 1, 2, and 3; and
- G+I in situation 4.
That's not exactly what we're currently doing. And I'm not even sure if somebody else operating a CollecTor instance with the bridgedescs module would have the same preferences.
Let's discuss!https://gitlab.torproject.org/tpo/network-health/metrics/collector/-/issues/18798Analyze descriptor completeness2022-02-28T14:57:22ZiwakehAnalyze descriptor completenessI started a wiki page [here](https://trac.torproject.org/projects/tor/wiki/doc/CollecTor/AnalysisDescriptorCompleteness).
Update: This wiki page needs to be move to our new wiki.I started a wiki page [here](https://trac.torproject.org/projects/tor/wiki/doc/CollecTor/AnalysisDescriptorCompleteness).
Update: This wiki page needs to be move to our new wiki.