I can see how in practice this is really useful, but this may not be the best field to add to the headers. Are we going to add fields for all of the versions of libraries on the system (e.g. OpenSSL), operating system kernel, etc. All of these fields could be places with important information that helps us to interpret the results we see but every change to the headers means also a change to parsers.
There is also an implicit assumption here that the bandwidth scanner uses tor at all. A future bandwidth scanner implementation may use stem.client or an alternative implementation.
For exit lists we are considering using a free-form text field with some suggested formats, e.g. something like "ExitScanner 55.5 using Tor 1.0.0 on Windows 10". If there is later something that affects the specific implementation we can adjust that string without a spec change to report library details.
For quick diagnosis of problems, it would be great if the bandwidth scanners added some contact information to the output. We are planning to add contact strings to exit scanners.
Whatever we do here, we should come up with good reasons why we are doing it and then apply the decisions to both exit lists and bandwidth files if the good reasons are working for both.
I had a discussion with juga in IRC, here are some things we discussed:
The need for the inline contact information becomes evident when we look at how much trouble we've had tracking down who runs the default bridges in Tor Browser #30121 (moved)
Inline metadata is better than out-of-band metadata because it is archived by Tor Metrics along with the bandwidth measurement data
The need for AS as well as country allows us to correlate BGP events with bandwidth changes. Ideally we can know both the AS of the scanner and the target. Tor Metrics already knows the AS of the relay.
In some cases, the target is on a CDN, which means the AS may not be known in advance. In these cases we might choose to omit the AS number and instead use a user-assigned country code for the target.
We could assign QM-QZ (user-assigned codes not used elsewhere in Tor to my knowledge) to CDN providers but each new provider used would require a spec change, and external researchers would end up using whatever value and not being able to interop their data with ours if we clash.
We could use the OO escape code and build codes like "OOFastly" which is probably the more maintainable solution, but would require that parsers do not panic when they see >2 characters (I know the standard is known as alpha-2 but it isn't actually limited to 2 characters!)
We could use Tor's geoip database to resolve the IP address resolved via the same circuit of the CDN into a country/IP address to add to the measurements, but this may prove to be unreliable especially in the longer term as IPv4 exhaustion causes more and more IP address swapping about.
I can see how in practice this is really useful, but this may not be the best field to add to the headers. Are we going to add fields for all of the versions of libraries on the system (e.g. OpenSSL), operating system kernel, etc. All of these fields could be places with important information that helps us to interpret the results we see
You're right: we need to define scope.
Here's the scope of the sbws data file:
questions we have asked bandwidth authority operators (or they have asked us)
data we wanted Torflow to provide, but it couldn't (or didn't)
questions that relay operators ask
factors that have a large impact on measurements
The scope you're suggesting is much larger, and potentially unlimited.
I suggest that we restrict ourselves to data that we need to know to run the network.
In this context:
we wanted to know the tor version to recommend tor upgrades to bandwidth authority operators (#30184 (moved))
we might be interested in OpenSSL and NSS versions in future, because they have both had different bugs that stop relays connecting to each other
we have asked authority operators about operating systems before
but I have never wanted to know kernel versions
I have never even wondered about the versions of any other libraries
Here's another possible rule:
if it's in Tor relay descriptors, then it's useful for administering the network
tor version
operating system
but every change to the headers means also a change to parsers.
A well-written parser will present unknown headers (or all headers) in a flexible, dictionary-like structure.
There is also an implicit assumption here that the bandwidth scanner uses tor at all. A future bandwidth scanner implementation may use stem.client or an alternative implementation.
Most headers are optional. We can specify that if there's no tor, then the tor version header SHOULD NOT be present.
For exit lists we are considering using a free-form text field with some suggested formats, e.g. something like "ExitScanner 55.5 using Tor 1.0.0 on Windows 10". If there is later something that affects the specific implementation we can adjust that string without a spec change to report library details.
Unstructured text is a nightmare to parse. Just look at the browser user-agent. Or the Tor ContactInfo field.
If we need something, let's specify it as a separate header.
If we're not sure, then let's leave it out.
For quick diagnosis of problems, it would be great if the bandwidth scanners added some contact information to the output. We are planning to add contact strings to exit scanners.
Let's not specify another unstructured contact string, please.
Instead, if we need to email someone, let's specify an operator email address.
Whatever we do here, we should come up with good reasons why we are doing it and then apply the decisions to both exit lists and bandwidth files if the good reasons are working for both.
The need for the inline contact information becomes evident when we look at how much trouble we've had tracking down who runs the default bridges in Tor Browser #30121 (moved)
The need for AS as well as country allows us to correlate BGP events with bandwidth changes. Ideally we can know both the AS of the scanner and the target. Tor Metrics already knows the AS of the relay.
When we decided to include the country (#29299 (moved)) i remember arma commenting in irc to just add it in the configuration to don't make the code more complicated. I'm commenting more on this in (#30229 (moved))
In some cases, the target is on a CDN, which means the AS may not be known in advance. In these cases we might choose to omit the AS number and instead use a user-assigned country code for the target.
Do we currently have more than one CDN?. I only know one, though would be great to have more.
We could assign QM-QZ (user-assigned codes not used elsewhere in Tor to my knowledge) to CDN providers but each new provider used would require a spec change, and external researchers would end up using whatever value and not being able to interop their data with ours if we clash.
We could use the OO escape code and build codes like "OOFastly" which is probably the more maintainable solution, but would require that parsers do not panic when they see >2 characters (I know the standard is known as alpha-2 but it isn't actually limited to 2 characters!)
We could use Tor's geoip database to resolve the IP address resolved via the same circuit of the CDN into a country/IP address to add to the measurements, but this may prove to be unreliable especially in the longer term as IPv4 exhaustion causes more and more IP address swapping about.
Not convinced on any of the options for CDNs, but can't think of other solution.
we wanted to know the tor version to recommend tor upgrades to bandwidth authority operators (#30184 (moved))
easy.
we might be interested in OpenSSL and NSS versions in future, because they have both had different bugs that stop relays connecting to each other
Is there a way to ask this to Tor? GETINFO only returns Tor version. Is it possible that python ssl.OPENSSL_VERSION might be different to the one being used by the running Tor?.
we have asked authority operators about operating systems before
For exit lists we are considering using a free-form text field with some suggested formats, e.g. something like "ExitScanner 55.5 using Tor 1.0.0 on Windows 10". If there is later something that affects the specific implementation we can adjust that string without a spec change to report library details.
Unstructured text is a nightmare to parse. Just look at the browser user-agent. Or the Tor ContactInfo field.
If we need something, let's specify it as a separate header.
If we're not sure, then let's leave it out.
There is also an implicit assumption here that the bandwidth scanner uses tor at all. A future bandwidth scanner implementation may use stem.client or an alternative implementation.
Most headers are optional. We can specify that if there's no tor, then the tor version header SHOULD NOT be present.
I think both the spec change and the sbws changes are good, but I have two questions:
What is the purpose of the xxx's with 'tech-dept'? Is the goal we go back here and do something actionable and is that something that is best to have in the code rather than in tickets?
The constant renaming seems OK to me, but it seems complicated to maintain all these lists and how they are subsets/supersets of each other?
I think both the spec change and the sbws changes are good, but I have two questions:
What is the purpose of the xxx's with 'tech-dept'? Is the goal we go back here and do something actionable and is that something that is best to have in the code rather than in tickets?
If i'd create a ticket with a tech-debt (i guess i wrote a typo there) changes, i don't know if it'll ever be solved and then in the ticket would need to point to all the parts of the code where i detected it while working on something else.
The constant renaming seems OK to me, but it seems complicated to maintain all these lists and how they are subsets/supersets of each other?
agree, do you have a suggestion on how to change that without having to increase minor version because changing API?
I think both the spec change and the sbws changes are good, but I have two questions:
What is the purpose of the xxx's with 'tech-dept'? Is the goal we go back here and do something actionable and is that something that is best to have in the code rather than in tickets?
If i'd create a ticket with a tech-debt (i guess i wrote a typo there) changes, i don't know if it'll ever be solved and then in the ticket would need to point to all the parts of the code where i detected it while working on something else.
The constant renaming seems OK to me, but it seems complicated to maintain all these lists and how they are subsets/supersets of each other?
agree, do you have a suggestion on how to change that without having to increase minor version because changing API?
The names of internal constants like HEADER_KEYS_V1_1_ORDERED aren't part of the API, so they don't need a version change.
What changes are you thinking about? How do they change the API?
I think both the spec change and the sbws changes are good, but I have two questions:
What is the purpose of the xxx's with 'tech-dept'? Is the goal we go back here and do something actionable and is that something that is best to have in the code rather than in tickets?
If i'd create a ticket with a tech-debt (i guess i wrote a typo there) changes, i don't know if it'll ever be solved and then in the ticket would need to point to all the parts of the code where i detected it while working on something else.
The constant renaming seems OK to me, but it seems complicated to maintain all these lists and how they are subsets/supersets of each other?
agree, do you have a suggestion on how to change that without having to increase minor version because changing API?
The names of internal constants like HEADER_KEYS_V1_1_ORDERED aren't part of the API, so they don't need a version change.
i know, that's why i did the changes
What changes are you thinking about? How do they change the API?
teor, i saw you did a review on my PR. I'm a bit confused cause you didn't change this ticket as needs revision nor add you as reviewer. I've made all (except 1) of the changes you suggested, plus others i realized about.
That's why i'm changing now this ticket to needs review again.
Note that i'm not squashing and force pushing, so the changes can be seen.
By experiences with other reviews that get a long list of fixups, conflicts with rebases increase or individual commits can start getting more confuse for myself.
Before a final merge (note for myself more than reviewer), the commit that is actually about the tor version, should be reordered to be the last one.
Sorry I forgot to put the ticket in needs_revision.
I didn't put myself as reviewer, because I didn't want this ticket to block on me,
I still think the variable naming is confusing: SELF_INITIALIZED, TO_INIT, and INT_KEYS are confusingly similar. And INITIALIZED and INIT are inconsistent.
I'll leave you to decide if you want to fix that now, or open another ticket to fix it later.
I've created #33572 (moved) to merge the patch in the spec, because this ticket component is not Tor.
I've created #33571 (moved) for the constants renaming.