Add the tor version to the sbws bandwidth file header

Trac:
Parent Ticket: #33121 (moved)

added component::core tor/sbws milestone::sbws: 1.2.x-final parent::33121 points::1 priority::high resolution::fixed reviewer::ahf sbws-roadmap severity::major status::closed type::enhancement labels

I can see how in practice this is really useful, but this may not be the best field to add to the headers. Are we going to add fields for all of the versions of libraries on the system (e.g. OpenSSL), operating system kernel, etc. All of these fields could be places with important information that helps us to interpret the results we see but every change to the headers means also a change to parsers.

There is also an implicit assumption here that the bandwidth scanner uses tor at all. A future bandwidth scanner implementation may use stem.client or an alternative implementation.

For exit lists we are considering using a free-form text field with some suggested formats, e.g. something like "ExitScanner 55.5 using Tor 1.0.0 on Windows 10". If there is later something that affects the specific implementation we can adjust that string without a spec change to report library details.

For quick diagnosis of problems, it would be great if the bandwidth scanners added some contact information to the output. We are planning to add contact strings to exit scanners.

Whatever we do here, we should come up with good reasons why we are doing it and then apply the decisions to both exit lists and bandwidth files if the good reasons are working for both.

Trac:
Cc: juga to juga, irl

I had a discussion with juga in IRC, here are some things we discussed:

The need for the inline contact information becomes evident when we look at how much trouble we've had tracking down who runs the default bridges in Tor Browser #30121 (moved)
Inline metadata is better than out-of-band metadata because it is archived by Tor Metrics along with the bandwidth measurement data
The need for AS as well as country allows us to correlate BGP events with bandwidth changes. Ideally we can know both the AS of the scanner and the target. Tor Metrics already knows the AS of the relay.
In some cases, the target is on a CDN, which means the AS may not be known in advance. In these cases we might choose to omit the AS number and instead use a user-assigned country code for the target.
We could assign QM-QZ (user-assigned codes not used elsewhere in Tor to my knowledge) to CDN providers but each new provider used would require a spec change, and external researchers would end up using whatever value and not being able to interop their data with ours if we clash.
We could use the OO escape code and build codes like "OOFastly" which is probably the more maintainable solution, but would require that parsers do not panic when they see >2 characters (I know the standard is known as alpha-2 but it isn't actually limited to 2 characters!)
We could use Tor's geoip database to resolve the IP address resolved via the same circuit of the CDN into a country/IP address to add to the measurements, but this may prove to be unreliable especially in the longer term as IPv4 exhaustion causes more and more IP address swapping about.

Replying to irl:

I can see how in practice this is really useful, but this may not be the best field to add to the headers. Are we going to add fields for all of the versions of libraries on the system (e.g. OpenSSL), operating system kernel, etc. All of these fields could be places with important information that helps us to interpret the results we see

You're right: we need to define scope.

Here's the scope of the sbws data file:

questions we have asked bandwidth authority operators (or they have asked us)
data we wanted Torflow to provide, but it couldn't (or didn't)
questions that relay operators ask
factors that have a large impact on measurements

The scope you're suggesting is much larger, and potentially unlimited.

I suggest that we restrict ourselves to data that we need to know to run the network.

In this context:

we wanted to know the tor version to recommend tor upgrades to bandwidth authority operators (#30184 (moved))
we might be interested in OpenSSL and NSS versions in future, because they have both had different bugs that stop relays connecting to each other
we have asked authority operators about operating systems before
- but I have never wanted to know kernel versions
I have never even wondered about the versions of any other libraries

Here's another possible rule:

if it's in Tor relay descriptors, then it's useful for administering the network
- tor version
- operating system

but every change to the headers means also a change to parsers.

Parsers which follow the spec MUST allow unknown headers: https://gitweb.torproject.org/torspec.git/tree/bandwidth-file-spec.txt#n481

A well-written parser will present unknown headers (or all headers) in a flexible, dictionary-like structure.

There is also an implicit assumption here that the bandwidth scanner uses tor at all. A future bandwidth scanner implementation may use stem.client or an alternative implementation.

Most headers are optional. We can specify that if there's no tor, then the tor version header SHOULD NOT be present.

For exit lists we are considering using a free-form text field with some suggested formats, e.g. something like "ExitScanner 55.5 using Tor 1.0.0 on Windows 10". If there is later something that affects the specific implementation we can adjust that string without a spec change to report library details.

Unstructured text is a nightmare to parse. Just look at the browser user-agent. Or the Tor ContactInfo field.

If we need something, let's specify it as a separate header. If we're not sure, then let's leave it out.

For quick diagnosis of problems, it would be great if the bandwidth scanners added some contact information to the output. We are planning to add contact strings to exit scanners.

Let's not specify another unstructured contact string, please.

Instead, if we need to email someone, let's specify an operator email address.

Whatever we do here, we should come up with good reasons why we are doing it and then apply the decisions to both exit lists and bandwidth files if the good reasons are working for both.

I agree.

Summarizing the metadata proposed:

Replying to irl:

The need for the inline contact information becomes evident when we look at how much trouble we've had tracking down who runs the default bridges in Tor Browser #30121 (moved)

Do we need contact information having an identifier (see https://trac.torproject.org/projects/tor/ticket/30217#comment:2) and initially knowing who runs the scanners (usually the same dirauth)?.

The need for AS as well as country allows us to correlate BGP events with bandwidth changes. Ideally we can know both the AS of the scanner and the target. Tor Metrics already knows the AS of the relay.

When we decided to include the country (#29299 (moved)) i remember arma commenting in irc to just add it in the configuration to don't make the code more complicated. I'm commenting more on this in (#30229 (moved))

In some cases, the target is on a CDN, which means the AS may not be known in advance. In these cases we might choose to omit the AS number and instead use a user-assigned country code for the target.

Do we currently have more than one CDN?. I only know one, though would be great to have more.

We could assign QM-QZ (user-assigned codes not used elsewhere in Tor to my knowledge) to CDN providers but each new provider used would require a spec change, and external researchers would end up using whatever value and not being able to interop their data with ours if we clash.

We could use the OO escape code and build codes like "OOFastly" which is probably the more maintainable solution, but would require that parsers do not panic when they see >2 characters (I know the standard is known as alpha-2 but it isn't actually limited to 2 characters!)

We could use Tor's geoip database to resolve the IP address resolved via the same circuit of the CDN into a country/IP address to add to the measurements, but this may prove to be unreliable especially in the longer term as IPv4 exhaustion causes more and more IP address swapping about.

Not convinced on any of the options for CDNs, but can't think of other solution.

Replying to teor:

we wanted to know the tor version to recommend tor upgrades to bandwidth authority operators (#30184 (moved))

easy.

we might be interested in OpenSSL and NSS versions in future, because they have both had different bugs that stop relays connecting to each other

Is there a way to ask this to Tor? GETINFO only returns Tor version. Is it possible that python ssl.OPENSSL_VERSION might be different to the one being used by the running Tor?.

we have asked authority operators about operating systems before

What in concrete we would like to know about the operating system?, see options in https://docs.python.org/3/library/platform.html

Reminder to myself: i've implemented the operating system, ssl_version and tor_version in branch https://github.com/juga0/sbws/tree/ticket30196

I am a bit confused by this conversation about many different features in the same ticket.

Let's deal with each feature in a separate ticket?

#29355 (moved): Scanner nickname and UUID
#30196 (moved): Tor version
#30229 (moved): GeoIP: country and AS
- #30253 (moved): CDN country and AS
#30251 (moved): Operating System
#30252 (moved): OpenSSL and NSS
#30254 (moved): Contact (email or unstructured)

I made them all children of #30255 (moved).

Trac:
Parent: N/A to #30255 (moved)

Replying to teor:

Replying to irl:

For exit lists we are considering using a free-form text field with some suggested formats, e.g. something like "ExitScanner 55.5 using Tor 1.0.0 on Windows 10". If there is later something that affects the specific implementation we can adjust that string without a spec change to report library details.

Unstructured text is a nightmare to parse. Just look at the browser user-agent. Or the Tor ContactInfo field.

If we need something, let's specify it as a separate header. If we're not sure, then let's leave it out.

Replying to teor:

Replying to irl:

There is also an implicit assumption here that the bandwidth scanner uses tor at all. A future bandwidth scanner implementation may use stem.client or an alternative implementation.

Most headers are optional. We can specify that if there's no tor, then the tor version header SHOULD NOT be present.

Replying to juga:

Replying to teor:

we wanted to know the tor version to recommend tor upgrades to bandwidth authority operators (#30184 (moved))

easy.

Replying to juga:

Reminder to myself: i've implemented the operating system, ssl_version and tor_version in branch https://github.com/juga0/sbws/tree/ticket30196

Here's my review of tor_version:

the code looks good
please refactor EXTRA_ARG_KEYVALUES and BANDWIDTH_HEADER_KEY_VALUES_INIT so you're not adding tor_version in two different places
please submit a bandwidth-file-spec patch, so the reviewer knows what your code should do
does the "chg: v3bwfile: !refactor, obtain headers from state" commit belong in a separate pull request? It's quite large.

Edit:

please submit each feature in a separate pull request

Trac:
Status: new to needs_revision

We also want the tor version so we can work out if bandwidth measurements change when the tor version changes: https://lists.torproject.org/pipermail/tor-scaling/2019-June/000048.html

Trac:
Priority: Medium to High
Severity: Normal to Major

Trac:
Keywords: N/A deleted, sbws-roadmap-september added

Changing keyword of roadmapped open sbws tickets to a general sbws-roadmap one.

Trac:
Keywords: sbws-roadmap-september deleted, sbws-roadmap added

The goal is to deploy sbws in all bw authorities. We need to fix critical bugs to do this.

Trac:
Parent: #30255 (moved) to #33121 (moved)

I followed #comment:9 suggestions in https://github.com/torproject/sbws/pull/367, created #33197 (moved) for the big refactor and modified the spec in https://github.com/torproject/torspec/pull/107.

Trac:
Status: needs_revision to needs_review

Trac:
Reviewer: N/A to ahf

I think both the spec change and the sbws changes are good, but I have two questions:

What is the purpose of the xxx's with 'tech-dept'? Is the goal we go back here and do something actionable and is that something that is best to have in the code rather than in tickets?
The constant renaming seems OK to me, but it seems complicated to maintain all these lists and how they are subsets/supersets of each other?

Trac:
Status: needs_review to merge_ready

Replying to ahf:

I think both the spec change and the sbws changes are good, but I have two questions:

What is the purpose of the xxx's with 'tech-dept'? Is the goal we go back here and do something actionable and is that something that is best to have in the code rather than in tickets?

If i'd create a ticket with a tech-debt (i guess i wrote a typo there) changes, i don't know if it'll ever be solved and then in the ticket would need to point to all the parts of the code where i detected it while working on something else.

The constant renaming seems OK to me, but it seems complicated to maintain all these lists and how they are subsets/supersets of each other?

agree, do you have a suggestion on how to change that without having to increase minor version because changing API?

Hi Juga,

I made a few comments on the pull request.

Replying to juga:

Replying to ahf:

I think both the spec change and the sbws changes are good, but I have two questions:

What is the purpose of the xxx's with 'tech-dept'? Is the goal we go back here and do something actionable and is that something that is best to have in the code rather than in tickets?

If i'd create a ticket with a tech-debt (i guess i wrote a typo there) changes, i don't know if it'll ever be solved and then in the ticket would need to point to all the parts of the code where i detected it while working on something else.

The constant renaming seems OK to me, but it seems complicated to maintain all these lists and how they are subsets/supersets of each other?

agree, do you have a suggestion on how to change that without having to increase minor version because changing API?

The names of internal constants like HEADER_KEYS_V1_1_ORDERED aren't part of the API, so they don't need a version change.

What changes are you thinking about? How do they change the API?

Replying to teor:

Hi Juga,

I made a few comments on the pull request.

Replying to juga:

Replying to ahf:

I think both the spec change and the sbws changes are good, but I have two questions:

What is the purpose of the xxx's with 'tech-dept'? Is the goal we go back here and do something actionable and is that something that is best to have in the code rather than in tickets?

If i'd create a ticket with a tech-debt (i guess i wrote a typo there) changes, i don't know if it'll ever be solved and then in the ticket would need to point to all the parts of the code where i detected it while working on something else.

The constant renaming seems OK to me, but it seems complicated to maintain all these lists and how they are subsets/supersets of each other?

agree, do you have a suggestion on how to change that without having to increase minor version because changing API?

The names of internal constants like HEADER_KEYS_V1_1_ORDERED aren't part of the API, so they don't need a version change.

i know, that's why i did the changes

What changes are you thinking about? How do they change the API?

none, that's why i replied to ahf question in point 2. in https://trac.torproject.org/projects/tor/ticket/30196?replyto=18#comment:16 with other question.

i have ideas on changes, but they would change the API

teor, i saw you did a review on my PR. I'm a bit confused cause you didn't change this ticket as needs revision nor add you as reviewer. I've made all (except 1) of the changes you suggested, plus others i realized about. That's why i'm changing now this ticket to needs review again. Note that i'm not squashing and force pushing, so the changes can be seen. By experiences with other reviews that get a long list of fixups, conflicts with rebases increase or individual commits can start getting more confuse for myself. Before a final merge (note for myself more than reviewer), the commit that is actually about the tor version, should be reordered to be the last one.

Trac:
Status: merge_ready to needs_review

Sorry I forgot to put the ticket in needs_revision.

I didn't put myself as reviewer, because I didn't want this ticket to block on me,

I still think the variable naming is confusing: SELF_INITIALIZED, TO_INIT, and INT_KEYS are confusingly similar. And INITIALIZED and INIT are inconsistent.

I'll leave you to decide if you want to fix that now, or open another ticket to fix it later.

I think the naming confusion that Teor brought up should be fixed, but it can be fixed in a follow up ticket (IMO).

I think this is a large change that overall makes the code better so I think we should get it in. Let's go over the naming changes in another ticket?

Trac:
Status: needs_review to merge_ready

I've created #33572 (moved) to merge the patch in the spec, because this ticket component is not Tor. I've created #33571 (moved) for the constants renaming.