New version of exit list format

added component::metrics/exit scanner in Legacy / Trac owner::metrics-team in Legacy / Trac priority::medium in Legacy / Trac severity::normal in Legacy / Trac status::new in Legacy / Trac type::task in Legacy / Trac labels

I already started working on this as discussed yesterday. Some comments:

Replying to irl:

We need to extend the exit list format to include:

Metadata:

Source identifier

Source software name

Source software version

Source IP address

I already have these.

Source ASN

Source country

I'm not sure about these. We would basically include these by having the source look up its IP address in a database. But then the result depends on which database (version) the source uses. Of course, whoever uses this information could as well look up the source IP address in the database (version) of their choice and discard these two fields. Maybe this means we shouldn't put too much effort in the source's ability to include these two fields. Or we could just omit them from the spec. Not sure!

Source operator contact details

Measurement results:

IPv6 addresses

Yep, these make sense.

Error code

Error codes:

Success

Unknown Failure

Timeout

(others we can think of now, but we can extend this later when we actually implement it)

This one is tricky. I don't think that the current scanner includes scans that ended with unknown failures or timeouts. It includes, for each found exit IP address, the latest scan time of a successful run resulting in that IP address. It probably omits IP addresses after a given number of hours, but we'd have to look at the code in order to know.

I think we should probably have one line per measurement, so IPv4 and IPv6 results would be listed separately, not on the same line. In the future we may have differing transports to consider (TCP/QUIC/something else) so maybe we should not just have IPv4 vs IPv6 but some numeric identifier that is later extensible.

Agreed on the IPv4/IPv6 distinction. I was thinking to simply include a new ExitAddress6 line for IPv6 addresses and continue using ExitAddress for IPv4 addresses. And I'd probably simply add another keyword for the next transport or address version. What else do you have in mind?

Relatedly, I'd want to include OrAddress and OrAddress6 for the addresses found in the consensus. Background is that I'd like to use exit lists as single input document type for ExoneraTor in the future.

Exit lists are not currently included in torspec but probably should be. The specification should cover the existing format, and then also the new format. We should expect that we will later extend the new format with a signature. Maybe we should just figure that out now also.

Turns out that specifying the existing format is not trivial. Right now I'm looking at metrics-lib only, but I think I'll have to look at other code that produces/consumes these lists. For example, it would be great to know whether Published and LastStatus in the current format are considered required or optional fields, because it would be very convenient to lose them in version 2. What other code I should be looking at?

Trac:
Owner: metrics-team to karsten
Status: new to accepted
Cc: N/A to metrics-team

Replying to karsten:

Source ASN

Source country

I'm not sure about these. We would basically include these by having the source look up its IP address in a database. But then the result depends on which database (version) the source uses. Of course, whoever uses this information could as well look up the source IP address in the database (version) of their choice and discard these two fields. Maybe this means we shouldn't put too much effort in the source's ability to include these two fields. Or we could just omit them from the spec. Not sure!

I would rather have them as optional if you think that they would not be required. I would expect this to be either declared by the user, who should know best, or looked up via RIPEstat.

This one is tricky. I don't think that the current scanner includes scans that ended with unknown failures or timeouts. It includes, for each found exit IP address, the latest scan time of a successful run resulting in that IP address. It probably omits IP addresses after a given number of hours, but we'd have to look at the code in order to know.

I would like to have one line per measurement, whether it succeeds, has a duplicate result, fails, or whatever. This helps us to understand how the tool is performing and doesn't hide information that would be really useful in debugging.

I think we should probably have one line per measurement, so IPv4 and IPv6 results would be listed separately, not on the same line. In the future we may have differing transports to consider (TCP/QUIC/something else) so maybe we should not just have IPv4 vs IPv6 but some numeric identifier that is later extensible.

Agreed on the IPv4/IPv6 distinction. I was thinking to simply include a new ExitAddress6 line for IPv6 addresses and continue using ExitAddress for IPv4 addresses. And I'd probably simply add another keyword for the next transport or address version. What else do you have in mind?

This could also work, but we should do it in a way that we have defined a generalised format for the measurement result and then we have specifics for IPv4 and IPv6 which should just be that the expected address format is different.

Relatedly, I'd want to include OrAddress and OrAddress6 for the addresses found in the consensus. Background is that I'd like to use exit lists as single input document type for ExoneraTor in the future.

Perhaps we are describing Internet Address Lists and not Exit Lists?

Exit lists are not currently included in torspec but probably should be. The specification should cover the existing format, and then also the new format. We should expect that we will later extend the new format with a signature. Maybe we should just figure that out now also.

Turns out that specifying the existing format is not trivial. Right now I'm looking at metrics-lib only, but I think I'll have to look at other code that produces/consumes these lists. For example, it would be great to know whether Published and LastStatus in the current format are considered required or optional fields, because it would be very convenient to lose them in version 2. What other code I should be looking at?

I've not thought about this yet, but why would it be convenient to lose these in version 2?

Replying to irl:

Replying to karsten:

Source ASN

Source country

I'm not sure about these. We would basically include these by having the source look up its IP address in a database. But then the result depends on which database (version) the source uses. Of course, whoever uses this information could as well look up the source IP address in the database (version) of their choice and discard these two fields. Maybe this means we shouldn't put too much effort in the source's ability to include these two fields. Or we could just omit them from the spec. Not sure!

I would rather have them as optional if you think that they would not be required. I would expect this to be either declared by the user, who should know best, or looked up via RIPEstat.

We can specify these. What format would we expect the country and ASN to be in?

This one is tricky. I don't think that the current scanner includes scans that ended with unknown failures or timeouts. It includes, for each found exit IP address, the latest scan time of a successful run resulting in that IP address. It probably omits IP addresses after a given number of hours, but we'd have to look at the code in order to know.

I would like to have one line per measurement, whether it succeeds, has a duplicate result, fails, or whatever. This helps us to understand how the tool is performing and doesn't hide information that would be really useful in debugging.

I see your point. However, this would be a backward-incompatible change to the current document format where the IP address is unique for the ExitAddress lines of any given router. And it might not scale if we add lots and lots of scans all ending with the same result. Unclear.

I think we should probably have one line per measurement, so IPv4 and IPv6 results would be listed separately, not on the same line. In the future we may have differing transports to consider (TCP/QUIC/something else) so maybe we should not just have IPv4 vs IPv6 but some numeric identifier that is later extensible.

Agreed on the IPv4/IPv6 distinction. I was thinking to simply include a new ExitAddress6 line for IPv6 addresses and continue using ExitAddress for IPv4 addresses. And I'd probably simply add another keyword for the next transport or address version. What else do you have in mind?

This could also work, but we should do it in a way that we have defined a generalised format for the measurement result and then we have specifics for IPv4 and IPv6 which should just be that the expected address format is different.

Sounds good.

Relatedly, I'd want to include OrAddress and OrAddress6 for the addresses found in the consensus. Background is that I'd like to use exit lists as single input document type for ExoneraTor in the future.

Perhaps we are describing Internet Address Lists and not Exit Lists?

Possibly. Maybe this won't scale, either. Unclear.

Exit lists are not currently included in torspec but probably should be. The specification should cover the existing format, and then also the new format. We should expect that we will later extend the new format with a signature. Maybe we should just figure that out now also.

Turns out that specifying the existing format is not trivial. Right now I'm looking at metrics-lib only, but I think I'll have to look at other code that produces/consumes these lists. For example, it would be great to know whether Published and LastStatus in the current format are considered required or optional fields, because it would be very convenient to lose them in version 2. What other code I should be looking at?

I've not thought about this yet, but why would it be convenient to lose these in version 2?

Both are rather implementation-specific pieces of information that are not really relevant for exit lists. Published is used to avoid doing another scan until the next descriptor arrives, and LastStatus is used to decide when to discard a router. Both parts are contained in exit lists, because they're not primarily an output format but an internal state file used by TorDNSEL.

It doesn't hurt to have these lines, except that they eat up space. However, declaring them as required means we can never remove them from future formats without making a backward-incompatible change. But maybe this ship has sailed, and we need to consider them required, because they have always been there.

P.S.: Do we need a new Metrics/* subcomponent for this exit list work?

I just attached a first draft as commit 1eeefc4 in my task-29624 metrics-web branch, and I changed the component to Metrics/Website until we have a better place.

I made two decisions when writing this first draft which we can further discuss:

I specifically left out measurement details. While I very much agree that these might be useful for debugging, I don't see yet how they would fit into exit lists. Similarly, votes do not contain details about reachability tests, even though that would be interesting information. I'd be curious whether bandwidth files contain measurement details other than results. I'm afraid that if we start adding measurements to exit lists, we'll soon want to include more measurement details, adding more and more data that current consumers of exit lists do not really care about.
I took out OR addresses, despite my earlier plans to include them. If we were to include everything from consensuses that's required for ExoneraTor and similar applications, we'd also have to include exit policy summaries. Then we'd be able to ignore consensuses and instead look at exit lists only. But what if we ever want to use more data from consensuses in ExoneraTor? We'd have to add that to the exit list format, too. I think it's better to leave this data where it comes from: in consensuses.

Please review!

Trac:
Component: Metrics/ExoneraTor to Metrics/Website
Status: accepted to needs_review

Trac:
Parent: N/A to legacy/trac#29650 (moved)
Component: Metrics/Website to Metrics/Exit Scanner

Trac:
Reviewer: N/A to irl
Keywords: metrics-exit-list-project metrics-roadmap-2019-q2 deleted, metrics-roadmap-2019-q2 added

We need to work on the use of words like "may". Unless Tor already has something for this, let's refer to RFC2119.

I don't believe we need to prefix keywords with "Scanner". Was there a specific reason for this?

dir-spec uses kebab-case for keywords, not CamelCase.

For fields that are already defined in dir-spec, like "contact" we should refer to those semantics instead of making up our own.

As above, for date/time formats.

We should be specific on our use of country codes. There are extensions added by the databases we are using, and we also use our own extensions. Maybe we should talk to OONI and see what they are using too so we can be unified.

How does the "Downloaded" keyword work with signed documents? How do you see it being used?

On point 1, this sounds OK. I am starting to think of exit lists in the new scanner context as a derived format from the raw measurement results in a similar way that our current torperf files are derived from onionperf analysis results which are derived from tor/tgen logs.

As an aside, the format we are deriving from will most likely be [ndjson]. This is not important for the spec.

On point 2, this also sounds OK. Should we specify that an exit list should be used with a specific consensus in applications like ExoneraTor? I think no, we should always use the latest exit list and latest consensus to give the most up-to-date information available.

Trac:
Status: needs_review to needs_revision

Replying to notirl:

We need to work on the use of words like "may". Unless Tor already has something for this, let's refer to RFC2119.

Makes sense. However, it's been a while that I wrote specs with those keywords, and I think I didn't get it right in all cases back then. Do you mind going through the spec at the end and correcting keywords accordingly?

I don't believe we need to prefix keywords with "Scanner". Was there a specific reason for this?

My idea was to avoid future conflicts with keywords used in exit list entries, and in the header it matters the least to make keywords a bit longer. I don't feel strongly, though. Mild preference for keeping the prefix.

dir-spec uses kebab-case for keywords, not CamelCase.

For fields that are already defined in dir-spec, like "contact" we should refer to those semantics instead of making up our own.

Hmm, should we really mix CamelCase and kebab-case in a single document? I think I'd prefer to stay in CamelCase notation.

As above, for date/time formats.

Hmm? I copied over the format from dir-spec. The formats should be equivalent. Or what do you mean?

We should be specific on our use of country codes. There are extensions added by the databases we are using, and we also use our own extensions. Maybe we should talk to OONI and see what they are using too so we can be unified.

I'm not sure what to gain from defining (or linking to) a set of allowed country codes. I consider this field mostly informational. But I don't really mind. In any case we could move forward with completing this spec and writing parsers, and we could later adapt the spec to define a subset of valid two-letter country codes.

How does the "Downloaded" keyword work with signed documents? How do you see it being used?

Signed documents are certainly a challenge. The issue is that this keyword is already being used: CollecTor adds it. A better choice (back then) would have been to use an annotation for this. But I think the Created keyword will supersede this keyword anyway. Still, it's there, which is why I included it in the spec. Maybe there's a better plan?

On point 1, this sounds OK. I am starting to think of exit lists in the new scanner context as a derived format from the raw measurement results in a similar way that our current torperf files are derived from onionperf analysis results which are derived from tor/tgen logs.

As an aside, the format we are deriving from will most likely be [ndjson]. This is not important for the spec.

Makes sense.

On point 2, this also sounds OK. Should we specify that an exit list should be used with a specific consensus in applications like ExoneraTor? I think no, we should always use the latest exit list and latest consensus to give the most up-to-date information available.

Agreed, we should leave this up to the application.

Changing back to needs_review for the open questions. Thanks!

Trac:
Status: needs_revision to needs_review

Here are my notes from talking this over at today's meeting:

Replying to karsten:

Replying to notirl:

We need to work on the use of words like "may". Unless Tor already has something for this, let's refer to RFC2119.

Makes sense. However, it's been a while that I wrote specs with those keywords, and I think I didn't get it right in all cases back then. Do you mind going through the spec at the end and correcting keywords accordingly?

I don't believe we need to prefix keywords with "Scanner". Was there a specific reason for this?

My idea was to avoid future conflicts with keywords used in exit list entries, and in the header it matters the least to make keywords a bit longer. I don't feel strongly, though. Mild preference for keeping the prefix.

dir-spec uses kebab-case for keywords, not CamelCase.

For fields that are already defined in dir-spec, like "contact" we should refer to those semantics instead of making up our own.

Hmm, should we really mix CamelCase and kebab-case in a single document? I think I'd prefer to stay in CamelCase notation.

We made plans to use kebab-case keywords only in version 2. This means that it won't be backward-compatible with version 1 which only uses CamelCase keywords. The API can still provide the same methods for accessing parts of an exit list, regardless of the version. Let's try this.

Related to this change, we're going to say "contact" rather than "ScannerContact" or "scanner-contact", and we're linking to version 3 of dir-spec to say that we're using the format specified there.

As above, for date/time formats.

Hmm? I copied over the format from dir-spec. The formats should be equivalent. Or what do you mean?

Likewise, we're linking to dir-spec version 3.

We should be specific on our use of country codes. There are extensions added by the databases we are using, and we also use our own extensions. Maybe we should talk to OONI and see what they are using too so we can be unified.

I'm not sure what to gain from defining (or linking to) a set of allowed country codes. I consider this field mostly informational. But I don't really mind. In any case we could move forward with completing this spec and writing parsers, and we could later adapt the spec to define a subset of valid two-letter country codes.

For now we'll allow [A-Z][A-Z] as valid 2-alpha country code as specified in ISO 3166-1 alpha-2. We're writing these as uppercase and parsing them case-insensitively.

How does the "Downloaded" keyword work with signed documents? How do you see it being used?

Signed documents are certainly a challenge. The issue is that this keyword is already being used: CollecTor adds it. A better choice (back then) would have been to use an annotation for this. But I think the Created keyword will supersede this keyword anyway. Still, it's there, which is why I included it in the spec. Maybe there's a better plan?

We might use @downloaded-at in CollecTor, but we're not going to specify a new line like this in version 2 of the exit list specification.

On point 1, this sounds OK. I am starting to think of exit lists in the new scanner context as a derived format from the raw measurement results in a similar way that our current torperf files are derived from onionperf analysis results which are derived from tor/tgen logs.

As an aside, the format we are deriving from will most likely be [ndjson]. This is not important for the spec.

Makes sense.

On point 2, this also sounds OK. Should we specify that an exit list should be used with a specific consensus in applications like ExoneraTor? I think no, we should always use the latest exit list and latest consensus to give the most up-to-date information available.

Agreed, we should leave this up to the application.

Changing back to needs_review for the open questions. Thanks!

I'm going to make changes as outlined above, and then irl is going to adapt the MAY/MUST/etc. parts.

Trac:
Status: needs_review to needs_revision

Alright, I made a few tweaks to my earlier draft in commit f3f289f in the same branch as mentioned above. irl, please feel free to make the MAY/MUST/etc. changes now as well as any other tweaks you think would be useful. Thanks!

Trac:
Status: needs_revision to assigned
Owner: karsten to irl

Trac:
Status: assigned to accepted

I'm currently working on this, it is taking a little longer than I would have hoped because I'm first getting a handle on cert-spec. I'd like us to be able to define this in a way that we don't need a version 3 to add signatures.

My thoughts so far are:

exit scanners will have Ed25519 keys
there may be one long-lived identity key and one shorter-term signing key (to allow offline master key)
there won't be any RSA keys, it will be "Ed25519-first"
we re-use the certificate formats from cert-spec
signing is optional, if there is no identity line then no signature should be expected

Two changes are going to be related to unifying the keywords between this spec and dir-spec. created->published, software->platform.

I think the address4|6 lines should be optional, so that we can prevent the scanners becoming targets for attack.

I need to pause on this to look at other tasks, but hopefully will return soon and we can get this bit finished off.

Sounds all like good suggestions! Let me know when you have something you want me to review.

I've been digging further into this and all of my thoughts above are entirely workable. We will need to extend cert-spec with a new document type for exit lists but that's just a side detail. I wrote up my exploration today on my blog as I was going along, if it is of interest:

https://iain.learmonth.me/blog/2019/2019w151/

The graph in there might be useful if we need to show anyone why the exit scanner is useful.

Trac:
Cc: metrics-team to metrics-team, arlolra

I saw that commits happen to check, which means that it is being maintained and it is also a consumer of this format, so adding Arlo to CC. (:

Ok, I've made some changes here. I think we should probably not call this final until we have software that actually speaks it, but this should help us us to be able to start implementing that software.

https://gitweb.torproject.org/user/irl/metrics-web.git/plain/exit-list-spec.txt?h=task-29624

Once we have some initial implementation, we should send this to tor-dev for review and then have it included in torspec along with the update to cert-spec for the prefix string.

Trac:
Status: accepted to needs_review

Nice blog post! It's indeed interesting to see what fraction of IP addresses is only available in exit lists. If I may nitpick:

It's slightly confusing (to me) that the red and green line show single set sizes whereas the blue line shows one set minus the other one. I wonder if the lines would work better (for me) if they were: Found only in consensus, Found in consensus and exit list, and Found only in exit list. That way it would be possible to add up all three lines and obtain the total number of addresses.
Maybe this graph would work as a stacked area plot like our bandwidth flags graph, just with three stacked areas rather than four. If you look at that linked bandwidth flags graph, yellow would be "Found only in consensus", green would be "Found in consensus and exit list" and blue would be "Found only in exit list". That would show pretty quickly what fraction someone is missing when only looking at consensuses: the blue part at the top.
If you like, I can help you with making that graph. Just give me the raw data, and I'll make a quick graph out of it. Also as an experience to learn more about the tidyverse. And for the next funding proposal, of course.

Anyway, I think you wanted a review of the exit list spec! Here's what I found:

"[the identity-ed25519 line] MUST appear as the first or second element in the router descriptor": Why the first? In version 2 or later the first element is always "exit-list", so it can only be the second element. And why router descriptor? You mean exit list?
Sometimes we're using two spaces between sentences. Let's be consistent and either use one or two.
In the "location" line, should we write the AS number as AS12345 rather than just 12345? Technically, it's correct to just use an integer there, but if we think of humans reading the format, we might want to include the "AS" part anyway.
Should the "platform" line suggest a possible format for exit list scanner software version, Tor software version, and operating system? Something like "ExitScanner 55.5 using Tor 1.0.0 on Windows 10"?
There are at least two places where you write "[XXX: Informational: ...]". Can we just include those informational parts without the XXX stuff?
A bit below that you refer to the "created" timestamp, which is now called the "published" timestamp.
I wonder if the "address" part of "exit-address4" and "exit-address6" lines should be unique (with respect to the exit list entry, not the whole exit list, of course). This has implications on how somebody parsing an exit list entry can store and retrieve parsed addresses. Of course, if we think it's useful to include an address several times with different scan times, we'll have to say that the address is not necessarily unique. In any case, we should say what people can expect there, or they'll make assumptions.
The "exit-scanner-sig-ed25519" should probably specify where to expect the signature part. Is it going to be part of the line, separated by SP from the keyword? Or is it going to be an object in the next line? In the latter case the subsequent description would have to say "including the newline character after" rather than "including the first space after".

Other than these comments, the plan to build it see if it works sounds reasonable to me. Thanks!

Trac:
Status: needs_review to needs_revision

Trac:

Here's a graph roughly like the one I suggested above:

Here's the source code:

require(tidyr)
require(dplyr)
require(readr)
require(ggplot2)

read_csv("exitips.csv", col_types = cols(
      time = col_datetime(format = ""),
      consensus = col_double(),
      exitl = col_double(),
      exitlo = col_double())) %>%
  transmute(time, consensus_only = consensus - exitl + exitlo,
    both = exitl - exitlo,
    exitlist_only = exitlo) %>%
  gather(document, ips, 2:4) %>%
  mutate(document = factor(document,
    levels = c("exitlist_only", "both", "consensus_only"),
    labels = c("Found only in exit list",
      "Found in consensus and exit list", "Found only in consensus"))) %>%
  ggplot(aes(x = time, y = ips, fill = document)) +
  geom_area() +
  scale_x_datetime(name = "") +
  scale_y_continuous(name = "", limits = c(0, NA)) +
  scale_fill_manual(name = "",
    values = c("#03B3FF", "#39FF02", "#FFFF00"))
ggsave(filename = "exitips-stacked-area.png", width = 8, height = 5, dpi = 150)

Trac:
Status: needs_revision to assigned
Owner: irl to metrics-team

Trac:
Status: assigned to new

We would like to do this one day, but it was too ambitious to have this in the roadmap now.

Trac:
Parent: legacy/trac#29650 (moved) to N/A
Keywords: metrics-roadmap-2019-q2 deleted, N/A added

This is not currently in need of review.

Trac:
Reviewer: irl to N/A

mentioned in issue legacy/trac#29650 (moved)

moved from legacy/trac#29624 (moved)

added Task label and removed 1 deleted label

removed 1 deleted label

added Backlog label

added Roadmap::Future label and removed Backlog label

New version of exit list format

Designs

Child items ...

Activity