Publish post-sanitization broker logs

added component::circumvention/snowflake in Legacy / Trac priority::medium in Legacy / Trac resolution::fixed in Legacy / Trac reviewer::phw in Legacy / Trac severity::normal in Legacy / Trac status::closed in Legacy / Trac type::task in Legacy / Trac labels

Trac:
Status: new to needs_review

The clientid column appears to be empty:

$ cut -d , -f 4 broker.csv | sort -n | uniq

clientid

Do we need it if it doesn't contain any information?

The only problem I can think of is somebody using the data set to confirm if a given computer was a client or a proxy. Given the low resolution of the timestamps, the absence of client IDs, and the pseudonymous proxy IDs, I consider the risk low.

Trac:
Status: needs_review to needs_information
Reviewer: N/A to phw

These graphs are really nice. Thanks for doing this! I'm wondering if we want something similar to the information in these graphs for the ongoing metrics collected from the broker (legacy/trac#21315 (moved)).

I agree that this information should be fine to publish. We don't have any geoip information from proxies or clients either which keeps the risk low.

Replying to phw:

The clientid column appears to be empty: {{{ $ cut -d , -f 4 broker.csv | sort -n | uniq

clientid }}} Do we need it if it doesn't contain any information?

We can remove that column. I included the column to show that I would have used that information if it were present, but none of the broker's log messages has it.

The only problem I can think of is somebody using the data set to confirm if a given computer was a client or a proxy. Given the low resolution of the timestamps, the absence of client IDs, and the pseudonymous proxy IDs, I consider the risk low.

I was thinking, maybe the truncated timestamps don't help so much. Proxies poll every 10 seconds, so if you see a certain proxy ID polled 10 times in the 00:00:00–00:10:00 interval, and 30 times in the 00:10:00–00:20:00 interval, then it's a good guess that the proxy was active starting 00:08:40 and ending 00:15:00. In the graphs, I further truncated the timestamps to 24 hours, just to reduce the size of the data (see reduce.R). Maybe we should make the timestamps even coarser than 10 minutes?

Replying to cohosh:

These graphs are really nice. Thanks for doing this! I'm wondering if we want something similar to the information in these graphs for the ongoing metrics collected from the broker (legacy/trac#21315 (moved)).

Yes, I think it would be great to have similar graphs using up-to-date logs. The scripts in broker-logs.zip:ticket:30693 should still work with the current sanitized logs. We should potentially think about making the logs better reflect what we want to measure, because currently they require a fair bit of inference. Here's the main classification function from the process script that tries to regularize the log messages:

var regexpRecvSnowflake = regexp.MustCompile(`^Received snowflake:\s*([\w/+]*)`)
var regexpProxyNoClient = regexp.MustCompile(`^Proxy ([\w/+]*) did not receive a Client offer.`)

	if strings.HasPrefix(msg, "ACME hostnames: ") {
		err = w.Write([]string{timestamp, "start", "", "", ""})
	} else if strings.HasPrefix(msg, "http: TLS handshake error ") {
		err = w.Write([]string{timestamp, "error", "", "", "tls"})
	} else if strings.HasPrefix(msg, "http2: server: error ") {
		err = w.Write([]string{timestamp, "error", "", "", "http2"})
	} else if strings.HasPrefix(msg, "http2: received GOAWAY ") {
		err = w.Write([]string{timestamp, "error", "", "", "http2"})
	} else if msg == "Starting HTTP-01 listener" {
	} else if msg == "Invalid data." {
		err = w.Write([]string{timestamp, "error", "", "", "invalid-data"})
	} else if msg == "Mismatched IDs!" {
		err = w.Write([]string{timestamp, "proxy-gets-none", "", "", "mismatched-ids"})
	} else if msg == "Passing client offer to snowflake proxy." {
		err = w.Write([]string{timestamp, "client-offers", "", "", ""})
	} else if msg == "Client: No snowflake proxies available." {
		err = w.Write([]string{timestamp, "client-gets-none", "", "", "no-proxies"})
	} else if msg == "Client: Timed out." {
		err = w.Write([]string{timestamp, "client-gets-none", "", "", "timeout"})
	} else if matches := regexpRecvSnowflake.FindStringSubmatch(msg); matches != nil {
		err = w.Write([]string{timestamp, "proxy-polls", idFor(matches[1]), "", ""})
	} else if matches := regexpProxyNoClient.FindStringSubmatch(msg); matches != nil {
		err = w.Write([]string{timestamp, "proxy-gets-none", idFor(matches[1]), "", "no-clients"})
	} else if msg == "Passing client offer to snowflake." {
		err = w.Write([]string{timestamp, "proxy-gets-offer", "", "", ""})
	} else if strings.HasPrefix(msg, "Received answer: ") {
		err = w.Write([]string{timestamp, "proxy-answers", "", "", ""})
	} else if msg == "Client: Retrieving answer" {
		err = w.Write([]string{timestamp, "client-gets-answer", "", "", ""})
	} else {
		return fmt.Errorf("cannot parse line")
	}

Replying to dcf:

Replying to cohosh:

These graphs are really nice. Thanks for doing this! I'm wondering if we want something similar to the information in these graphs for the ongoing metrics collected from the broker (legacy/trac#21315 (moved)).

Yes, I think it would be great to have similar graphs using up-to-date logs. The scripts in broker-logs.zip:ticket:30693 should still work with the current sanitized logs. We should potentially think about making the logs better reflect what we want to measure, because currently they require a fair bit of inference. Here's the main classification function from the process script that tries to regularize the log messages: I made a ticket for this: legacy/trac#30830 (moved).

I'm trying to reason out the difference between analyzing the log output like we've done here and exporting formatted stats for the metrics like we're planning to do for the geoip metrics. Are there scenarios where it's better for us to analyze data in the way we've done here, or should we figure out how to make the graphs in legacy/trac#30693 (moved) with exported data?

I've published the sanitized CSV file as well as the sanitization scripts at https://archive.org/details/snowflake-broker-logs-20190416.

I made a repo for the sanitization scripts at https://dip.torproject.org/dcf/snowflake-broker-log-sanitization, in case they are ever useful again.

Trac:
Status: needs_information to closed
Resolution: N/A to fixed

closed

moved from legacy/trac#30731 (moved)

added Task label and removed 1 deleted label

removed 1 deleted label

Publish post-sanitization broker logs

Designs

Child items ...

Activity