legacy/trac#30693 (moved) deleted old unsanitized server logs. Before deletion, I extracted a sanitized CSV file that is enough to make [[comment:3:ticket:30693|graphs like this]].
This ticket is to discuss whether the sanitized CSV is safe to publish and publish it if so. The sanitized log is currently on the snowflake-broker host under the filename:
/var/log/snowflake-broker/broker.csv.xz
The scripts used to create it are
broker-logs.zip:ticket:30693
Designs
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Do we need it if it doesn't contain any information?
The only problem I can think of is somebody using the data set to confirm if a given computer was a client or a proxy. Given the low resolution of the timestamps, the absence of client IDs, and the pseudonymous proxy IDs, I consider the risk low.
Trac: Status: needs_review to needs_information Reviewer: N/Ato phw
These graphs are really nice. Thanks for doing this! I'm wondering if we want something similar to the information in these graphs for the ongoing metrics collected from the broker (legacy/trac#21315 (moved)).
I agree that this information should be fine to publish. We don't have any geoip information from proxies or clients either which keeps the risk low.
The clientid column appears to be empty:
{{{
$ cut -d , -f 4 broker.csv | sort -n | uniq
clientid
}}}
Do we need it if it doesn't contain any information?
We can remove that column. I included the column to show that I would have used that information if it were present, but none of the broker's log messages has it.
The only problem I can think of is somebody using the data set to confirm if a given computer was a client or a proxy. Given the low resolution of the timestamps, the absence of client IDs, and the pseudonymous proxy IDs, I consider the risk low.
I was thinking, maybe the truncated timestamps don't help so much. Proxies poll every 10 seconds, so if you see a certain proxy ID polled 10 times in the 00:00:00–00:10:00 interval, and 30 times in the 00:10:00–00:20:00 interval, then it's a good guess that the proxy was active starting 00:08:40 and ending 00:15:00. In the graphs, I further truncated the timestamps to 24 hours, just to reduce the size of the data (see reduce.R). Maybe we should make the timestamps even coarser than 10 minutes?
These graphs are really nice. Thanks for doing this! I'm wondering if we want something similar to the information in these graphs for the ongoing metrics collected from the broker (legacy/trac#21315 (moved)).
Yes, I think it would be great to have similar graphs using up-to-date logs. The scripts in broker-logs.zip:ticket:30693 should still work with the current sanitized logs. We should potentially think about making the logs better reflect what we want to measure, because currently they require a fair bit of inference. Here's the main classification function from the process script that tries to regularize the log messages:
These graphs are really nice. Thanks for doing this! I'm wondering if we want something similar to the information in these graphs for the ongoing metrics collected from the broker (legacy/trac#21315 (moved)).
Yes, I think it would be great to have similar graphs using up-to-date logs. The scripts in broker-logs.zip:ticket:30693 should still work with the current sanitized logs. We should potentially think about making the logs better reflect what we want to measure, because currently they require a fair bit of inference. Here's the main classification function from the process script that tries to regularize the log messages:
I made a ticket for this: legacy/trac#30830 (moved).
I'm trying to reason out the difference between analyzing the log output like we've done here and exporting formatted stats for the metrics like we're planning to do for the geoip metrics. Are there scenarios where it's better for us to analyze data in the way we've done here, or should we figure out how to make the graphs in legacy/trac#30693 (moved) with exported data?