2017-01: Add a visualization of Tor Browser downloads to Metrics. (Sponsor X 6.3. Metrics, part 1)
2017-01: Finalize visualizations of Tor Browser downloads on Metrics based on user feedback. (Sponsor X 6.3. Metrics, part 2)
We can re-use existing work from #20008 (moved) here where we defined a database schema and wrote an importer.
And we already have two prototypes (1 and 2) which are based on RStudio's Shiny, which we can't use here, but which we can use for inspiration. To be clear, what we do here will be less shiny, because we're stuck with our not-very-interactive graphing engine. But we cannot solve that now.
I'll attach early results in a second.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
...
Show closed items
Linked items
0
Link issues together to show that they're related.
Learn more.
I'm attaching a graph showing downloads and updates between January 10 and 13, 2017:
I believe that the fonts will be a bit nicer after deploying this on the metrics server. This is related to slightly different R versions. But that's not what this is about. This is just to show a sample of what kind of graph to expect here.
Please review my branch task-21236 which contains a new data-processing module "webstats" and a new graph "Tor Browser downloads and updates". That code produced the .csv and .png files above. It also adds a graph page to Tor Metrics.
One known issue that I didn't find time to fix yet is that locale detection is not always as precise as it could be. Example from the .csv file: 2017-01-10,tmup,o,,0.3.0b1-0.3.0b2-linux32-en-US.xml,,3
Oh, and of course that code still contains an if block containing the line: // TODO take me out after testing which we really should do before merging. But it's useful for testing.
Ah, and there are no tests yet, but the code is in a relatively testable state. We could write tests similar to the ones for the "connbidirect" module.
A 'meta' question:
Shouldn't we use milestones for all Metrics Web and Atlas tickets related to Sporsor X? These will come in handy when writing the report in June.
As there are no releases yet, maybe milestones metrics-web-X16.17 and Atlas-X16.17?
Regarding your meta question, I was thinking of using the monthly team reports as input for the final Sponsor X report. And this ticket would certainly go into a monthly report as soon as the graph is ready. I agree that this wouldn't apply to all Atlas tickets, so my (secret) plan was to run a Trac query in June/July showing all tickets closed on the Sponsor X time frame and consider which to include in the report. I think that should cover most of what we did, and it takes less effort than manually tagging tickets now. What do you think, would that work?
And thanks in advance for looking at this ticket today. :)
The restructuring task (#19754 (moved)) didn't begin yet, but still new modules should rather improve the situation a little.
tl;dr: Please see my branch with some suggestions.
It uses build-base.xml, adapts to our general project structure a little more. The test task is now part of build-base.xml; I added a testrun task and command line parameters for integration testing; a database module, which might make testing easier and might be re-usable when doing #19754 (moved); the methods in Main are now accessible for unit tests.
Please reuse and maybe add some simple tests, if there is time?
Thanks for the quick response! And thanks a lot for making those build improvements.
I made some tweaks to your commits and pushed them to my task-21236-2 branch. I'd like to squash that commit and yours into my initial commit 597e715 at the end, if you don't mind. Unless there's a reason to keep them separate, in which case, please let me know.
But before we squash or don't squash anything, let's talk about the Database class. I like the idea of encapsulating database access---in theory. I even had a similar class like yours a day or two ago. But I took that out again, because I felt it only added complexity for no particular gain. Let me elaborate.
The Main class still imports types from java.sql, meaning that the Database class does not hide that we're talking to an SQL database via JDBC. And the methods in Database all contain just a single line of code or two, meaning that we're not really hiding much complexity. The reason is that JDBC is already an interface hiding away all the complexity that comes with talking to an SQL database.
So, let's talk about the purpose of having this separate Database class. I assume the main reason for having it is testing. Is it going to be easier to test the Database class than to test the single methods in modules that are related to database access? And maybe even more importantly, is it easier to mock the Database class than to provide a dummy database via a custom JDBC string and just test the methods contained in modules that way? If the answers are "yes and yes", would you mind writing a simple test for the new Database class, just to get us started?
Otherwise, would you mind if we postpone this change and keep all database methods in the Main class for now? It's easy to extract database functionality (and downloading functionality or .csv-file writing functionality) as part of the restructuring. Don't get me wrong, I'm a supporter of encapsulation if something is fully encapsulated and for a good reason. But I'm also a fan of self-contained and therefore often more readable code. (For example, do you think we'll remember in a year when we write the next module that we disable auto-commit when connecting to the database?!)
And here are some minor points that need discussion:
If we keep the Database class, we should either move it to the org.torproject.metrics.webstats package, or start a shared/src/ directory for code that is supposed to be shared between modules. My favorite would be to start with the former and consider doing the latter once we have a second module using it.
Should the arguments --filter and --db be switched around, so that people can specify their own database but omit the filter? Alternatively, we might want to say more clearly how to pass an argument for --filter that doesn't filter anything.
I didn't run this code yet. Does writeStatistics() add a newline after writing the header?
As you point out, we should write tests. I'm happy to do that, though I'm afraid the most interesting parts require mocking either the web server or the database. If you have suggestions for doing that, please let me know. Otherwise I'd start with some simple tests for parseLogLines() or writeStatistics().
And here's another remaining issue from above, listed for completeness:
I didn't find time to fix yet is that locale detection is not always as precise as it could be. Example from the .csv file: 2017-01-10,tmup,o,,0.3.0b1-0.3.0b2-linux32-en-US.xml,,3
Introducing the Database class didn't cause any overwhelming changes,
because Main is essentially functional programming.
Encapsulating the database methods is a tiny first step here to unclutter
Main for further refactoring. And, as you pointed out for testing.
As metrics-web is a 'living' project I'd prefer these tiny steps that
can be easily verified to not wreak havoc (until there are more tests :-)
(What is needed next are data objects. From the top of my head I'd say
with these data object it is possible to abstract/encapsulate/reuse
database persistence, file generation, and also R-access to some extend
(there is some java integration for R).
The actual redesign, should of course be done by also considering
all modules together in-depth.)
Moving Database.java to shared/src is fine (thanks for pointing that out).
(A more radical idea: all java code could be put into one source folder
in the base project already. This would reduce complexity and prepare
for adaption to metrics-base.)
Command line args:
The suggestion I made was ad-hoc aiming at replacing the TODO: remove after tests.
approach with something lasting, b/c what is facilitated by remove after tests code
is useful. The command line args were intended for testing not at all as a user convenience.
Thus, baking it more there are to options:
change the usage note that these args are only for testing, or
add the code to allow the database url as a single arg, too
Tests:
The simple tests for parseLogLines() (or only the matchers used?) and writeStatistics() are sufficient and needed for later refactoring. (You're probably right that my code suggestion would miss a newline after the header in csv.)
I think the most pragmatic and efficient way will be to pick code that produces vital output for writing the tests that can be re-used or simply adapted during refactoring. More complicated test coverage should be tackled after/while refactoring (when need arises).
As usual, I'd love to write more code here, but currently it competes for time with my non-code writing tasks.
Regarding graph and web-site text:
The graph and data description should maybe mention that the logs used were cleaned to avoid any privacy issues and also have a link to the logs?
Otherwise people seeing these graphs might wonder, how the particular data were gathered and worry about privacy.
(sql queries: As there are so many pattern matching queries and in case that these take real long, an operator class (varchar_pattern_ops, see Postgresql manual), e.g., on resource_string, could improve the situation. Maybe, useful once there are more logs to 'crunch'.)
I'm attaching a graph showing downloads and updates between January 10 and 13, 2017:
I believe that the fonts will be a bit nicer after deploying this on the metrics server. This is related to slightly different R versions. But that's not what this is about. This is just to show a sample of what kind of graph to expect here.
Looks good to me. Personally, I would not worry so much about missing shinyness as long as the graphs get the numbers across and that's what the one you attached does IMO. The results still confuse me (like: We have 100,000 new downloads each day but that does not show up in the update pings which basically stay the same? We still have more than 120,000 update requests every day, even after almost 6 weeks after the last release?). But that is probably another story. :)
@gk: Thanks for looking! I can't say much about those possibly confusing results. But I'm pretty sure that these particular questions (downloads not showing in update pings, lots of update requests weeks after the last release) will only be the first of a series of questions that we'll have to look into. I'd say let's deploy what we have now and discuss those questions when we have more data. They'll either turn into bugfixes or frequently asked questions. :)
@iwakeh: Thanks for all your input above! I incorporated some of your suggestions, discarded others (see below), and made a few more changes while running tests and reviewing my own code once more. Here's a summary of the changes:
After some back and forth I decided to postpone the Database class change. It needs more work before being actually useful, and I don't want to share code between modules that's not ready yet. Let's revisit this topic when we have enough time to fully encapsulate database functionality so that when we can easily mock it.
I also took out the part where we're parsing command-line arguments and put in jdbc:postgresql:webstats as database connection string. I think that we can expect operators to make sure that the local user who runs the java process has permission to connect to that database without telling us the password. And we can expect contributors to be able to edit the code and skip log files when developing.
I also tweaked the description a bit to avoid immediate concerns about privacy. Not that this would eliminate all concerns, but hey.
Optimizing SQL queries might indeed be useful, though it's not an issue yet. It probably helps that we have a separate resources table with unique resource strings. We should keep this in mind though if module performance deteriorates over time. (Such a thing has happened to other modules in the past, see https://lists.torproject.org/pipermail/tor-dev/2016-January/010139.html.)
I think I got the locale detection issues under control by tweaking regular expressions.
There, I think that addresses all suggestions and open issues. Please take a look at the fixup commit in my task-21236-2 branch, and please let me know whether that needs more work. After that, I'd want to squash, rebase, merge, deploy, and announce. Thanks!
Good to have some tests for a start!
Wondering, if a test should make explicit that the ending really doesn't matter, i.e., that SAMPLE_LOG_FILE_NAME could be almost anything, but tests would still pass?
Or, rather change the matcher to only choose .xz files?
The result sets etc. could all be ensured to be closed by using try-with-resources (in most cases w/o any catch).
The extended explanation will be useful either preventing questions or by being the place to point to when questions come up.
In general, metrics-web needs quite some work still. But, that'll be other tickets.
Good to have some tests for a start!
Wondering, if a test should make explicit that the ending really doesn't matter, i.e., that SAMPLE_LOG_FILE_NAME could be almost anything, but tests would still pass?
Or, rather change the matcher to only choose .xz files?
Changing the matcher wouldn't work, because we're using it for .xz files as well as subdirectories. Added a test to point this out.
The result sets etc. could all be ensured to be closed by using try-with-resources (in most cases w/o any catch).
Changed them all.
The extended explanation will be useful either preventing questions or by being the place to point to when questions come up.
Hopefully!
In general, metrics-web needs quite some work still. But, that'll be other tickets.
Absolutely! Let's keep this in mind for the next funding proposal.
I just ran into this bug while importing logs on the Metrics server:
[java] 18:58:10.454 [main] DEBUG org.torproject.metrics.webstats.Main - Importing log file https://webstats.torproject.org/out/archeotrichon.torproject.org/dist.torproject.org-access.log-20151219.xz. [java] Exception in thread "main" java.lang.OutOfMemoryError: Java heap space [java] at java.util.Arrays.copyOfRange(Arrays.java:2694) [java] at java.lang.String.<init>(String.java:203) [java] at java.io.BufferedReader.readLine(BufferedReader.java:349) [java] at java.io.BufferedReader.readLine(BufferedReader.java:382) [java] at org.torproject.metrics.webstats.Main.downloadLogFile(Main.java:219) [java] at org.torproject.metrics.webstats.Main.importLogFiles(Main.java:175) [java] at org.torproject.metrics.webstats.Main.main(Main.java:94) [java] Java Result: 1
That log file is 282K compressed and 1.0G uncompressed. Looks like we'll have to try harder not to keep the file in memory at once. I'm too tired to look into this today, but I'll try to fix this tomorrow, in particular how we solved this in the ad-hoc analysis of webstats in #20008 (moved).
Reasoning behind this new branch is that it is better to make this change now (while we are both reading this code again and again) than in several month. And, it will then be easier to integrate into a new metrics-web code structure.
Using my branch I imported a lot (not all) of logs and found one more bug:
The matcher can match w/o having found the wanted number of groups, which led to an addition in LogMap.accept(), but other than that the code does as before.
LogDb handles all db-access, DirectoryListing does all the url harvesting, LogMap parses log lines and counts, Main controls only the program's flow.
Thanks for refactoring this code. I'll have to look at your patch more closely. But I'd rather want to do after publicly deploying this new graph. On a first look I don't see any harm in splitting up existing code into multiple classes. But I also don't see a clear benefit that would urge us to make the change immediately. Maybe I'm just hesitant, because I don't know yet how the future Metrics website will use databases for aggregation, what data format it will use for graph or table data, whether metrics-lib will support webstats logs, whether CollecTor will collect and serve webstats logs, and so on. Any efforts we make in refactoring this code might be wasted if we make major changes in a few months. Maybe we can combine discussing this specific code and thinking about the vision of a future Metrics website. We can start doing that next week (better: next month), ideally after the Metrics and the Onionoo/Atlas deliverables are completed and invoiced.
Regardless of the above, can you extract that bugfix you mentioned into a separate commit that I can cherry-pick? Or can you describe it in more detail, so that I can fix it here?
9341942 fixes a syntax error in metrics.json that broke the webapp.
Oh, the Tor Browser downloads and updates" graph is now deployed, but not yet linked from the homepage. Please give it a try and see if anything is broken or unclear.
Thanks for refactoring this code. I'll have to look at your patch more closely. But I'd rather want to do after publicly deploying this new graph. On a first look I don't see any harm in splitting up existing code into multiple classes. But I also don't see a clear benefit that would urge us to make the change immediately.
The point is to have the different objects/classes that have clear interaction/cooperation and otherwise hide the details of things that are unimportant to the caller. The LogMap class 'knows' which log-lines it counts, DirectoryListing the wanted URLs etc.
Maybe I'm just hesitant, because I don't know yet how the future Metrics website will use databases for aggregation, what data format it will use for graph or table data, whether metrics-lib will support webstats logs, whether CollecTor will collect and serve webstats logs, and so on. Any efforts we make in refactoring this code might be wasted if we make major changes in a few months.
This effort will exactly pay off when these future changes are to be made. Then all code is in different class with a clear contract to the outside and all dirty implementation details hidden. Modularization is about making future changes easier to accomplish.
Maybe we can combine discussing this specific code and thinking about the vision of a future Metrics website. We can start doing that next week (better: next month), ideally after the Metrics and the Onionoo/Atlas deliverables are completed and invoiced.
I didn't intend to start a lengthy re-coding discussion. Just raising concerns when a new module is implemented in 'the old' way.
We can just have a short session about these topics at the Berlin meeting, maybe?
Regardless of the above, can you extract that bugfix you mentioned into a separate commit that I can cherry-pick? Or can you describe it in more detail, so that I can fix it here?
Compare this line with this, unfortunately I couldn't catch the log-line causing the error.
- if (!logLineMatcher.matches()) {+ if (!logLineMatcher.matches() && logLineMatcher.groupCount() != 3) {
9341942 fixes a syntax error in metrics.json that broke the webapp.
Oh, the Tor Browser downloads and updates" graph is now deployed, but not yet linked from the homepage. Please give it a try and see if anything is broken or unclear.
..
Oh, the Tor Browser downloads and updates" graph is now deployed, but not yet linked from the homepage. Please give it a try and see if anything is broken or unclear.
Secret preview looks fine.
As the section Disagreement among the directory authorities (deprecated) is just above: Shouldn't it be removed now?
Let's have the coding discussion in Berlin. I'm not opposed to the refactoring you propose. I'm mostly concerned about the timing and about not knowing in which direction we'll be heading with metrics-web. And if we do a first step into some direction, that might limit or at least influence our future choices, which I'd want to avoid. But let's talk more about this in Berlin when we all still memorize this code. :)
What we should talk about though is the log-line matcher issue you describe above. I believe there's no bug! Our LOG_LINE_PATTERN should not match a log line with fewer or more capturing groups. Note that there are four capturing groups in the pattern of which none are optional but of which we only use three. I believe that logLineMatcher.groupCount() != 3 always returns true, because there are four capturing groups, regardless of the pattern, so your code change doesn't actually change behavior there. What we can do is turn the fourth, unused capturing group into a non-capturing group (see 57edde6 in my task-21236-4 branch). But that's just a tweak, not a bugfix. If I overlooked a bug here, please let me know!
Alright, I'll merge and deploy later today (assuming that these trivial two changes don't require much review). And I'll write a short announcement to tor-project@ that this graph now exists. But I'll hold back the blog post until we have a more thorough analysis of this graph, per suggestion from yesterday's Vegas meeting.
Maybe off topic? Does it matter that the Linux sandbox behavior is totally different from everyone else, because it uses a different implementation to download, update check, and update? I suspect the user base isn't that big right now, but I'd hate to throw stats off...
nb: I like how it works right now and I don't see that changing for the sake of metrics...
[snip] But let's talk more about this in Berlin when we all still memorize this code. :)
All fine, I understand the timing issue.
What we should talk about though is the log-line matcher issue you describe above. I believe there's no bug! .... What we can do is turn the fourth, unused capturing group into a non-capturing group (see 57edde6 in my task-21236-4 branch). But that's just a tweak, not a bugfix. If I overlooked a bug here, please let me know!
Thanks for the analysis!
Maybe, the import was trapped on a huge fourth group? But, actually, if nothing showed up in your import logs, I would just keep the code that made the import. There would be also a test missing if the tweak is to be added to the codebase. (Those regex-issues are always troublesome.)
[snip]
Alright, I'll merge and deploy later today (assuming that these trivial two changes don't require much review). And I'll write a short announcement to tor-project@ that this graph now exists. But I'll hold back the blog post until we have a more thorough analysis of this graph, per suggestion from yesterday's Vegas meeting.
Okay, I reverted the regex change, because the import did indeed run successfully without that change. And I pushed the other changes as reviewed earlier.