The goal is to set up web server log analysis for daily processing of all web server logs (www, blog, trac, and so on). This all started with #2489 (moved) Set up new web server logging and log analysis infrastructure.
Karsten is analyzing and sanitizing the logs that we have, and I'm going to research tools we can use to display the logs in a user-friendly way.
Suggested tools are AWStats and Piwik. I'm sure there are more tools out there as well.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Changing this ticket to a project and adding it to the sponsor Z milestone that's due on December 31, 2011, so that it appears on the sponsor Z deliverable list.
Also note that Webalizer may be another web log analysis tool that we could use. We even have preliminary results generated by Webalizer.
Trac: Milestone: N/Ato Sponsor Z: December 31, 2011 Type: task to project
I looked at four different web log analysis tools, here's what I found:
Piwik looks great, but is not available in Ubuntu or Debian. Setting it up manually is pretty straight forward, but you will not be able to import Apache logs without using some third-party script. Last time I checked, that third-party script had some issues with our sanitized log format.
AWStats is easy to set up and easy to use, but incredibly slow when importing logs. I set up AWStats on an Ubuntu EC2 instance and pulled the sanitized logs for January and February 2010 (you only get 8 GB storage). The import of wiki.torproject.org-access.log was pretty quick, and we have some preliminary results. However, the import of www.torproject.org-access.log does not complete at all. Maybe it's because I tried to do all this in the cloud, or maybe it's just AWStats.
Webalizer is just as easy to set up and use as AWStats. It doesn't look as pretty, but it's a lot faster when it comes to importing existing log I managed to set it up and import the Jan+Feb www.torproject.org-access.log without any problems.
Splunk was recommended to me by someone on Twitter, so I figured I'd look into it. The free version of Splunk allows you to index only 500 megabytes of data per day, we probably want more than that.
Another option is to write our own parser and use R to create graphs similar to what we have on metrics.tpo. Writing our own parser will take some time, so maybe we should just go with Webalizer for now.
I'm curious why AWStats takes so long to process our sanitized logs. This isn't only relevant for picking a web log analysis tool for ourselves, but also for providing logs that will be useful for others.
I wonder if AWStats is confused that all requests come in at 00:00:00 and from either 0.0.0.0 or 0.0.0.1. Maybe we can teach it not to look at these data fields to reconstruct user sessions.
Can you paste your AWStats config and a little howto somewhere? I'd like to try it on my local Debian machine with 500 GiB disk space.
I wonder if AWStats is confused that all requests come in at 00:00:00 and from either 0.0.0.0 or 0.0.0.1. Maybe we can teach it not to look at these data fields to reconstruct user sessions.
AWStats didn't have any problems with the log for the wiki. I haven't spent too much time looking into fine-tuning AWStats to ignore certain fields, though.
Can you paste your AWStats config and a little howto somewhere? I'd like to try it on my local Debian machine with 500 GiB disk space.
The Ubuntu AWStats howto covers the basics. I have attached the (almost default) config I used for the www log.
So, I ran AWStats tonight to parse the 2010 logs. It took 3 hours and 38 minutes, which is reasonable, IMO. We only have to import a few years of logs once, and if that takes 12 hours, that's fine.
I have no idea why it took forever for you, though. Maybe it was a problem with available disk space, who knows.
So, I ran AWStats tonight to parse the 2010 logs. It took 3 hours and 38 minutes, which is reasonable, IMO. We only have to import a few years of logs once, and if that takes 12 hours, that's fine.
Cool, can you make the results available? That means we have two options; AWStats and Webalizer. Which one's your favorite? Or should we roll our own solution?
I have no idea why it took forever for you, though. Maybe it was a problem with available disk space, who knows.
How would I do that? I don't think the results are written to static HTML files, are they? Can I send you a tarball of some directory (that you're going to tell me), and you make the shiny stats available?
That means we have two options; AWStats and Webalizer. Which one's your favorite? Or should we roll our own solution?
Can we run both? I don't know if we'll run into problems with the daily (?) updates once we have them. I'm waiting for our VM server to return, and then I'm going to set up the sanitizing code. The next step will be to set up a VM with AWStats and/or Webalizer. I could imagine we might run into other problems then, so I'd rather not want to exclude either of the two tools yet.
If we can avoid it, let's avoid writing something ourselves at this point.
How would I do that? I don't think the results are written to static HTML files, are they? Can I send you a tarball of some directory (that you're going to tell me), and you make the shiny stats available?
No static HTML as far as I know. You'd need to run apache2 on the same host. If you can create a tarball of the following directories (I don't think all of them are necessary, but hey), I'll make the stats available on the EC2 server:
That means we have two options; AWStats and Webalizer. Which one's your favorite? Or should we roll our own solution?
Can we run both? I don't know if we'll run into problems with the daily (?) updates once we have them. I'm waiting for our VM server to return, and then I'm going to set up the sanitizing code. The next step will be to set up a VM with AWStats and/or Webalizer. I could imagine we might run into other problems then, so I'd rather not want to exclude either of the two tools yet.
I don't see a problem with running both, so yes. A lot of people run both because they like something from AWStats that isn't available in Webalizer and vice versa. Should we set up the web log analysis tools on the same VM as the one sanitizing the logs, or should we get a new one?
If we can avoid it, let's avoid writing something ourselves at this point.
No static HTML as far as I know. You'd need to run apache2 on the same host. If you can create a tarball of the following directories (I don't think all of them are necessary, but hey), I'll make the stats available on the EC2 server:
I don't see a problem with running both, so yes. A lot of people run both because they like something from AWStats that isn't available in Webalizer and vice versa.
Sounds good.
Should we set up the web log analysis tools on the same VM as the one sanitizing the logs, or should we get a new one?
I think we should get a new one. The VM that has non-sanitized logs shouldn't run a web server. It will also be the VM that sanitizes bridge descriptors.
If you want to start setting up the VM with AWStats and Webalizer, please don't wait for me to set up the VM that sanitizes logs. The 2010 logs should be sufficient to get something running. The connection to the sanitizing VM will be a cronjob rsync'ing the sanitized logs as you find them in the tarballs. Both the AWStats and the Webalizer setup should be able to handle adding new sanitized log files and removing files older than, say, one week. We'll probably want to keep back the logs from a given day until that day is over (the sorting doesn't make much sense if we're sorting requests from just a few hours), so files shouldn't change once you get them.
If we can avoid it, let's avoid writing something ourselves at this point.
No static HTML as far as I know. You'd need to run apache2 on the same host. If you can create a tarball of the following directories (I don't think all of them are necessary, but hey), I'll make the stats available on the EC2 server:
Should we set up the web log analysis tools on the same VM as the one sanitizing the logs, or should we get a new one?
I think we should get a new one. The VM that has non-sanitized logs shouldn't run a web server. It will also be the VM that sanitizes bridge descriptors.
Ok, I have requested a VM for AWStats and Webalizer in #4634 (closed).
If you want to start setting up the VM with AWStats and Webalizer, please don't wait for me to set up the VM that sanitizes logs. The 2010 logs should be sufficient to get something running. The connection to the sanitizing VM will be a cronjob rsync'ing the sanitized logs as you find them in the tarballs. Both the AWStats and the Webalizer setup should be able to handle adding new sanitized log files and removing files older than, say, one week. We'll probably want to keep back the logs from a given day until that day is over (the sorting doesn't make much sense if we're sorting requests from just a few hours), so files shouldn't change once you get them.
How would I do that? I don't think the results are written to static HTML files, are they? Can I send you a tarball of some directory (that you're going to tell me), and you make the shiny stats available?
We've got AWStats and Webalizer running, and we're working on adding logs for more torproject.org domains. Closing this ticket now, feel free to reopen if it's related to the web log analysis tools.
Trac: Status: assigned to closed Resolution: N/Ato fixed