As part of #4463 (moved), I'd like to have AWStats and Webalizer set up on a VM. They are both available in Debian and easy to set up. The 2010 logs are already sanitized, so we'll start with those. I'm thinking that stats.tpo/awstats and stats.tpo/webalizer will look nice.
Karsten is setting up a VM that sanitizes logs (so that we can include 2011 logs etc). Here's his plan for how everything's going to work:
The connection to the sanitizing VM will be a cronjob rsync'ing the sanitized logs as you find them in the tarballs. Both the AWStats and the Webalizer setup should be able to handle adding new sanitized log files and removing files older than, say, one week. We'll probably want to keep back the logs from a given day until that day is over (the sorting doesn't make much sense if we're sorting requests from just a few hours), so files shouldn't change once you get them.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
Why two VMs? Why not just sanitize the logs on the same VM, purge the unsanitized logs, and let webalizer/awstats run against them? My concern is we're creating lots of admin overhead for little gain. The webserver logs are already fairly sanitized. It's not the end of the world if they are leaked. And, we don't need to keep all of this data hanging around. The output of webalizer or awstats should be sufficient to see the necessary statistics. If the machine dies and we lose all of our logs, oh well. We don't use them now anyway. We're assigning too much value to these logs and the data contained within them.
Why webalizer AND awstats? We should choose one and be done with it.
And while I'm whining, stats.tpo is too generic, how about webstats or weblogs for the domain?
Why two VMs? Why not just sanitize the logs on the same VM, purge the unsanitized logs, and let webalizer/awstats run against them? My concern is we're creating lots of admin overhead for little gain. The webserver logs are already fairly sanitized. It's not the end of the world if they are leaked. And, we don't need to keep all of this data hanging around. The output of webalizer or awstats should be sufficient to see the necessary statistics. If the machine dies and we lose all of our logs, oh well. We don't use them now anyway. We're assigning too much value to these logs and the data contained within them.
The idea was to sanitize web logs on the same VM that we use for sanitizing bridge descriptors and have a separate VM for running AWStats and/or Webalizer.
As for throwing away original logs, I don't mind as much. My idea was to test the sanitized log format and visualization a bit before throwing away original data. But I can give up on that idea. I wouldn't want to do this for bridge descriptors, but I don't care as much for web logs. Just don't be sad if we lack history to compare download numbers or anything over time.
Why webalizer AND awstats? We should choose one and be done with it.
We're not sure yet if both of them work as expected. For example, Runa had difficulties with AWStats on AWS. It might be that we'll run into more problems when deploying this. Both tools work, so we can try deploying both, and if one fails, we take the other. Really, setting up these two tools on the same VM working with the same data set isn't the major effort here.
And while I'm whining, stats.tpo is too generic, how about webstats or weblogs for the domain?
I created htdocs/awstats and htdocs/webalizer for the stuff we want to have on the web. Seems like webalizer needs to have /usr/share/GeoIP/GeoIP.dat before it'll parse any logs, though.
Reopening the ticket because we still need to configure awstats. See the "Apache2 Configuration" section on this page for a few hints. I also need to know where I should put the awstats database, default is /var/lib/awstats.
Trac: Resolution: fixed toN/A Status: closed to reopened