VisiTor
     or: a script to tell you how many of your users are probably Tor users

---------------------------------------------------------------------------

Introduction:

Some individuals and organizations wonder how many of the people visiting
their web server are using Tor to do so.

We can answer this question by comparing the web server logs containing
client IP addresses and timestamps with our bulk exit list archives.
Obviously, most people don't want to give away their web server logs,
which is why we made the exit list archives and this parsing script
available to have people run the comparison on their own.

Note that the approach of comparing exit lists with web server logs may
lead to a small false positive rate: when the IP address of an exit relay
is assigned to someone else, and that person contacts the web server
within 24 hours after the address change, this script would falsely count
that person as Tor user. We could improve accuracy by processing more data
about the running relays and their exit policies. But in general, the
approach taken here should be sufficient for statistics. If you need to be
more certain about a specific IP address being a relay at a specific time,
you should look at:

  https://metrics.torproject.org/exonerator.html

This script consists of a Java or Python part and an R part. The Java or
Python part parses a web server log and the downloaded exit list archives
and writes daily statistics on requests by Tor users to disk. It further
detects user-agent strings used by different Torbutton versions to count
potential Torbutton users over Tor. The optional R part can be used to
visualize the results.

---------------------------------------------------------------------------

Java Quick Start:

In order to run this script, you need to install and download the
following software and data (please note that all instructions are written
for Linux and Mac OS X; commands for Windows may vary):

- Install Java 6 or higher. (See Section 2.2 in
  https://gitweb.torproject.org/metrics-db.git/blob_plain/HEAD:/doc/manual.pdf
  for instructions to install Java 6 on Debian Lenny.)

- Download the exit list archives of the relevant time from
  https://metrics.torproject.org/data.html#exitlist and extract them to a
  directory in your working directory, e.g. /home/you/visitor/exitlists/ .
  Note that as of August 2010, one month of exit lists is 20M compressed
  and 168M uncompressed.

- Put your .gz-compressed or decompressed web server log in your working
  directory, too, e.g. /home/you/visitor/access_log.gz .

  (The log file is expected to use Apache's combined log format:
  "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\""; so if
  you changed the log format, be sure to transform your log to said format
  before passing it to VisiTor!)

- Compile the (single) Java class using this command:

  $ javac VisiTor.java

- Run the Java application, providing it with the parameters it needs.
  Passing '-' (without quotes) as web server log file name means that the
  web server log will be read from stdin. Note that the fourth parameter
  that writes out the server log part with Tor user requests is optional:

  java VisiTor <web server log> <exit list directory> <output file>
       [<server log part with Tor user requests>]

  Sample invocations might be:

  $ java VisiTor access_log exitlists/ out.csv tor_access_log

  $ java VisiTor access_log.gz exitlists/ out.csv tor_access_log

  $ gunzip -c access_log.gz | java VisiTor - exitlists/ out.csv \
        tor_access_log

- Find the results in /home/you/visitor/out.csv in a format that can be
  imported by any spreadsheet application like OpenOffice.org Calc or
  processed by R.

---------------------------------------------------------------------------

Python Quick Start

The Python script is a port of the original implementation in Java to
Python. In order to run this, you need a decently up-to-date version of
Python (tested with Python 2.4 and 2.7), which you can check by typing
`which python` in the terminal.

- Just as the Java version, you need to download the exit list from
  https://metrics.torproject.org/data.html#exitlist

- Once you have uncompressed the folder, you can run the script by typing

  python visitor.py <access log> <exit list> [<output file>]

  where <access log> is the name of the Apache access log you wish to
  analyze, <exit list> is the uncompressed exit list folder, and <output
  file> is the optional argument specifying the name of the output file.
  If <output file> is missing, the program outputs the result to stdout.
  The program also outputs some warning/messages to stderr.

  Unlike the Java version, currently <access log> needs to be *uncompressed*
  (The Java version allows you to pass in a gzipped access log).

  Suppose you have an Apache log named access_log and an exit list called
  exit_list, and wish to publish the statistics to a file named out.csv.
  Then, type

  $ python visitor.py access_log exit_list out.csv

  or

  $ python visitor.py access_log exit_list > out.csv

---------------------------------------------------------------------------

R Quick Start

- Install R 2.8 and ggplot2 0.8.8 or higher. (See Section 2.4 in
  https://gitweb.torproject.org/metrics-db.git/blob_plain/HEAD:/doc/manual.pdf
  for instructions to install R 2.8 and ggplot2 0.8.8 on Debian Lenny.)

- If you chose another filename than out.csv above, edit plot.R to read
  the correct file.

- Run the script with this command:

  $ R --slave < plot.R

- Find the generated graph in /home/you/visitor/visitors.png .

