How to run the Snakes on a Tor Exit Scanner


I. Introduction

The Snakes on a Tor Exit Scanner scans the Tor network for misbehaving
and misconfigured exit nodes. It has several tests that it performs,
including HTML, javascript, arbitrary HTTP, SSL and DNS scans. The
mechanisms by which these scans operate will be covered in another
document. This document concerns itself only with running the scanner.


II. Prerequisites

Python 2.5+
Tor 0.2.2.x
py-openssl/pyOpenSSL
Bonus: Secondary external IP address

Having a second external IP address will allow your scanner to filter
out false positives for dynamic pages that arise due to pages encoding
your IP address in documents.


III. Setup

A. Compiling Tor

To run SoaT you will need Tor 0.2.2.x.

It is also strongly recommended that you have a custom Tor instance that
is devoted only to exit scanning, and is not performing any other
function (including serving as a relay or a directory authority).


B. Configuring SoaT for Randomized Testing

If you just want to run a simple static test, skip this section.

To configure SoaT you will need to edit soat_config.py.

In particular, you'll want to change 'refetch_ip' to be set to your 
secondary IP address. If you don't have a secondary IP, set it to None.

If you're feeling ambitious, you can edit soat_config.py to change the
set of 'scan_filetypes' and increase 'max_content_size' to something
large enough to support these filetypes. However, you should balance
this with our more immediate need for the scanner to run quickly so that
the code is exercised and can stabilize quickly.

If you plan on doing search-based tests, you'll also want to edit
./wordlist.txt and change its contents to be a smattering of random
and/or commonly censored words. If you speak other languages (especially
any that have unicode characters), using keywords from them would be
especially useful for testing and scanning. Note that these queries WILL
be issued in plaintext via non-Tor, and the resulting urls fetched via
non-Tor as well, so bear that and your server's legal jurisdiction in
mind when choosing keywords.

You can also separate out the wordlist.txt file into three files by
changing the soat_config.py settings 'ssl_wordlist_file',
'html_wordlist_file', and 'filetype_wordlist_file'. This will allow
you to use separate keywords for obtaining SSL, HTML, and Filetype
urls. This can be useful if you believe it likely for an adversary to
target only certain keywords/concepts/sites in a particular context.

You can edit the contents of the wordlist files while SoaT runs. It will
pick up the changes after it completes a full network scan with the old
list.


IV. Running Tor and SoaT

Once you have everything compiled and configured, you should be ready to
run the pieces. You probably want to do this as a separate, unprivileged
user.

First, start up your custom Tor with the sample torrc provided in the
TorFlow svn root:

# ~/src/tor-git/src/or/tor -f ./data/tor/torrc &

Now you're ready to run SoaT. The next section describes SoaT's different
tests and operating modes, but if you'd like to get started immediately,
you may choose to run a search based SSL and HTTP test. These are the most
complete of the currently implemented tests, and have very low false
positive rates.

# ./soat.py --ssl --http >& ./data/soat.log &

or

# ./soat.py --ssl --target=ip:port >& ./data/soat.log &


V. Tests and Operating Modes

Currently, SoaT's most developed tests are those for SSL and HTTP requests.
But SoaT is also capable of doing HTML, and DNS Rebind tests. Any
combination of these tests may be performed during a SoaT run, although
DNS Rebind requires at least one other test to be performed in parallel. To
enable a test, simply pass SoaT its flag: --ssl, --http, --html, or
--dnsrebind.

By default the tests are run in search based mode, this means that the URLs
to be requested during the run are gathered by querying search engines for
the terms in your ./wordlist.txt file.

An alternative, and potentially less false positive prone, operating mode is
the fixed target mode. Fixed target mode is enabled by passing SoaT one or
more --target=<URL> flags. Only the URLs referenced by the target flags will
be requested. This operating mode has several attractive features, for
instance, you can reduce false positive rates by selecting static content, and
you can shorten the duration of runs by selecting small files on highly
responsive servers.

It should be noted that, despite their attractive features, fixed target
scans are likely to miss many of the results which search based scans
detect. Principally this is because it is difficult as a SoaT operator
to pick a diverse set of targets. Consider, for example, that if in
selecting your targets you neglect to include a site on one of OpenDNS'
blacklists, then you're going to miss one of the most common configuration
issues that SoaT detects. Another issue is that it's quite likely some
malicious exit nodes limit their activity to a small set of sites. As
such, any reduction in your search space limits the likelihood that you'll
make a request through such an exit which triggers its malicious behavior.


VI. Monitoring and Results

A. Issues with automated search engine queries and Randomized Scans

SoaT can use Ixquick, Google, or Yahoo to perform its search queries. The
current default is Ixquick, and for most purposes this should be fine. If
you do find that you're having trouble discovering URLs (particularly for
the SSL test), then you may wish to switch to Google or Yahoo. To do so, 
open your soat_config.py and change:

default_search_mode = ixquick_search_mode

to

default_search_mode = google_search_mode

or

default_search_mode = yahoo_search_mode

Regardless of the engine used, you'll need to keep an eye on the beginning
of the soat.log to make sure it is actually retrieving URLs. Google's
servers can periodically decide that you are not worthy to query them,
especially if you restart soat several times in a row. If this happens,
you'll need to temporarily switch search engines.

Be warned that the Yahoo search mode is not acceptable for conducting SSL
tests as Yahoo lacks the necessary query terms. If neither Ixquick nor
Google are working, you'll either need to stop your SSL tests or
switch to a fixed target scan.

B. Handling Crashes

At this stage in the game, your primary task will be to periodically
check the scanner for exceptions and hangs. For that you'll just want
to tail the soat.log file to make sure it is putting out recent loglines
and is continuing to run. If there are any issues, please mail me your
soat.log.

If/When SoaT crashes, you should be able to resume it exactly where it
left off with:

# ./soat.py --resume=-1 [other options you used last time] >& soat.log &

Keeping the same options during a --resume is a Really Good Idea.

Soat actually saves a snapshot to a unique name each time you run it
without --resume, so you can suspend and resume arbitrary runs by
specifying their number:

# ls ./data/soat/
# ./soat.py --resume=2 --ssl --html --http --dnsrebind >& soat.log &

Using --resume=-1 indicates that SoaT should resume its most recent run.

C. Handling Results

As things stabilize, you'll want to begin grepping your soat.log for
ERROR lines. These indicate serious scanning errors and content
modifications. There will likely be false positives at first, and these
will require you tar up your ./data directory and soat.log and send it
to me to improve the filters for them:

# tar -jcf soat-data.tbz2 ./data/soat ./soat.log

If you're feeling adventurous, you can inspect the results yourself by
running snakeinspector.py. Running it with no arguments will dump all
failures to your screen in a semi-human readable format. You can add a
--verbose to get unified diffs of content modifications, and you can
filter on specific Test Result types with --resultfilter, and on
specific exit idhexes with --exit. Ex:

# ./snakeinspector.py --verbose --exit=80972D30FE33CB8AD60726C5272AFCEBB05CD6F7
   --resultfilter=SSLTestResult 

Other useful filters are --after, --before, --finishedafter, and
--finishedbefore. These each take a timestamp such as
"Thu Jan 1 00:00:00 1970". --after and --before are useful while a test
is in progress to see what's been discovered so far. The finishedafter
and finishedbefore flags filter results based on when the test during
which they were discovered was completed, and provide a nice way
to group all results from the same test together. If you wanted to
see all results from tests completed the week of August 9, 2010, you
could run:

# ./snakeinspector.py --verbose --finishedafter="Mon Aug 9 00:00:00 2010"
    --finishedbefore="Mon Aug 16 00:00:00 2010"

You can see the full list of available filters by running:

# ./snakeinspector.py --help

D. Verifying Results

If you would like to verify a set of results, you can use the --rescan
option of soat, which crawls your data directory and creates a list of
nodes to scan that consist only of failures, and then scans those with
fresh URLs:

# ./soat.py --rescan --ssl --html --http --dnsrebind >& soat.log &

Rescans can also be resumed with --resume should they fail.

SoaT can also do a rescan at the end of every loop through the node
list. This is governed by the rescan_at_finish soat_config option.

Note that rescanning does not prune out geolocated URLs that differ
across the majority of exit nodes. It can thus cause many more false
positives to accumulate than a regular scan.

E. Reporting Results

You'll notice in your soat_config.py that there are several variables
prefixed by "mail_". Set appropriately, these allow you to automatically
email results to us through snakeinspector (you'll also have to add
our email address to the to_email list).
If you have a gmail account, you can set these variables as follows:

mail_server = "smtp.gmail.com"
mail_auth = True
mail_tls = False
mail_starttls = True
mail_user = "your_username@example.com"
mail_password = "your_password"

If you're wary of leaving your email password in plaintext in the
soat_config, you can set mail_password = None, and you'll be
prompted to provide it when snakeinspector is run.

In this current directory is a cron.sh script that calls snakeinspector to
email results that completed in the last hour, or since the last time
you've run it. Add it to `crontab -e` like so:

0 * * * * ~/code/torflow.git/NetworkScanners/ExitAuthority/cron.sh

Alright that covers the basics. Let's get those motherfuckin snakes off
this motherfuckin Tor!