Commit de66bed6 authored by George Kadianakis's avatar George Kadianakis
Browse files

Merge branch 'tor-github/pr/1366'

parents 93186821 b03cb0cc
Loading
Loading
Loading
Loading

changes/ticket31849

0 → 100644
+5 −0
Original line number Diff line number Diff line
  o Documentation:
    - The Tor source code repository now includes a (somewhat dated)
      description of Tor's modular architecture, in doc/HACKING/design.
      This is based on the old "tor-guts.git" repository, which we are
      adopting and superseding.  Closes ticket 31849.
+124 −0
Original line number Diff line number Diff line

## Overview ##

This document describes the general structure of the Tor codebase, how
it fits together, what functionality is available for extending Tor,
and gives some notes on how Tor got that way.

Tor remains a work in progress: We've been working on it for more than a
decade, and we've learned a lot about good coding since we first
started.  This means, however, that some of the older pieces of Tor will
have some "code smell" in them that could sure stand a brisk
refactoring.  So when I describe a piece of code, I'll sometimes give a
note on how it got that way, and whether I still think that's a good
idea.

The first drafts of this document were written in the Summer and Fall of
2015, when Tor 0.2.6 was the most recent stable version, and Tor 0.2.7
was under development.  If you're reading this far in the future, some
things may have changed.  Caveat haxxor!

This document is not an overview of the Tor protocol.  For that, see the
design paper and the specifications at https://spec.torproject.org/ .

For more information about Tor's coding standards and some helpful
development tools, see doc/HACKING in the Tor repository.

For more information about writing tests, see doc/HACKING/WritingTests.txt
in the Tor repository.

### The very high level ###

Ultimately, Tor runs as an event-driven network daemon: it responds to
network events, signals, and timers by sending and receiving things over
the network.  Clients, relays, and directory authorities all use the
same codebase: the Tor process will run as a client, relay, or authority
depending on its configuration.

Tor has a few major dependencies, including Libevent (used to tell which
sockets are readable and writable), OpenSSL (used for many encryption
functions, and to implement the TLS protocol), and zlib (used to
compress and uncompress directory information).

Most of Tor's work today is done in a single event-driven main thread.
Tor also spawns one or more worker threads to handle CPU-intensive
tasks.  (Right now, this only includes circuit encryption.)

On startup, Tor initializes its libraries, reads and responds to its
configuration files, and launches a main event loop.  At first, the only
events that Tor listens for are a few signals (like TERM and HUP), and
one or more listener sockets (for different kinds of incoming
connections).  Tor also configures a timer function to run once per
second to handle periodic events.  As Tor runs over time, other events
will open, and new events will be scheduled.

The codebase is divided into a few main subdirectories:

   src/common -- utility functions, not necessarily tor-specific.

   src/or -- implements the Tor protocols.

   src/test -- unit and regression tests

   src/ext -- Code maintained elsewhere that we include in the Tor
   source distribution.

   src/trunnel -- automatically generated code (from the Trunnel)
   tool: used to parse and encode binary formats.

### Some key high-level abstractions ###

The most important abstractions at Tor's high-level are Connections,
Channels, Circuits, and Nodes.

A 'Connection' represents a stream-based information flow.  Most
connections are TCP connections to remote Tor servers and clients. (But
as a shortcut, a relay will sometimes make a connection to itself
without actually using a TCP connection.  More details later on.)
Connections exist in different varieties, depending on what
functionality they provide.  The principle types of connection are
"edge" (eg a socks connection or a connection from an exit relay to a
destination), "OR" (a TLS stream connecting to a relay), "Directory" (an
HTTP connection to learn about the network), and "Control" (a connection
from a controller).

A 'Circuit' is persistent tunnel through the Tor network, established
with public-key cryptography, and used to send cells one or more hops.
Clients keep track of multi-hop circuits, and the cryptography
associated with each hop.  Relays, on the other hand, keep track only of
their hop of each circuit.

A 'Channel' is an abstract view of sending cells to and from a Tor
relay.  Currently, all channels are implemented using OR connections.
If we switch to other strategies in the future, we'll have more
connection types.

A 'Node' is a view of a Tor instance's current knowledge and opinions
about a Tor relay orbridge.

### The rest of this document. ###

> **Note**: This section describes the eventual organization of this
> document, which is not yet complete.

We'll begin with an overview of the various utility functions available
in Tor's 'common' directory.  Knowing about these is key to writing
portable, simple code in Tor.

Then we'll go on and talk about the main data-flow of the Tor network:
how Tor generates and responds to network traffic.  This will occupy a
chapter for the main overview, with other chapters for special topics.

After that, we'll mention the main modules in Tor, and describe the
function of each.

We'll cover the directory subsystem next: how Tor learns about other
relays, and how relays advertise themselves.

Then we'll cover a few specialized modules, such as hidden services,
sandboxing, hibernation, accounting, statistics, guards, path
generation, pluggable transports, and how they integrate with the rest of Tor.

We'll close with a meandering overview of important pending issues in
the Tor codebase, and how they affect the future of the Tor software.
+121 −0
Original line number Diff line number Diff line

## Utility code in Tor

Most of Tor's utility code is in modules in the src/common subdirectory.

These are divided, broadly, into _compatibility_ functions, _utility_
functions, _containers_, and _cryptography_.  (Someday in the future, it
would be great to split these modules into separate directories.  Also, some
functions are probably put in the wrong modules)

### Compatibility code

These functions live in src/common/compat\*.c; some corresponding macros live
in src/common/compat\*.h.  They serve as wrappers around platform-specific or
compiler-specific logic functionality.

In general, the rest of the Tor code *should not* be calling platform-specific
or otherwise non-portable functions.  Instead, they should call wrappers from
compat.c, which implement a common cross-platform API.  (If you don't know
whether a function is portable, it's usually good enough to see whether it
exists on OSX, Linux, and Windows.)

Other compatibility modules include backtrace.c, which generates stack traces
for crash reporting; sandbox.c, which implements the Linux seccomp2 sandbox;
and procmon.c, which handles monitoring a child process.

Parts of address.c are compatibility code for handling network addressing
issues; other parts are in util.c.

Notable compatibility areas are:

   * mmap support for mapping files into the address space (read-only)

   * Code to work around the intricacies

   * Workaround code for Windows's horrible winsock incompatibilities and
     Linux's intricate socket extensions.

   * Helpful string functions like memmem, memstr, asprintf, strlcpy, and
     strlcat that not all platforms have.

   * Locale-ignoring variants of the ctypes functions.

   * Time-manipulation functions

   * File locking function

   * IPv6 functions for platforms that don't have enough IPv6 support

   * Endianness functions

   * OS functions

   * Threading and locking functions.

=== Utility functions

General-purpose utilities are in util.c; they include higher-level wrappers
around many of the compatibility functions to provide things like
file-at-once access, memory management functions, math, string manipulation,
time manipulation, filesystem manipulation, etc.

(Some functionality, like daemon-launching, would be better off in a
compatibility module.)

In util_format.c, we have code to implement stuff like base-32 and base-64
encoding.

The address.c module interfaces with the system resolver and implements
address parsing and formatting functions.  It converts sockaddrs to and from
a more compact tor_addr_t type.

The di_ops.c module provides constant-time comparison and associative-array
operations, for side-channel avoidance.

The logging subsystem in log.c supports logging to files, to controllers, to
stdout/stderr, or to the system log.

The abstraction in memarea.c is used in cases when a large amount of
temporary objects need to be allocated, and they can all be freed at the same
time.

The torgzip.c module wraps the zlib library to implement compression.

Workqueue.c provides a simple multithreaded work-queue implementation.

### Containers

The container.c module defines these container types, used throughout the Tor
codebase.

There is a dynamic array called **smartlist**, used as our general resizeable
array type.  It supports sorting, searching, common set operations, and so
on.  It has specialized functions for smartlists of strings, and for
heap-based priority queues.

There's a bit-array type.

A set of mapping types to map strings, 160-bit digests, and 256-bit digests
to void \*.  These are what we generally use when we want O(1) lookup.

Additionally, for containers, we use the ht.h and tor_queue.h headers, in
src/ext.  These provide intrusive hashtable and linked-list macros.

###  Cryptography

Once, we tried to keep our cryptography code in a single "crypto.c" file,
with an "aes.c" module containing an AES implementation for use with older
OpenSSLs.

Now, our practice has become to introduce crypto_\*.c modules when adding new
cryptography backend code.  We have modules for Ed25519, Curve25519,
secret-to-key algorithms, and password-based boxed encryption.

Our various TLS compatibility code, wrappers, and hacks are kept in
tortls.c, which is probably too full of Tor-specific kludges.  I'm
hoping we can eliminate most of those kludges when we finally remove
support for older versions of our TLS handshake.


+93 −0
Original line number Diff line number Diff line

## Memory management

### Heap-allocation functions

Tor imposes a few light wrappers over C's native malloc and free
functions, to improve convenience, and to allow wholescale replacement
of malloc and free as needed.

You should never use 'malloc', 'calloc', 'realloc, or 'free' on their
own; always use the variants prefixed with 'tor_'.
They are the same as the standard C functions, with the following
exceptions:

   * tor_free(NULL) is a no-op.
   * tor_free() is a macro that takes an lvalue as an argument and sets it to
     NULL after freeing it.  To avoid this behavior, you can use tor_free_()
     instead.
   * tor_malloc() and friends fail with an assertion if they are asked to
     allocate a value so large that it is probably an underflow.
   * It is always safe to tor_malloc(0), regardless of whether your libc
     allows it.
   * tor_malloc(), tor_realloc(), and friends are never allowed to fail.
     Instead, Tor will die with an assertion.  This means that you never
     need to check their return values.  See the next subsection for
     information on why we think this is a good idea.

We define additional general-purpose memory allocation functions as well:

   * tor_malloc_zero(x) behaves as calloc(1, x), except the it makes clear
     the intent to allocate a single zeroed-out value.
   * tor_reallocarray(x,y) behaves as the OpenBSD reallocarray function.
     Use it for cases when you need to realloc() in a multiplication-safe
     way.

And specific-purpose functions as well:

   * tor_strdup() and tor_strndup() behaves as the underlying libc functions,
     but use tor_malloc() instead of the underlying function.
   * tor_memdup() copies a chunk of memory of a given size.
   * tor_memdup_nulterm() copies a chunk of memory of a given size, then
     NUL-terminates it just to be safe.

#### Why assert on failure?

Why don't we allow tor_malloc() and its allies to return NULL?

First, it's error-prone.  Many programmers forget to check for NULL return
values, and testing for malloc() failures is a major pain.

Second, it's not necessarily a great way to handle OOM conditions. It's
probably better (we think) to have a memory target where we dynamically free
things ahead of time in order to stay under the target.  Trying to respond to
an OOM at the point of tor_malloc() failure, on the other hand, would involve
a rare operation invoked from deep in the call stack.  (Again, that's
error-prone and hard to debug.)

Third, thanks to the rise of Linux and other operating systems that allow
memory to be overcommitted, you can't actually ever rely on getting a NULL
from malloc() when you're out of memory; instead you have to use an approach
closer to tracking the total memory usage.

#### Conventions for your own allocation functions.

Whenever you create a new type, the convention is to give it a pair of
x_new() and x_free() functions, named after the type.

Calling x_free(NULL) should always be a no-op.


### Grow-only memory allocation: memarea.c

It's often handy to allocate a large number of tiny objects, all of which
need to disappear at the same time.  You can do this in tor using the
memarea.c abstraction, which uses a set of grow-only buffers for allocation,
and only supports a single "free" operation at the end.

Using memareas also helps you avoid memory fragmentation.  You see, some libc
malloc implementations perform badly on the case where a large number of
small temporary objects are allocated at the same time as a few long-lived
objects of similar size.  But if you use tor_malloc() for the long-lived ones
and a memarea for the temporary object, the malloc implementation is likelier
to do better.

To create a new memarea, use memarea_new().  To drop all the storage from a
memarea, and invalidate its pointers, use memarea_drop_all().

The allocation functions memarea_alloc(), memarea_alloc_zero(),
memarea_memdup(), memarea_strdup(), and memarea_strndup() are analogous to
the similarly-named malloc() functions.  There is intentionally no
memarea_free() or memarea_realloc().

+43 −0
Original line number Diff line number Diff line

## Collections in tor

### Smartlists: Neither lists, nor especially smart.

For historical reasons, we call our dynamic-allocated array type
"smartlist_t".  It can grow or shrink as elements are added and removed.

All smartlists hold an array of void \*.  Whenever you expose a smartlist
in an API you *must* document which types its pointers actually hold.

<!-- It would be neat to fix that, wouldn't it? -NM  -->

Smartlists are created empty with smartlist_new() and freed with
smartlist_free().  See the containers.h module documentation for more
information; there are many convenience functions for commonly needed
operations.


### Digest maps, string maps, and more.

Tor makes frequent use of maps from 160-bit digests, 256-bit digests,
or nul-terminated strings to void \*. These types are digestmap_t,
digest256map_t, and strmap_t respectively.  See the containers.h
module documentation for more information.


### Intrusive lists and hashtables

For performance-sensitive cases, we sometimes want to use "intrusive"
collections: ones where the bookkeeping pointers are stuck inside the
structures that belong to the collection.  If you've used the
BSD-style sys/queue.h macros, you'll be familiar with these.

Unfortunately, the sys/queue.h macros vary significantly between the
platforms that have them, so we provide our own variants in
src/ext/tor_queue.h .

We also provide an intrusive hashtable implementation in src/ext/ht.h
. When you're using it, you'll need to define your own hash
functions. If attacker-induced collisions are a worry here, use the
cryptographic siphash24g function to extract hashes.
Loading