Merge branch 'tor-github/pr/1366' (de66bed6) · Commits · ZerXes / Tor

changes/ticket31849

0 → 100644

+5 −0

Original line number	Diff line number	Diff line
		o Documentation:
		- The Tor source code repository now includes a (somewhat dated)
		description of Tor's modular architecture, in doc/HACKING/design.
		This is based on the old "tor-guts.git" repository, which we are
		adopting and superseding. Closes ticket 31849.

doc/HACKING/design/00-overview.md

0 → 100644

+124 −0

Original line number	Diff line number	Diff line

		## Overview ##

		This document describes the general structure of the Tor codebase, how
		it fits together, what functionality is available for extending Tor,
		and gives some notes on how Tor got that way.

		Tor remains a work in progress: We've been working on it for more than a
		decade, and we've learned a lot about good coding since we first
		started. This means, however, that some of the older pieces of Tor will
		have some "code smell" in them that could sure stand a brisk
		refactoring. So when I describe a piece of code, I'll sometimes give a
		note on how it got that way, and whether I still think that's a good
		idea.

		The first drafts of this document were written in the Summer and Fall of
		2015, when Tor 0.2.6 was the most recent stable version, and Tor 0.2.7
		was under development. If you're reading this far in the future, some
		things may have changed. Caveat haxxor!

		This document is not an overview of the Tor protocol. For that, see the
		design paper and the specifications at https://spec.torproject.org/ .

		For more information about Tor's coding standards and some helpful
		development tools, see doc/HACKING in the Tor repository.

		For more information about writing tests, see doc/HACKING/WritingTests.txt
		in the Tor repository.

		### The very high level ###

		Ultimately, Tor runs as an event-driven network daemon: it responds to
		network events, signals, and timers by sending and receiving things over
		the network. Clients, relays, and directory authorities all use the
		same codebase: the Tor process will run as a client, relay, or authority
		depending on its configuration.

		Tor has a few major dependencies, including Libevent (used to tell which
		sockets are readable and writable), OpenSSL (used for many encryption
		functions, and to implement the TLS protocol), and zlib (used to
		compress and uncompress directory information).

		Most of Tor's work today is done in a single event-driven main thread.
		Tor also spawns one or more worker threads to handle CPU-intensive
		tasks. (Right now, this only includes circuit encryption.)

		On startup, Tor initializes its libraries, reads and responds to its
		configuration files, and launches a main event loop. At first, the only
		events that Tor listens for are a few signals (like TERM and HUP), and
		one or more listener sockets (for different kinds of incoming
		connections). Tor also configures a timer function to run once per
		second to handle periodic events. As Tor runs over time, other events
		will open, and new events will be scheduled.

		The codebase is divided into a few main subdirectories:

		src/common -- utility functions, not necessarily tor-specific.

		src/or -- implements the Tor protocols.

		src/test -- unit and regression tests

		src/ext -- Code maintained elsewhere that we include in the Tor
		source distribution.

		src/trunnel -- automatically generated code (from the Trunnel)
		tool: used to parse and encode binary formats.

		### Some key high-level abstractions ###

		The most important abstractions at Tor's high-level are Connections,
		Channels, Circuits, and Nodes.

		A 'Connection' represents a stream-based information flow. Most
		connections are TCP connections to remote Tor servers and clients. (But
		as a shortcut, a relay will sometimes make a connection to itself
		without actually using a TCP connection. More details later on.)
		Connections exist in different varieties, depending on what
		functionality they provide. The principle types of connection are
		"edge" (eg a socks connection or a connection from an exit relay to a
		destination), "OR" (a TLS stream connecting to a relay), "Directory" (an
		HTTP connection to learn about the network), and "Control" (a connection
		from a controller).

		A 'Circuit' is persistent tunnel through the Tor network, established
		with public-key cryptography, and used to send cells one or more hops.
		Clients keep track of multi-hop circuits, and the cryptography
		associated with each hop. Relays, on the other hand, keep track only of
		their hop of each circuit.

		A 'Channel' is an abstract view of sending cells to and from a Tor
		relay. Currently, all channels are implemented using OR connections.
		If we switch to other strategies in the future, we'll have more
		connection types.

		A 'Node' is a view of a Tor instance's current knowledge and opinions
		about a Tor relay orbridge.

		### The rest of this document. ###

		> Note: This section describes the eventual organization of this
		> document, which is not yet complete.

		We'll begin with an overview of the various utility functions available
		in Tor's 'common' directory. Knowing about these is key to writing
		portable, simple code in Tor.

		Then we'll go on and talk about the main data-flow of the Tor network:
		how Tor generates and responds to network traffic. This will occupy a
		chapter for the main overview, with other chapters for special topics.

		After that, we'll mention the main modules in Tor, and describe the
		function of each.

		We'll cover the directory subsystem next: how Tor learns about other
		relays, and how relays advertise themselves.

		Then we'll cover a few specialized modules, such as hidden services,
		sandboxing, hibernation, accounting, statistics, guards, path
		generation, pluggable transports, and how they integrate with the rest of Tor.

		We'll close with a meandering overview of important pending issues in
		the Tor codebase, and how they affect the future of the Tor software.

doc/HACKING/design/01-common-utils.md

0 → 100644

+121 −0

Original line number	Diff line number	Diff line

		## Utility code in Tor

		Most of Tor's utility code is in modules in the src/common subdirectory.

		These are divided, broadly, into _compatibility_ functions, _utility_
		functions, _containers_, and _cryptography_. (Someday in the future, it
		would be great to split these modules into separate directories. Also, some
		functions are probably put in the wrong modules)

		### Compatibility code

		These functions live in src/common/compat\*.c; some corresponding macros live
		in src/common/compat\*.h. They serve as wrappers around platform-specific or
		compiler-specific logic functionality.

		In general, the rest of the Tor code should not be calling platform-specific
		or otherwise non-portable functions. Instead, they should call wrappers from
		compat.c, which implement a common cross-platform API. (If you don't know
		whether a function is portable, it's usually good enough to see whether it
		exists on OSX, Linux, and Windows.)

		Other compatibility modules include backtrace.c, which generates stack traces
		for crash reporting; sandbox.c, which implements the Linux seccomp2 sandbox;
		and procmon.c, which handles monitoring a child process.

		Parts of address.c are compatibility code for handling network addressing
		issues; other parts are in util.c.

		Notable compatibility areas are:

		* mmap support for mapping files into the address space (read-only)

		* Code to work around the intricacies

		* Workaround code for Windows's horrible winsock incompatibilities and
		Linux's intricate socket extensions.

		* Helpful string functions like memmem, memstr, asprintf, strlcpy, and
		strlcat that not all platforms have.

		* Locale-ignoring variants of the ctypes functions.

		* Time-manipulation functions

		* File locking function

		* IPv6 functions for platforms that don't have enough IPv6 support

		* Endianness functions

		* OS functions

		* Threading and locking functions.

		=== Utility functions

		General-purpose utilities are in util.c; they include higher-level wrappers
		around many of the compatibility functions to provide things like
		file-at-once access, memory management functions, math, string manipulation,
		time manipulation, filesystem manipulation, etc.

		(Some functionality, like daemon-launching, would be better off in a
		compatibility module.)

		In util_format.c, we have code to implement stuff like base-32 and base-64
		encoding.

		The address.c module interfaces with the system resolver and implements
		address parsing and formatting functions. It converts sockaddrs to and from
		a more compact tor_addr_t type.

		The di_ops.c module provides constant-time comparison and associative-array
		operations, for side-channel avoidance.

		The logging subsystem in log.c supports logging to files, to controllers, to
		stdout/stderr, or to the system log.

		The abstraction in memarea.c is used in cases when a large amount of
		temporary objects need to be allocated, and they can all be freed at the same
		time.

		The torgzip.c module wraps the zlib library to implement compression.

		Workqueue.c provides a simple multithreaded work-queue implementation.

		### Containers

		The container.c module defines these container types, used throughout the Tor
		codebase.

		There is a dynamic array called smartlist, used as our general resizeable
		array type. It supports sorting, searching, common set operations, and so
		on. It has specialized functions for smartlists of strings, and for
		heap-based priority queues.

		There's a bit-array type.

		A set of mapping types to map strings, 160-bit digests, and 256-bit digests
		to void \*. These are what we generally use when we want O(1) lookup.

		Additionally, for containers, we use the ht.h and tor_queue.h headers, in
		src/ext. These provide intrusive hashtable and linked-list macros.

		### Cryptography

		Once, we tried to keep our cryptography code in a single "crypto.c" file,
		with an "aes.c" module containing an AES implementation for use with older
		OpenSSLs.

		Now, our practice has become to introduce crypto_\*.c modules when adding new
		cryptography backend code. We have modules for Ed25519, Curve25519,
		secret-to-key algorithms, and password-based boxed encryption.

		Our various TLS compatibility code, wrappers, and hacks are kept in
		tortls.c, which is probably too full of Tor-specific kludges. I'm
		hoping we can eliminate most of those kludges when we finally remove
		support for older versions of our TLS handshake.

doc/HACKING/design/01a-memory.md

0 → 100644

+93 −0

Original line number	Diff line number	Diff line

		## Memory management

		### Heap-allocation functions

		Tor imposes a few light wrappers over C's native malloc and free
		functions, to improve convenience, and to allow wholescale replacement
		of malloc and free as needed.

		You should never use 'malloc', 'calloc', 'realloc, or 'free' on their
		own; always use the variants prefixed with 'tor_'.
		They are the same as the standard C functions, with the following
		exceptions:

		* tor_free(NULL) is a no-op.
		* tor_free() is a macro that takes an lvalue as an argument and sets it to
		NULL after freeing it. To avoid this behavior, you can use tor_free_()
		instead.
		* tor_malloc() and friends fail with an assertion if they are asked to
		allocate a value so large that it is probably an underflow.
		* It is always safe to tor_malloc(0), regardless of whether your libc
		allows it.
		* tor_malloc(), tor_realloc(), and friends are never allowed to fail.
		Instead, Tor will die with an assertion. This means that you never
		need to check their return values. See the next subsection for
		information on why we think this is a good idea.

		We define additional general-purpose memory allocation functions as well:

		* tor_malloc_zero(x) behaves as calloc(1, x), except the it makes clear
		the intent to allocate a single zeroed-out value.
		* tor_reallocarray(x,y) behaves as the OpenBSD reallocarray function.
		Use it for cases when you need to realloc() in a multiplication-safe
		way.

		And specific-purpose functions as well:

		* tor_strdup() and tor_strndup() behaves as the underlying libc functions,
		but use tor_malloc() instead of the underlying function.
		* tor_memdup() copies a chunk of memory of a given size.
		* tor_memdup_nulterm() copies a chunk of memory of a given size, then
		NUL-terminates it just to be safe.

		#### Why assert on failure?

		Why don't we allow tor_malloc() and its allies to return NULL?

		First, it's error-prone. Many programmers forget to check for NULL return
		values, and testing for malloc() failures is a major pain.

		Second, it's not necessarily a great way to handle OOM conditions. It's
		probably better (we think) to have a memory target where we dynamically free
		things ahead of time in order to stay under the target. Trying to respond to
		an OOM at the point of tor_malloc() failure, on the other hand, would involve
		a rare operation invoked from deep in the call stack. (Again, that's
		error-prone and hard to debug.)

		Third, thanks to the rise of Linux and other operating systems that allow
		memory to be overcommitted, you can't actually ever rely on getting a NULL
		from malloc() when you're out of memory; instead you have to use an approach
		closer to tracking the total memory usage.

		#### Conventions for your own allocation functions.

		Whenever you create a new type, the convention is to give it a pair of
		x_new() and x_free() functions, named after the type.

		Calling x_free(NULL) should always be a no-op.


		### Grow-only memory allocation: memarea.c

		It's often handy to allocate a large number of tiny objects, all of which
		need to disappear at the same time. You can do this in tor using the
		memarea.c abstraction, which uses a set of grow-only buffers for allocation,
		and only supports a single "free" operation at the end.

		Using memareas also helps you avoid memory fragmentation. You see, some libc
		malloc implementations perform badly on the case where a large number of
		small temporary objects are allocated at the same time as a few long-lived
		objects of similar size. But if you use tor_malloc() for the long-lived ones
		and a memarea for the temporary object, the malloc implementation is likelier
		to do better.

		To create a new memarea, use memarea_new(). To drop all the storage from a
		memarea, and invalidate its pointers, use memarea_drop_all().

		The allocation functions memarea_alloc(), memarea_alloc_zero(),
		memarea_memdup(), memarea_strdup(), and memarea_strndup() are analogous to
		the similarly-named malloc() functions. There is intentionally no
		memarea_free() or memarea_realloc().

doc/HACKING/design/01b-collections.md

0 → 100644

+43 −0

Original line number	Diff line number	Diff line

		## Collections in tor

		### Smartlists: Neither lists, nor especially smart.

		For historical reasons, we call our dynamic-allocated array type
		"smartlist_t". It can grow or shrink as elements are added and removed.

		All smartlists hold an array of void \*. Whenever you expose a smartlist
		in an API you must document which types its pointers actually hold.

		<!-- It would be neat to fix that, wouldn't it? -NM -->

		Smartlists are created empty with smartlist_new() and freed with
		smartlist_free(). See the containers.h module documentation for more
		information; there are many convenience functions for commonly needed
		operations.


		### Digest maps, string maps, and more.

		Tor makes frequent use of maps from 160-bit digests, 256-bit digests,
		or nul-terminated strings to void \*. These types are digestmap_t,
		digest256map_t, and strmap_t respectively. See the containers.h
		module documentation for more information.


		### Intrusive lists and hashtables

		For performance-sensitive cases, we sometimes want to use "intrusive"
		collections: ones where the bookkeeping pointers are stuck inside the
		structures that belong to the collection. If you've used the
		BSD-style sys/queue.h macros, you'll be familiar with these.

		Unfortunately, the sys/queue.h macros vary significantly between the
		platforms that have them, so we provide our own variants in
		src/ext/tor_queue.h .

		We also provide an intrusive hashtable implementation in src/ext/ht.h
		. When you're using it, you'll need to define your own hash
		functions. If attacker-induced collisions are a worry here, use the
		cryptographic siphash24g function to extract hashes.