@@ -12,7 +12,7 @@ The impacts of this project are wide ranging:
**Objective 1: Arti supports directory authorities**
Directory authorities are a group of special-purpose relays on the Tor network that maintain the list of currently-running relays and every hour publish a consensus view of the network. In Tor terms, a consensus is a single signed document compiled and voted on by the directory authorities once per hour, ensuring that all clients have the same information about the relays that make up the Tor network. Currently eight relays are considered directory authorities and have been chosen to be in these positions because of their operator’s long-term contributions to the Tor network and Tor community.1
Directory authorities are a group of special-purpose relays on the Tor network that maintain the list of currently-running relays and every hour publish a consensus view of the network. In Tor terms, a consensus is a single signed document compiled and voted on by the directory authorities once per hour, ensuring that all clients have the same information about the relays that make up the Tor network. Currently eight relays are considered directory authorities and have been chosen to be in these positions because of their operator’s long-term contributions to the Tor network and Tor community.
To achieve this Objective, we will port Tor’s existing directory authority implementation to Rust and establish a minimally-disruptive path to migrate existing directory authorities to the new code base. This Objective requires explicit, collaborative communication with directory authority operators, conducting tests with them, and engaging them in the creation of a transition plan.
...
...
@@ -44,11 +44,11 @@ Resolving these flag assignment issues in directory authorities has been hampere
In addressing directory authority flag assignment issues within the scope of this Activity, our approach for determining which issues need attention involves integrating the implementation of our specified flag assignment specifications with evaluations of appropriate strategies for resolving outstanding assignment concerns. We will employ the following criteria in making these determinations for prioritizing and evaluating our success:
• The urgency and severity of the directory authority flag assignment issue
• The clarity and simplicity of the solution to address incorrect, bad, or otherwise harmful behavior
• The balance between addressing the issue and maintaining existing functionality within the directory authority re-implementation project timeline and budget constraints
• The extent to which unresolved issues hinder or delay the implementation of flag assignments in the directory authority context
• Whether resolving the issue may introduce any privacy, security, or network disruptive implications
- The urgency and severity of the directory authority flag assignment issue
- The clarity and simplicity of the solution to address incorrect, bad, or otherwise harmful behavior
- The balance between addressing the issue and maintaining existing functionality within the directory authority re-implementation project timeline and budget constraints
- The extent to which unresolved issues hinder or delay the implementation of flag assignments in the directory authority context
- Whether resolving the issue may introduce any privacy, security, or network disruptive implications
After evaluating issues and solutions based on these criteria, we will proceed with implementing the most appropriate path forward. Our success in this process will be measured by assessing consensus documents produced by directory authorities to confirm that flag assignments are being properly implemented and assignment issues have been resolved.
...
...
@@ -85,7 +85,8 @@ Administration of directory authorities is substantially more complex than admin
In this Activity, we will first develop a phased transition plan to switch the network from C authorities to Arti authorities. Then we will consult with the authority operators, adjusting the plan as needed, to ensure that they believe the plan is achievable, sensible, and robust. If during this Activity we identify new necessary tooling for migration that we have not yet identified, we will develop those tools as part of O1.9. (We have not currently identified any such tooling, but any planning activity carries the possibility of finding unexpected second-order requirements.)
Objective 2: Arti supports all necessary relay types
**Objective 2: Arti supports all necessary relay types**
To achieve this Objective, we will build relay mode in Arti such that it can replace Tor for relays in the network. This includes supporting all necessary relay types: guards, middle relays, exits, and bridges (with and without pluggable transports).
Notably absent from the above list are directory authorities. This work is distinctly defined in Objective 1 “Arti supports Directory Authorities.” Separating the work to port directory authorities from the work to port relays allows us to implement a multi-phased approach for a smoother transition of the network. The Tor relays on the network can effectively migrate to the Rust-based Arti code independently from migrating the directory authorities.
...
...
@@ -104,19 +105,19 @@ First, we will describe the specific work that we know we must do in this Activi
The specific known features to be reimplemented in Rust, and the development efforts to do so are as follows:
• Bandwidth limits are features used to control the average and burst number of bytes a relay uses. This feature allows relay operators to control how much bandwidth their relay uses. Bandwidth limit features allow relay operators to keep their ISP bills under control and ensure they only donate bandwidth that they can afford. (This corresponds to the C implementation's `BandwidthRate`, `BandwidthBurst`, `RelayBandwidthRate`, and `RelayBandwidthBurst` options, primarily implemented in `src/core/mainloop/connection.c` and `src/lib/evloop/token_bucket.c`.)
◦ At this time, we plan to reimplement this feature by employing wrappers on Arti's existing network backend code to record the amount of traffic sent or received on the network, and to throttle traffic when it would otherwise be greater than the amount permitted. We will try to identify existing library solutions that can integrate with our network backend, if appropriate. We will include a mechanism for configuring these limits.
• Bandwidth accounting is a feature used to limit the total number of bytes produced and consumed over a longer time frame of days, weeks, or months. This feature allows relay operators to control how a specified amount of bandwidth is used over time, which is another mechanism for relay operators to control their ISP bills and ensure they only donate bandwidth that they can afford. (This corresponds to the C implementation's `AccountingMax` option, primarily implemented in `src/feature/hibernate/`.)
◦ At this time, we plan to reimplement this feature by porting the C codebase's existing mechanisms for scheduling "hibernation" intervals to Arti. We'll integrate this code to monitor our total bandwidth usage (monitored as described in the item above) to decide when to begin "hibernating". The hibernation mechanism itself will extend Arti's current "dormancy" code to minimize functional duplication. The accounting code will use Arti's existing “persistent state manager" code to record partial activity to a state file so that relays can be rebooted without exceeding their bandwidth allocations.
• Limits on the number of CPU cores to be allocated to the relay process are features used to prevent exhaustion of resources and/or excessive hosting bills. Again, this allows relay operators to ensure they only donate the CPU resources that they can afford. (This corresponds to the C implementation's `NumCpus` option, primarily implemented in the `get_num_cpus()` function and in the other functions that call it.)
◦ At this time, we plan to reimplement these features by adding a configuration option to control the number of CPUs used, by using existing library code to learn the maximum number of cores available and to tell the operating system to limit the number of cores used (when applicable). For our existing asynchronous networking backends, we will write code so that they get configured with the correct number of worker threads to use the configured number of cores.
• Integration with the `systemd` service monitoring system is a feature is used to surface information about whether a relay has crashed or become unresponsive. This allows relay operators to effectively monitor the status of their relay. (In C this is primarily implemented via the blocks of code in `hibernate.c`, `mainloop.c`, and `main.c` marked with `#ifdef HAVE_SYSTEMD` and `#ifdef HAVE_SYSTEMD_209`.)
- Bandwidth limits are features used to control the average and burst number of bytes a relay uses. This feature allows relay operators to control how much bandwidth their relay uses. Bandwidth limit features allow relay operators to keep their ISP bills under control and ensure they only donate bandwidth that they can afford. (This corresponds to the C implementation's `BandwidthRate`, `BandwidthBurst`, `RelayBandwidthRate`, and `RelayBandwidthBurst` options, primarily implemented in `src/core/mainloop/connection.c` and `src/lib/evloop/token_bucket.c`.)
- At this time, we plan to reimplement this feature by employing wrappers on Arti's existing network backend code to record the amount of traffic sent or received on the network, and to throttle traffic when it would otherwise be greater than the amount permitted. We will try to identify existing library solutions that can integrate with our network backend, if appropriate. We will include a mechanism for configuring these limits.
- Bandwidth accounting is a feature used to limit the total number of bytes produced and consumed over a longer time frame of days, weeks, or months. This feature allows relay operators to control how a specified amount of bandwidth is used over time, which is another mechanism for relay operators to control their ISP bills and ensure they only donate bandwidth that they can afford. (This corresponds to the C implementation's `AccountingMax` option, primarily implemented in `src/feature/hibernate/`.)
- At this time, we plan to reimplement this feature by porting the C codebase's existing mechanisms for scheduling "hibernation" intervals to Arti. We'll integrate this code to monitor our total bandwidth usage (monitored as described in the item above) to decide when to begin "hibernating". The hibernation mechanism itself will extend Arti's current "dormancy" code to minimize functional duplication. The accounting code will use Arti's existing “persistent state manager" code to record partial activity to a state file so that relays can be rebooted without exceeding their bandwidth allocations.
- Limits on the number of CPU cores to be allocated to the relay process are features used to prevent exhaustion of resources and/or excessive hosting bills. Again, this allows relay operators to ensure they only donate the CPU resources that they can afford. (This corresponds to the C implementation's `NumCpus` option, primarily implemented in the `get_num_cpus()` function and in the other functions that call it.)
- At this time, we plan to reimplement these features by adding a configuration option to control the number of CPUs used, by using existing library code to learn the maximum number of cores available and to tell the operating system to limit the number of cores used (when applicable). For our existing asynchronous networking backends, we will write code so that they get configured with the correct number of worker threads to use the configured number of cores.
- Integration with the `systemd` service monitoring system is a feature is used to surface information about whether a relay has crashed or become unresponsive. This allows relay operators to effectively monitor the status of their relay. (In C this is primarily implemented via the blocks of code in `hibernate.c`, `mainloop.c`, and `main.c` marked with `#ifdef HAVE_SYSTEMD` and `#ifdef HAVE_SYSTEMD_209`.)
◦ At this time, we plan to reimplement this feature by adding optional "process status" features to our runtime, to expose information about Arti's PID and liveness, and then by exposing those features via APIs required by systemd.
• Resource limitation for maximum open network connection usage is a feature used to ensure that Tor relays don't exhaust system resources or crash other programs. (In C this is handled primarily via the `ConnLimit` option, and the `set_max_file_descriptors()` function and surrounding code.)
- Resource limitation for maximum open network connection usage is a feature used to ensure that Tor relays don't exhaust system resources or crash other programs. (In C this is handled primarily via the `ConnLimit` option, and the `set_max_file_descriptors()` function and surrounding code.)
◦ At this time, we plan to reimplement this feature in two pieces. First, we will write code to detect the number of network connections available, to keep track of the number we are using, and to adjust system limits as configured. Second, we will write code to handle the case where we have run out of available connections gracefully, and not crash or produce excessive errors.
• Integration with existing system logging methods are features used to support better integration with existing sysadmin tooling. This allows relay operators to more easily monitor and maintain the status of their relay. (In C this is handled by the `Log syslog` option, and implemented in the logfile_deliver() function and related invocation and configuration code.)
- Integration with existing system logging methods are features used to support better integration with existing sysadmin tooling. This allows relay operators to more easily monitor and maintain the status of their relay. (In C this is handled by the `Log syslog` option, and implemented in the logfile_deliver() function and related invocation and configuration code.)
◦ At this time, we plan to reimplement this feature by using an existing Rust "syslog" system library that integrates with our logging subsystem (currently the `tracing` library). If none exists, we will implement one.
• Ongoing improvements to logging and diagnostic messages reported by Arti are features used to help relay operators understand logs and easily monitor and maintain their relays. Improved messages also aid Tor developers to diagnose problems reported by relay operators. (In C this includes the totality of log messages throughout the codebase, and usability improvements such as rate-limiting log messages as implemented with the`log_fn_ratelim()` function.)
- Ongoing improvements to logging and diagnostic messages reported by Arti are features used to help relay operators understand logs and easily monitor and maintain their relays. Improved messages also aid Tor developers to diagnose problems reported by relay operators. (In C this includes the totality of log messages throughout the codebase, and usability improvements such as rate-limiting log messages as implemented with the`log_fn_ratelim()` function.)
◦ At this time, we plan to reimplement these features by learning, from operators in production and on the test network, which information they find overly verbose and which information they find missing. Then we will add or downgrade logging as needed. If needed, we'll reimplement or adapt a mechanism for coalescing similar log messages to avoid filling up the logs.
Above and elsewhere, when we describe our plans for how to reimplement a given feature, we are referring to our current best anticipated plans. We may make changes to how we reimplement these given features as needed if we find that the originally anticipated means for delivering a feature can be replaced with one that can be delivered more effectively, efficiently, or reliably. We are outlining our planned development efforts in this way in order to clarify the anticipated workflow, but if we find a more effective path to deliver it, we will do so in the interests of cost containment and product quality.
...
...
@@ -130,23 +131,29 @@ Some effort in this Activity is allocated for solving issues raised by operators
After these requests are evaluated and we decide whether or not to resolve them, we will gauge our success based on whether relay operators report that these issues have been resolved and whether relay operators are able to migrate to Arti successfully, following the targets set in the M&E plan.
Objective 3: Arti is stable enough for general usage
**Objective 3: Arti is stable enough for general usage*
To achieve this Objective, we must test and tune Arti so we can ensure the new relay implementation is stable enough for general usage on supported modern operating systems. Across this Objective we will be analyzing the privacy implications of Arti, its robustness, and ensuring that Arti builds reproducibly.
O3.1 Ensure parity with C implementation: We will begin this Objective with an Activity to ensure that Arti is in parity with the C implementation. To do so, we will reverse engineer security requirements from existing C code and translate those to Arti. Through this process we will be identifying missing features and implementing them, as well as evaluating privacy implications of Arti to ensure it does not introduce any new privacy vulnerabilities.
O3.2 Reimplement, develop, and test protections to external attacks: Our next Activity to ensure stability is to reimplement, develop, and test protections for Arti against malicious input and external attacks like DoS. We will then conduct extensive fuzzing to ensure these protections are working as expected and to improve them where they are failing. The C implementation includes numerous security features and practices that protect it from forms of active attack. These attacks are not theoretical—they have been observed on the actual network. If we did not reimplement these features, Arti would be susceptible to attacks that have already been observed on the live Tor network, including attacks that would enable an attacker to crash some or all of the network, degrade performance, or deanonymize user traffic. Relay operators and users would not be able to migrate safely under such conditions, and if they tried, we expect that active attackers would crash their relays at will.
Features that exist in the C codebase that we will port to Arti:
• Support for detection of and response to memory-related denial-of-service attacks. (The C implementation provides this feature via the `MaxMemInQueues` option and the related `cell_queues_check_size()` function.) Building this will require listing the places in our code where we allocate memory based on externally triggerable actions, and instrumenting those places to track their total amount of allocated memory. Then we'll need to implement code to monitor these total amounts and terminate connections or circuits (or take other appropriate responses) when the cell queues become too full. Our work here will be guided by the algorithms in the C implementation, which in turn are based on the work of Jansen, Tschorsch, Johnson, and Scheuermann (2014).
• Support for detection of and response to other resource-based denial-of-service attacks to prevent an attacker from opening huge numbers of connections or circuits from the same IP or network. (The C implementation provides this feature via the code in `src/core/or/dos.c`.) Building this will require analyzing and porting algorithms from that C file to Arti and refactoring them as appropriate to account for Arti's more modular code structure.
- Support for detection of and response to memory-related denial-of-service attacks. (The C implementation provides this feature via the `MaxMemInQueues` option and the related `cell_queues_check_size()` function.) Building this will require listing the places in our code where we allocate memory based on externally triggerable actions, and instrumenting those places to track their total amount of allocated memory. Then we'll need to implement code to monitor these total amounts and terminate connections or circuits (or take other appropriate responses) when the cell queues become too full. Our work here will be guided by the algorithms in the C implementation, which in turn are based on the work of Jansen, Tschorsch, Johnson, and Scheuermann (2014).
- Support for detection of and response to other resource-based denial-of-service attacks to prevent an attacker from opening huge numbers of connections or circuits from the same IP or network. (The C implementation provides this feature via the code in `src/core/or/dos.c`.) Building this will require analyzing and porting algorithms from that C file to Arti and refactoring them as appropriate to account for Arti's more modular code structure.
Beyond reimplementing existing protections against DoS attacks, we also must harden Arti against hostile inputs to reduce the likelihood of a successful exploitation of any programming errors in our codebase. We will do this by developing specific tests, utilizing those tests, and conducting extensive fuzzing that will allow us to detect and resolve security errors.
This is critical because the C implementation, flawed as the C language is, is the product of over 20 years of active research, refinement, and testing. As such, new successful exploits against it are rare. Although Rust is a safer language than C, and security-relevant mistakes are harder to make in Rust, Arti will not boast the same benefit of years of experience and code inspection by security researchers. As such, it is critical for the safety of Tor users that we take measures to limit the impact of any mistakes that we make during development.
Although these efforts are not reimplementing specific security features, they are efforts to bring Arti as close to parity with the C implementation as possible by replicating a process of thorough refinement and testing to discover and resolve exploits.
To accomplish this, we will be developing:
• Tests to ensure that unhandled programming errors (called "panics" in Rust) are contained to a single protocol context (a circuit, a stream, or a channel), and do not cause the termination of any more protocol objects than are needed. To develop these tests, we will add a fault injection mechanism to the stream, circuit, and channel mechanisms, in order to simulate the case where a programming error has caused a "panic," and verify that only the offending protocol object is closed. If we did not validate that the error containment mechanism worked, we would risk the possibility of shipping code without working error containment—and then having an error that otherwise would be minor in its effects turn into an opportunity for a network-wide DoS attack or worse.
- Tests to ensure that unhandled programming errors (called "panics" in Rust) are contained to a single protocol context (a circuit, a stream, or a channel), and do not cause the termination of any more protocol objects than are needed. To develop these tests, we will add a fault injection mechanism to the stream, circuit, and channel mechanisms, in order to simulate the case where a programming error has caused a "panic," and verify that only the offending protocol object is closed. If we did not validate that the error containment mechanism worked, we would risk the possibility of shipping code without working error containment—and then having an error that otherwise would be minor in its effects turn into an opportunity for a network-wide DoS attack or worse.
We will also be writing tests using:
• Existing automated “fuzzing” tools to search inputs that can cause our program to crash. We will apply tools like `cargo-fuzz`, which is already used for Arti's existing input-handling code, to all of the new input-parsing code we write for this project. We will apply it to types of code that Arti does not currently expose for fuzzing—notably, our connection and circuit state machines. If we were to ship unfuzzed parser state machines, we would significantly raise the odds of an attacker discovering an unpatched vulnerability and using it to attack the network. (Attackers commonly use fuzzers themselves to find software vulnerabilities.)
- Existing automated “fuzzing” tools to search inputs that can cause our program to crash. We will apply tools like `cargo-fuzz`, which is already used for Arti's existing input-handling code, to all of the new input-parsing code we write for this project. We will apply it to types of code that Arti does not currently expose for fuzzing—notably, our connection and circuit state machines. If we were to ship unfuzzed parser state machines, we would significantly raise the odds of an attacker discovering an unpatched vulnerability and using it to attack the network. (Attackers commonly use fuzzers themselves to find software vulnerabilities.)
Once these tests are in place, we will:
• Conduct testing and fuzzing on Arti, using the tools mentioned above. If we find errors or potential security issues as a result of running these tests, we will locate and resolve the bugs responsible.
- Conduct testing and fuzzing on Arti, using the tools mentioned above. If we find errors or potential security issues as a result of running these tests, we will locate and resolve the bugs responsible.
We believe that this work is essential in order to ensure that we are delivering a secure project. Since programming mistakes can't be avoided completely, it's important to limit their impact and minimize the odds that hostile parties find them before they are fixed. Because Tor is a tool used by human rights defenders, journalists, minority communities, and marginalized populations that are the target of surveillance and attempted deanonymizing attacks online, delivering an insecure Arti implementation would put these users at great risk. To remove this Activity would be to downgrade the confidence we have in the security of our tool
Even if we are lucky enough not to have any relevant programming errors, by carrying out these mitigation strategies and being transparent about our process, we will increase confidence among relay operators that they will not lose security by migrating to Arti and increase confidence among users that the move to Arti will not decrease the efficacy of privacy and security properties they rely on when using Tor.
...
...
@@ -159,7 +166,10 @@ O3.4 Adjust to bring in alignment, as necessary, specification documents for all
We will follow the above practices to keep up-to-date the specification documents for all technologies and/or protocols being updated under this project, along with documentation for their security, privacy, and design requirements and features.
Our current specifications are maintained in our specifications and proposals repository. The specifications most relevant to Tor relay operation are: `tor-spec`, `cert-spec`, `dir-spec`, `dos-spec`, `ext-orport-spec` (for bridges), and `padding-spec`. Not every specification listed will need refinement or editing during the course of this project, but all of these specifications will be referenced and reviewed.
Additionally, some documents in the `proposals` subdirectory of the repository mentioned above —those marked with "FINISHED" status in 000-index.txt—are proposals that are implemented in some version of Tor; the proposals themselves still need to be merged into the specifications proper. We will ensure that any of these proposals that are relevant to the work completed in this project are merged into the appropriate specification.
Objective 4: Arti relay implementation performs as well or better than C implementation
**Objective 4: Arti relay implementation performs as well or better than C implementation**
To achieve this Objective, we must profile and verify Arti relay performance and ensure that Arti relay implementation performs as well or better than the C implementation of Tor.
O4.1 Ensure Arti performance is measurable: In this Activity, we will port or adapt the necessary tools for us to measure Arti’s relay performance using Shadow, a tool that allows us to simulate the real Tor network. This work allows us to measure the performance of Arti reproducibly and reliably, without conducting experiments on the live Tor network (thus, protecting users from any negative impacts). We will ensure that it is possible to use data directly from Arti to analyze and reason about performance characteristics of the application.
...
...
@@ -204,7 +214,8 @@ The algorithms that we will need to re-adjust based on work in this Activity:
• Round-trip time congestion control (RTTCC) and Conflux: RTTCC is an algorithm to improve speed and reduce memory requirements for fast Tor relays by reducing queue lengths. Conflux7 is a dynamic traffic-splitting approach that assigns traffic to an overlay path based on its measured latency. Together, congestion control and Conflux are used to provide performance and improvements. (In the C implementation, RTTCC is implemented principally via the `src/core/or/conestion_control*.c` files. Conflux is principally implemented in the C implementation `src/core/or/conflux*.c`.)
◦ Congestion control and Conflux are already implemented in Arti for a specific part of the Tor protocol. We will need to complete some minor adaptation, tuning, and refactoring to both in order to maximize performance and adjust to work completed above in this Activity.
Objective 5: The Tor network is significantly transitioned to Arti Relays
**Objective 5: The Tor network is significantly transitioned to Arti Relays**
In this Objective we will create the test network needed for testing all work created in this project, create a framework for evaluating the transition from the C implementation to Rust, develop any needed tools for relay operators to migrate their relays, and engage the community in a public campaign to encourage transition to Arti. This Objective is not about transitioning all relays to Arti—as stated above, the Tor Project does not control the volunteer relay operators and cannot force them to transition—but is about making a concerted effort to encourage a significant number of operators to transition through support and easy-to-use migration tools.
O5.1 Create and maintain a test network with all Tor relays: This Activity will take place at the beginning of the project, as the test network will be used across all Objectives to test, tune, and verify the work conducted. To create a test network, we will use virtual machines to deploy an entire Tor network using test versions of relays using iterative versions of Arti. We’ll also deploy test network support infrastructure, like the metrics pipeline, Prometheus, and log monitoring. This Activity will facilitate all other work in this project as well as the ability to collect information necessary for the Monitoring & Evaluation Plan.