Alexander Færøy · a1a4b621
--- a/org/roadmaps/CoreTor/PerformanceMetrics.md
+++ b/org/roadmaps/CoreTor/PerformanceMetrics.md
 [[PageOutline]]

-= Metrics Definitions =
+# Metrics Definitions

 The following metrics are meant to be used in performance and scalability tuning, development, and research. We are attempting to capture a representative baseline, as well as a consistent data visualization methodology, and ensure that we have are aware of what metrics require new data collection to produce.

 Our metrics are broken down into four categories: Latency, throughput, capacity, and reliability.

-== Latency Metrics ==
+## Latency Metrics

-* **CDF-TTFB**: Cumulative distribution function of the time-to-first-byte of a 5MB download. [https://people.torproject.org/~karsten/volatile/onionperf-metrics-2019-02-02.pdf See page 2 of this report]. 
+* **CDF-TTFB**: Cumulative distribution function of the time-to-first-byte of a 5MB download. [See page 2 of this report](https://people.torproject.org/~karsten/volatile/onionperf-metrics-2019-02-02.pdf). 
  * A good CDF-TTFB should look like a cliff (very little performance variance in times) and this cliff should be close to the origin of the graph (very fast response times overall).
  * A bad CDF-TTFB will look like a long, slow climb (high variance in performance and lots of slow results), and be very far from the origin of the graph (slow overall/average case performance).
 * **CDF-RTT**: This is the CDF of round trip times to an HTTP Request/response echo server.
  * XXX: Aggregating this metric and graphing it over time seems challenging, especially since we want to capture how individual circuits change over time.

-== Throughput Metrics ==
+## Throughput Metrics

-* **CDF-TTLB**: Cumulative distribution function of the time-to-last-byte of a 5MB download. [https://people.torproject.org/~karsten/volatile/onionperf-metrics-2019-02-02.pdf See page 3 of this report]
+* **CDF-TTLB**: Cumulative distribution function of the time-to-last-byte of a 5MB download. [See page 3 of this report](https://people.torproject.org/~karsten/volatile/onionperf-metrics-2019-02-02.pdf)
  * Good and bad results for this CDF have the same characteristics as the CDF-TTFB graph, but this graph shows us the performance of the entire download overall.
-* **CDF-DL**: This is the CDF of the average bandwidth of the second half of a 5Mb download, [https://people.torproject.org/~karsten/volatile/onionperf-metrics-2019-02-02.pdf similar to page 4 of this report]
+* **CDF-DL**: This is the CDF of the average bandwidth of the second half of a 5Mb download, [similar to page 4 of this report](https://people.torproject.org/~karsten/volatile/onionperf-metrics-2019-02-02.pdf)
  * Good and bad results for this CDF have the same characteristics as the CDF-TTFB and CDF-TTLB graphs, but this graph shows us the distribution of the steady-state throughput of the network for very long downloads.

-== Reliability Metrics ==
+## Reliability Metrics

-* **Failure rainbow**: The rate of stream timeouts and other connection failures [https://people.torproject.org/~karsten/volatile/onionperf-metrics-2019-02-02.pdf similar to page 1 of this report]. XXX: Circuit timeouts and circuit failures should appear here somehow. Karsten also mentioned new failure types.
+* **Failure rainbow**: The rate of stream timeouts and other connection failures [similar to page 1 of this report](https://people.torproject.org/~karsten/volatile/onionperf-metrics-2019-02-02.pdf). XXX: Circuit timeouts and circuit failures should appear here somehow. Karsten also mentioned new failure types.
  * A good failure rainbow (ie: one that indicates healthy network performance) has a low number of stream timeouts and no user-facing failures, and no failures during download. It should look more like a single color, or largely dominated by a single color, and not like an actual rainbow.
  * A bad failure rainbow looks more like a smeared out actual rainbow. It has lots of failure counts for lots of different colors. The onion service rainbow from that report indicates that onion services are less healthy performance-wise than the public server. To emphasize that Failure Rainbows are bad, only vomit-related color tones should be used.
 * **Circuit timeout rate**: The frequency of circuit build timeouts observed through BUILDTIMEOUT_SET control port event, or manual counting. 
  * The circuit timeout rate should consistently match the cbtquantile consensus parameter (XXX: This could be combined with the Failure Rainbow metric).

-== Capacity Metrics ==
+## Capacity Metrics

 The following metrics come from relay extrainfo descriptors. Because relays choose different time intervals for the values in these metrics, we must use much larger on/off time windows for experiments that need these metrics (irl suggests 72 hour cycles, using only the middle 24 hours for results):

@@ -38,7 +38,7 @@ The following metrics come from relay extrainfo descriptors. Because relays choo
  * An unhealthy network operates with an average capacity that is very close to its peak possible throughput. This means most of its streams are in a congested state -- latency will build up and other performance/health metrics should show signs of stress.
 * **Bottleneck Utilization**: Compute the Guard, Exit, and Total Utilization levels at each time point, and choose the highest utilization value of the three.

-== Balancing Metrics ==
+## Balancing Metrics

 * **CDF-Relay-Utilization**: Similar to the Per-flag Utilization, it is also possible to derive a CDF of the distribution of the average read/write history divided by peak advertised bandwidth, for each relay in the network. This metric would show us what the distribution of utilization is across the network. It can also be broken down per-flag (so that there are separate CDFs generated for Guard, Middle, Exit, and Guard+Exit flagged relays).
  * A healthy network will be well load-balanced where all relays tend to be operating with similar amounts of reserve capacity in proportion to their total. Thus, this CDF should be narrow and cliff-like, and the cliff should be centered at the same location as the overall Utilization relative to its total (each relay is loaded the same as the overall network).
@@ -46,7 +46,7 @@ The following metrics come from relay extrainfo descriptors. Because relays choo
   * A healthy, balanced network will have a cliff in this CDF around 1.0. This means that all relays have the same stream bandwidth when carrying  streams.
   * An unhealthy, unbalanced network will have a long, slow sloping hill, and/or lots of lumps below 1.0 and far above 1.0.

-= Data visualization Issues =
+# Data visualization Issues

 For all of the above metrics, we started with the assumption that the full CDF is what we want, to fully capture the full best/worst case and the distribution of the values. However, one major downside of CDFs are that they are difficult to use to represent changes over time. Each CDF graph is a snapshot of performance over some time window.

@@ -58,7 +58,7 @@ This leads to the following visualization questions:

 Having some way to look at these metrics over time, with their full distribution, will vastly improve our ability to understand performance cycles in the network, as well as reaction to events such as massive user arrival.

-= Sources of Model Error =
+# Sources of Model Error

 In addition to visualization problems, our metrics currently suffer from the following major sources of model error, causing us to fail to accurately represent actual user experience:

@@ -72,7 +72,7 @@ In addition to visualization problems, our metrics currently suffer from the fol
 3. Torperf does not have an accurate browser model
 * Browser-specific performance improvements (and regressions) due to Optimistic Data, HTTP/2, HTTP Prefetch, and other browser properties cannot be measured by Torperf 

-= Metrics that require new collection methodology =
+# Metrics that require new collection methodology

 1. **CDF-RTT**
 * Requires multiple HTTP Echo Servers in well-chosen geographical locations; **OR** some kind of hack, like connecting to IP:Port pairs forbidden by Exit policy and timing the rejection response (maybe this is better?)