|
|
|
This document aims to describe how to produce the graphs that will be on the CAPTCHA Monitor's dashboard at [dashboard.captcha.wtf](https://dashboard.captcha.wtf/). If you have any suggestions/feedback, please mention it under [ticket #41](https://gitlab.torproject.org/woswos/CAPTCHA-Monitor/-/issues/41) of this repository.
|
|
|
|
|
|
|
|
The following graph style will be used for all graphs unless otherwise specified:
|
|
|
|
* Type
|
|
|
|
* Line chart
|
|
|
|
* Axes
|
|
|
|
* X-axis: The dates of the last 30*24 consensuses (last 30 days), each tick
|
|
|
|
representing a single consensus (The plotting tool automatically omits the
|
|
|
|
overlapping labels but keeps the data points in the chart)
|
|
|
|
* Y-axis: The percentage values from 0% to 100%, uses a linear scale
|
|
|
|
* Sample Graph
|
|
|
|
![graph-style](uploads/e62c2716de6cd64e3a6bf949d1bd0726/graph-style.png)
|
|
|
|
|
|
|
|
**Table of contents**
|
|
|
|
- [Graphs for understanding CAPTCHA rates related to user decisions](#graphs-for-understanding-captcha-rates-related-to-user-decisions)
|
|
|
|
- [Weighted CAPTCHA rate by method](#weighted-captcha-rate-by-method)
|
|
|
|
- [Weighted CAPTCHA rate by connection security](#weighted-captcha-rate-by-connection-security)
|
|
|
|
- [Weighted CAPTCHA rate by HTTP request quantity](#weighted-captcha-rate-by-http-request-quantity)
|
|
|
|
- [Weighted CAPTCHA rate by CDN provider](#weighted-captcha-rate-by-cdn-provider)
|
|
|
|
- [Graphs for understanding the overall network status](#graphs-for-understanding-the-overall-network-status)
|
|
|
|
- [Probability of a Tor client receiving CAPTCHA](#probability-of-a-tor-client-receiving-captcha)
|
|
|
|
- [Weighted CAPTCHA rate by IP version](#weighted-captcha-rate-by-ip-version)
|
|
|
|
- [Weighted CAPTCHA rate by exit probability](#weighted-captcha-rate-by-exit-probability)
|
|
|
|
- [Weighted CAPTCHA rate by exit relay age](#weighted-captcha-rate-by-exit-relay-age)
|
|
|
|
- [Weighted CAPTCHA rate by exit relay location](#weighted-captcha-rate-by-exit-relay-location)
|
|
|
|
- [Graphs for understanding the Cloudflare firewall](#graphs-about-understanding-the-cloudflare-firewall)
|
|
|
|
- [CAPTCHA rate by Cloudflare security level/firewall settings](#captcha-rate-by-cloudflare-security-levelfirewall-settings)
|
|
|
|
- [CAPTCHA rate by traffic origin](#captcha-rate-by-traffic-origin)
|
|
|
|
- [Weighted CAPTCHA rate by exit relay age](#weighted-captcha-rate-by-exit-relay-age-1)
|
|
|
|
- [Weighted CAPTCHA rate by exit relay location](#weighted-captcha-rate-by-exit-relay-location-1)
|
|
|
|
- [Code injection rate](#code-injection-rate)
|
|
|
|
- [Graphs about Tor Browser centric data](#graphs-about-tor-browser-centric-data)
|
|
|
|
- [Weighted CAPTCHA rate by Tor Browser version](#weighted-captcha-rate-by-tor-browser-version)
|
|
|
|
- [Weighted CAPTCHA rate by Tor Browser security level](#weighted-captcha-rate-by-tor-browser-security-level)
|
|
|
|
- [Graphs about individual exit relays](#graphs-about-individual-exit-relays)
|
|
|
|
- [Overall CAPTCHA rate](#overall-captcha-rate)
|
|
|
|
- [CAPTCHA rate by CDN provider](#captcha-rate-by-cdn-provider)
|
|
|
|
|
|
|
|
# Graphs for understanding CAPTCHA rates related to user decisions
|
|
|
|
## Weighted CAPTCHA rate by method
|
|
|
|
### Purpose
|
|
|
|
Understanding the effect of using different methods (for example using
|
|
|
|
web browsers like Tor Browser, Firefox over Tor, Brave, etc.) on the probability
|
|
|
|
of seeing a CAPTCHA while browsing the internet using the public Tor network.
|
|
|
|
|
|
|
|
### Steps to produce
|
|
|
|
1. Get consensuses from CollecTor
|
|
|
|
2. Repeat the following for each consensus:
|
|
|
|
1. Parse and memorize the `valid-after` & `fresh-until` timestamps from the
|
|
|
|
consensus header and `bandwidth-weights` values from the footer
|
|
|
|
2. Repeat the following for each *running exit relay* entry within the consensus:
|
|
|
|
1. Parse the `r` line and memorize the IPv4 address and identity
|
|
|
|
2. Parse the `w` line and memorize the bandwidth
|
|
|
|
3. Parse the `s` line and memorize the relay flags
|
|
|
|
3. Calculate the weighted exit probabilities using the `bandwidth-weights`
|
|
|
|
from the consensus, `bandwidth` values, and `flags` for each exit relay
|
|
|
|
(see an example calculation [here](https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/metrics/onionoo/updater/NodeDetailsStatusUpdater.java#n597))
|
|
|
|
4. Use CAPTCHA Monitor API to get measurements that were completed
|
|
|
|
using Tor and between the `valid-after` & `fresh-until` timestamps of the
|
|
|
|
consensus
|
|
|
|
5. Join the measurements and relay data using the relay fingerprints.
|
|
|
|
Typically each relay maps to multiple measurements.
|
|
|
|
6. Distribute the joined data into bins based on `method` field's value
|
|
|
|
7. Repeat the following for each bin:
|
|
|
|
1. Further bin the measurements into sub-bins based on the exit relay used
|
|
|
|
to perform the measurement
|
|
|
|
2. Repeat the following for each exit relay in each sub-bin:
|
|
|
|
1. Count the total number of measurements in this sub-bin that were
|
|
|
|
completed using this exit relay
|
|
|
|
2. Count the total number of measurements in this sub-bin that were
|
|
|
|
completed using this exit relay and have `is_captcha_found` field
|
|
|
|
set to `1`
|
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
|
$`\frac{Step 2.7.2.2}{Step 2.7.2.1} \times 100`$ (Assume `0%` if an
|
|
|
|
exit relay exists in the consensus but there are no corresponding
|
|
|
|
measurements)
|
|
|
|
3. Calculate the weighted average of the percentage values (obtained in
|
|
|
|
Step 2.7.2.3) using exit probabilities (obtained in Step 2.3) as the
|
|
|
|
scaling factor
|
|
|
|
8. Plot the weighted percentage values for each `method` bin in the Y-axis and
|
|
|
|
the `valid-after` timestamp of the consensus in the X-axis
|
|
|
|
3. Merge the graphs created for each consensus
|
|
|
|
|
|
|
|
### Related metrics
|
|
|
|
- [(2)](home#metrics-to-track) How does the HTTP request headers affect
|
|
|
|
Cloudflare's decision-making mechanism? [ticket:33010#comment:4]
|
|
|
|
- [(2.1)](home#metrics-to-track) Is there a difference between using the
|
|
|
|
actual Tor Browser itself and tor-browser-selenium in terms of the HTTP headers?
|
|
|
|
- [(2.2)](home#metrics-to-track) How does Cloudflare react differently if the
|
|
|
|
browser doesn't support alt-svc headers? [ticket:32915]
|
|
|
|
- [(3)](home#metrics-to-track) How do different browsers with different
|
|
|
|
User Agents get affected? [ticket:33010#comment:2], [ticket:32924], [ticket:31404]
|
|
|
|
- [(3.1)](home#metrics-to-track) Is there a difference between using a web
|
|
|
|
browser or fetching web pages via cURL or other HTTP libraries?
|
|
|
|
- [(7)](home#metrics-to-track) How does the time of the day affect the
|
|
|
|
Cloudflare's blocking mechanism? Does it matter the day of the week or the time
|
|
|
|
of the day? [ticket:33010#comment:15]
|
|
|
|
- [(15)](home#metrics-to-track) If browsers that should not face CAPTCHA face
|
|
|
|
CAPTCHA, why does this happen?
|
|
|
|
- [(16)](home#metrics-to-track) How do the observed patterns in the results
|
|
|
|
change over time? [ticket:33010]
|
|
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
|
|
## Weighted CAPTCHA rate by connection security
|
|
|
|
### Purpose
|
|
|
|
Understanding the effect of using https and not using https on the probability
|
|
|
|
of seeing a CAPTCHA
|
|
|
|
|
|
|
|
### Steps to produce
|
|
|
|
1. Get consensuses from CollecTor
|
|
|
|
2. Repeat the following for each consensus:
|
|
|
|
1. Parse and memorize the `valid-after` & `fresh-until` timestamps from the
|
|
|
|
consensus header and `bandwidth-weights` values from the footer
|
|
|
|
2. Repeat the following for each *running exit relay* entry within the consensus:
|
|
|
|
1. Parse the `r` line and memorize the IPv4 address and identity
|
|
|
|
2. Parse the `w` line and memorize the bandwidth
|
|
|
|
3. Parse the `s` line and memorize the relay flags
|
|
|
|
3. Calculate the weighted exit probabilities using the `bandwidth-weights`
|
|
|
|
from the consensus, `bandwidth` values, and `flags` for each exit relay
|
|
|
|
(see an example calculation [here](https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/metrics/onionoo/updater/NodeDetailsStatusUpdater.java#n597))
|
|
|
|
4. Use CAPTCHA Monitor API to get measurements that were completed
|
|
|
|
using Tor and between the `valid-after` & `fresh-until` timestamps of the
|
|
|
|
consensus
|
|
|
|
5. Use CAPTCHA Monitor API to get the list of URLs that are used in the
|
|
|
|
experiments. This list contains the metadata about the URLs.
|
|
|
|
6. Join the measurements, URL list, and relay data using the relay
|
|
|
|
fingerprints and URLs. Typically each relay and URL map to multiple measurements.
|
|
|
|
7. Distribute the joined data into 2 bins based on whether the
|
|
|
|
`is_https` field of each entry is `1` or `0`
|
|
|
|
8. Repeat the following for each bin:
|
|
|
|
1. Further bin the measurements into sub-bins based on the exit relay used
|
|
|
|
to perform the measurement
|
|
|
|
2. Repeat the following for each exit relay in each sub-bin:
|
|
|
|
1. Count the total number of measurements in this sub-bin that were
|
|
|
|
completed using this exit relay
|
|
|
|
2. Count the total number of measurements in this sub-bin that were
|
|
|
|
completed using this exit relay and have `is_captcha_found` field
|
|
|
|
set to `1`
|
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
|
$`\frac{Step 2.8.2.2}{Step 2.8.2.1} \times 100`$ (Assume `0%` if an
|
|
|
|
exit relay exists in the consensus but there are no corresponding
|
|
|
|
measurements)
|
|
|
|
3. Calculate the weighted average of the percentage values (obtained in
|
|
|
|
Step 2.8.2.3) using exit probabilities (obtained in Step 2.3) as the
|
|
|
|
scaling factor
|
|
|
|
9. Plot the weighted percentage values for each bin in the Y-axis and
|
|
|
|
the `valid-after` timestamp of the consensus in the X-axis
|
|
|
|
3. Merge the graphs created for each consensus
|
|
|
|
|
|
|
|
### Related metrics
|
|
|
|
- [(14)](home#metrics-to-track) Is there a difference if the origin server has
|
|
|
|
an SSL certificate or not?
|
|
|
|
- [(14.1)](home#metrics-to-track) Does the blocking change if the SSL
|
|
|
|
certificate is issued by Cloudflare or by another entity?
|
|
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
|
|
## Weighted CAPTCHA rate by HTTP request quantity
|
|
|
|
### Purpose
|
|
|
|
Understanding the effect of connecting to websites that require single or
|
|
|
|
multiple HTTP requests to load on the probability of seeing a CAPTCHA
|
|
|
|
|
|
|
|
### Steps to produce
|
|
|
|
1. Get consensuses from CollecTor
|
|
|
|
2. Repeat the following for each consensus:
|
|
|
|
1. Parse and memorize the `valid-after` & `fresh-until` timestamps from the
|
|
|
|
consensus header and `bandwidth-weights` values from the footer
|
|
|
|
2. Repeat the following for each *running exit relay* entry within the consensus:
|
|
|
|
1. Parse the `r` line and memorize the IPv4 address and identity
|
|
|
|
2. Parse the `w` line and memorize the bandwidth
|
|
|
|
3. Parse the `s` line and memorize the relay flags
|
|
|
|
3. Calculate the weighted exit probabilities using the `bandwidth-weights`
|
|
|
|
from the consensus, `bandwidth` values, and `flags` for each exit relay
|
|
|
|
(see an example calculation [here](https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/metrics/onionoo/updater/NodeDetailsStatusUpdater.java#n597))
|
|
|
|
4. Use CAPTCHA Monitor API to get measurements that were completed
|
|
|
|
using Tor and between the `valid-after` & `fresh-until` timestamps of the
|
|
|
|
consensus
|
|
|
|
5. Use CAPTCHA Monitor API to get the list of URLs that are used in the
|
|
|
|
experiments. This list contains the metadata about the URLs.
|
|
|
|
6. Join the measurements, URL list, and relay data using the relay
|
|
|
|
fingerprints and URLs. Typically each relay and URL map to multiple measurements.
|
|
|
|
7. Distribute the joined data into 2 bins based on whether the
|
|
|
|
`requires_multiple_reqs` field of each entry is `1` or `0`
|
|
|
|
8. Repeat the following for each bin:
|
|
|
|
1. Further bin the measurements into sub-bins based on the exit relay used
|
|
|
|
to perform the measurement
|
|
|
|
2. Repeat the following for each exit relay in each sub-bin:
|
|
|
|
1. Count the total number of measurements in this sub-bin that were
|
|
|
|
completed using this exit relay
|
|
|
|
2. Count the total number of measurements in this sub-bin that were
|
|
|
|
completed using this exit relay and have `is_captcha_found` field
|
|
|
|
set to `1`
|
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
|
$`\frac{Step 2.8.2.2}{Step 2.8.2.1} \times 100`$ (Assume `0%` if an
|
|
|
|
exit relay exists in the consensus but there are no corresponding
|
|
|
|
measurements)
|
|
|
|
3. Calculate the weighted average of the percentage values (obtained in
|
|
|
|
Step 2.8.2.3) using exit probabilities (obtained in Step 2.3) as the
|
|
|
|
scaling factor
|
|
|
|
9. Plot the weighted percentage values for each bin in the Y-axis and
|
|
|
|
the `valid-after` timestamp of the consensus in the X-axis
|
|
|
|
3. Merge the graphs created for each consensus
|
|
|
|
|
|
|
|
### Related metrics
|
|
|
|
- [(13)](home#metrics-to-track) Is there a difference between websites that load
|
|
|
|
resources from third-party resources and websites that contain all resources on
|
|
|
|
the origin server? [ticket:33010#comment:6]
|
|
|
|
- [(13.1)](home#metrics-to-track) How do users of websites get affected if
|
|
|
|
the main website is not fronted by Cloudflare, but some of the resources are
|
|
|
|
fetched from a Cloudflare fronted web server? [ticket:33010#comment:6], [ticket:15450]
|
|
|
|
|
|
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
|
|
## Weighted CAPTCHA rate by CDN provider
|
|
|
|
### Purpose
|
|
|
|
Understanding the effect of connecting to websites that use CDN providers such
|
|
|
|
as Cloudflare, Akamai, Amazon Cloudfront, etc. on the probability of seeing a
|
|
|
|
CAPTCHA
|
|
|
|
|
|
|
|
### Steps to produce
|
|
|
|
1. Get consensuses from CollecTor
|
|
|
|
2. Repeat the following for each consensus:
|
|
|
|
1. Parse and memorize the `valid-after` & `fresh-until` timestamps from the
|
|
|
|
consensus header and `bandwidth-weights` values from the footer
|
|
|
|
2. Repeat the following for each *running exit relay* entry within the consensus:
|
|
|
|
1. Parse the `r` line and memorize the IPv4 address and identity
|
|
|
|
2. Parse the `w` line and memorize the bandwidth
|
|
|
|
3. Parse the `s` line and memorize the relay flags
|
|
|
|
3. Calculate the weighted exit probabilities using the `bandwidth-weights`
|
|
|
|
from the consensus, `bandwidth` values, and `flags` for each exit relay
|
|
|
|
(see an example calculation [here](https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/metrics/onionoo/updater/NodeDetailsStatusUpdater.java#n597))
|
|
|
|
4. Use CAPTCHA Monitor API to get measurements that were completed
|
|
|
|
using Tor and between the `valid-after` & `fresh-until` timestamps of the
|
|
|
|
consensus
|
|
|
|
5. Use CAPTCHA Monitor API to get the list of URLs that are used in the
|
|
|
|
experiments. This list contains the metadata about the URLs.
|
|
|
|
6. Join the measurements, URL list, and relay data using the relay
|
|
|
|
fingerprints and URLs. Typically each relay and URL map to multiple measurements.
|
|
|
|
7. Distribute the joined data into bins based on `cdn_provider` field's value
|
|
|
|
8. Repeat the following for each bin:
|
|
|
|
1. Further bin the measurements into sub-bins based on the exit relay used
|
|
|
|
to perform the measurement
|
|
|
|
2. Repeat the following for each exit relay in each sub-bin:
|
|
|
|
1. Count the total number of measurements in this sub-bin that were
|
|
|
|
completed using this exit relay
|
|
|
|
2. Count the total number of measurements in this sub-bin that were
|
|
|
|
completed using this exit relay and have `is_captcha_found` field
|
|
|
|
set to `1`
|
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
|
$`\frac{Step 2.8.2.2}{Step 2.8.2.1} \times 100`$ (Assume `0%` if an
|
|
|
|
exit relay exists in the consensus but there are no corresponding
|
|
|
|
measurements)
|
|
|
|
3. Calculate the weighted average of the percentage values (obtained in
|
|
|
|
Step 2.8.2.3) using exit probabilities (obtained in Step 2.3) as the
|
|
|
|
scaling factor
|
|
|
|
9. Plot the weighted percentage values for each bin in the Y-axis and
|
|
|
|
the `valid-after` timestamp of the consensus in the X-axis
|
|
|
|
3. Merge the graphs created for each consensus
|
|
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
|
|
# Graphs for understanding the overall network status
|
|
|
|
## Probability of a Tor client receiving CAPTCHA
|
|
|
|
### Purpose
|
|
|
|
Understanding the probability of a Tor client choosing an exit relay in the normal
|
|
|
|
weighted way receiving a CAPTCHA
|
|
|
|
|
|
|
|
### Steps to produce
|
|
|
|
1. Get consensuses from CollecTor
|
|
|
|
2. Repeat the following for each consensus:
|
|
|
|
1. Parse and memorize the `valid-after` & `fresh-until` timestamps from the
|
|
|
|
consensus header and `bandwidth-weights` values from the footer
|
|
|
|
2. Repeat the following for each *running exit relay* entry within the consensus:
|
|
|
|
1. Parse the `r` line and memorize the IPv4 address and identity
|
|
|
|
2. Parse the `w` line and memorize the bandwidth
|
|
|
|
3. Parse the `s` line and memorize the relay flags
|
|
|
|
3. Calculate the weighted exit probabilities using the `bandwidth-weights`
|
|
|
|
from the consensus, `bandwidth` values, and `flags` for each exit relay
|
|
|
|
(see an example calculation [here](https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/metrics/onionoo/updater/NodeDetailsStatusUpdater.java#n597))
|
|
|
|
4. Use CAPTCHA Monitor API to get measurements that were completed
|
|
|
|
using Tor and between the `valid-after` & `fresh-until` timestamps of the
|
|
|
|
consensus
|
|
|
|
5. Repeat the following for each running exit relay entry within the consensus:
|
|
|
|
1. Count the total number of measurements that were completed using this
|
|
|
|
exit relay
|
|
|
|
2. Count the total number of measurements that were completed using this
|
|
|
|
exit relay and have `is_captcha_found` field set to `1`
|
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
|
$`\frac{Step 2.5.2}{Step 2.5.1} \times 100`$ (Assume `0%` if an exit relay
|
|
|
|
exists in the consensus but there are no corresponding measurements)
|
|
|
|
6. Calculate the weighted average of the percentage values (obtained in
|
|
|
|
Step 2.5.3) using exit probabilities (obtained in Step 2.3) as the scaling
|
|
|
|
factor
|
|
|
|
7. Map and memorize the consensus's `valid-after` timestamp to the
|
|
|
|
weighted average of the percentages
|
|
|
|
3. Plot the weighted percentage values for each consensus in the Y-axis and
|
|
|
|
the `valid-after` timestamps in the X-axis
|
|
|
|
|
|
|
|
### Related metrics
|
|
|
|
- [(12)](home#metrics-to-track) What is the chance of a Tor client getting affected
|
|
|
|
by Cloudflare's blocking practices when choosing a Tor exit node? [ticket:33010]
|
|
|
|
- [(17)](home#metrics-to-track) Is whether you get a CAPTCHA much more probabilistic
|
|
|
|
and transient? [ticket:33010]
|
|
|
|
- [(18)](home#metrics-to-track) The chance that a Tor client, choosing an exit
|
|
|
|
relay in the normal weighted faction, will get hit by a CAPTCHA [ticket:33010]
|
|
|
|
|
|
|
|
|
|
|
|
## Weighted CAPTCHA rate by IP version
|
|
|
|
### Purpose
|
|
|
|
Understanding the effect of connecting to web servers
|
|
|
|
(and consequently exit relays) that support IPv4 vs IPv6 on the probability
|
|
|
|
of seeing a CAPTCHA
|
|
|
|
|
|
|
|
### Steps to produce
|
|
|
|
1. Get consensuses from CollecTor
|
|
|
|
2. Repeat the following for each consensus:
|
|
|
|
1. Parse and memorize the `valid-after` & `fresh-until` timestamps from the
|
|
|
|
consensus header and `bandwidth-weights` values from the footer
|
|
|
|
2. Repeat the following for each *running exit relay* entry within the consensus:
|
|
|
|
1. Parse the `r` line and memorize the IPv4 address and identity
|
|
|
|
2. Parse the `w` line and memorize the bandwidth
|
|
|
|
3. Parse the `s` line and memorize the relay flags
|
|
|
|
3. Calculate the weighted exit probabilities using the `bandwidth-weights`
|
|
|
|
from the consensus, `bandwidth` values, and `flags` for each exit relay
|
|
|
|
(see an example calculation [here](https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/metrics/onionoo/updater/NodeDetailsStatusUpdater.java#n597))
|
|
|
|
4. Use CAPTCHA Monitor API to get measurements that were completed
|
|
|
|
using Tor and between the `valid-after` & `fresh-until` timestamps of the
|
|
|
|
consensus
|
|
|
|
5. Obtain the "details document" from Onionoo and match the Onionoo data
|
|
|
|
with the relay entries from consensus using the relay fingerprints. The following query is
|
|
|
|
recommended for obtaining the "details document":
|
|
|
|
https://onionoo.torproject.org/details?type=relay&flag=Exit&fields=exit_addresses,fingerprint,exit_policy_v6_summary
|
|
|
|
6. Distribute the exit relay entries from the consensus into 2 bins based on
|
|
|
|
whether they support IPv6 exiting or not. This should be decided based on
|
|
|
|
the `exit_policy_v6_summary` field obtained from the "details document"
|
|
|
|
7. Repeat the following for each bin:
|
|
|
|
1. Repeat the following for each exit relay in the bin:
|
|
|
|
1. Count the total number of measurements that were
|
|
|
|
completed using this exit relay
|
|
|
|
2. Count the total number of measurements that were
|
|
|
|
completed using this exit relay and have `is_captcha_found` field
|
|
|
|
set to `1`
|
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
|
$`\frac{Step 2.7.1.2}{Step 2.7.1.1} \times 100`$ (Assume `0%` if an
|
|
|
|
exit relay exists in the consensus but there are no corresponding
|
|
|
|
measurements)
|
|
|
|
2. Calculate the weighted average of the percentage values (obtained in
|
|
|
|
Step 2.7.1.3) using exit probabilities (obtained in Step 2.3) as the
|
|
|
|
scaling factor
|
|
|
|
7. Plot the weighted percentage values for each bin in the Y-axis and
|
|
|
|
the `valid-after` timestamp of the consensus in the X-axis
|
|
|
|
3. Merge the graphs created for each consensus
|
|
|
|
|
|
|
|
### Related metrics
|
|
|
|
- [(1)](home#metrics-to-track) Does Cloudflare treat IPv4 and IPv6 addresses
|
|
|
|
differently? [ticket:33010#comment:2]
|
|
|
|
- [(9)](home#metrics-to-track) How do specific exit nodes get affected by
|
|
|
|
Cloudflare's blocking practices?
|
|
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
|
|
## Weighted CAPTCHA rate by exit probability
|
|
|
|
### Purpose
|
|
|
|
Understanding the effect of using smaller or larger exit relays on the
|
|
|
|
probability of seeing a CAPTCHA
|
|
|
|
|
|
|
|
### Steps to produce
|
|
|
|
1. Get consensuses from CollecTor
|
|
|
|
2. Repeat the following for each consensus:
|
|
|
|
1. Parse and memorize the `valid-after` & `fresh-until` timestamps from the
|
|
|
|
consensus header and `bandwidth-weights` values from the footer
|
|
|
|
2. Repeat the following for each *running exit relay* entry within the consensus:
|
|
|
|
1. Parse the `r` line and memorize the IPv4 address and identity
|
|
|
|
2. Parse the `w` line and memorize the bandwidth
|
|
|
|
3. Parse the `s` line and memorize the relay flags
|
|
|
|
3. Calculate the weighted exit probabilities using the `bandwidth-weights`
|
|
|
|
from the consensus, `bandwidth` values, and `flags` for each exit relay
|
|
|
|
(see an example calculation [here](https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/metrics/onionoo/updater/NodeDetailsStatusUpdater.java#n597))
|
|
|
|
4. Use CAPTCHA Monitor API to get measurements that were completed
|
|
|
|
using Tor and between the `valid-after` & `fresh-until` timestamps of the
|
|
|
|
consensus
|
|
|
|
5. Distribute the exit relay entries from the consensus into 10 bins (each
|
|
|
|
bin containing probability values between n and n+0.1) based on their
|
|
|
|
exit probabilities (calculated in Step 2.3)
|
|
|
|
6. Repeat the following for each bin:
|
|
|
|
1. Repeat the following for each exit relay in the bin:
|
|
|
|
1. Count the total number of measurements that were
|
|
|
|
completed using this exit relay
|
|
|
|
2. Count the total number of measurements that were
|
|
|
|
completed using this exit relay and have `is_captcha_found` field
|
|
|
|
set to `1`
|
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
|
$`\frac{Step 2.6.1.2}{Step 2.6.1.1} \times 100`$ (Assume `0%` if an
|
|
|
|
exit relay exists in the consensus but there are no corresponding
|
|
|
|
measurements)
|
|
|
|
2. Calculate the weighted average of the percentage values (obtained in
|
|
|
|
Step 2.6.1.3) using exit probabilities (obtained in Step 2.3) as the
|
|
|
|
scaling factor
|
|
|
|
7. Plot the weighted percentage values for each bin in the Y-axis and
|
|
|
|
the `valid-after` timestamp of the consensus in the X-axis
|
|
|
|
3. Merge the graphs created for each consensus
|
|
|
|
|
|
|
|
### Related metrics
|
|
|
|
- [(9)](home#metrics-to-track) How do specific exit nodes get affected by
|
|
|
|
Cloudflare's blocking practices?
|
|
|
|
- [(9.1)](home#metrics-to-track) Does the size/age/location of the exit node
|
|
|
|
play a role? [ticket:33010#comment:15]
|
|
|
|
- [(9.2)](home#metrics-to-track) Is it always the same Tor exit nodes that get
|
|
|
|
blocked?
|
|
|
|
- [(11)](home#metrics-to-track) What fraction of the Tor exit nodes get affected
|
|
|
|
by Cloudflare's blocking practices? [ticket:33010], [ticket:23840#comment:22]
|
|
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
|
|
## Weighted CAPTCHA rate by exit relay age
|
|
|
|
### Purpose
|
|
|
|
Understanding the effect of using older or younger exit relays
|
|
|
|
(based on `first_seen` date) on the probability of seeing a CAPTCHA
|
|
|
|
|
|
|
|
### Steps to produce
|
|
|
|
1. Get consensuses from CollecTor
|
|
|
|
2. Repeat the following for each consensus:
|
|
|
|
1. Parse and memorize the `valid-after` & `fresh-until` timestamps from the
|
|
|
|
consensus header and `bandwidth-weights` values from the footer
|
|
|
|
2. Repeat the following for each *running exit relay* entry within the consensus:
|
|
|
|
1. Parse the `r` line and memorize the IPv4 address and identity
|
|
|
|
2. Parse the `w` line and memorize the bandwidth
|
|
|
|
3. Parse the `s` line and memorize the relay flags
|
|
|
|
3. Calculate the weighted exit probabilities using the `bandwidth-weights`
|
|
|
|
from the consensus, `bandwidth` values, and `flags` for each exit relay
|
|
|
|
(see an example calculation [here](https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/metrics/onionoo/updater/NodeDetailsStatusUpdater.java#n597))
|
|
|
|
4. Use CAPTCHA Monitor API to get measurements that were completed
|
|
|
|
using Tor and between the `valid-after` & `fresh-until` timestamps of the
|
|
|
|
consensus
|
|
|
|
5. Obtain the "details document" from Onionoo and match the Onionoo data
|
|
|
|
with the relay entries from consensus using the relay fingerprints. The following query is
|
|
|
|
recommended for obtaining the "details document":
|
|
|
|
https://onionoo.torproject.org/details?type=relay&flag=Exit&fields=exit_addresses,fingerprint,first_seen
|
|
|
|
6. Calculate the age of the exit relays in days using the `first_seen` field
|
|
|
|
of the "details document" and `valid-after` timestamp of the consensus
|
|
|
|
(`exit_age` = ceil_days(`valid-after` - `first_seen`))
|
|
|
|
7. Distribute the exit relay entries from the consensus into
|
|
|
|
`(max(exit_age) - min(exit_age)) / 365` bins based on their ages (calculated in Step 2.6)
|
|
|
|
8. Repeat the following for each bin:
|
|
|
|
1. Repeat the following for each exit relay in the bin:
|
|
|
|
1. Count the total number of measurements that were
|
|
|
|
completed using this exit relay
|
|
|
|
2. Count the total number of measurements that were
|
|
|
|
completed using this exit relay and have `is_captcha_found` field
|
|
|
|
set to `1`
|
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
|
$`\frac{Step 2.8.1.2}{Step 2.8.1.1} \times 100`$ (Assume `0%` if an
|
|
|
|
exit relay exists in the consensus but there are no corresponding
|
|
|
|
measurements)
|
|
|
|
2. Calculate the weighted average of the percentage values (obtained in
|
|
|
|
Step 2.8.1.3) using exit probabilities (obtained in Step 2.3) as the
|
|
|
|
scaling factor
|
|
|
|
7. Plot the weighted percentage values for each bin in the Y-axis and
|
|
|
|
the `valid-after` timestamp of the consensus in the X-axis
|
|
|
|
3. Merge the graphs created for each consensus
|
|
|
|
|
|
|
|
### Related metrics
|
|
|
|
- [(9)](home#metrics-to-track) How do specific exit nodes get affected by
|
|
|
|
Cloudflare's blocking practices?
|
|
|
|
- [(9.1)](home#metrics-to-track) Does the size/age/location of the exit node
|
|
|
|
play a role? [ticket:33010#comment:15]
|
|
|
|
- [(9.2)](home#metrics-to-track) Is it always the same Tor exit nodes that
|
|
|
|
get blocked?
|
|
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
|
|
## Weighted CAPTCHA rate by exit relay location
|
|
|
|
### Purpose
|
|
|
|
Understanding the effect of the physical location of the exit relay's location
|
|
|
|
on the probability of seeing a CAPTCHA. This graph will show top 10 countries
|
|
|
|
with highest CAPTCHA rates.
|
|
|
|
|
|
|
|
### Steps to produce
|
|
|
|
1. Get consensuses from CollecTor
|
|
|
|
2. Repeat the following for each consensus:
|
|
|
|
1. Parse and memorize the `valid-after` & `fresh-until` timestamps from the
|
|
|
|
consensus header and `bandwidth-weights` values from the footer
|
|
|
|
2. Repeat the following for each *running exit relay* entry within the consensus:
|
|
|
|
1. Parse the `r` line and memorize the IPv4 address and identity
|
|
|
|
2. Parse the `w` line and memorize the bandwidth
|
|
|
|
3. Parse the `s` line and memorize the relay flags
|
|
|
|
3. Calculate the weighted exit probabilities using the `bandwidth-weights`
|
|
|
|
from the consensus, `bandwidth` values, and `flags` for each exit relay
|
|
|
|
(see an example calculation [here](https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/metrics/onionoo/updater/NodeDetailsStatusUpdater.java#n597))
|
|
|
|
4. Use CAPTCHA Monitor API to get measurements that were completed
|
|
|
|
using Tor and between the `valid-after` & `fresh-until` timestamps of the
|
|
|
|
consensus
|
|
|
|
5. Obtain the "details document" from Onionoo and match the Onionoo data
|
|
|
|
with the relay entries from consensus using the relay fingerprints. The following query is
|
|
|
|
recommended for obtaining the "details document":
|
|
|
|
https://onionoo.torproject.org/details?type=relay&flag=Exit&fields=exit_addresses,fingerprint,country_name
|
|
|
|
6. Distribute the exit relay entries from the consensus into bins based on
|
|
|
|
their `country_name` value (obtained in Step 2.5)
|
|
|
|
7. Repeat the following for each bin:
|
|
|
|
1. Repeat the following for each exit relay in the bin:
|
|
|
|
1. Count the total number of measurements that were
|
|
|
|
completed using this exit relay
|
|
|
|
2. Count the total number of measurements that were
|
|
|
|
completed using this exit relay and have `is_captcha_found` field
|
|
|
|
set to `1`
|
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
|
$`\frac{Step 2.7.1.2}{Step 2.7.1.1} \times 100`$ (Assume `0%` if an
|
|
|
|
exit relay exists in the consensus but there are no corresponding
|
|
|
|
measurements)
|
|
|
|
2. Calculate the weighted average of the percentage values (obtained in
|
|
|
|
Step 2.7.1.3) using exit probabilities (obtained in Step 2.3) as the
|
|
|
|
scaling factor
|
|
|
|
7. Plot the weighted percentage values for each bin in the Y-axis and
|
|
|
|
the `valid-after` timestamp of the consensus in the X-axis
|
|
|
|
3. Merge the graphs with top 10 highest percentage values and discard the rest
|
|
|
|
(or keep if you want to have them as well)
|
|
|
|
|
|
|
|
### Related metrics
|
|
|
|
- [(9)](home#metrics-to-track) How do specific exit nodes get affected by
|
|
|
|
Cloudflare's blocking practices?
|
|
|
|
- [(9.1)](home#metrics-to-track) Does the size/age/location of the exit node
|
|
|
|
play a role? [ticket:33010#comment:15]
|
|
|
|
- [(9.2)](home#metrics-to-track) Is it always the same Tor exit nodes that get
|
|
|
|
blocked?
|
|
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
|
|
# Graphs about understanding the Cloudflare firewall
|
|
|
|
## CAPTCHA rate by Cloudflare security level/firewall settings
|
|
|
|
### Purpose
|
|
|
|
Understanding the effect of different Cloudflare security levels and firewall
|
|
|
|
configurations on the probability of seeing a CAPTCHA.
|
|
|
|
|
|
|
|
We have a few different domains to test different configurations. Here they are:
|
|
|
|
- captcha.wtf
|
|
|
|
- IPv4 only domain, no additional Cloudflare firewall rules
|
|
|
|
- yearlight.buzz
|
|
|
|
- IPv4 only domain, Cloudflare firewall is set to present "JS Challenge" for
|
|
|
|
traffic originating from the Tor network
|
|
|
|
- bottomlesspit.xyz
|
|
|
|
- IPv4 only domain, Cloudflare firewall is set to present "CAPTCHA Challenge" for
|
|
|
|
traffic originating from the Tor network
|
|
|
|
- broccolipizza.monster
|
|
|
|
- IPv4 only domain, Cloudflare firewall is set to block all traffic
|
|
|
|
originating from the Tor network
|
|
|
|
- exit11.online
|
|
|
|
- IPv6 only domain, no additional Cloudflare firewall rules
|
|
|
|
- icanhazcaptcha.xyz
|
|
|
|
- IPv6 only domain, Cloudflare firewall is set to present "CAPTCHA Challenge" for
|
|
|
|
traffic originating from the Tor network
|
|
|
|
|
|
|
|
### Steps to produce
|
|
|
|
0. Determine a date range and granularity to plot. Here, we will plot last 30 days
|
|
|
|
with a granularity of 1 hour.
|
|
|
|
1. Use CAPTCHA Monitor API to get measurements that were *completed
|
|
|
|
using domains specified above* and during the chosen date range and
|
|
|
|
5. Iterate over the chosen date range with the chosen time intervals. Repeat
|
|
|
|
the following for each iteration:
|
|
|
|
1. Distribute the measurements that were completed within the interval of
|
|
|
|
this iteration into bins based on `url` field's value
|
|
|
|
2. Repeat the following for each bin:
|
|
|
|
1. Count the total number of measurements in this bin
|
|
|
|
2. Count the total number of measurements in this bin that have
|
|
|
|
`is_captcha_found` field set to `1`
|
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
|
$`\frac{Step 5.2.2}{Step 5.2.1} \times 100`$ (Leave this bin's value
|
|
|
|
empty if there are no corresponding measurements)
|
|
|
|
3. Plot the percentage values for each bin in the Y-axis and the beginning
|
|
|
|
time of this interval in the X-axis
|
|
|
|
5. Merge the graphs created for each iteration
|
|
|
|
|
|
|
|
### Related metrics
|
|
|
|
<!-- - [(3.4)](home#metrics-to-track) How does Cloudflare react to browsers with
|
|
|
|
and without JavaScript enabled? [ticket:31404] -->
|
|
|
|
- [(6)](home#metrics-to-track) How do different security levels of Cloudflare
|
|
|
|
affect the blocking mechanism? [ticket:33010#comment:5]
|
|
|
|
- [(6.1)](home#metrics-to-track) Do some of the Cloudflare security levels
|
|
|
|
block users immediately without presenting a CAPTCHA challenge at all?
|
|
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
|
|
## CAPTCHA rate by traffic origin
|
|
|
|
### Purpose
|
|
|
|
Understanding how Cloudflare treats to Tor traffic vs. non-Tor traffic (this one
|
|
|
|
is stating the obvious but still good to have data to back up the obvious)
|
|
|
|
|
|
|
|
### Steps to produce
|
|
|
|
0. Determine a date range and granularity to plot. Here, we will plot last 30 days
|
|
|
|
with a granularity of 1 hour.
|
|
|
|
1. Use CAPTCHA Monitor API to get measurements that were completed during the
|
|
|
|
chosen date range
|
|
|
|
2. Use CAPTCHA Monitor API to get the list of URLs that are used in the
|
|
|
|
experiments. This list contains the metadata about the URLs.
|
|
|
|
3. Join the measurements and URL list using the `URL` fields. Typically each
|
|
|
|
URL maps to multiple measurements.
|
|
|
|
4. Discard the measurements that do not have `cloudflare` in their `cdn_provider`
|
|
|
|
fields
|
|
|
|
5. Iterate over the chosen date range with the chosen time intervals. Repeat
|
|
|
|
the following for each iteration:
|
|
|
|
1. Distribute the measurements that were completed within the interval of
|
|
|
|
this iteration into 2 bins based on `method` field's value. Put the methods
|
|
|
|
without "tor" (ex. "firefox") into the `Non-Tor Traffic` bin and the rest
|
|
|
|
(ex. "firefox_over_tor") into the `Tor Traffic` bin.
|
|
|
|
2. Repeat the following for each bin:
|
|
|
|
1. Count the total number of measurements in this bin
|
|
|
|
2. Count the total number of measurements in this bin that have
|
|
|
|
`is_captcha_found` field set to `1`
|
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
|
$`\frac{Step 5.2.2}{Step 5.2.1} \times 100`$ (Leave this bin's value
|
|
|
|
empty if there are no corresponding measurements)
|
|
|
|
3. Plot the percentage values for each bin in the Y-axis and the beginning
|
|
|
|
time of this interval in the X-axis
|
|
|
|
5. Merge the graphs created for each iteration
|
|
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
|
|
## Weighted CAPTCHA rate by exit relay age
|
|
|
|
### Purpose
|
|
|
|
Understanding how quickly Cloudflare blocks the newer relays and if there is a
|
|
|
|
different treatment for older relays
|
|
|
|
|
|
|
|
### Steps to produce
|
|
|
|
1. Get consensuses from CollecTor
|
|
|
|
2. Repeat the following for each consensus:
|
|
|
|
1. Parse and memorize the `valid-after` & `fresh-until` timestamps from the
|
|
|
|
consensus header and `bandwidth-weights` values from the footer
|
|
|
|
2. Repeat the following for each *running exit relay* entry within the consensus:
|
|
|
|
1. Parse the `r` line and memorize the IPv4 address and identity
|
|
|
|
2. Parse the `w` line and memorize the bandwidth
|
|
|
|
3. Parse the `s` line and memorize the relay flags
|
|
|
|
3. Calculate the weighted exit probabilities using the `bandwidth-weights`
|
|
|
|
from the consensus, `bandwidth` values, and `flags` for each exit relay
|
|
|
|
(see an example calculation [here](https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/metrics/onionoo/updater/NodeDetailsStatusUpdater.java#n597))
|
|
|
|
4. Use CAPTCHA Monitor API to get measurements that were completed
|
|
|
|
using Tor and between the `valid-after` & `fresh-until` timestamps of the
|
|
|
|
consensus
|
|
|
|
5. Use CAPTCHA Monitor API to get the list of URLs that are used in the
|
|
|
|
experiments. This list contains the metadata about the URLs.
|
|
|
|
6. Join the measurements and URL list using the `URL` fields. Typically each
|
|
|
|
URL maps to multiple measurements.
|
|
|
|
7. Discard the measurements that do not have `cloudflare` in their `cdn_provider`
|
|
|
|
fields
|
|
|
|
8. Obtain the "details document" from Onionoo and match the Onionoo data
|
|
|
|
with the relay entries from consensus using the relay fingerprints. The following query is
|
|
|
|
recommended for obtaining the "details document":
|
|
|
|
https://onionoo.torproject.org/details?type=relay&flag=Exit&fields=exit_addresses,fingerprint,first_seen
|
|
|
|
9. Calculate the age of the exit relays in days using the `first_seen` field
|
|
|
|
of the "details document" and `valid-after` timestamp of the consensus
|
|
|
|
(`exit_age` = ceil_days(`valid-after` - `first_seen`))
|
|
|
|
10. Distribute the exit relay entries from the consensus into
|
|
|
|
`(max(exit_age) - min(exit_age)) / 365` bins based on their ages
|
|
|
|
(calculated in Step 2.9)
|
|
|
|
11. Repeat the following for each bin:
|
|
|
|
1. Repeat the following for each exit relay in the bin:
|
|
|
|
1. Count the total number of measurements that were
|
|
|
|
completed using this exit relay
|
|
|
|
2. Count the total number of measurements that were
|
|
|
|
completed using this exit relay and have `is_captcha_found` field
|
|
|
|
set to `1`
|
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
|
$`\frac{Step 2.8.1.2}{Step 2.8.1.1} \times 100`$ (Assume `0%` if an
|
|
|
|
exit relay exists in the consensus but there are no corresponding
|
|
|
|
measurements)
|
|
|
|
2. Calculate the weighted average of the percentage values (obtained in
|
|
|
|
Step 2.8.1.3) using exit probabilities (obtained in Step 2.3) as the
|
|
|
|
scaling factor
|
|
|
|
7. Plot the weighted percentage values for each bin in the Y-axis and
|
|
|
|
the `valid-after` timestamp of the consensus in the X-axis
|
|
|
|
3. Merge the graphs created for each consensus
|
|
|
|
|
|
|
|
### Related metrics
|
|
|
|
- [(8)](home#metrics-to-track) How often does Cloudflare's blocking mechanism
|
|
|
|
change/update itself?
|
|
|
|
- [(10)](home#metrics-to-track) How well does Cloudflare keep track of the new
|
|
|
|
or old Tor exit nodes?
|
|
|
|
- [(10.1)](home#metrics-to-track) How frequently Cloudflare updates its Tor exit
|
|
|
|
node list?
|
|
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
|
|
## Weighted CAPTCHA rate by exit relay location
|
|
|
|
### Purpose
|
|
|
|
Understanding if Cloudflare prefers to block requests more from exit relays in
|
|
|
|
certain countries
|
|
|
|
|
|
|
|
### Steps to produce
|
|
|
|
1. Get consensuses from CollecTor
|
|
|
|
2. Repeat the following for each consensus:
|
|
|
|
1. Parse and memorize the `valid-after` & `fresh-until` timestamps from the
|
|
|
|
consensus header and `bandwidth-weights` values from the footer
|
|
|
|
2. Repeat the following for each *running exit relay* entry within the consensus:
|
|
|
|
1. Parse the `r` line and memorize the IPv4 address and identity
|
|
|
|
2. Parse the `w` line and memorize the bandwidth
|
|
|
|
3. Parse the `s` line and memorize the relay flags
|
|
|
|
3. Calculate the weighted exit probabilities using the `bandwidth-weights`
|
|
|
|
from the consensus, `bandwidth` values, and `flags` for each exit relay
|
|
|
|
(see an example calculation [here](https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/metrics/onionoo/updater/NodeDetailsStatusUpdater.java#n597))
|
|
|
|
4. Use CAPTCHA Monitor API to get measurements that were completed
|
|
|
|
using Tor and between the `valid-after` & `fresh-until` timestamps of the
|
|
|
|
consensus
|
|
|
|
5. Use CAPTCHA Monitor API to get the list of URLs that are used in the
|
|
|
|
experiments. This list contains the metadata about the URLs.
|
|
|
|
6. Join the measurements and URL list using the `URL` fields. Typically each
|
|
|
|
URL maps to multiple measurements.
|
|
|
|
7. Discard the measurements that do not have `cloudflare` in their `cdn_provider`
|
|
|
|
fields
|
|
|
|
8. Obtain the "details document" from Onionoo and match the Onionoo data
|
|
|
|
with the relay entries from consensus using the relay fingerprints. The following query is
|
|
|
|
recommended for obtaining the "details document":
|
|
|
|
https://onionoo.torproject.org/details?type=relay&flag=Exit&fields=exit_addresses,fingerprint,country_name
|
|
|
|
9. Distribute the exit relay entries from the consensus into bins based on
|
|
|
|
their `country_name` value (obtained in Step 2.5)
|
|
|
|
10. Repeat the following for each bin:
|
|
|
|
1. Repeat the following for each exit relay in the bin:
|
|
|
|
1. Count the total number of measurements that were completed using
|
|
|
|
this exit relay
|
|
|
|
2. Count the total number of measurements that were completed using
|
|
|
|
this exit relay and have `is_captcha_found` field set to `1`
|
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
|
$`\frac{Step 2.10.1.2}{Step 2.10.1.1} \times 100`$ (Assume `0%` if an
|
|
|
|
exit relay exists in the consensus but there are no corresponding
|
|
|
|
measurements)
|
|
|
|
2. Calculate the weighted average of the percentage values (obtained in
|
|
|
|
Step 2.10.1.3) using exit probabilities (obtained in Step 2.3) as the
|
|
|
|
scaling factor
|
|
|
|
7. Plot the weighted percentage values for each bin in the Y-axis and
|
|
|
|
the `valid-after` timestamp of the consensus in the X-axis
|
|
|
|
3. Merge the graphs with top 10 highest percentage values and discard the rest
|
|
|
|
(or keep if you want to have them as well)
|
|
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
|
|
## Code injection rate
|
|
|
|
### Purpose
|
|
|
|
Cloudflare sometimes injects third-party code to the websites without letting the
|
|
|
|
users know. This graph aims to visualize the percentage of measurements were
|
|
|
|
affected by third-party code injection over time.
|
|
|
|
|
|
|
|
### Steps to produce
|
|
|
|
0. Determine a date range and granularity to plot. Here, we will plot last 30 days
|
|
|
|
with a granularity of 1 hour.
|
|
|
|
1. Use CAPTCHA Monitor API to get measurements that were during between the
|
|
|
|
chosen date range
|
|
|
|
2. Use CAPTCHA Monitor API to get the list of URLs that are used in the
|
|
|
|
experiments. This list contains the metadata about the URLs.
|
|
|
|
3. Join the measurements and URL list using the `URL` fields. Typically each
|
|
|
|
URL maps to multiple measurements.
|
|
|
|
4. Discard the measurements that do not have `cloudflare` in their `cdn_provider`
|
|
|
|
fields
|
|
|
|
5. Iterate over the chosen date range with the chosen time intervals. Repeat
|
|
|
|
the following for each iteration:
|
|
|
|
1. Distribute the measurements that were completed within the
|
|
|
|
interval of this iteration into 2 bins based on `is_data_modified` field's
|
|
|
|
value. Skip the measurements that do not have `is_data_modified` field.
|
|
|
|
2. Repeat the following for each bin:
|
|
|
|
1. Count the total number of measurements in this bin
|
|
|
|
2. Count the total number of measurements in this bin that have
|
|
|
|
`is_captcha_found` field set to `1`
|
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
|
$`\frac{Step 5.2.2}{Step 5.2.1} \times 100`$ (Leave this bin's value
|
|
|
|
empty if there are no corresponding measurements)
|
|
|
|
3. Plot the percentage values for each bin in the Y-axis and the beginning
|
|
|
|
time of this interval in the X-axis
|
|
|
|
5. Merge the graphs created for each iteration
|
|
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
|
|
# Graphs about Tor Browser centric data
|
|
|
|
## Weighted CAPTCHA rate by Tor Browser version
|
|
|
|
### Purpose
|
|
|
|
Understanding the effect of using different Tor Browser versions on the
|
|
|
|
probability of seeing a CAPTCHA
|
|
|
|
|
|
|
|
### Steps to produce
|
|
|
|
1. Get consensuses from CollecTor
|
|
|
|
2. Repeat the following for each consensus:
|
|
|
|
1. Parse and memorize the `valid-after` & `fresh-until` timestamps from the
|
|
|
|
consensus header and `bandwidth-weights` values from the footer
|
|
|
|
2. Repeat the following for each *running exit relay* entry within the consensus:
|
|
|
|
1. Parse the `r` line and memorize the IPv4 address and identity
|
|
|
|
2. Parse the `w` line and memorize the bandwidth
|
|
|
|
3. Parse the `s` line and memorize the relay flags
|
|
|
|
3. Calculate the weighted exit probabilities using the `bandwidth-weights`
|
|
|
|
from the consensus, `bandwidth` values, and `flags` for each exit relay
|
|
|
|
(see an example calculation [here](https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/metrics/onionoo/updater/NodeDetailsStatusUpdater.java#n597))
|
|
|
|
4. Use CAPTCHA Monitor API to get measurements that were completed
|
|
|
|
using Tor Browser (`method` field is equal to `tor_browser`) and between
|
|
|
|
the `valid-after` & `fresh-until` timestamps of the consensus
|
|
|
|
5. Join the measurements and relay data using the relay fingerprints.
|
|
|
|
Typically each relay maps to multiple measurements.
|
|
|
|
6. Distribute the joined data into bins based on `browser_version`
|
|
|
|
field's value
|
|
|
|
7. Repeat the following for each bin:
|
|
|
|
1. Further bin the measurements into sub-bins based on the exit relay used
|
|
|
|
to perform the measurement
|
|
|
|
2. Repeat the following for each exit relay in each sub-bin:
|
|
|
|
1. Count the total number of measurements in this sub-bin that were
|
|
|
|
completed using this exit relay
|
|
|
|
2. Count the total number of measurements in this sub-bin that were
|
|
|
|
completed using this exit relay and have `is_captcha_found` field
|
|
|
|
set to `1`
|
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
|
$`\frac{Step 2.7.2.2}{Step 2.7.2.1} \times 100`$ (Assume `0%` if an
|
|
|
|
exit relay exists in the consensus but there are no corresponding
|
|
|
|
measurements)
|
|
|
|
3. Calculate the weighted average of the percentage values (obtained in
|
|
|
|
Step 2.7.2.3) using exit probabilities (obtained in Step 2.3) as the
|
|
|
|
scaling factor
|
|
|
|
8. Plot the weighted percentage values for each `method` bin in the Y-axis and
|
|
|
|
the `valid-after` timestamp of the consensus in the X-axis
|
|
|
|
3. Merge the graphs created for each consensus
|
|
|
|
|
|
|
|
### Related metrics
|
|
|
|
- [(3.2)](home#metrics-to-track) What about different versions of the
|
|
|
|
Tor Browser? Does Cloudflare behave differently to different versions of the
|
|
|
|
same browser?
|
|
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
|
|
## Weighted CAPTCHA rate by Tor Browser security level
|
|
|
|
### Purpose
|
|
|
|
Understanding the effect of using Tor Browser at different security levels
|
|
|
|
(Standard, Safer, Safest) on the probability of seeing a CAPTCHA
|
|
|
|
|
|
|
|
### Steps to produce
|
|
|
|
1. Get consensuses from CollecTor
|
|
|
|
2. Repeat the following for each consensus:
|
|
|
|
1. Parse and memorize the `valid-after` & `fresh-until` timestamps from the
|
|
|
|
consensus header and `bandwidth-weights` values from the footer
|
|
|
|
2. Repeat the following for each *running exit relay* entry within the consensus:
|
|
|
|
1. Parse the `r` line and memorize the IPv4 address and identity
|
|
|
|
2. Parse the `w` line and memorize the bandwidth
|
|
|
|
3. Parse the `s` line and memorize the relay flags
|
|
|
|
3. Calculate the weighted exit probabilities using the `bandwidth-weights`
|
|
|
|
from the consensus, `bandwidth` values, and `flags` for each exit relay
|
|
|
|
(see an example calculation [here](https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/metrics/onionoo/updater/NodeDetailsStatusUpdater.java#n597))
|
|
|
|
4. Use CAPTCHA Monitor API to get measurements that were completed
|
|
|
|
using Tor Browser (`method` field is equal to `tor_browser`) and between
|
|
|
|
the `valid-after` & `fresh-until` timestamps of the consensus
|
|
|
|
5. Join the measurements and relay data using the relay fingerprints.
|
|
|
|
Typically each relay maps to multiple measurements.
|
|
|
|
6. Distribute the joined data into 3 bins based on `tbb_security_level`
|
|
|
|
field's value
|
|
|
|
7. Repeat the following for each bin:
|
|
|
|
1. Further bin the measurements into sub-bins based on the exit relay used
|
|
|
|
to perform the measurement
|
|
|
|
2. Repeat the following for each exit relay in each sub-bin:
|
|
|
|
1. Count the total number of measurements in this sub-bin that were
|
|
|
|
completed using this exit relay
|
|
|
|
2. Count the total number of measurements in this sub-bin that were
|
|
|
|
completed using this exit relay and have `is_captcha_found` field
|
|
|
|
set to `1`
|
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
|
$`\frac{Step 2.7.2.2}{Step 2.7.2.1} \times 100`$ (Assume `0%` if an
|
|
|
|
exit relay exists in the consensus but there are no corresponding
|
|
|
|
measurements)
|
|
|
|
3. Calculate the weighted average of the percentage values (obtained in
|
|
|
|
Step 2.7.2.3) using exit probabilities (obtained in Step 2.3) as the
|
|
|
|
scaling factor
|
|
|
|
8. Plot the weighted percentage values for each `method` bin in the Y-axis and
|
|
|
|
the `valid-after` timestamp of the consensus in the X-axis
|
|
|
|
3. Merge the graphs created for each consensus
|
|
|
|
|
|
|
|
### Related metrics
|
|
|
|
- [(3.3)](home#metrics-to-track) What about the different security levels of Tor
|
|
|
|
Browser?
|
|
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
|
|
# Graphs about individual exit relays
|
|
|
|
## Overall CAPTCHA rate
|
|
|
|
### Purpose
|
|
|
|
Seeing the overall CAPTCHA rate for a specific exit relay
|
|
|
|
|
|
|
|
### Steps to produce
|
|
|
|
0. Determine a date range and granularity to plot. Here, we will plot last 30 days
|
|
|
|
with a granularity of 1 hour.
|
|
|
|
1. Use CAPTCHA Monitor API to get measurements that were completed using the
|
|
|
|
target exit relay and between the chosen date range
|
|
|
|
2. Use CAPTCHA Monitor API to get the list of URLs that are used in the
|
|
|
|
experiments. This list contains the metadata about the URLs.
|
|
|
|
3. Join the measurements and URL list using the `URL` fields. Typically each
|
|
|
|
URL maps to multiple measurements.
|
|
|
|
4. Iterate over the chosen date range with the chosen time intervals. Repeat
|
|
|
|
the following for each iteration:
|
|
|
|
1. Count the total number of measurements completed within this interval
|
|
|
|
2. Count the total number of measurements completed within this interval
|
|
|
|
that have `is_captcha_found` field set to `1`
|
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
|
$`\frac{Step 4.2.2}{Step 4.2.1} \times 100`$ (Leave this interval's value
|
|
|
|
empty if there are no corresponding measurements)
|
|
|
|
3. Plot the percentage values for each iteration in the Y-axis and the beginning
|
|
|
|
time for each iteration in the X-axis
|
|
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
|
|
## CAPTCHA rate by CDN provider
|
|
|
|
### Purpose
|
|
|
|
Understanding how different CDN providers such as Cloudflare, Akamai,
|
|
|
|
Amazon Cloudfront, etc. behave requests coming from a specific exit relay
|
|
|
|
|
|
|
|
### Steps to produce
|
|
|
|
0. Determine a date range and granularity to plot. Here, we will plot last 30 days
|
|
|
|
with a granularity of 1 hour.
|
|
|
|
1. Use CAPTCHA Monitor API to get measurements that were completed using the
|
|
|
|
target exit relay and between the chosen date range
|
|
|
|
2. Use CAPTCHA Monitor API to get the list of URLs that are used in the
|
|
|
|
experiments. This list contains the metadata about the URLs.
|
|
|
|
3. Join the measurements and URL list using the `URL` fields. Typically each
|
|
|
|
URL maps to multiple measurements.
|
|
|
|
4. Iterate over the chosen date range with the chosen time intervals. Repeat
|
|
|
|
the following for each iteration:
|
|
|
|
1. Distribute the measurements that were completed within the
|
|
|
|
interval of this iteration into bins based on `cdn_provider` field's value
|
|
|
|
2. Repeat the following for each bin:
|
|
|
|
1. Count the total number of measurements in this bin
|
|
|
|
2. Count the total number of measurements in this bin that have
|
|
|
|
`is_captcha_found` field set to `1`
|
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
|
$`\frac{Step 4.2.2}{Step 4.2.1} \times 100`$ (Leave this bin's value
|
|
|
|
empty if there are no corresponding measurements)
|
|
|
|
3. Plot the percentage values for each bin in the Y-axis and the beginning
|
|
|
|
time of this interval in the X-axis
|
|
|
|
5. Merge the graphs created for each iteration |