... | ... | @@ -51,225 +51,7 @@ The following graph style will be used for all graphs unless otherwise specified |
|
|
* Sample Graph (Number of data points is reduced for simplicity)
|
|
|
![graph-style](uploads/e62c2716de6cd64e3a6bf949d1bd0726/graph-style.png)
|
|
|
|
|
|
# Graphs for understanding CAPTCHA rates related to user decisions
|
|
|
## Weighted CAPTCHA rate by method
|
|
|
### Purpose
|
|
|
Understanding the effect of using different methods (for example using
|
|
|
web browsers like Tor Browser, Firefox over Tor, Brave's Tor Tabs, etc.) on the
|
|
|
probability of seeing a CAPTCHA
|
|
|
|
|
|
### Steps to produce
|
|
|
1. Get consensuses from CollecTor
|
|
|
2. Repeat the following for each consensus:
|
|
|
1. Parse and memorize the `valid-after` & `fresh-until` timestamps from the
|
|
|
consensus header and `bandwidth-weights` values from the footer
|
|
|
2. Repeat the following for each *running exit relay* entry within the consensus:
|
|
|
1. Parse the `r` line and memorize the IPv4 address and identity
|
|
|
2. Parse the `w` line and memorize the bandwidth
|
|
|
3. Parse the `s` line and memorize the relay flags
|
|
|
3. Calculate the weighted exit probabilities using the `bandwidth-weights`
|
|
|
from the consensus, `bandwidth` values, and `flags` for each exit relay
|
|
|
(see an example calculation [here](https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/metrics/onionoo/updater/NodeDetailsStatusUpdater.java#n597))
|
|
|
4. Use CAPTCHA Monitor API to get measurements that were completed
|
|
|
using Tor and between the `valid-after` & `fresh-until` timestamps of the
|
|
|
consensus
|
|
|
5. Join the measurements and relay data using the relay fingerprints.
|
|
|
Typically each relay maps to multiple measurements.
|
|
|
6. Distribute the joined data into bins based on `method` field's value
|
|
|
7. Repeat the following for each bin:
|
|
|
1. Further bin the measurements into sub-bins based on the exit relay used
|
|
|
to perform the measurement
|
|
|
2. Repeat the following for each exit relay in each sub-bin:
|
|
|
1. Count the total number of measurements in this sub-bin that were
|
|
|
completed using this exit relay
|
|
|
2. Count the total number of measurements in this sub-bin that were
|
|
|
completed using this exit relay and have `is_captcha_found` field
|
|
|
set to `1`
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
$`\frac{Step 2.7.2.2}{Step 2.7.2.1} \times 100`$
|
|
|
3. Calculate the weighted average of the percentage values (obtained in
|
|
|
Step 2.7.2.3) using exit probabilities (obtained in Step 2.3) as the
|
|
|
scaling factor
|
|
|
8. Plot the weighted percentage values for each `method` bin in the Y-axis and
|
|
|
the `valid-after` timestamp of the consensus in the X-axis
|
|
|
3. Merge the graphs created for each consensus
|
|
|
|
|
|
### Related questions
|
|
|
- [(2)](home#metrics-to-track) How does the HTTP request headers affect
|
|
|
Cloudflare's decision-making mechanism? [ticket:33010#comment:4]
|
|
|
- [(2.1)](home#metrics-to-track) Is there a difference between using the
|
|
|
actual Tor Browser itself and tor-browser-selenium in terms of the HTTP headers?
|
|
|
- [(2.2)](home#metrics-to-track) How does Cloudflare react differently if the
|
|
|
browser doesn't support alt-svc headers? [ticket:32915]
|
|
|
- [(3)](home#metrics-to-track) How do different browsers with different
|
|
|
User Agents get affected? [ticket:33010#comment:2], [ticket:32924], [ticket:31404]
|
|
|
- [(3.1)](home#metrics-to-track) Is there a difference between using a web
|
|
|
browser or fetching web pages via cURL or other HTTP libraries?
|
|
|
- [(7)](home#metrics-to-track) How does the time of the day affect the
|
|
|
Cloudflare's blocking mechanism? Does it matter the day of the week or the time
|
|
|
of the day? [ticket:33010#comment:15]
|
|
|
- [(15)](home#metrics-to-track) If browsers that should not face CAPTCHA face
|
|
|
CAPTCHA, why does this happen?
|
|
|
- [(16)](home#metrics-to-track) How do the observed patterns in the results
|
|
|
change over time? [ticket:33010]
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
## Weighted CAPTCHA rate by connection security
|
|
|
### Purpose
|
|
|
Understanding the effect of using TLS and not using TLS on the probability
|
|
|
of seeing a CAPTCHA
|
|
|
|
|
|
### Steps to produce
|
|
|
1. Get consensuses from CollecTor
|
|
|
2. Repeat the following for each consensus:
|
|
|
1. Parse and memorize the `valid-after` & `fresh-until` timestamps from the
|
|
|
consensus header and `bandwidth-weights` values from the footer
|
|
|
2. Repeat the following for each *running exit relay* entry within the consensus:
|
|
|
1. Parse the `r` line and memorize the IPv4 address and identity
|
|
|
2. Parse the `w` line and memorize the bandwidth
|
|
|
3. Parse the `s` line and memorize the relay flags
|
|
|
3. Calculate the weighted exit probabilities using the `bandwidth-weights`
|
|
|
from the consensus, `bandwidth` values, and `flags` for each exit relay
|
|
|
(see an example calculation [here](https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/metrics/onionoo/updater/NodeDetailsStatusUpdater.java#n597))
|
|
|
4. Use CAPTCHA Monitor API to get measurements that were completed
|
|
|
using Tor and between the `valid-after` & `fresh-until` timestamps of the
|
|
|
consensus
|
|
|
5. Use CAPTCHA Monitor API to get the list of URLs that are used in the
|
|
|
experiments. This list contains the metadata about the URLs.
|
|
|
6. Join the measurements, URL list, and relay data using the relay
|
|
|
fingerprints and URLs. Typically each relay and URL map to multiple measurements.
|
|
|
7. Distribute the joined data into 2 bins based on whether the
|
|
|
`is_https` field of each entry is `1` or `0`
|
|
|
8. Repeat the following for each bin:
|
|
|
1. Further bin the measurements into sub-bins based on the exit relay used
|
|
|
to perform the measurement
|
|
|
2. Repeat the following for each exit relay in each sub-bin:
|
|
|
1. Count the total number of measurements in this sub-bin that were
|
|
|
completed using this exit relay
|
|
|
2. Count the total number of measurements in this sub-bin that were
|
|
|
completed using this exit relay and have `is_captcha_found` field
|
|
|
set to `1`
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
$`\frac{Step 2.8.2.2}{Step 2.8.2.1} \times 100`$
|
|
|
3. Calculate the weighted average of the percentage values (obtained in
|
|
|
Step 2.8.2.3) using exit probabilities (obtained in Step 2.3) as the
|
|
|
scaling factor
|
|
|
9. Plot the weighted percentage values for each bin in the Y-axis and
|
|
|
the `valid-after` timestamp of the consensus in the X-axis
|
|
|
3. Merge the graphs created for each consensus
|
|
|
|
|
|
### Related questions
|
|
|
- [(14)](home#metrics-to-track) Is there a difference if the origin server has
|
|
|
an SSL certificate or not?
|
|
|
- [(14.1)](home#metrics-to-track) Does the blocking change if the SSL
|
|
|
certificate is issued by Cloudflare or by another entity?
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
## Weighted CAPTCHA rate by HTTP request quantity
|
|
|
### Purpose
|
|
|
Understanding the effect of connecting to websites that require single or
|
|
|
multiple HTTP requests to load on the probability of seeing a CAPTCHA
|
|
|
|
|
|
### Steps to produce
|
|
|
1. Get consensuses from CollecTor
|
|
|
2. Repeat the following for each consensus:
|
|
|
1. Parse and memorize the `valid-after` & `fresh-until` timestamps from the
|
|
|
consensus header and `bandwidth-weights` values from the footer
|
|
|
2. Repeat the following for each *running exit relay* entry within the consensus:
|
|
|
1. Parse the `r` line and memorize the IPv4 address and identity
|
|
|
2. Parse the `w` line and memorize the bandwidth
|
|
|
3. Parse the `s` line and memorize the relay flags
|
|
|
3. Calculate the weighted exit probabilities using the `bandwidth-weights`
|
|
|
from the consensus, `bandwidth` values, and `flags` for each exit relay
|
|
|
(see an example calculation [here](https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/metrics/onionoo/updater/NodeDetailsStatusUpdater.java#n597))
|
|
|
4. Use CAPTCHA Monitor API to get measurements that were completed
|
|
|
using Tor and between the `valid-after` & `fresh-until` timestamps of the
|
|
|
consensus
|
|
|
5. Use CAPTCHA Monitor API to get the list of URLs that are used in the
|
|
|
experiments. This list contains the metadata about the URLs.
|
|
|
6. Join the measurements, URL list, and relay data using the relay
|
|
|
fingerprints and URLs. Typically each relay and URL map to multiple measurements.
|
|
|
7. Distribute the joined data into 2 bins based on whether the
|
|
|
`requires_multiple_reqs` field of each entry is `1` or `0`
|
|
|
8. Repeat the following for each bin:
|
|
|
1. Further bin the measurements into sub-bins based on the exit relay used
|
|
|
to perform the measurement
|
|
|
2. Repeat the following for each exit relay in each sub-bin:
|
|
|
1. Count the total number of measurements in this sub-bin that were
|
|
|
completed using this exit relay
|
|
|
2. Count the total number of measurements in this sub-bin that were
|
|
|
completed using this exit relay and have `is_captcha_found` field
|
|
|
set to `1`
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
$`\frac{Step 2.8.2.2}{Step 2.8.2.1} \times 100`$
|
|
|
3. Calculate the weighted average of the percentage values (obtained in
|
|
|
Step 2.8.2.3) using exit probabilities (obtained in Step 2.3) as the
|
|
|
scaling factor
|
|
|
9. Plot the weighted percentage values for each bin in the Y-axis and
|
|
|
the `valid-after` timestamp of the consensus in the X-axis
|
|
|
3. Merge the graphs created for each consensus
|
|
|
|
|
|
### Related questions
|
|
|
- [(13)](home#metrics-to-track) Is there a difference between websites that load
|
|
|
resources from third-party resources and websites that contain all resources on
|
|
|
the origin server? [ticket:33010#comment:6]
|
|
|
- [(13.1)](home#metrics-to-track) How do users of websites get affected if
|
|
|
the main website is not fronted by Cloudflare, but some of the resources are
|
|
|
fetched from a Cloudflare fronted web server? [ticket:33010#comment:6], [ticket:15450]
|
|
|
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
## Weighted CAPTCHA rate by CDN provider
|
|
|
### Purpose
|
|
|
Understanding the effect of connecting to websites that use CDN providers such
|
|
|
as Cloudflare, Akamai, Amazon Cloudfront, etc. on the probability of seeing a
|
|
|
CAPTCHA
|
|
|
|
|
|
### Steps to produce
|
|
|
1. Get consensuses from CollecTor
|
|
|
2. Repeat the following for each consensus:
|
|
|
1. Parse and memorize the `valid-after` & `fresh-until` timestamps from the
|
|
|
consensus header and `bandwidth-weights` values from the footer
|
|
|
2. Repeat the following for each *running exit relay* entry within the consensus:
|
|
|
1. Parse the `r` line and memorize the IPv4 address and identity
|
|
|
2. Parse the `w` line and memorize the bandwidth
|
|
|
3. Parse the `s` line and memorize the relay flags
|
|
|
3. Calculate the weighted exit probabilities using the `bandwidth-weights`
|
|
|
from the consensus, `bandwidth` values, and `flags` for each exit relay
|
|
|
(see an example calculation [here](https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/metrics/onionoo/updater/NodeDetailsStatusUpdater.java#n597))
|
|
|
4. Use CAPTCHA Monitor API to get measurements that were completed
|
|
|
using Tor and between the `valid-after` & `fresh-until` timestamps of the
|
|
|
consensus
|
|
|
5. Use CAPTCHA Monitor API to get the list of URLs that are used in the
|
|
|
experiments. This list contains the metadata about the URLs.
|
|
|
6. Join the measurements, URL list, and relay data using the relay
|
|
|
fingerprints and URLs. Typically each relay and URL map to multiple measurements.
|
|
|
7. Distribute the joined data into bins based on `cdn_provider` field's value
|
|
|
8. Repeat the following for each bin:
|
|
|
1. Further bin the measurements into sub-bins based on the exit relay used
|
|
|
to perform the measurement
|
|
|
2. Repeat the following for each exit relay in each sub-bin:
|
|
|
1. Count the total number of measurements in this sub-bin that were
|
|
|
completed using this exit relay
|
|
|
2. Count the total number of measurements in this sub-bin that were
|
|
|
completed using this exit relay and have `is_captcha_found` field
|
|
|
set to `1`
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
$`\frac{Step 2.8.2.2}{Step 2.8.2.1} \times 100`$
|
|
|
3. Calculate the weighted average of the percentage values (obtained in
|
|
|
Step 2.8.2.3) using exit probabilities (obtained in Step 2.3) as the
|
|
|
scaling factor
|
|
|
9. Plot the weighted percentage values for each bin in the Y-axis and
|
|
|
the `valid-after` timestamp of the consensus in the X-axis
|
|
|
3. Merge the graphs created for each consensus
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
# Graphs for understanding the overall network status
|
|
|
# Graphs for understanding the overall network status (by CDN)
|
|
|
## Probability of a Tor client receiving CAPTCHA
|
|
|
### Purpose
|
|
|
Understanding the probability of a Tor client choosing an exit relay in the normal
|
... | ... | @@ -534,41 +316,29 @@ Cloudflare's blocking practices? |
|
|
blocked?
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
# Graphs for understanding the Cloudflare firewall
|
|
|
## CAPTCHA rate by Cloudflare security level/firewall settings
|
|
|
## CAPTCHA rate by traffic origin (Tor traffic vs Non-Tor traffic)
|
|
|
### Purpose
|
|
|
Understanding the effect of different Cloudflare security levels and firewall
|
|
|
configurations on the probability of seeing a CAPTCHA.
|
|
|
|
|
|
We have a few different domains to test different configurations. Here they are:
|
|
|
- captcha.wtf
|
|
|
- IPv4 only domain, no additional Cloudflare firewall rules
|
|
|
- yearlight.buzz
|
|
|
- IPv4 only domain, Cloudflare firewall is set to present "JS Challenge" for
|
|
|
traffic originating from the Tor network
|
|
|
- bottomlesspit.xyz
|
|
|
- IPv4 only domain, Cloudflare firewall is set to present "CAPTCHA Challenge" for
|
|
|
traffic originating from the Tor network
|
|
|
- broccolipizza.monster
|
|
|
- IPv4 only domain, Cloudflare firewall is set to block all traffic
|
|
|
originating from the Tor network
|
|
|
- exit11.online
|
|
|
- IPv6 only domain, no additional Cloudflare firewall rules
|
|
|
- icanhazcaptcha.xyz
|
|
|
- IPv6 only domain, Cloudflare firewall is set to present "CAPTCHA Challenge" for
|
|
|
traffic originating from the Tor network
|
|
|
Understanding how Cloudflare treats to Tor traffic vs. non-Tor traffic (this one
|
|
|
is stating the obvious but still good to have data to back up the obvious)
|
|
|
|
|
|
### Steps to produce
|
|
|
0. Determine a date range and granularity to plot. Here, we will plot last 30 days
|
|
|
with a granularity of 1 hour.
|
|
|
1. Use CAPTCHA Monitor API to get measurements that were *completed
|
|
|
using only domains specified above* and during the chosen date range and
|
|
|
1. Use CAPTCHA Monitor API to get measurements that were completed during the
|
|
|
chosen date range
|
|
|
2. Use CAPTCHA Monitor API to get the list of URLs that are used in the
|
|
|
experiments. This list contains the metadata about the URLs.
|
|
|
3. Join the measurements and URL list using the `URL` fields. Typically each
|
|
|
URL maps to multiple measurements.
|
|
|
4. Discard the measurements that do not have `cloudflare` in their `cdn_provider`
|
|
|
field
|
|
|
5. Iterate over the chosen date range with the chosen time intervals. Repeat
|
|
|
the following for each iteration:
|
|
|
1. Distribute the measurements that were completed within the interval of
|
|
|
this iteration into bins based on `url` field's value
|
|
|
this iteration into 2 bins based on `method` field's value. Put the methods
|
|
|
without "tor" (ex. "firefox") into the `Non-Tor Traffic` bin and the rest
|
|
|
(ex. "firefox_over_tor") into the `Tor Traffic` bin.
|
|
|
2. Repeat the following for each bin:
|
|
|
1. Count the total number of measurements in this bin
|
|
|
2. Count the total number of measurements in this bin that have
|
... | ... | @@ -580,55 +350,68 @@ the following for each iteration: |
|
|
time of this interval in the X-axis
|
|
|
5. Merge the graphs created for each iteration
|
|
|
|
|
|
### Related questions
|
|
|
<!-- - [(3.4)](home#metrics-to-track) How does Cloudflare react to browsers with
|
|
|
and without JavaScript enabled? [ticket:31404] -->
|
|
|
- [(6)](home#metrics-to-track) How do different security levels of Cloudflare
|
|
|
affect the blocking mechanism? [ticket:33010#comment:5]
|
|
|
- [(6.1)](home#metrics-to-track) Do some of the Cloudflare security levels
|
|
|
block users immediately without presenting a CAPTCHA challenge at all?
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
|
|
|
## CAPTCHA rate by traffic origin
|
|
|
# Graphs for understanding CAPTCHA rates related to website decisions
|
|
|
## Weighted CAPTCHA rate by connection security
|
|
|
### Purpose
|
|
|
Understanding how Cloudflare treats to Tor traffic vs. non-Tor traffic (this one
|
|
|
is stating the obvious but still good to have data to back up the obvious)
|
|
|
Understanding the effect of using TLS and not using TLS on the probability
|
|
|
of seeing a CAPTCHA
|
|
|
|
|
|
### Steps to produce
|
|
|
0. Determine a date range and granularity to plot. Here, we will plot last 30 days
|
|
|
with a granularity of 1 hour.
|
|
|
1. Use CAPTCHA Monitor API to get measurements that were completed during the
|
|
|
chosen date range
|
|
|
2. Use CAPTCHA Monitor API to get the list of URLs that are used in the
|
|
|
experiments. This list contains the metadata about the URLs.
|
|
|
3. Join the measurements and URL list using the `URL` fields. Typically each
|
|
|
URL maps to multiple measurements.
|
|
|
4. Discard the measurements that do not have `cloudflare` in their `cdn_provider`
|
|
|
field
|
|
|
5. Iterate over the chosen date range with the chosen time intervals. Repeat
|
|
|
the following for each iteration:
|
|
|
1. Distribute the measurements that were completed within the interval of
|
|
|
this iteration into 2 bins based on `method` field's value. Put the methods
|
|
|
without "tor" (ex. "firefox") into the `Non-Tor Traffic` bin and the rest
|
|
|
(ex. "firefox_over_tor") into the `Tor Traffic` bin.
|
|
|
2. Repeat the following for each bin:
|
|
|
1. Count the total number of measurements in this bin
|
|
|
2. Count the total number of measurements in this bin that have
|
|
|
`is_captcha_found` field set to `1`
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
$`\frac{Step 5.2.2}{Step 5.2.1} \times 100`$ (Leave this bin's value
|
|
|
empty if there are no corresponding measurements)
|
|
|
3. Plot the percentage values for each bin in the Y-axis and the beginning
|
|
|
time of this interval in the X-axis
|
|
|
5. Merge the graphs created for each iteration
|
|
|
1. Get consensuses from CollecTor
|
|
|
2. Repeat the following for each consensus:
|
|
|
1. Parse and memorize the `valid-after` & `fresh-until` timestamps from the
|
|
|
consensus header and `bandwidth-weights` values from the footer
|
|
|
2. Repeat the following for each *running exit relay* entry within the consensus:
|
|
|
1. Parse the `r` line and memorize the IPv4 address and identity
|
|
|
2. Parse the `w` line and memorize the bandwidth
|
|
|
3. Parse the `s` line and memorize the relay flags
|
|
|
3. Calculate the weighted exit probabilities using the `bandwidth-weights`
|
|
|
from the consensus, `bandwidth` values, and `flags` for each exit relay
|
|
|
(see an example calculation [here](https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/metrics/onionoo/updater/NodeDetailsStatusUpdater.java#n597))
|
|
|
4. Use CAPTCHA Monitor API to get measurements that were completed
|
|
|
using Tor and between the `valid-after` & `fresh-until` timestamps of the
|
|
|
consensus
|
|
|
5. Use CAPTCHA Monitor API to get the list of URLs that are used in the
|
|
|
experiments. This list contains the metadata about the URLs.
|
|
|
6. Join the measurements, URL list, and relay data using the relay
|
|
|
fingerprints and URLs. Typically each relay and URL map to multiple measurements.
|
|
|
7. Distribute the joined data into 2 bins based on whether the
|
|
|
`is_https` field of each entry is `1` or `0`
|
|
|
8. Repeat the following for each bin:
|
|
|
1. Further bin the measurements into sub-bins based on the exit relay used
|
|
|
to perform the measurement
|
|
|
2. Repeat the following for each exit relay in each sub-bin:
|
|
|
1. Count the total number of measurements in this sub-bin that were
|
|
|
completed using this exit relay
|
|
|
2. Count the total number of measurements in this sub-bin that were
|
|
|
completed using this exit relay and have `is_captcha_found` field
|
|
|
set to `1`
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
$`\frac{Step 2.8.2.2}{Step 2.8.2.1} \times 100`$
|
|
|
3. Calculate the weighted average of the percentage values (obtained in
|
|
|
Step 2.8.2.3) using exit probabilities (obtained in Step 2.3) as the
|
|
|
scaling factor
|
|
|
9. Plot the weighted percentage values for each bin in the Y-axis and
|
|
|
the `valid-after` timestamp of the consensus in the X-axis
|
|
|
3. Merge the graphs created for each consensus
|
|
|
|
|
|
### Related questions
|
|
|
- [(14)](home#metrics-to-track) Is there a difference if the origin server has
|
|
|
an SSL certificate or not?
|
|
|
- [(14.1)](home#metrics-to-track) Does the blocking change if the SSL
|
|
|
certificate is issued by Cloudflare or by another entity?
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
## Weighted CAPTCHA rate by exit relay age
|
|
|
## Weighted CAPTCHA rate by HTTP request quantity
|
|
|
### Purpose
|
|
|
Understanding how quickly Cloudflare blocks the newer relays and if there is a
|
|
|
different treatment for older relays
|
|
|
Understanding the effect of connecting to websites that require single or
|
|
|
multiple HTTP requests to load on the probability of seeing a CAPTCHA
|
|
|
|
|
|
### Steps to produce
|
|
|
1. Get consensuses from CollecTor
|
... | ... | @@ -647,52 +430,44 @@ different treatment for older relays |
|
|
consensus
|
|
|
5. Use CAPTCHA Monitor API to get the list of URLs that are used in the
|
|
|
experiments. This list contains the metadata about the URLs.
|
|
|
6. Join the measurements and URL list using the `URL` fields. Typically each
|
|
|
URL maps to multiple measurements.
|
|
|
7. Discard the measurements that do not have `cloudflare` in their `cdn_provider`
|
|
|
field
|
|
|
8. Obtain the "details document" from Onionoo and match the Onionoo data
|
|
|
with the relay entries from consensus using the relay fingerprints. The following query is
|
|
|
recommended for obtaining the "details document":
|
|
|
https://onionoo.torproject.org/details?type=relay&flag=Exit&fields=exit_addresses,fingerprint,first_seen
|
|
|
9. Calculate the age of the exit relays in days using the `first_seen` field
|
|
|
of the "details document" and `valid-after` timestamp of the consensus
|
|
|
(`exit_age` = ceil_days(`valid-after` - `first_seen`))
|
|
|
10. Distribute the exit relay entries from the consensus into
|
|
|
`(max(exit_age) - min(exit_age)) / 365` bins based on their ages
|
|
|
(calculated in Step 2.9)
|
|
|
11. Repeat the following for each bin:
|
|
|
1. Repeat the following for each exit relay in the bin:
|
|
|
1. Count the total number of measurements that were
|
|
|
6. Join the measurements, URL list, and relay data using the relay
|
|
|
fingerprints and URLs. Typically each relay and URL map to multiple measurements.
|
|
|
7. Distribute the joined data into 2 bins based on whether the
|
|
|
`requires_multiple_reqs` field of each entry is `1` or `0`
|
|
|
8. Repeat the following for each bin:
|
|
|
1. Further bin the measurements into sub-bins based on the exit relay used
|
|
|
to perform the measurement
|
|
|
2. Repeat the following for each exit relay in each sub-bin:
|
|
|
1. Count the total number of measurements in this sub-bin that were
|
|
|
completed using this exit relay
|
|
|
2. Count the total number of measurements that were
|
|
|
2. Count the total number of measurements in this sub-bin that were
|
|
|
completed using this exit relay and have `is_captcha_found` field
|
|
|
set to `1`
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
$`\frac{Step 2.8.1.2}{Step 2.8.1.1} \times 100`$ (Assume `0%` if an
|
|
|
exit relay exists in the consensus but there are no corresponding
|
|
|
measurements)
|
|
|
2. Calculate the weighted average of the percentage values (obtained in
|
|
|
Step 2.8.1.3) using exit probabilities (obtained in Step 2.3) as the
|
|
|
$`\frac{Step 2.8.2.2}{Step 2.8.2.1} \times 100`$
|
|
|
3. Calculate the weighted average of the percentage values (obtained in
|
|
|
Step 2.8.2.3) using exit probabilities (obtained in Step 2.3) as the
|
|
|
scaling factor
|
|
|
7. Plot the weighted percentage values for each bin in the Y-axis and
|
|
|
9. Plot the weighted percentage values for each bin in the Y-axis and
|
|
|
the `valid-after` timestamp of the consensus in the X-axis
|
|
|
3. Merge the graphs created for each consensus
|
|
|
|
|
|
### Related questions
|
|
|
- [(8)](home#metrics-to-track) How often does Cloudflare's blocking mechanism
|
|
|
change/update itself?
|
|
|
- [(10)](home#metrics-to-track) How well does Cloudflare keep track of the new
|
|
|
or old Tor exit nodes?
|
|
|
- [(10.1)](home#metrics-to-track) How frequently Cloudflare updates its Tor exit
|
|
|
node list?
|
|
|
- [(13)](home#metrics-to-track) Is there a difference between websites that load
|
|
|
resources from third-party resources and websites that contain all resources on
|
|
|
the origin server? [ticket:33010#comment:6]
|
|
|
- [(13.1)](home#metrics-to-track) How do users of websites get affected if
|
|
|
the main website is not fronted by Cloudflare, but some of the resources are
|
|
|
fetched from a Cloudflare fronted web server? [ticket:33010#comment:6], [ticket:15450]
|
|
|
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
## Weighted CAPTCHA rate by exit relay location
|
|
|
## Weighted CAPTCHA rate by CDN provider
|
|
|
### Purpose
|
|
|
Understanding if Cloudflare prefers to block requests more from exit relays in
|
|
|
certain countries
|
|
|
Understanding the effect of connecting to websites that use CDN providers such
|
|
|
as Cloudflare, Akamai, Amazon Cloudfront, etc. on the probability of seeing a
|
|
|
CAPTCHA
|
|
|
|
|
|
### Steps to produce
|
|
|
1. Get consensuses from CollecTor
|
... | ... | @@ -711,73 +486,96 @@ certain countries |
|
|
consensus
|
|
|
5. Use CAPTCHA Monitor API to get the list of URLs that are used in the
|
|
|
experiments. This list contains the metadata about the URLs.
|
|
|
6. Join the measurements and URL list using the `URL` fields. Typically each
|
|
|
URL maps to multiple measurements.
|
|
|
7. Discard the measurements that do not have `cloudflare` in their `cdn_provider`
|
|
|
field
|
|
|
8. Obtain the "details document" from Onionoo and match the Onionoo data
|
|
|
with the relay entries from consensus using the relay fingerprints. The following query is
|
|
|
recommended for obtaining the "details document":
|
|
|
https://onionoo.torproject.org/details?type=relay&flag=Exit&fields=exit_addresses,fingerprint,country_name
|
|
|
9. Distribute the exit relay entries from the consensus into bins based on
|
|
|
their `country_name` value (obtained in Step 2.8)
|
|
|
10. Repeat the following for each bin:
|
|
|
1. Repeat the following for each exit relay in the bin:
|
|
|
1. Count the total number of measurements that were completed using
|
|
|
this exit relay
|
|
|
2. Count the total number of measurements that were completed using
|
|
|
this exit relay and have `is_captcha_found` field set to `1`
|
|
|
6. Join the measurements, URL list, and relay data using the relay
|
|
|
fingerprints and URLs. Typically each relay and URL map to multiple measurements.
|
|
|
7. Distribute the joined data into bins based on `cdn_provider` field's value
|
|
|
8. Repeat the following for each bin:
|
|
|
1. Further bin the measurements into sub-bins based on the exit relay used
|
|
|
to perform the measurement
|
|
|
2. Repeat the following for each exit relay in each sub-bin:
|
|
|
1. Count the total number of measurements in this sub-bin that were
|
|
|
completed using this exit relay
|
|
|
2. Count the total number of measurements in this sub-bin that were
|
|
|
completed using this exit relay and have `is_captcha_found` field
|
|
|
set to `1`
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
$`\frac{Step 2.10.1.2}{Step 2.10.1.1} \times 100`$ (Assume `0%` if an
|
|
|
exit relay exists in the consensus but there are no corresponding
|
|
|
measurements)
|
|
|
2. Calculate the weighted average of the percentage values (obtained in
|
|
|
Step 2.10.1.3) using exit probabilities (obtained in Step 2.3) as the
|
|
|
$`\frac{Step 2.8.2.2}{Step 2.8.2.1} \times 100`$
|
|
|
3. Calculate the weighted average of the percentage values (obtained in
|
|
|
Step 2.8.2.3) using exit probabilities (obtained in Step 2.3) as the
|
|
|
scaling factor
|
|
|
7. Plot the weighted percentage values for each bin in the Y-axis and
|
|
|
9. Plot the weighted percentage values for each bin in the Y-axis and
|
|
|
the `valid-after` timestamp of the consensus in the X-axis
|
|
|
3. Merge the graphs with top 10 highest percentage values and discard the rest
|
|
|
(or keep if you want to have them as well)
|
|
|
3. Merge the graphs created for each consensus
|
|
|
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
## Code injection rate
|
|
|
|
|
|
# Graphs for understanding CAPTCHA rates related to user decisions
|
|
|
## Weighted CAPTCHA rate by method
|
|
|
### Purpose
|
|
|
Cloudflare sometimes injects third-party code to the websites without letting the
|
|
|
users know. This graph aims to visualize the percentage of measurements were
|
|
|
affected by third-party code injection over time.
|
|
|
Understanding the effect of using different methods (for example using
|
|
|
web browsers like Tor Browser, Firefox over Tor, Brave's Tor Tabs, etc.) on the
|
|
|
probability of seeing a CAPTCHA
|
|
|
|
|
|
### Steps to produce
|
|
|
0. Determine a date range and granularity to plot. Here, we will plot last 30 days
|
|
|
with a granularity of 1 hour.
|
|
|
1. Use CAPTCHA Monitor API to get measurements that were during between the
|
|
|
chosen date range
|
|
|
2. Use CAPTCHA Monitor API to get the list of URLs that are used in the
|
|
|
experiments. This list contains the metadata about the URLs.
|
|
|
3. Join the measurements and URL list using the `URL` fields. Typically each
|
|
|
URL maps to multiple measurements.
|
|
|
4. Discard the measurements that do not have `cloudflare` in their `cdn_provider`
|
|
|
field
|
|
|
5. Iterate over the chosen date range with the chosen time intervals. Repeat
|
|
|
the following for each iteration:
|
|
|
1. Distribute the measurements that were completed within the
|
|
|
interval of this iteration into 2 bins based on `is_data_modified` field's
|
|
|
value. Skip the measurements that do not have `is_data_modified` field.
|
|
|
2. Repeat the following for each bin:
|
|
|
1. Count the total number of measurements in this bin
|
|
|
2. Count the total number of measurements in this bin that have
|
|
|
`is_captcha_found` field set to `1`
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
$`\frac{Step 5.2.2}{Step 5.2.1} \times 100`$ (Leave this bin's value
|
|
|
empty if there are no corresponding measurements)
|
|
|
3. Plot the percentage values for each bin in the Y-axis and the beginning
|
|
|
time of this interval in the X-axis
|
|
|
5. Merge the graphs created for each iteration
|
|
|
1. Get consensuses from CollecTor
|
|
|
2. Repeat the following for each consensus:
|
|
|
1. Parse and memorize the `valid-after` & `fresh-until` timestamps from the
|
|
|
consensus header and `bandwidth-weights` values from the footer
|
|
|
2. Repeat the following for each *running exit relay* entry within the consensus:
|
|
|
1. Parse the `r` line and memorize the IPv4 address and identity
|
|
|
2. Parse the `w` line and memorize the bandwidth
|
|
|
3. Parse the `s` line and memorize the relay flags
|
|
|
3. Calculate the weighted exit probabilities using the `bandwidth-weights`
|
|
|
from the consensus, `bandwidth` values, and `flags` for each exit relay
|
|
|
(see an example calculation [here](https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/metrics/onionoo/updater/NodeDetailsStatusUpdater.java#n597))
|
|
|
4. Use CAPTCHA Monitor API to get measurements that were completed
|
|
|
using Tor and between the `valid-after` & `fresh-until` timestamps of the
|
|
|
consensus
|
|
|
5. Join the measurements and relay data using the relay fingerprints.
|
|
|
Typically each relay maps to multiple measurements.
|
|
|
6. Distribute the joined data into bins based on `method` field's value
|
|
|
7. Repeat the following for each bin:
|
|
|
1. Further bin the measurements into sub-bins based on the exit relay used
|
|
|
to perform the measurement
|
|
|
2. Repeat the following for each exit relay in each sub-bin:
|
|
|
1. Count the total number of measurements in this sub-bin that were
|
|
|
completed using this exit relay
|
|
|
2. Count the total number of measurements in this sub-bin that were
|
|
|
completed using this exit relay and have `is_captcha_found` field
|
|
|
set to `1`
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
$`\frac{Step 2.7.2.2}{Step 2.7.2.1} \times 100`$
|
|
|
3. Calculate the weighted average of the percentage values (obtained in
|
|
|
Step 2.7.2.3) using exit probabilities (obtained in Step 2.3) as the
|
|
|
scaling factor
|
|
|
8. Plot the weighted percentage values for each `method` bin in the Y-axis and
|
|
|
the `valid-after` timestamp of the consensus in the X-axis
|
|
|
3. Merge the graphs created for each consensus
|
|
|
|
|
|
### Related questions
|
|
|
- [(2)](home#metrics-to-track) How does the HTTP request headers affect
|
|
|
Cloudflare's decision-making mechanism? [ticket:33010#comment:4]
|
|
|
- [(2.1)](home#metrics-to-track) Is there a difference between using the
|
|
|
actual Tor Browser itself and tor-browser-selenium in terms of the HTTP headers?
|
|
|
- [(2.2)](home#metrics-to-track) How does Cloudflare react differently if the
|
|
|
browser doesn't support alt-svc headers? [ticket:32915]
|
|
|
- [(3)](home#metrics-to-track) How do different browsers with different
|
|
|
User Agents get affected? [ticket:33010#comment:2], [ticket:32924], [ticket:31404]
|
|
|
- [(3.1)](home#metrics-to-track) Is there a difference between using a web
|
|
|
browser or fetching web pages via cURL or other HTTP libraries?
|
|
|
- [(7)](home#metrics-to-track) How does the time of the day affect the
|
|
|
Cloudflare's blocking mechanism? Does it matter the day of the week or the time
|
|
|
of the day? [ticket:33010#comment:15]
|
|
|
- [(15)](home#metrics-to-track) If browsers that should not face CAPTCHA face
|
|
|
CAPTCHA, why does this happen?
|
|
|
- [(16)](home#metrics-to-track) How do the observed patterns in the results
|
|
|
change over time? [ticket:33010]
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
# Graphs about Tor Browser centric data
|
|
|
## Weighted CAPTCHA rate by Tor Browser version
|
|
|
### Purpose
|
|
|
Understanding the effect of using different Tor Browser versions on the
|
... | ... | @@ -873,9 +671,103 @@ Understanding the effect of using Tor Browser at different security levels |
|
|
- [(3.3)](home#metrics-to-track) What about the different security levels of Tor
|
|
|
Browser?
|
|
|
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
|
|
|
# Graphs for understanding the Cloudflare firewall
|
|
|
## CAPTCHA rate by Cloudflare security level/firewall settings
|
|
|
### Purpose
|
|
|
Understanding the effect of different Cloudflare security levels and firewall
|
|
|
configurations on the probability of seeing a CAPTCHA.
|
|
|
|
|
|
We have a few different domains to test different configurations. Here they are:
|
|
|
- captcha.wtf
|
|
|
- IPv4 only domain, no additional Cloudflare firewall rules
|
|
|
- yearlight.buzz
|
|
|
- IPv4 only domain, Cloudflare firewall is set to present "JS Challenge" for
|
|
|
traffic originating from the Tor network
|
|
|
- bottomlesspit.xyz
|
|
|
- IPv4 only domain, Cloudflare firewall is set to present "CAPTCHA Challenge" for
|
|
|
traffic originating from the Tor network
|
|
|
- broccolipizza.monster
|
|
|
- IPv4 only domain, Cloudflare firewall is set to block all traffic
|
|
|
originating from the Tor network
|
|
|
- exit11.online
|
|
|
- IPv6 only domain, no additional Cloudflare firewall rules
|
|
|
- icanhazcaptcha.xyz
|
|
|
- IPv6 only domain, Cloudflare firewall is set to present "CAPTCHA Challenge" for
|
|
|
traffic originating from the Tor network
|
|
|
|
|
|
### Steps to produce
|
|
|
0. Determine a date range and granularity to plot. Here, we will plot last 30 days
|
|
|
with a granularity of 1 hour.
|
|
|
1. Use CAPTCHA Monitor API to get measurements that were *completed
|
|
|
using only domains specified above* and during the chosen date range
|
|
|
5. Iterate over the chosen date range with the chosen time intervals. Repeat
|
|
|
the following for each iteration:
|
|
|
1. Distribute the measurements that were completed within the interval of
|
|
|
this iteration into bins based on `url` field's value
|
|
|
2. Repeat the following for each bin:
|
|
|
1. Count the total number of measurements in this bin
|
|
|
2. Count the total number of measurements in this bin that have
|
|
|
`is_captcha_found` field set to `1`
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
$`\frac{Step 5.2.2}{Step 5.2.1} \times 100`$ (Leave this bin's value
|
|
|
empty if there are no corresponding measurements)
|
|
|
3. Plot the percentage values for each bin in the Y-axis and the beginning
|
|
|
time of this interval in the X-axis
|
|
|
5. Merge the graphs created for each iteration
|
|
|
|
|
|
### Related questions
|
|
|
<!-- - [(3.4)](home#metrics-to-track) How does Cloudflare react to browsers with
|
|
|
and without JavaScript enabled? [ticket:31404] -->
|
|
|
- [(6)](home#metrics-to-track) How do different security levels of Cloudflare
|
|
|
affect the blocking mechanism? [ticket:33010#comment:5]
|
|
|
- [(6.1)](home#metrics-to-track) Do some of the Cloudflare security levels
|
|
|
block users immediately without presenting a CAPTCHA challenge at all?
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
## Code injection rate
|
|
|
### Purpose
|
|
|
Cloudflare sometimes injects third-party code to the websites without letting the
|
|
|
users know. This graph aims to visualize the percentage of measurements were
|
|
|
affected by third-party code injection over time.
|
|
|
|
|
|
### Steps to produce
|
|
|
0. Determine a date range and granularity to plot. Here, we will plot last 30 days
|
|
|
with a granularity of 1 hour.
|
|
|
1. Use CAPTCHA Monitor API to get measurements that were during between the
|
|
|
chosen date range
|
|
|
2. Use CAPTCHA Monitor API to get the list of URLs that are used in the
|
|
|
experiments. This list contains the metadata about the URLs.
|
|
|
3. Join the measurements and URL list using the `URL` fields. Typically each
|
|
|
URL maps to multiple measurements.
|
|
|
4. Discard the measurements that do not have `cloudflare` in their `cdn_provider`
|
|
|
field
|
|
|
5. Iterate over the chosen date range with the chosen time intervals. Repeat
|
|
|
the following for each iteration:
|
|
|
1. Distribute the measurements that were completed within the
|
|
|
interval of this iteration into 2 bins based on `is_data_modified` field's
|
|
|
value. Skip the measurements that do not have `is_data_modified` field.
|
|
|
2. Repeat the following for each bin:
|
|
|
1. Count the total number of measurements in this bin
|
|
|
2. Count the total number of measurements in this bin that have
|
|
|
`is_captcha_found` field set to `1`
|
|
|
3. Calculate the percentage of measurements that received CAPTCHA using
|
|
|
$`\frac{Step 5.2.2}{Step 5.2.1} \times 100`$ (Leave this bin's value
|
|
|
empty if there are no corresponding measurements)
|
|
|
3. Plot the percentage values for each bin in the Y-axis and the beginning
|
|
|
time of this interval in the X-axis
|
|
|
5. Merge the graphs created for each iteration
|
|
|
|
|
|
|
|
|
<!-- ####################################################################### -->
|
|
|
<!-- ####################################################################### -->
|
|
|
|
|
|
|
|
|
# Graphs about individual exit relays
|
|
|
## Overall CAPTCHA rate
|
|
|
### Purpose
|
... | ... | |