# Graphs for understanding CAPTCHA rates related to user decisions
# Graphs for understanding the overall network status (by CDN)
## Weighted CAPTCHA rate by method
### Purpose
Understanding the effect of using different methods (for example using
web browsers like Tor Browser, Firefox over Tor, Brave's Tor Tabs, etc.) on the
probability of seeing a CAPTCHA
### Steps to produce
1. Get consensuses from CollecTor
2. Repeat the following for each consensus:
1. Parse and memorize the `valid-after` & `fresh-until` timestamps from the
consensus header and `bandwidth-weights` values from the footer
2. Repeat the following for each *running exit relay* entry within the consensus:
1. Parse the `r` line and memorize the IPv4 address and identity
2. Parse the `w` line and memorize the bandwidth
3. Parse the `s` line and memorize the relay flags
3. Calculate the weighted exit probabilities using the `bandwidth-weights`
from the consensus, `bandwidth` values, and `flags` for each exit relay
(see an example calculation [here](https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/metrics/onionoo/updater/NodeDetailsStatusUpdater.java#n597))
4. Use CAPTCHA Monitor API to get measurements that were completed
using Tor and between the `valid-after` & `fresh-until` timestamps of the
consensus
5. Join the measurements and relay data using the relay fingerprints.
Typically each relay maps to multiple measurements.
6. Distribute the joined data into bins based on `method` field's value
7. Repeat the following for each bin:
1. Further bin the measurements into sub-bins based on the exit relay used
to perform the measurement
2. Repeat the following for each exit relay in each sub-bin:
1. Count the total number of measurements in this sub-bin that were
completed using this exit relay
2. Count the total number of measurements in this sub-bin that were
completed using this exit relay and have `is_captcha_found` field
set to `1`
3. Calculate the percentage of measurements that received CAPTCHA using
$`\frac{Step 2.7.2.2}{Step 2.7.2.1} \times 100`$
3. Calculate the weighted average of the percentage values (obtained in
Step 2.7.2.3) using exit probabilities (obtained in Step 2.3) as the
scaling factor
8. Plot the weighted percentage values for each `method` bin in the Y-axis and
the `valid-after` timestamp of the consensus in the X-axis
3. Merge the graphs created for each consensus
### Related questions
-[(2)](home#metrics-to-track) How does the HTTP request headers affect
Understanding the effect of using TLS and not using TLS on the probability
of seeing a CAPTCHA
### Steps to produce
1. Get consensuses from CollecTor
2. Repeat the following for each consensus:
1. Parse and memorize the `valid-after` & `fresh-until` timestamps from the
consensus header and `bandwidth-weights` values from the footer
2. Repeat the following for each *running exit relay* entry within the consensus:
1. Parse the `r` line and memorize the IPv4 address and identity
2. Parse the `w` line and memorize the bandwidth
3. Parse the `s` line and memorize the relay flags
3. Calculate the weighted exit probabilities using the `bandwidth-weights`
from the consensus, `bandwidth` values, and `flags` for each exit relay
(see an example calculation [here](https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/metrics/onionoo/updater/NodeDetailsStatusUpdater.java#n597))
4. Use CAPTCHA Monitor API to get measurements that were completed
using Tor and between the `valid-after` & `fresh-until` timestamps of the
consensus
5. Use CAPTCHA Monitor API to get the list of URLs that are used in the
experiments. This list contains the metadata about the URLs.
6. Join the measurements, URL list, and relay data using the relay
fingerprints and URLs. Typically each relay and URL map to multiple measurements.
7. Distribute the joined data into 2 bins based on whether the
`is_https` field of each entry is `1` or `0`
8. Repeat the following for each bin:
1. Further bin the measurements into sub-bins based on the exit relay used
to perform the measurement
2. Repeat the following for each exit relay in each sub-bin:
1. Count the total number of measurements in this sub-bin that were
completed using this exit relay
2. Count the total number of measurements in this sub-bin that were
completed using this exit relay and have `is_captcha_found` field
set to `1`
3. Calculate the percentage of measurements that received CAPTCHA using
$`\frac{Step 2.8.2.2}{Step 2.8.2.1} \times 100`$
3. Calculate the weighted average of the percentage values (obtained in
Step 2.8.2.3) using exit probabilities (obtained in Step 2.3) as the
scaling factor
9. Plot the weighted percentage values for each bin in the Y-axis and
the `valid-after` timestamp of the consensus in the X-axis
3. Merge the graphs created for each consensus
### Related questions
-[(14)](home#metrics-to-track) Is there a difference if the origin server has
an SSL certificate or not?
-[(14.1)](home#metrics-to-track) Does the blocking change if the SSL
certificate is issued by Cloudflare or by another entity?
Understanding the effect of connecting to websites that require single or
multiple HTTP requests to load on the probability of seeing a CAPTCHA
### Steps to produce
1. Get consensuses from CollecTor
2. Repeat the following for each consensus:
1. Parse and memorize the `valid-after` & `fresh-until` timestamps from the
consensus header and `bandwidth-weights` values from the footer
2. Repeat the following for each *running exit relay* entry within the consensus:
1. Parse the `r` line and memorize the IPv4 address and identity
2. Parse the `w` line and memorize the bandwidth
3. Parse the `s` line and memorize the relay flags
3. Calculate the weighted exit probabilities using the `bandwidth-weights`
from the consensus, `bandwidth` values, and `flags` for each exit relay
(see an example calculation [here](https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/metrics/onionoo/updater/NodeDetailsStatusUpdater.java#n597))
4. Use CAPTCHA Monitor API to get measurements that were completed
using Tor and between the `valid-after` & `fresh-until` timestamps of the
consensus
5. Use CAPTCHA Monitor API to get the list of URLs that are used in the
experiments. This list contains the metadata about the URLs.
6. Join the measurements, URL list, and relay data using the relay
fingerprints and URLs. Typically each relay and URL map to multiple measurements.
7. Distribute the joined data into 2 bins based on whether the
`requires_multiple_reqs` field of each entry is `1` or `0`
8. Repeat the following for each bin:
1. Further bin the measurements into sub-bins based on the exit relay used
to perform the measurement
2. Repeat the following for each exit relay in each sub-bin:
1. Count the total number of measurements in this sub-bin that were
completed using this exit relay
2. Count the total number of measurements in this sub-bin that were
completed using this exit relay and have `is_captcha_found` field
set to `1`
3. Calculate the percentage of measurements that received CAPTCHA using
$`\frac{Step 2.8.2.2}{Step 2.8.2.1} \times 100`$
3. Calculate the weighted average of the percentage values (obtained in
Step 2.8.2.3) using exit probabilities (obtained in Step 2.3) as the
scaling factor
9. Plot the weighted percentage values for each bin in the Y-axis and
the `valid-after` timestamp of the consensus in the X-axis
3. Merge the graphs created for each consensus
### Related questions
-[(13)](home#metrics-to-track) Is there a difference between websites that load
resources from third-party resources and websites that contain all resources on
the origin server? [ticket:33010#comment:6]
-[(13.1)](home#metrics-to-track) How do users of websites get affected if
the main website is not fronted by Cloudflare, but some of the resources are
fetched from a Cloudflare fronted web server? [ticket:33010#comment:6], [ticket:15450]
Understanding the effect of connecting to websites that use CDN providers such
as Cloudflare, Akamai, Amazon Cloudfront, etc. on the probability of seeing a
CAPTCHA
### Steps to produce
1. Get consensuses from CollecTor
2. Repeat the following for each consensus:
1. Parse and memorize the `valid-after` & `fresh-until` timestamps from the
consensus header and `bandwidth-weights` values from the footer
2. Repeat the following for each *running exit relay* entry within the consensus:
1. Parse the `r` line and memorize the IPv4 address and identity
2. Parse the `w` line and memorize the bandwidth
3. Parse the `s` line and memorize the relay flags
3. Calculate the weighted exit probabilities using the `bandwidth-weights`
from the consensus, `bandwidth` values, and `flags` for each exit relay
(see an example calculation [here](https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/metrics/onionoo/updater/NodeDetailsStatusUpdater.java#n597))
4. Use CAPTCHA Monitor API to get measurements that were completed
using Tor and between the `valid-after` & `fresh-until` timestamps of the
consensus
5. Use CAPTCHA Monitor API to get the list of URLs that are used in the
experiments. This list contains the metadata about the URLs.
6. Join the measurements, URL list, and relay data using the relay
fingerprints and URLs. Typically each relay and URL map to multiple measurements.
7. Distribute the joined data into bins based on `cdn_provider` field's value
8. Repeat the following for each bin:
1. Further bin the measurements into sub-bins based on the exit relay used
to perform the measurement
2. Repeat the following for each exit relay in each sub-bin:
1. Count the total number of measurements in this sub-bin that were
completed using this exit relay
2. Count the total number of measurements in this sub-bin that were
completed using this exit relay and have `is_captcha_found` field
set to `1`
3. Calculate the percentage of measurements that received CAPTCHA using
$`\frac{Step 2.8.2.2}{Step 2.8.2.1} \times 100`$
3. Calculate the weighted average of the percentage values (obtained in
Step 2.8.2.3) using exit probabilities (obtained in Step 2.3) as the
scaling factor
9. Plot the weighted percentage values for each bin in the Y-axis and
the `valid-after` timestamp of the consensus in the X-axis
# Graphs for understanding CAPTCHA rates related to website decisions
## Weighted CAPTCHA rate by connection security
### Purpose
### Purpose
Understanding how Cloudflare treats to Tor traffic vs. non-Tor traffic (this one
Understanding the effect of using TLS and not using TLS on the probability
is stating the obvious but still good to have data to back up the obvious)
of seeing a CAPTCHA
### Steps to produce
### Steps to produce
0. Determine a date range and granularity to plot. Here, we will plot last 30 days
1. Get consensuses from CollecTor
with a granularity of 1 hour.
2. Repeat the following for each consensus:
1. Use CAPTCHA Monitor API to get measurements that were completed during the
1. Parse and memorize the `valid-after` & `fresh-until` timestamps from the
chosen date range
consensus header and `bandwidth-weights` values from the footer
2. Use CAPTCHA Monitor API to get the list of URLs that are used in the
2. Repeat the following for each *running exit relay* entry within the consensus:
experiments. This list contains the metadata about the URLs.
1. Parse the `r` line and memorize the IPv4 address and identity
3. Join the measurements and URL list using the `URL` fields. Typically each
2. Parse the `w` line and memorize the bandwidth
URL maps to multiple measurements.
3. Parse the `s` line and memorize the relay flags
4. Discard the measurements that do not have `cloudflare` in their `cdn_provider`
3. Calculate the weighted exit probabilities using the `bandwidth-weights`
field
from the consensus, `bandwidth` values, and `flags` for each exit relay
5. Iterate over the chosen date range with the chosen time intervals. Repeat
(see an example calculation [here](https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/metrics/onionoo/updater/NodeDetailsStatusUpdater.java#n597))
the following for each iteration:
4. Use CAPTCHA Monitor API to get measurements that were completed
1. Distribute the measurements that were completed within the interval of
using Tor and between the `valid-after` & `fresh-until` timestamps of the
this iteration into 2 bins based on `method` field's value. Put the methods
consensus
without "tor" (ex. "firefox") into the `Non-Tor Traffic` bin and the rest
5. Use CAPTCHA Monitor API to get the list of URLs that are used in the
(ex. "firefox_over_tor") into the `Tor Traffic` bin.
experiments. This list contains the metadata about the URLs.
2. Repeat the following for each bin:
6. Join the measurements, URL list, and relay data using the relay
1. Count the total number of measurements in this bin
fingerprints and URLs. Typically each relay and URL map to multiple measurements.
2. Count the total number of measurements in this bin that have
7. Distribute the joined data into 2 bins based on whether the
`is_captcha_found` field set to `1`
`is_https` field of each entry is `1` or `0`
3. Calculate the percentage of measurements that received CAPTCHA using
8. Repeat the following for each bin:
$`\frac{Step 5.2.2}{Step 5.2.1} \times 100`$ (Leave this bin's value
1. Further bin the measurements into sub-bins based on the exit relay used
empty if there are no corresponding measurements)
to perform the measurement
3. Plot the percentage values for each bin in the Y-axis and the beginning
2. Repeat the following for each exit relay in each sub-bin:
time of this interval in the X-axis
1. Count the total number of measurements in this sub-bin that were
5. Merge the graphs created for each iteration
completed using this exit relay
2. Count the total number of measurements in this sub-bin that were
completed using this exit relay and have `is_captcha_found` field
set to `1`
3. Calculate the percentage of measurements that received CAPTCHA using
$`\frac{Step 2.8.2.2}{Step 2.8.2.1} \times 100`$
3. Calculate the weighted average of the percentage values (obtained in
Step 2.8.2.3) using exit probabilities (obtained in Step 2.3) as the
scaling factor
9. Plot the weighted percentage values for each bin in the Y-axis and
the `valid-after` timestamp of the consensus in the X-axis
3. Merge the graphs created for each consensus
### Related questions
-[(14)](home#metrics-to-track) Is there a difference if the origin server has
an SSL certificate or not?
-[(14.1)](home#metrics-to-track) Does the blocking change if the SSL
certificate is issued by Cloudflare or by another entity?
# Graphs for understanding CAPTCHA rates related to user decisions
## Weighted CAPTCHA rate by method
### Purpose
### Purpose
Cloudflare sometimes injects third-party code to the websites without letting the
Understanding the effect of using different methods (for example using
users know. This graph aims to visualize the percentage of measurements were
web browsers like Tor Browser, Firefox over Tor, Brave's Tor Tabs, etc.) on the
affected by third-party code injection over time.
probability of seeing a CAPTCHA
### Steps to produce
### Steps to produce
0. Determine a date range and granularity to plot. Here, we will plot last 30 days
1. Get consensuses from CollecTor
with a granularity of 1 hour.
2. Repeat the following for each consensus:
1. Use CAPTCHA Monitor API to get measurements that were during between the
1. Parse and memorize the `valid-after` & `fresh-until` timestamps from the
chosen date range
consensus header and `bandwidth-weights` values from the footer
2. Use CAPTCHA Monitor API to get the list of URLs that are used in the
2. Repeat the following for each *running exit relay* entry within the consensus:
experiments. This list contains the metadata about the URLs.
1. Parse the `r` line and memorize the IPv4 address and identity
3. Join the measurements and URL list using the `URL` fields. Typically each
2. Parse the `w` line and memorize the bandwidth
URL maps to multiple measurements.
3. Parse the `s` line and memorize the relay flags
4. Discard the measurements that do not have `cloudflare` in their `cdn_provider`
3. Calculate the weighted exit probabilities using the `bandwidth-weights`
field
from the consensus, `bandwidth` values, and `flags` for each exit relay
5. Iterate over the chosen date range with the chosen time intervals. Repeat
(see an example calculation [here](https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/metrics/onionoo/updater/NodeDetailsStatusUpdater.java#n597))
the following for each iteration:
4. Use CAPTCHA Monitor API to get measurements that were completed
1. Distribute the measurements that were completed within the
using Tor and between the `valid-after` & `fresh-until` timestamps of the
interval of this iteration into 2 bins based on `is_data_modified` field's
consensus
value. Skip the measurements that do not have `is_data_modified` field.
5. Join the measurements and relay data using the relay fingerprints.
2. Repeat the following for each bin:
Typically each relay maps to multiple measurements.
1. Count the total number of measurements in this bin
6. Distribute the joined data into bins based on `method` field's value
2. Count the total number of measurements in this bin that have
7. Repeat the following for each bin:
`is_captcha_found` field set to `1`
1. Further bin the measurements into sub-bins based on the exit relay used
3. Calculate the percentage of measurements that received CAPTCHA using
to perform the measurement
$`\frac{Step 5.2.2}{Step 5.2.1} \times 100`$ (Leave this bin's value
2. Repeat the following for each exit relay in each sub-bin:
empty if there are no corresponding measurements)
1. Count the total number of measurements in this sub-bin that were
3. Plot the percentage values for each bin in the Y-axis and the beginning
completed using this exit relay
time of this interval in the X-axis
2. Count the total number of measurements in this sub-bin that were
5. Merge the graphs created for each iteration
completed using this exit relay and have `is_captcha_found` field
set to `1`
3. Calculate the percentage of measurements that received CAPTCHA using
$`\frac{Step 2.7.2.2}{Step 2.7.2.1} \times 100`$
3. Calculate the weighted average of the percentage values (obtained in
Step 2.7.2.3) using exit probabilities (obtained in Step 2.3) as the
scaling factor
8. Plot the weighted percentage values for each `method` bin in the Y-axis and
the `valid-after` timestamp of the consensus in the X-axis
3. Merge the graphs created for each consensus
### Related questions
-[(2)](home#metrics-to-track) How does the HTTP request headers affect