Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
CAPTCHA-Monitor
CAPTCHA-Monitor
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 22
    • Issues 22
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge Requests 0
    • Merge Requests 0
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
  • Operations
    • Operations
    • Incidents
    • Environments
  • Analytics
    • Analytics
    • CI / CD
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar

GitLab is used only for code review, issue tracking and project management. Canonical locations for source code are still https://gitweb.torproject.org/ https://git.torproject.org/ and git-rw.torproject.org.

  • Barkin Simsek
  • CAPTCHA-MonitorCAPTCHA-Monitor
  • Wiki
  • Dashboard Graphs v0

Last edited by Barkin Simsek Aug 12, 2020
Page history

Dashboard Graphs v0

Redesigning CAPTCHA Monitor's Data Visualization Tools [Archived v0 version]

Problem

  • The graphs currently displayed at the dashboard are somewhat misleading and difficult to understand
    • The graphing algorithm does not distinguish between exit relays that see CAPTCHA once or fifty times and considers all of them equal
    • The graphing algorithm does not consider exit relay's size, but larger exit relays receiving CAPTCHA has a larger effect on the network
    • The scale in the graphs are also misleading in some cases
  • There are no clear explanations about what each graph aims to achieve
  • There is also no way for curious people to reproduce the graphs or understand the process for producing the graphs

Solution

Redesigning the graphs by addressing the problems from the first version and improving with feedback. Thus, this document aims to describe how to produce the v2 graphs that will be available on the CAPTCHA Monitor's dashboard. Graphs are divided into five main groups based on the graphs' function and purpose.

The document looks long, but there are a lot of repeating explanations across the graphs. If you have any suggestions/feedback, please mention it under ticket #41 of this repository.

Table of contents

  • Default graph style
  • Graphs for understanding CAPTCHA rates related to user decisions
    • Weighted CAPTCHA rate by method
    • Weighted CAPTCHA rate by connection security
    • Weighted CAPTCHA rate by HTTP request quantity
    • Weighted CAPTCHA rate by CDN provider
  • Graphs for understanding the overall network status
    • Probability of a Tor client receiving CAPTCHA
    • Weighted CAPTCHA rate by IP version
    • Weighted CAPTCHA rate by exit probability
    • Weighted CAPTCHA rate by exit relay age
    • Weighted CAPTCHA rate by exit relay location
  • Graphs for understanding the Cloudflare firewall
    • CAPTCHA rate by Cloudflare security level/firewall settings
    • CAPTCHA rate by traffic origin
    • Weighted CAPTCHA rate by exit relay age
    • Weighted CAPTCHA rate by exit relay location
    • Code injection rate
  • Graphs about Tor Browser centric data
    • Weighted CAPTCHA rate by Tor Browser version
    • Weighted CAPTCHA rate by Tor Browser security level
  • Graphs about individual exit relays
    • Overall CAPTCHA rate
    • CAPTCHA rate by CDN provider

Default graph style

The following graph style will be used for all graphs unless otherwise specified:

  • Type
    • Line chart
  • Axes
    • X-axis: The dates of the last 30 days. While reading the descriptions, you will see that the valid-after timestamp of the consensuses are told to be used in the X-axis. These valid-after timestamps will be used to place the data to the correct position in the graphs but their values will not be used as axis labels. This decision was made to decrease the clutteredness of the graph labels.
    • Y-axis: The percentage values from 0% to 100%, uses a linear scale
  • Sample Graph (Number of data points is reduced for simplicity) graph-style

Graphs for understanding CAPTCHA rates related to user decisions

Weighted CAPTCHA rate by method

Purpose

Understanding the effect of using different methods (for example using web browsers like Tor Browser, Firefox over Tor, Brave's Tor Tabs, etc.) on the probability of seeing a CAPTCHA

Steps to produce

  1. Get consensuses from CollecTor
  2. Repeat the following for each consensus:
    1. Parse and memorize the valid-after & fresh-until timestamps from the consensus header and bandwidth-weights values from the footer
    2. Repeat the following for each running exit relay entry within the consensus:
      1. Parse the r line and memorize the IPv4 address and identity
      2. Parse the w line and memorize the bandwidth
      3. Parse the s line and memorize the relay flags
    3. Calculate the weighted exit probabilities using the bandwidth-weights from the consensus, bandwidth values, and flags for each exit relay (see an example calculation here)
    4. Use CAPTCHA Monitor API to get measurements that were completed using Tor and between the valid-after & fresh-until timestamps of the consensus
    5. Join the measurements and relay data using the relay fingerprints. Typically each relay maps to multiple measurements.
    6. Distribute the joined data into bins based on method field's value
    7. Repeat the following for each bin:
      1. Further bin the measurements into sub-bins based on the exit relay used to perform the measurement
      2. Repeat the following for each exit relay in each sub-bin:
        1. Count the total number of measurements in this sub-bin that were completed using this exit relay
        2. Count the total number of measurements in this sub-bin that were completed using this exit relay and have is_captcha_found field set to 1
        3. Calculate the percentage of measurements that received CAPTCHA using \frac{Step 2.7.2.2}{Step 2.7.2.1} \times 100
      3. Calculate the weighted average of the percentage values (obtained in Step 2.7.2.3) using exit probabilities (obtained in Step 2.3) as the scaling factor
    8. Plot the weighted percentage values for each method bin in the Y-axis and the valid-after timestamp of the consensus in the X-axis
  3. Merge the graphs created for each consensus

Related questions

  • (2) How does the HTTP request headers affect Cloudflare's decision-making mechanism? [ticket:33010#comment:4]
    • (2.1) Is there a difference between using the actual Tor Browser itself and tor-browser-selenium​ in terms of the HTTP headers?
    • (2.2) How does Cloudflare react differently if the browser doesn't support alt-svc headers? [ticket:32915]
  • (3) How do different browsers with different User Agents get affected? [ticket:33010#comment:2], [ticket:32924], [ticket:31404]
    • (3.1) Is there a difference between using a web browser or fetching web pages via cURL or other HTTP libraries?
  • (7) How does the time of the day affect the Cloudflare's blocking mechanism? Does it matter the day of the week or the time of the day? [ticket:33010#comment:15]
  • (15) If browsers that should not face CAPTCHA face CAPTCHA, why does this happen?
  • (16) How do the observed patterns in the results change over time? [ticket:33010]

Weighted CAPTCHA rate by connection security

Purpose

Understanding the effect of using TLS and not using TLS on the probability of seeing a CAPTCHA

Steps to produce

  1. Get consensuses from CollecTor
  2. Repeat the following for each consensus:
    1. Parse and memorize the valid-after & fresh-until timestamps from the consensus header and bandwidth-weights values from the footer
    2. Repeat the following for each running exit relay entry within the consensus:
      1. Parse the r line and memorize the IPv4 address and identity
      2. Parse the w line and memorize the bandwidth
      3. Parse the s line and memorize the relay flags
    3. Calculate the weighted exit probabilities using the bandwidth-weights from the consensus, bandwidth values, and flags for each exit relay (see an example calculation here)
    4. Use CAPTCHA Monitor API to get measurements that were completed using Tor and between the valid-after & fresh-until timestamps of the consensus
    5. Use CAPTCHA Monitor API to get the list of URLs that are used in the experiments. This list contains the metadata about the URLs.
    6. Join the measurements, URL list, and relay data using the relay fingerprints and URLs. Typically each relay and URL map to multiple measurements.
    7. Distribute the joined data into 2 bins based on whether the is_https field of each entry is 1 or 0
    8. Repeat the following for each bin:
      1. Further bin the measurements into sub-bins based on the exit relay used to perform the measurement
      2. Repeat the following for each exit relay in each sub-bin:
        1. Count the total number of measurements in this sub-bin that were completed using this exit relay
        2. Count the total number of measurements in this sub-bin that were completed using this exit relay and have is_captcha_found field set to 1
        3. Calculate the percentage of measurements that received CAPTCHA using \frac{Step 2.8.2.2}{Step 2.8.2.1} \times 100
      3. Calculate the weighted average of the percentage values (obtained in Step 2.8.2.3) using exit probabilities (obtained in Step 2.3) as the scaling factor
    9. Plot the weighted percentage values for each bin in the Y-axis and the valid-after timestamp of the consensus in the X-axis
  3. Merge the graphs created for each consensus

Related questions

  • (14) Is there a difference if the origin server has an SSL certificate or not?
    • (14.1) Does the blocking change if the SSL certificate is issued by Cloudflare or by another entity?

Weighted CAPTCHA rate by HTTP request quantity

Purpose

Understanding the effect of connecting to websites that require single or multiple HTTP requests to load on the probability of seeing a CAPTCHA

Steps to produce

  1. Get consensuses from CollecTor
  2. Repeat the following for each consensus:
    1. Parse and memorize the valid-after & fresh-until timestamps from the consensus header and bandwidth-weights values from the footer
    2. Repeat the following for each running exit relay entry within the consensus:
      1. Parse the r line and memorize the IPv4 address and identity
      2. Parse the w line and memorize the bandwidth
      3. Parse the s line and memorize the relay flags
    3. Calculate the weighted exit probabilities using the bandwidth-weights from the consensus, bandwidth values, and flags for each exit relay (see an example calculation here)
    4. Use CAPTCHA Monitor API to get measurements that were completed using Tor and between the valid-after & fresh-until timestamps of the consensus
    5. Use CAPTCHA Monitor API to get the list of URLs that are used in the experiments. This list contains the metadata about the URLs.
    6. Join the measurements, URL list, and relay data using the relay fingerprints and URLs. Typically each relay and URL map to multiple measurements.
    7. Distribute the joined data into 2 bins based on whether the requires_multiple_reqs field of each entry is 1 or 0
    8. Repeat the following for each bin:
      1. Further bin the measurements into sub-bins based on the exit relay used to perform the measurement
      2. Repeat the following for each exit relay in each sub-bin:
        1. Count the total number of measurements in this sub-bin that were completed using this exit relay
        2. Count the total number of measurements in this sub-bin that were completed using this exit relay and have is_captcha_found field set to 1
        3. Calculate the percentage of measurements that received CAPTCHA using \frac{Step 2.8.2.2}{Step 2.8.2.1} \times 100
      3. Calculate the weighted average of the percentage values (obtained in Step 2.8.2.3) using exit probabilities (obtained in Step 2.3) as the scaling factor
    9. Plot the weighted percentage values for each bin in the Y-axis and the valid-after timestamp of the consensus in the X-axis
  3. Merge the graphs created for each consensus

Related questions

  • (13) Is there a difference between websites that load resources from third-party resources and websites that contain all resources on the origin server? [ticket:33010#comment:6]
    • (13.1) How do users of websites get affected if the main website is not fronted by Cloudflare, but some of the resources are fetched from a Cloudflare fronted web server? [ticket:33010#comment:6], [ticket:15450]

Weighted CAPTCHA rate by CDN provider

Purpose

Understanding the effect of connecting to websites that use CDN providers such as Cloudflare, Akamai, Amazon Cloudfront, etc. on the probability of seeing a CAPTCHA

Steps to produce

  1. Get consensuses from CollecTor
  2. Repeat the following for each consensus:
    1. Parse and memorize the valid-after & fresh-until timestamps from the consensus header and bandwidth-weights values from the footer
    2. Repeat the following for each running exit relay entry within the consensus:
      1. Parse the r line and memorize the IPv4 address and identity
      2. Parse the w line and memorize the bandwidth
      3. Parse the s line and memorize the relay flags
    3. Calculate the weighted exit probabilities using the bandwidth-weights from the consensus, bandwidth values, and flags for each exit relay (see an example calculation here)
    4. Use CAPTCHA Monitor API to get measurements that were completed using Tor and between the valid-after & fresh-until timestamps of the consensus
    5. Use CAPTCHA Monitor API to get the list of URLs that are used in the experiments. This list contains the metadata about the URLs.
    6. Join the measurements, URL list, and relay data using the relay fingerprints and URLs. Typically each relay and URL map to multiple measurements.
    7. Distribute the joined data into bins based on cdn_provider field's value
    8. Repeat the following for each bin:
      1. Further bin the measurements into sub-bins based on the exit relay used to perform the measurement
      2. Repeat the following for each exit relay in each sub-bin:
        1. Count the total number of measurements in this sub-bin that were completed using this exit relay
        2. Count the total number of measurements in this sub-bin that were completed using this exit relay and have is_captcha_found field set to 1
        3. Calculate the percentage of measurements that received CAPTCHA using \frac{Step 2.8.2.2}{Step 2.8.2.1} \times 100
      3. Calculate the weighted average of the percentage values (obtained in Step 2.8.2.3) using exit probabilities (obtained in Step 2.3) as the scaling factor
    9. Plot the weighted percentage values for each bin in the Y-axis and the valid-after timestamp of the consensus in the X-axis
  3. Merge the graphs created for each consensus

Graphs for understanding the overall network status

Probability of a Tor client receiving CAPTCHA

Purpose

Understanding the probability of a Tor client choosing an exit relay in the normal weighted way receiving a CAPTCHA

Steps to produce

  1. Get consensuses from CollecTor
  2. Repeat the following for each consensus:
    1. Parse and memorize the valid-after & fresh-until timestamps from the consensus header and bandwidth-weights values from the footer
    2. Repeat the following for each running exit relay entry within the consensus:
      1. Parse the r line and memorize the IPv4 address and identity
      2. Parse the w line and memorize the bandwidth
      3. Parse the s line and memorize the relay flags
    3. Calculate the weighted exit probabilities using the bandwidth-weights from the consensus, bandwidth values, and flags for each exit relay (see an example calculation here)
    4. Use CAPTCHA Monitor API to get measurements that were completed using Tor and between the valid-after & fresh-until timestamps of the consensus
    5. Repeat the following for each running exit relay entry within the consensus:
      1. Count the total number of measurements that were completed using this exit relay
      2. Count the total number of measurements that were completed using this exit relay and have is_captcha_found field set to 1
      3. Calculate the percentage of measurements that received CAPTCHA using \frac{Step 2.5.2}{Step 2.5.1} \times 100 (Assume 0% if an exit relay exists in the consensus but there are no corresponding measurements)
    6. Calculate the weighted average of the percentage values (obtained in Step 2.5.3) using exit probabilities (obtained in Step 2.3) as the scaling factor
    7. Map and memorize the consensus's valid-after timestamp to the weighted average of the percentages
  3. Plot the weighted percentage values for each consensus in the Y-axis and the valid-after timestamps in the X-axis

Related questions

  • (12) What is the chance of a Tor client getting affected by Cloudflare's blocking practices when choosing a Tor exit node? [ticket:33010]
  • (17) Is whether you get a CAPTCHA much more probabilistic and transient? [ticket:33010]
  • (18) The chance that a Tor client, choosing an exit relay in the normal weighted faction, will get hit by a CAPTCHA [ticket:33010]

Weighted CAPTCHA rate by IP version

Purpose

Understanding the effect of connecting to web servers (and consequently exit relays) that support IPv4 vs IPv6 on the probability of seeing a CAPTCHA

Steps to produce

  1. Get consensuses from CollecTor
  2. Repeat the following for each consensus:
    1. Parse and memorize the valid-after & fresh-until timestamps from the consensus header and bandwidth-weights values from the footer
    2. Repeat the following for each running exit relay entry within the consensus:
      1. Parse the r line and memorize the IPv4 address and identity
      2. Parse the w line and memorize the bandwidth
      3. Parse the s line and memorize the relay flags
    3. Calculate the weighted exit probabilities using the bandwidth-weights from the consensus, bandwidth values, and flags for each exit relay (see an example calculation here)
    4. Use CAPTCHA Monitor API to get measurements that were completed using Tor and between the valid-after & fresh-until timestamps of the consensus
    5. Obtain the "details document" from Onionoo and match the Onionoo data with the relay entries from consensus using the relay fingerprints. The following query is recommended for obtaining the "details document": https://onionoo.torproject.org/details?type=relay&flag=Exit&fields=exit_addresses,fingerprint,exit_policy_v6_summary
    6. Distribute the exit relay entries from the consensus into 2 bins based on whether they support IPv6 exiting or not. This should be decided based on the exit_policy_v6_summary field obtained from the "details document"
    7. Repeat the following for each bin:
      1. Repeat the following for each exit relay in the bin:
        1. Count the total number of measurements that were completed using this exit relay
        2. Count the total number of measurements that were completed using this exit relay and have is_captcha_found field set to 1
        3. Calculate the percentage of measurements that received CAPTCHA using \frac{Step 2.7.1.2}{Step 2.7.1.1} \times 100 (Assume 0% if an exit relay exists in the consensus but there are no corresponding measurements)
      2. Calculate the weighted average of the percentage values (obtained in Step 2.7.1.3) using exit probabilities (obtained in Step 2.3) as the scaling factor
    8. Plot the weighted percentage values for each bin in the Y-axis and the valid-after timestamp of the consensus in the X-axis
  3. Merge the graphs created for each consensus

Related questions

  • (1) Does Cloudflare treat IPv4 and IPv6 addresses differently? [ticket:33010#comment:2]
  • (9) How do specific exit nodes get affected by Cloudflare's blocking practices?

Weighted CAPTCHA rate by exit probability

Purpose

Understanding the effect of using smaller or larger exit relays on the probability of seeing a CAPTCHA

Steps to produce

  1. Get consensuses from CollecTor
  2. Repeat the following for each consensus:
    1. Parse and memorize the valid-after & fresh-until timestamps from the consensus header and bandwidth-weights values from the footer
    2. Repeat the following for each running exit relay entry within the consensus:
      1. Parse the r line and memorize the IPv4 address and identity
      2. Parse the w line and memorize the bandwidth
      3. Parse the s line and memorize the relay flags
    3. Calculate the weighted exit probabilities using the bandwidth-weights from the consensus, bandwidth values, and flags for each exit relay (see an example calculation here)
    4. Use CAPTCHA Monitor API to get measurements that were completed using Tor and between the valid-after & fresh-until timestamps of the consensus
    5. Distribute the exit relay entries from the consensus into 10 bins (each bin containing probability values between n and n+0.1) based on their exit probabilities (calculated in Step 2.3)
    6. Repeat the following for each bin:
      1. Repeat the following for each exit relay in the bin:
        1. Count the total number of measurements that were completed using this exit relay
        2. Count the total number of measurements that were completed using this exit relay and have is_captcha_found field set to 1
        3. Calculate the percentage of measurements that received CAPTCHA using \frac{Step 2.6.1.2}{Step 2.6.1.1} \times 100 (Assume 0% if an exit relay exists in the consensus but there are no corresponding measurements)
      2. Calculate the weighted average of the percentage values (obtained in Step 2.6.1.3) using exit probabilities (obtained in Step 2.3) as the scaling factor
    7. Plot the weighted percentage values for each bin in the Y-axis and the valid-after timestamp of the consensus in the X-axis
  3. Merge the graphs created for each consensus

Related questions

  • (9) How do specific exit nodes get affected by Cloudflare's blocking practices?
    • (9.1) Does the size/age/location of the exit node play a role? [ticket:33010#comment:15]
    • (9.2) Is it always the same Tor exit nodes that get blocked?
  • (11) What fraction of the Tor exit nodes get affected by Cloudflare's blocking practices? [ticket:33010], [ticket:23840#comment:22]

Weighted CAPTCHA rate by exit relay age

Purpose

Understanding the effect of using older or younger exit relays (based on first_seen date) on the probability of seeing a CAPTCHA

Steps to produce

  1. Get consensuses from CollecTor
  2. Repeat the following for each consensus:
    1. Parse and memorize the valid-after & fresh-until timestamps from the consensus header and bandwidth-weights values from the footer
    2. Repeat the following for each running exit relay entry within the consensus:
      1. Parse the r line and memorize the IPv4 address and identity
      2. Parse the w line and memorize the bandwidth
      3. Parse the s line and memorize the relay flags
    3. Calculate the weighted exit probabilities using the bandwidth-weights from the consensus, bandwidth values, and flags for each exit relay (see an example calculation here)
    4. Use CAPTCHA Monitor API to get measurements that were completed using Tor and between the valid-after & fresh-until timestamps of the consensus
    5. Obtain the "details document" from Onionoo and match the Onionoo data with the relay entries from consensus using the relay fingerprints. The following query is recommended for obtaining the "details document": https://onionoo.torproject.org/details?type=relay&flag=Exit&fields=exit_addresses,fingerprint,first_seen
    6. Calculate the age of the exit relays in days using the first_seen field of the "details document" and valid-after timestamp of the consensus exit_age = ceil_days(valid-after - first_seen)
    7. Distribute the exit relay entries from the consensus into (max(exit_age) - min(exit_age)) / 365 bins based on their ages (calculated in Step 2.6)
    8. Repeat the following for each bin:
      1. Repeat the following for each exit relay in the bin:
        1. Count the total number of measurements that were completed using this exit relay
        2. Count the total number of measurements that were completed using this exit relay and have is_captcha_found field set to 1
        3. Calculate the percentage of measurements that received CAPTCHA using \frac{Step 2.8.1.2}{Step 2.8.1.1} \times 100 (Assume 0% if an exit relay exists in the consensus but there are no corresponding measurements)
      2. Calculate the weighted average of the percentage values (obtained in Step 2.8.1.3) using exit probabilities (obtained in Step 2.3) as the scaling factor
    9. Plot the weighted percentage values for each bin in the Y-axis and the valid-after timestamp of the consensus in the X-axis
  3. Merge the graphs created for each consensus

Related questions

  • (9) How do specific exit nodes get affected by Cloudflare's blocking practices?
    • (9.1) Does the size/age/location of the exit node play a role? [ticket:33010#comment:15]
    • (9.2) Is it always the same Tor exit nodes that get blocked?

Weighted CAPTCHA rate by exit relay location

Purpose

Understanding the effect of the physical location of the exit relay's location on the probability of seeing a CAPTCHA. This graph will show top 10 countries with highest CAPTCHA rates.

Steps to produce

  1. Get consensuses from CollecTor
  2. Repeat the following for each consensus:
    1. Parse and memorize the valid-after & fresh-until timestamps from the consensus header and bandwidth-weights values from the footer
    2. Repeat the following for each running exit relay entry within the consensus:
      1. Parse the r line and memorize the IPv4 address and identity
      2. Parse the w line and memorize the bandwidth
      3. Parse the s line and memorize the relay flags
    3. Calculate the weighted exit probabilities using the bandwidth-weights from the consensus, bandwidth values, and flags for each exit relay (see an example calculation here)
    4. Use CAPTCHA Monitor API to get measurements that were completed using Tor and between the valid-after & fresh-until timestamps of the consensus
    5. Obtain the "details document" from Onionoo and match the Onionoo data with the relay entries from consensus using the relay fingerprints. The following query is recommended for obtaining the "details document": https://onionoo.torproject.org/details?type=relay&flag=Exit&fields=exit_addresses,fingerprint,country_name
    6. Distribute the exit relay entries from the consensus into bins based on their country_name value (obtained in Step 2.5)
    7. Repeat the following for each bin:
      1. Repeat the following for each exit relay in the bin:
        1. Count the total number of measurements that were completed using this exit relay
        2. Count the total number of measurements that were completed using this exit relay and have is_captcha_found field set to 1
        3. Calculate the percentage of measurements that received CAPTCHA using \frac{Step 2.7.1.2}{Step 2.7.1.1} \times 100 (Assume 0% if an exit relay exists in the consensus but there are no corresponding measurements)
      2. Calculate the weighted average of the percentage values (obtained in Step 2.7.1.3) using exit probabilities (obtained in Step 2.3) as the scaling factor
    8. Plot the weighted percentage values for each bin in the Y-axis and the valid-after timestamp of the consensus in the X-axis
  3. Merge the graphs with top 10 highest percentage values and discard the rest (or keep if you want to have them as well)

Related questions

  • (9) How do specific exit nodes get affected by Cloudflare's blocking practices?
    • (9.1) Does the size/age/location of the exit node play a role? [ticket:33010#comment:15]
    • (9.2) Is it always the same Tor exit nodes that get blocked?

Graphs for understanding the Cloudflare firewall

CAPTCHA rate by Cloudflare security level/firewall settings

Purpose

Understanding the effect of different Cloudflare security levels and firewall configurations on the probability of seeing a CAPTCHA.

We have a few different domains to test different configurations. Here they are:

  • captcha.wtf
    • IPv4 only domain, no additional Cloudflare firewall rules
  • yearlight.buzz
    • IPv4 only domain, Cloudflare firewall is set to present "JS Challenge" for traffic originating from the Tor network
  • bottomlesspit.xyz
    • IPv4 only domain, Cloudflare firewall is set to present "CAPTCHA Challenge" for traffic originating from the Tor network
  • broccolipizza.monster
    • IPv4 only domain, Cloudflare firewall is set to block all traffic originating from the Tor network
  • exit11.online
    • IPv6 only domain, no additional Cloudflare firewall rules
  • icanhazcaptcha.xyz
    • IPv6 only domain, Cloudflare firewall is set to present "CAPTCHA Challenge" for traffic originating from the Tor network

Steps to produce

  1. Determine a date range and granularity to plot. Here, we will plot last 30 days with a granularity of 1 hour.
  2. Use CAPTCHA Monitor API to get measurements that were completed using only domains specified above and during the chosen date range and
  3. Iterate over the chosen date range with the chosen time intervals. Repeat the following for each iteration:
    1. Distribute the measurements that were completed within the interval of this iteration into bins based on url field's value
    2. Repeat the following for each bin:
      1. Count the total number of measurements in this bin
      2. Count the total number of measurements in this bin that have is_captcha_found field set to 1
      3. Calculate the percentage of measurements that received CAPTCHA using \frac{Step 5.2.2}{Step 5.2.1} \times 100 (Leave this bin's value empty if there are no corresponding measurements)
    3. Plot the percentage values for each bin in the Y-axis and the beginning time of this interval in the X-axis
  4. Merge the graphs created for each iteration

Related questions

  • (6) How do different security levels of Cloudflare affect the blocking mechanism? [ticket:33010#comment:5]
    • (6.1) Do some of the Cloudflare security levels block users immediately without presenting a CAPTCHA challenge at all?

CAPTCHA rate by traffic origin

Purpose

Understanding how Cloudflare treats to Tor traffic vs. non-Tor traffic (this one is stating the obvious but still good to have data to back up the obvious)

Steps to produce

  1. Determine a date range and granularity to plot. Here, we will plot last 30 days with a granularity of 1 hour.
  2. Use CAPTCHA Monitor API to get measurements that were completed during the chosen date range
  3. Use CAPTCHA Monitor API to get the list of URLs that are used in the experiments. This list contains the metadata about the URLs.
  4. Join the measurements and URL list using the URL fields. Typically each URL maps to multiple measurements.
  5. Discard the measurements that do not have cloudflare in their cdn_provider field
  6. Iterate over the chosen date range with the chosen time intervals. Repeat the following for each iteration:
    1. Distribute the measurements that were completed within the interval of this iteration into 2 bins based on method field's value. Put the methods without "tor" (ex. "firefox") into the Non-Tor Traffic bin and the rest (ex. "firefox_over_tor") into the Tor Traffic bin.
    2. Repeat the following for each bin:
      1. Count the total number of measurements in this bin
      2. Count the total number of measurements in this bin that have is_captcha_found field set to 1
      3. Calculate the percentage of measurements that received CAPTCHA using \frac{Step 5.2.2}{Step 5.2.1} \times 100 (Leave this bin's value empty if there are no corresponding measurements)
    3. Plot the percentage values for each bin in the Y-axis and the beginning time of this interval in the X-axis
  7. Merge the graphs created for each iteration

Weighted CAPTCHA rate by exit relay age

Purpose

Understanding how quickly Cloudflare blocks the newer relays and if there is a different treatment for older relays

Steps to produce

  1. Get consensuses from CollecTor
  2. Repeat the following for each consensus:
    1. Parse and memorize the valid-after & fresh-until timestamps from the consensus header and bandwidth-weights values from the footer
    2. Repeat the following for each running exit relay entry within the consensus:
      1. Parse the r line and memorize the IPv4 address and identity
      2. Parse the w line and memorize the bandwidth
      3. Parse the s line and memorize the relay flags
    3. Calculate the weighted exit probabilities using the bandwidth-weights from the consensus, bandwidth values, and flags for each exit relay (see an example calculation here)
    4. Use CAPTCHA Monitor API to get measurements that were completed using Tor and between the valid-after & fresh-until timestamps of the consensus
    5. Use CAPTCHA Monitor API to get the list of URLs that are used in the experiments. This list contains the metadata about the URLs.
    6. Join the measurements and URL list using the URL fields. Typically each URL maps to multiple measurements.
    7. Discard the measurements that do not have cloudflare in their cdn_provider field
    8. Obtain the "details document" from Onionoo and match the Onionoo data with the relay entries from consensus using the relay fingerprints. The following query is recommended for obtaining the "details document": https://onionoo.torproject.org/details?type=relay&flag=Exit&fields=exit_addresses,fingerprint,first_seen
    9. Calculate the age of the exit relays in days using the first_seen field of the "details document" and valid-after timestamp of the consensus (exit_age = ceil_days(valid-after - first_seen))
    10. Distribute the exit relay entries from the consensus into (max(exit_age) - min(exit_age)) / 365 bins based on their ages (calculated in Step 2.9)
    11. Repeat the following for each bin:
      1. Repeat the following for each exit relay in the bin:
        1. Count the total number of measurements that were completed using this exit relay
        2. Count the total number of measurements that were completed using this exit relay and have is_captcha_found field set to 1
        3. Calculate the percentage of measurements that received CAPTCHA using \frac{Step 2.8.1.2}{Step 2.8.1.1} \times 100 (Assume 0% if an exit relay exists in the consensus but there are no corresponding measurements)
      2. Calculate the weighted average of the percentage values (obtained in Step 2.8.1.3) using exit probabilities (obtained in Step 2.3) as the scaling factor
    12. Plot the weighted percentage values for each bin in the Y-axis and the valid-after timestamp of the consensus in the X-axis
  3. Merge the graphs created for each consensus

Related questions

  • (8) How often does Cloudflare's blocking mechanism change/update itself?
  • (10) How well does Cloudflare keep track of the new or old Tor exit nodes?
  • (10.1) How frequently Cloudflare updates its Tor exit node list?

Weighted CAPTCHA rate by exit relay location

Purpose

Understanding if Cloudflare prefers to block requests more from exit relays in certain countries

Steps to produce

  1. Get consensuses from CollecTor
  2. Repeat the following for each consensus:
    1. Parse and memorize the valid-after & fresh-until timestamps from the consensus header and bandwidth-weights values from the footer
    2. Repeat the following for each running exit relay entry within the consensus:
      1. Parse the r line and memorize the IPv4 address and identity
      2. Parse the w line and memorize the bandwidth
      3. Parse the s line and memorize the relay flags
    3. Calculate the weighted exit probabilities using the bandwidth-weights from the consensus, bandwidth values, and flags for each exit relay (see an example calculation here)
    4. Use CAPTCHA Monitor API to get measurements that were completed using Tor and between the valid-after & fresh-until timestamps of the consensus
    5. Use CAPTCHA Monitor API to get the list of URLs that are used in the experiments. This list contains the metadata about the URLs.
    6. Join the measurements and URL list using the URL fields. Typically each URL maps to multiple measurements.
    7. Discard the measurements that do not have cloudflare in their cdn_provider field
    8. Obtain the "details document" from Onionoo and match the Onionoo data with the relay entries from consensus using the relay fingerprints. The following query is recommended for obtaining the "details document": https://onionoo.torproject.org/details?type=relay&flag=Exit&fields=exit_addresses,fingerprint,country_name
    9. Distribute the exit relay entries from the consensus into bins based on their country_name value (obtained in Step 2.8)
    10. Repeat the following for each bin:
      1. Repeat the following for each exit relay in the bin:
        1. Count the total number of measurements that were completed using this exit relay
        2. Count the total number of measurements that were completed using this exit relay and have is_captcha_found field set to 1
        3. Calculate the percentage of measurements that received CAPTCHA using \frac{Step 2.10.1.2}{Step 2.10.1.1} \times 100 (Assume 0% if an exit relay exists in the consensus but there are no corresponding measurements)
      2. Calculate the weighted average of the percentage values (obtained in Step 2.10.1.3) using exit probabilities (obtained in Step 2.3) as the scaling factor
    11. Plot the weighted percentage values for each bin in the Y-axis and the valid-after timestamp of the consensus in the X-axis
  3. Merge the graphs with top 10 highest percentage values and discard the rest (or keep if you want to have them as well)

Code injection rate

Purpose

Cloudflare sometimes injects third-party code to the websites without letting the users know. This graph aims to visualize the percentage of measurements were affected by third-party code injection over time.

Steps to produce

  1. Determine a date range and granularity to plot. Here, we will plot last 30 days with a granularity of 1 hour.
  2. Use CAPTCHA Monitor API to get measurements that were during between the chosen date range
  3. Use CAPTCHA Monitor API to get the list of URLs that are used in the experiments. This list contains the metadata about the URLs.
  4. Join the measurements and URL list using the URL fields. Typically each URL maps to multiple measurements.
  5. Discard the measurements that do not have cloudflare in their cdn_provider field
  6. Iterate over the chosen date range with the chosen time intervals. Repeat the following for each iteration:
    1. Distribute the measurements that were completed within the interval of this iteration into 2 bins based on is_data_modified field's value. Skip the measurements that do not have is_data_modified field.
    2. Repeat the following for each bin:
      1. Count the total number of measurements in this bin
      2. Count the total number of measurements in this bin that have is_captcha_found field set to 1
      3. Calculate the percentage of measurements that received CAPTCHA using \frac{Step 5.2.2}{Step 5.2.1} \times 100 (Leave this bin's value empty if there are no corresponding measurements)
    3. Plot the percentage values for each bin in the Y-axis and the beginning time of this interval in the X-axis
  7. Merge the graphs created for each iteration

Graphs about Tor Browser centric data

Weighted CAPTCHA rate by Tor Browser version

Purpose

Understanding the effect of using different Tor Browser versions on the probability of seeing a CAPTCHA

Steps to produce

  1. Get consensuses from CollecTor
  2. Repeat the following for each consensus:
    1. Parse and memorize the valid-after & fresh-until timestamps from the consensus header and bandwidth-weights values from the footer
    2. Repeat the following for each running exit relay entry within the consensus:
      1. Parse the r line and memorize the IPv4 address and identity
      2. Parse the w line and memorize the bandwidth
      3. Parse the s line and memorize the relay flags
    3. Calculate the weighted exit probabilities using the bandwidth-weights from the consensus, bandwidth values, and flags for each exit relay (see an example calculation here)
    4. Use CAPTCHA Monitor API to get measurements that were completed using Tor Browser (method field is equal to tor_browser) and between the valid-after & fresh-until timestamps of the consensus
    5. Join the measurements and relay data using the relay fingerprints. Typically each relay maps to multiple measurements.
    6. Distribute the joined data into bins based on browser_version field's value
    7. Repeat the following for each bin:
      1. Further bin the measurements into sub-bins based on the exit relay used to perform the measurement
      2. Repeat the following for each exit relay in each sub-bin:
        1. Count the total number of measurements in this sub-bin that were completed using this exit relay
        2. Count the total number of measurements in this sub-bin that were completed using this exit relay and have is_captcha_found field set to 1
        3. Calculate the percentage of measurements that received CAPTCHA using \frac{Step 2.7.2.2}{Step 2.7.2.1} \times 100
      3. Calculate the weighted average of the percentage values (obtained in Step 2.7.2.3) using exit probabilities (obtained in Step 2.3) as the scaling factor
    8. Plot the weighted percentage values for each method bin in the Y-axis and the valid-after timestamp of the consensus in the X-axis
  3. Merge the graphs created for each consensus

Related questions

  • (3.2) What about different versions of the Tor Browser? Does Cloudflare behave differently to different versions of the same browser?

Weighted CAPTCHA rate by Tor Browser security level

Purpose

Understanding the effect of using Tor Browser at different security levels (Standard, Safer, Safest) on the probability of seeing a CAPTCHA

Steps to produce

  1. Get consensuses from CollecTor
  2. Repeat the following for each consensus:
    1. Parse and memorize the valid-after & fresh-until timestamps from the consensus header and bandwidth-weights values from the footer
    2. Repeat the following for each running exit relay entry within the consensus:
      1. Parse the r line and memorize the IPv4 address and identity
      2. Parse the w line and memorize the bandwidth
      3. Parse the s line and memorize the relay flags
    3. Calculate the weighted exit probabilities using the bandwidth-weights from the consensus, bandwidth values, and flags for each exit relay (see an example calculation here)
    4. Use CAPTCHA Monitor API to get measurements that were completed using Tor Browser (method field is equal to tor_browser) and between the valid-after & fresh-until timestamps of the consensus
    5. Join the measurements and relay data using the relay fingerprints. Typically each relay maps to multiple measurements.
    6. Distribute the joined data into 3 bins based on tbb_security_level field's value
    7. Repeat the following for each bin:
      1. Further bin the measurements into sub-bins based on the exit relay used to perform the measurement
      2. Repeat the following for each exit relay in each sub-bin:
        1. Count the total number of measurements in this sub-bin that were completed using this exit relay
        2. Count the total number of measurements in this sub-bin that were completed using this exit relay and have is_captcha_found field set to 1
        3. Calculate the percentage of measurements that received CAPTCHA using \frac{Step 2.7.2.2}{Step 2.7.2.1} \times 100
      3. Calculate the weighted average of the percentage values (obtained in Step 2.7.2.3) using exit probabilities (obtained in Step 2.3) as the scaling factor
    8. Plot the weighted percentage values for each method bin in the Y-axis and the valid-after timestamp of the consensus in the X-axis
  3. Merge the graphs created for each consensus

Related questions

  • (3.3) What about the different security levels of Tor Browser?

Graphs about individual exit relays

Overall CAPTCHA rate

Purpose

Seeing the overall CAPTCHA rate for a specific exit relay

Steps to produce

  1. Determine a date range and granularity to plot. Here, we will plot last 30 days with a granularity of 1 hour.
  2. Use CAPTCHA Monitor API to get measurements that were completed using the target exit relay and between the chosen date range
  3. Use CAPTCHA Monitor API to get the list of URLs that are used in the experiments. This list contains the metadata about the URLs.
  4. Join the measurements and URL list using the URL fields. Typically each URL maps to multiple measurements.
  5. Iterate over the chosen date range with the chosen time intervals. Repeat the following for each iteration:
    1. Count the total number of measurements completed within this interval
    2. Count the total number of measurements completed within this interval that have is_captcha_found field set to 1
    3. Calculate the percentage of measurements that received CAPTCHA using \frac{Step 4.2.2}{Step 4.2.1} \times 100 (Leave this interval's value empty if there are no corresponding measurements)
  6. Plot the percentage values for each iteration in the Y-axis and the beginning time for each iteration in the X-axis

CAPTCHA rate by CDN provider

Purpose

Understanding how different CDN providers such as Cloudflare, Akamai, Amazon Cloudfront, etc. behave requests coming from a specific exit relay

Steps to produce

  1. Determine a date range and granularity to plot. Here, we will plot last 30 days with a granularity of 1 hour.
  2. Use CAPTCHA Monitor API to get measurements that were completed using the target exit relay and between the chosen date range
  3. Use CAPTCHA Monitor API to get the list of URLs that are used in the experiments. This list contains the metadata about the URLs.
  4. Join the measurements and URL list using the URL fields. Typically each URL maps to multiple measurements.
  5. Iterate over the chosen date range with the chosen time intervals. Repeat the following for each iteration:
    1. Distribute the measurements that were completed within the interval of this iteration into bins based on cdn_provider field's value
    2. Repeat the following for each bin:
      1. Count the total number of measurements in this bin
      2. Count the total number of measurements in this bin that have is_captcha_found field set to 1
      3. Calculate the percentage of measurements that received CAPTCHA using \frac{Step 4.2.2}{Step 4.2.1} \times 100 (Leave this bin's value empty if there are no corresponding measurements)
    3. Plot the percentage values for each bin in the Y-axis and the beginning time of this interval in the X-axis
  6. Merge the graphs created for each iteration
Clone repository

Home
 Code
 Interesting Places to Visit
 Documentation
 Dataset
 Detailed Description
 Expected Long-term Impact
 Approach
 Metrics to Track
 Related Tickets
 Roadmap
 Domains Used For Testing
 Development
 Contact
 Reporting Bugs
 Contributing

GSoC 2020

Design Docs
 Dashboard Graphs
 Dashboard UI

Updates
 Tor Mailing List Threads
 Monthly Reports
  August 2020
  July 2020
  June 2020
  May 2020
 Weekly Blog Posts
  August 2020
  July 2020
  June 2020
  May 2020

Archive
 Dashboard Graphs v0