Skip to content

Find a better way to visualize statistics

While working on https://gitlab.torproject.org/tpo/network-health/metrics/website/-/issues/40009 I have noticed that our current graph don't help us understanding trends and seasonality in our series.

To make a test I have run the following simple decomposition analysis on bridge clients connecting from Russia between february and end of march.

import pandas as pd

df = pd.read_csv('userstats-combined.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2457009 entries, 0 to 2457008
Data columns (total 8 columns):
 #   Column     Dtype  
---  ------     -----  
 0   date       object 
 1   node       object 
 2   country    object 
 3   transport  object 
 4   version    float64
 5   frac       int64  
 6   low        int64  
 7   high       int64  
dtypes: float64(1), int64(3), object(4)
memory usage: 150.0+ MB
threshold = 100 # Anything that occurs less than this will be removed.
df = df[df.high >= threshold]
df = df[df.country != "??"]

date_th = '2022-02-01'

df = df[df.date >= date_th]
df
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre data-sourcepos="58:5-64:5"><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
date node country transport version frac low high
2409766 2022-02-01 bridge ae obfs4 NaN 85 201 217
2409793 2022-02-01 bridge ar obfs4 NaN 85 113 124
2409799 2022-02-01 bridge at obfs4 NaN 85 184 199
2409805 2022-02-01 bridge au obfs4 NaN 85 410 438
2409824 2022-02-01 bridge bd obfs4 NaN 85 105 110
... ... ... ... ... ... ... ... ...
2456964 2022-03-30 bridge us obfs4 NaN 92 6474 6577
2456966 2022-03-30 bridge us snowflake NaN 92 681 682
2456974 2022-03-30 bridge uz obfs4 NaN 92 124 130
2456987 2022-03-30 bridge vn obfs4 NaN 92 195 199
2457000 2022-03-30 bridge za obfs4 NaN 92 162 169

3837 rows × 8 columns

ru_ts = df[df['country']=='ru']
# Extract the names of the numerical columns
transports=['<OR>','obfs4','meek', 'snowflake']
metrics=['frac', 'high']
ru_ts
.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre data-sourcepos="231:5-237:5"><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>
date node country transport version frac low high
2410472 2022-02-01 bridge ru <OR> NaN 85 1335 1514
2410473 2022-02-01 bridge ru meek NaN 85 2113 2120
2410475 2022-02-01 bridge ru obfs3 NaN 85 336 351
2410476 2022-02-01 bridge ru obfs4 NaN 85 24723 24918
2410478 2022-02-01 bridge ru snowflake NaN 85 2456 2456
... ... ... ... ... ... ... ... ...
2456826 2022-03-30 bridge ru <OR> NaN 92 1700 1881
2456827 2022-03-30 bridge ru meek NaN 92 1668 1675
2456828 2022-03-30 bridge ru obfs3 NaN 92 434 437
2456829 2022-03-30 bridge ru obfs4 NaN 92 32814 32994
2456831 2022-03-30 bridge ru snowflake NaN 92 5164 5165

288 rows × 8 columns

First I plot statistics per transport. I plot frac and high metrics for each of them.

import matplotlib.pyplot as plt

# Plot time series for each sensor with BROKEN state marked with X in red color
for t in transports:
    serie = ru_ts[ru_ts.transport == t]
    for m in metrics:
        _ = plt.figure(figsize=(18,3))
        _ = plt.plot(serie.date, serie[m], color='blue')
        _ = plt.title("{} - {}".format(t,m))
        _ = plt.gcf().autofmt_xdate()
        for xc in serie.date:
            plt.axvline(x=xc, color='black', linestyle='--')
            _ = plt.axvline(x=xc, color='black', linestyle='--')
        plt.show()

output_6_0

output_6_1

output_6_2

output_6_3

output_6_4

output_6_5

output_6_6

output_6_7

Now I run a seasonal decomposition on the snowflake transport and high metric. I use a period of 8 days since in this paper https://arxiv.org/pdf/1507.05819.pdf they have identified a weekly seasionality for Tor users (which is generally the case for internet users).

from statsmodels.tsa.seasonal import seasonal_decompose

serie = pd.DataFrame(ru_ts[ru_ts.transport == 'snowflake']['high'])

decompose_result_mult = seasonal_decompose(serie, model="multiplicative", extrapolate_trend='freq', period=8)

trend = decompose_result_mult.trend
seasonal = decompose_result_mult.seasonal
residual = decompose_result_mult.resid

_ = plt.figure(figsize=(18,10))
_ = plt.title("trend")
_ = trend.plot()
plt.show()

_ = plt.figure(figsize=(18,10))
_ = plt.title("seasonal")
_ = seasonal.plot()
plt.show()

_ = plt.figure(figsize=(18,10))
_ = plt.title("residual")
_ = residual.plot()
plt.show()

output_7_0

output_7_1

output_7_2

This last bit is some differentials I was playing with. It should be polished but gives an idea of how things change between one day and the next.

for t in transports:
    serie = ru_ts[ru_ts.transport == t]
    for m in metrics:
        _ = plt.figure(figsize=(18,3))
        X = serie[m].values
        diff = list()
        for i in range(1, len(X)):
            value = X[i] - X[i - 1]
            diff.append(value)
        
        _ = plt.plot(diff, color='blue')
        _ = plt.title("{} - {}".format(t,m))
        _ = plt.gcf().autofmt_xdate()
        plt.show()

output_8_0

output_8_1

output_8_2

output_8_3

output_8_4

output_8_5

output_8_6

output_8_7

Edited by Hiro