Find a better way to visualize statistics

While working on https://gitlab.torproject.org/tpo/network-health/metrics/website/-/issues/40009 I have noticed that our current graph don't help us understanding trends and seasonality in our series.

To make a test I have run the following simple decomposition analysis on bridge clients connecting from Russia between february and end of march.

import pandas as pd

df = pd.read_csv('userstats-combined.csv')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2457009 entries, 0 to 2457008
Data columns (total 8 columns):
 #   Column     Dtype  
---  ------     -----  
 0   date       object 
 1   node       object 
 2   country    object 
 3   transport  object 
 4   version    float64
 5   frac       int64  
 6   low        int64  
 7   high       int64  
dtypes: float64(1), int64(3), object(4)
memory usage: 150.0+ MB

threshold = 100 # Anything that occurs less than this will be removed.
df = df[df.high >= threshold]
df = df[df.country != "??"]

date_th = '2022-02-01'

df = df[df.date >= date_th]

df

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre data-sourcepos="58:5-64:5"><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	date	node	country	transport	version	frac	low	high
2409766	2022-02-01	bridge	ae	obfs4	NaN	85	201	217
2409793	2022-02-01	bridge	ar	obfs4	NaN	85	113	124
2409799	2022-02-01	bridge	at	obfs4	NaN	85	184	199
2409805	2022-02-01	bridge	au	obfs4	NaN	85	410	438
2409824	2022-02-01	bridge	bd	obfs4	NaN	85	105	110
...	...	...	...	...	...	...	...	...
2456964	2022-03-30	bridge	us	obfs4	NaN	92	6474	6577
2456966	2022-03-30	bridge	us	snowflake	NaN	92	681	682
2456974	2022-03-30	bridge	uz	obfs4	NaN	92	124	130
2456987	2022-03-30	bridge	vn	obfs4	NaN	92	195	199
2457000	2022-03-30	bridge	za	obfs4	NaN	92	162	169

3837 rows × 8 columns

ru_ts = df[df['country']=='ru']
# Extract the names of the numerical columns
transports=['<OR>','obfs4','meek', 'snowflake']
metrics=['frac', 'high']

ru_ts

.dataframe tbody tr th:only-of-type { vertical-align: middle; } <pre data-sourcepos="231:5-237:5"><code>.dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </code></pre>

	date	node	country	transport	version	frac	low	high
2410472	2022-02-01	bridge	ru	<OR>	NaN	85	1335	1514
2410473	2022-02-01	bridge	ru	meek	NaN	85	2113	2120
2410475	2022-02-01	bridge	ru	obfs3	NaN	85	336	351
2410476	2022-02-01	bridge	ru	obfs4	NaN	85	24723	24918
2410478	2022-02-01	bridge	ru	snowflake	NaN	85	2456	2456
...	...	...	...	...	...	...	...	...
2456826	2022-03-30	bridge	ru	<OR>	NaN	92	1700	1881
2456827	2022-03-30	bridge	ru	meek	NaN	92	1668	1675
2456828	2022-03-30	bridge	ru	obfs3	NaN	92	434	437
2456829	2022-03-30	bridge	ru	obfs4	NaN	92	32814	32994
2456831	2022-03-30	bridge	ru	snowflake	NaN	92	5164	5165

288 rows × 8 columns

First I plot statistics per transport. I plot frac and high metrics for each of them.

import matplotlib.pyplot as plt

# Plot time series for each sensor with BROKEN state marked with X in red color
for t in transports:
    serie = ru_ts[ru_ts.transport == t]
    for m in metrics:
        _ = plt.figure(figsize=(18,3))
        _ = plt.plot(serie.date, serie[m], color='blue')
        _ = plt.title("{} - {}".format(t,m))
        _ = plt.gcf().autofmt_xdate()
        for xc in serie.date:
            plt.axvline(x=xc, color='black', linestyle='--')
            _ = plt.axvline(x=xc, color='black', linestyle='--')
        plt.show()

Now I run a seasonal decomposition on the snowflake transport and high metric. I use a period of 8 days since in this paper https://arxiv.org/pdf/1507.05819.pdf they have identified a weekly seasionality for Tor users (which is generally the case for internet users).

from statsmodels.tsa.seasonal import seasonal_decompose

serie = pd.DataFrame(ru_ts[ru_ts.transport == 'snowflake']['high'])

decompose_result_mult = seasonal_decompose(serie, model="multiplicative", extrapolate_trend='freq', period=8)

trend = decompose_result_mult.trend
seasonal = decompose_result_mult.seasonal
residual = decompose_result_mult.resid

_ = plt.figure(figsize=(18,10))
_ = plt.title("trend")
_ = trend.plot()
plt.show()

_ = plt.figure(figsize=(18,10))
_ = plt.title("seasonal")
_ = seasonal.plot()
plt.show()

_ = plt.figure(figsize=(18,10))
_ = plt.title("residual")
_ = residual.plot()
plt.show()

This last bit is some differentials I was playing with. It should be polished but gives an idea of how things change between one day and the next.

for t in transports:
    serie = ru_ts[ru_ts.transport == t]
    for m in metrics:
        _ = plt.figure(figsize=(18,3))
        X = serie[m].values
        diff = list()
        for i in range(1, len(X)):
            value = X[i] - X[i - 1]
            diff.append(value)
        
        _ = plt.plot(diff, color='blue')
        _ = plt.title("{} - {}".format(t,m))
        _ = plt.gcf().autofmt_xdate()
        plt.show()

Edited Apr 18, 2022 by Hiro