Find a better way to visualize statistics
While working on https://gitlab.torproject.org/tpo/network-health/metrics/website/-/issues/40009
I have noticed that our current graph don't help us understanding trends and seasonality in our series.
To make a test I have run the following simple decomposition analysis on bridge clients connecting from Russia between february and end of march.
import pandas as pd
df = pd.read_csv('userstats-combined.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2457009 entries, 0 to 2457008
Data columns (total 8 columns):
# Column Dtype
--- ------ -----
0 date object
1 node object
2 country object
3 transport object
4 version float64
5 frac int64
6 low int64
7 high int64
dtypes: float64(1), int64(3), object(4)
memory usage: 150.0+ MB
threshold = 100 # Anything that occurs less than this will be removed.
df = df[df.high >= threshold]
df = df[df.country != "??"]
date_th = '2022-02-01'
df = df[df.date >= date_th]
df
date | node | country | transport | version | frac | low | high | |
---|---|---|---|---|---|---|---|---|
2409766 | 2022-02-01 | bridge | ae | obfs4 | NaN | 85 | 201 | 217 |
2409793 | 2022-02-01 | bridge | ar | obfs4 | NaN | 85 | 113 | 124 |
2409799 | 2022-02-01 | bridge | at | obfs4 | NaN | 85 | 184 | 199 |
2409805 | 2022-02-01 | bridge | au | obfs4 | NaN | 85 | 410 | 438 |
2409824 | 2022-02-01 | bridge | bd | obfs4 | NaN | 85 | 105 | 110 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
2456964 | 2022-03-30 | bridge | us | obfs4 | NaN | 92 | 6474 | 6577 |
2456966 | 2022-03-30 | bridge | us | snowflake | NaN | 92 | 681 | 682 |
2456974 | 2022-03-30 | bridge | uz | obfs4 | NaN | 92 | 124 | 130 |
2456987 | 2022-03-30 | bridge | vn | obfs4 | NaN | 92 | 195 | 199 |
2457000 | 2022-03-30 | bridge | za | obfs4 | NaN | 92 | 162 | 169 |
3837 rows × 8 columns
ru_ts = df[df['country']=='ru']
# Extract the names of the numerical columns
transports=['<OR>','obfs4','meek', 'snowflake']
metrics=['frac', 'high']
ru_ts
date | node | country | transport | version | frac | low | high | |
---|---|---|---|---|---|---|---|---|
2410472 | 2022-02-01 | bridge | ru | <OR> | NaN | 85 | 1335 | 1514 |
2410473 | 2022-02-01 | bridge | ru | meek | NaN | 85 | 2113 | 2120 |
2410475 | 2022-02-01 | bridge | ru | obfs3 | NaN | 85 | 336 | 351 |
2410476 | 2022-02-01 | bridge | ru | obfs4 | NaN | 85 | 24723 | 24918 |
2410478 | 2022-02-01 | bridge | ru | snowflake | NaN | 85 | 2456 | 2456 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
2456826 | 2022-03-30 | bridge | ru | <OR> | NaN | 92 | 1700 | 1881 |
2456827 | 2022-03-30 | bridge | ru | meek | NaN | 92 | 1668 | 1675 |
2456828 | 2022-03-30 | bridge | ru | obfs3 | NaN | 92 | 434 | 437 |
2456829 | 2022-03-30 | bridge | ru | obfs4 | NaN | 92 | 32814 | 32994 |
2456831 | 2022-03-30 | bridge | ru | snowflake | NaN | 92 | 5164 | 5165 |
288 rows × 8 columns
First I plot statistics per transport. I plot frac and high metrics for each of them.
import matplotlib.pyplot as plt
# Plot time series for each sensor with BROKEN state marked with X in red color
for t in transports:
serie = ru_ts[ru_ts.transport == t]
for m in metrics:
_ = plt.figure(figsize=(18,3))
_ = plt.plot(serie.date, serie[m], color='blue')
_ = plt.title("{} - {}".format(t,m))
_ = plt.gcf().autofmt_xdate()
for xc in serie.date:
plt.axvline(x=xc, color='black', linestyle='--')
_ = plt.axvline(x=xc, color='black', linestyle='--')
plt.show()
Now I run a seasonal decomposition on the snowflake transport and high metric. I use a period of 8 days since in this paper https://arxiv.org/pdf/1507.05819.pdf they have identified a weekly seasionality for Tor users (which is generally the case for internet users).
from statsmodels.tsa.seasonal import seasonal_decompose
serie = pd.DataFrame(ru_ts[ru_ts.transport == 'snowflake']['high'])
decompose_result_mult = seasonal_decompose(serie, model="multiplicative", extrapolate_trend='freq', period=8)
trend = decompose_result_mult.trend
seasonal = decompose_result_mult.seasonal
residual = decompose_result_mult.resid
_ = plt.figure(figsize=(18,10))
_ = plt.title("trend")
_ = trend.plot()
plt.show()
_ = plt.figure(figsize=(18,10))
_ = plt.title("seasonal")
_ = seasonal.plot()
plt.show()
_ = plt.figure(figsize=(18,10))
_ = plt.title("residual")
_ = residual.plot()
plt.show()
This last bit is some differentials I was playing with. It should be polished but gives an idea of how things change between one day and the next.
for t in transports:
serie = ru_ts[ru_ts.transport == t]
for m in metrics:
_ = plt.figure(figsize=(18,3))
X = serie[m].values
diff = list()
for i in range(1, len(X)):
value = X[i] - X[i - 1]
diff.append(value)
_ = plt.plot(diff, color='blue')
_ = plt.title("{} - {}".format(t,m))
_ = plt.gcf().autofmt_xdate()
plt.show()