Analyze Tor usage data for ways to automatically detect country-wide blockings
Every now and then, there are country-wide blockings of Tor. In most cases we learn about these events from users telling us that Tor has stopped working from them. This may work okay, but given that we already have usage data per country, we should be able to detect blockings ourselves, preferrably automatically and with as few false positives as possible.
I already spent some time on a censorship detector that takes our usage data as input and tells us whenever the usage on a given day falls outside an expected interval. But I'm afraid I don't know enough math to push this further, at least not without reading more about time series analysis. Maybe someone wants to pick this up?
Here's where I am:
We take our estimated daily user numbers as input. Our goal is to give out a warning whenever the estimated user number from a given country drops below a predicted value. This predicted value is not static, but should depend on previous values, therefore we should use time series analysis. We want to model the user numbers for days 1..n-1, predict a value for day n, and warn if the actual value for day n is lower than the predicted value minus some error.
I read some stuff about time series analysis and came up with the ARIMA model. Thankfully, the ARIMA model is already implemented in R.
I'm going to upload some R code to the metrics-tasks repository once I have a ticket number (see comment below). The R code generates a PDF that shows on which days we'd receive a warning. I'm also going to attach the PDf to this ticket. Here's how you can run the R code yourself:
$ wget https://metrics.torproject.org/csv/direct-users.csv
$ R --slave -f detect-censorship.R
Possible next steps are a) finding good parameters for the ARIMA model, b) trying other time series models, and c) extending the approach to bridge users. Once we have a useful approach for estimated daily user numbers, we should d) try to get rid of day-based statistics which have a delay of 1--2 days and make the approach work for directory request stats and connecting bridge user stats to get results more quickly. The final step is to e) integrate the R code with the metrics website and execute it every few hours.