Show an aggressive user count estimate alongside our conservative user count estimate
On our various user graphs on the metrics site, we show a user count that assumes many users are online all day. In countries where many Tor users go online briefly to use Tor and then disappear again (e.g. from modems, internet cafes, etc), our approach means that our user counts in those countries is an underestimate -- by as much as an order of magnitude.
We picked that approach originally because we wanted to be publishing a clearly defensible number, but also because at the time most of our users were on good internet connections (so it wasn't so clearly wrong at first).
I find myself explaining this potential inaccuracy every time I'm showing the graphs to funders. And in the anti-censorship space, I'm often talking to them about exactly the countries where people don't typically leave their Tors running 24/7 on good internet connections.
So my proposal here is to have a "high water mark" line on the user count graphs, to go with our current "low water mark" line. The reality is that the true user count lies somewhere between these two lines, and we don't know where.
Now, there are several steps remaining:
-
How do we calculate our current numbers? Right now the way we get the number is to take the total number of users that we would have if every consensus fetch were made by a different user, and divide it by 10, on the somewhat random assumption that 2/3 of the requests are from repeat users. See https://gitweb.torproject.org/metrics-web.git/tree/src/main/resources/doc/users-q-and-a.txt : "We put in the assumption that the average client makes 10 such requests per day. A tor client that is connected 24/7 makes about 15 requests per day, but not all clients are connected 24/7, so we picked the number 10 for the average client. We simply divide directory requests by 10 and consider the result as the number of users. Another way of looking at it, is that we assume that each request represents a client that stays online for one tenth of a day, so 2 hours and 24 minutes."
-
How should we calculate these new lines? [The simple version] We take the high water mark as this number before we divided it by 10, and take the low water mark as this number divided by 15.
-
How should we calculate these new lines? [The complex version] We might be able to improve the high water mark accuracy based on e.g. looking at the number of IP addresses and deciding that if there are a small number of IP addresses, then probably there are a smaller number of users. But I would be wary of trying to get too smart at this stage -- really we want a new project to reassess whether we're counting accurately and if there's something fundamentally smarter we can do, and that's a different ticket.
-
How do we communicate these new lines on our graphs? Two lines of different colors? A shaded area in between them? This is in part a @UX question (cc @duncan, @nah), and it gets even more complex when we consider the graphs that show more than one curve at once (e.g. https://metrics.torproject.org/userstats-bridge-transport.html).