Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
Trac
Trac
  • Project overview
    • Project overview
    • Details
    • Activity
  • Issues 246
    • Issues 246
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Operations
    • Operations
    • Metrics
    • Incidents
  • Analytics
    • Analytics
    • Value Stream
  • Wiki
    • Wiki
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Create a new issue
  • Issue Boards

GitLab is used only for code review, issue tracking and project management. Canonical locations for source code are still https://gitweb.torproject.org/ https://git.torproject.org/ and git-rw.torproject.org.

  • Legacy
  • TracTrac
  • Issues
  • #26035

Closed (moved)
Open
Opened May 07, 2018 by Karsten Loesing@karsten

Streamline sample quantile types used in the various modules

While documenting how to reproduce our various statistics, I noticed that we're using different methods/formulas for computing sample quantiles, that is, the median, quartiles, percentiles, and so on. Ideally, we would settle on one method and use that everywhere. The benefit is easier documentation and reproducibility.

Here is a (probably still incomplete) list of graphs for which we calculate quantiles (with the tool written in parentheses):

  • Relay users: Median and inter-quartile range of ratios in censorship detector (Python, possibly Java soon)
  • Advertised bandwidth distribution: Percentiles, including the unusual 0-th percentile (Java) and median (R)
  • Advertised bandwidth of n-th fastest relays: Median (R)
  • Fraction of connections used uni-/bidirectionally: Quartiles (Java)
  • Time to download files over Tor: Quartiles (PostgreSQL)
  • Unique .onion addresses (version 2 only): Quartiles for weighted inter-quartile mean (Java)
  • Onion-service traffic (versions 2 and 3): Quartiles for weighted inter-quartile mean (Java)

There exist surprisingly many ways for computing quantiles. I found the following links to be quite helpful:

  • https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample
  • https://www.rdocumentation.org/packages/stats/versions/3.5.0/topics/quantile

Looking at the lists, we should probably pick two types: one discontinuous (R-1 to R-3) and one continuous type (R-4 to R-9). And ideally, we'd pick types that are either the defaults in the tools we're using or that we can easily select to use in those tools.

Going through our tools:

  • PostgreSQL has two functions, PERCENTILE_CONT and PERCENTILE_DISC, of which we already use the first. I did some experiments with a quite large sample set and found that PERCENTILE_CONT produces the exact same output as R-7 and PERCENTILE_DISC must be either R-1 or R-2. A math person might be able to say whether it's R-1 or R-2 by looking at the PostgreSQL source code. And maybe that person would be able to confirm the R-7 part, too. It seems like we don't have the choice of using other types than these in PosrtgreSQL, though, or at least not easily.
  • R has support for all nine types. After all, they're named after this language. It seems like R-7 is the default type.
  • Java with Apache Commons Math has support for all nine types, R-1 to R-9. And in theory, the two types we need shouldn't be terribly hard to re-implement, in case we want to avoid putting in this not-exactly-tiny library as dependency.
  • Python with SciPy/Numpy probably has support for some types, but I guess we're not planning to keep our Python code anyway, so this doesn't really matter.

Whee, long ticket. Thoughts?

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking
None
Due date
None
Reference: legacy/trac#26035