Gaba · 67542b0f
--- a/Meetings/2018/2018MexicoCity/Notes/PrivCountTechnical.md
+++ b/Meetings/2018/2018MexicoCity/Notes/PrivCountTechnical.md
+** Session: PrivCount Technical Issues **
+
+10/03/2018
+
+12:00 -- 13:00
+
+Session lead: teor
+
+Scope: Solve remaining issues to securely implement PrivCount.
+
+Agenda:
+- Action bounds - Consistent derivation
+- Adding new statistics - how to estimate value and cost
+- Binning outputs for privacy
+- Debugging protocol implementation
+- Floating point and differential privacy issues
+- Malicious inputs - prevent them from modifying output too much
+- Metrics statistical analysis
+- Noise
+    - How much excess noise to handle malicious inputs
+    - How to divide noise among relays
+    - Issue with low per-relay noise and possible zero truncation
+    - Function to sample Gaussian noise
+- Versioning
+    - Ensure noise is correctly despite multiple versions
+
+
+Initial questions:
+- Question: What is the vision for running PrivCount? Answer: Goal is every relay submits data via PrivCount.
+- Question: Do we need a minimum number of relays running PrivCount before we started outputting statistics? Answer: Not necessarily for the noise, but perhaps just to provide some level of aggregation.
+
+Discussion:
+- Action bounds
+    - How do we do consistent derivation?
+    - We should run a client and see what activity gets generated. What user applications get tested isn't clear.
+    - We could perhaps use measurements about popular client applications to determine what user activity to simulate
+- Adding new statistics
+    - How do we estimate value for purposes?
+    - We can ramp up.
+- Binning outputs for privacy
+    - Perhaps bin size could be the action bound.
+    - Do the rounding after the aggregation to handle floating point issues. Reference: "On significance of the least significant bits for differential privacy", Mironov, https://dl.acm.org/citation.cfm?id=2382264.
+    - Some small multiple (say, 10) of the action bound seems like a reasonable bin size
+- Debugging protocol implementation
+    - Run protocol on a test network
+    - Debugging mode when each relays individually adds sufficient noise to its statistic, and compare to side-by-side to current method
+    - Send a cell on a circuit on a circuit that says "measure me" with an ID that has the experiment number, or just pin a node that isn't in the consensus
+- Issue with low per-relay noise and possible zero truncation
+    - Used fixed-point arithmetic, where the decimal point position is chosen based on:
+      - the expected value of the statistic: avoid overflowing the ≈ 2^60^ available bits of the shamir secret (our field elements are 2^62^ - 2^30^ - 1, which can contain 61-bit 2's complement integers. We use the remaining 2^61^ - 2^30^ - 1 values to detect overflow with probability ≈ 0.5)
+      - the noise allocated to the statistic: allow enough (how much? 8 extra bits?) precision in the smallest possible noise added by a relay (a relay with consensus weight 1 will add sqrt(1/24 hours * 1/50 million total consensus weight) * noise allocation ≈ 1 / 2^15^ * noise allocation)
+- Function to sample Gaussian noise
+    - Need to re-implement sampling from a Gaussian distribution
+    - Use discrete Gaussian distribution, lattice-based cryptography implementations should have this, e.g. implementation of BV (Brakerski and Vaikuntanathan)
+      - one alternative: [https://arxiv.org/abs/1303.6257 Sampling exactly from the normal distribution]
+        - [http://exrandom.sourceforge.net/ Project page], [https://sourceforge.net/projects/exrandom/files/distrib/ tarballs], also implemented in MPFR as mpfr_nrandom
+        - Algorithm D outputs Gaussian-distributed integers with standard deviation σ, using approximately (1/0.715)log,,2,,σ uniformly random bits per sample
+        - Since our Gaussians will be scaled along with our counters, we will be sampling Gaussians with σ between 2^15^ and 2^60^, using approximately 21-84 bits per sample
+        - This method sets the low bits of the Gaussian directly from the uniformly random input bits, much like [https://gitweb.torproject.org/torspec.git/tree/proposals/288-privcount-with-shamir.txt#n452 Appendix C of the PrivCount shamir spec]
+      - another alternative: [https://eprint.iacr.org/2013/383.pdf Lattice Signatures and Bimodal Gaussians]
+        - [http://bliss.di.ens.fr/ Original source code], [https://wiki.strongswan.org/projects/strongswan/wiki/BLISS production-grade C implementation]
+        - Algorithm 12 in Section 6 outputs Gaussian-distributed integers with standard deviation σ for σ = kσ,,2,, (k integer, σ,,2,, = sqrt(1/(2ln2)) ≈ 0.849), using approximately 4 + 2log,,2,,σ uniformly random bits per sample
+        - Since our Gaussians will be scaled along with our counters, we will be sampling Gaussians with σ between 2^15^ and 2^60^, using approximately 34-124 bits per sample
+        - We will round up σ to the nearest kσ,,2,,. These large σ make the difference between kσ,,2,, and σ negligible
+      - background: [https://www.math.auckland.ac.nz/~sgal018/gen-gaussians.pdf SAMPLING FROM DISCRETE GAUSSIANS FOR LATTICE-BASED CRYPTOGRAPHY ON A CONSTRAINED DEVICE]
+    - [https://github.com/shaih/HElib HELib (Homomorphic Encryption in C++)] ~~should also have this~~ [https://github.com/shaih/HElib/blob/a49dd5210037c187ca4ea05ec8f532697262db20/src/NumbTh.cpp#L694 uses the box-mueller transform in floating-point and rounds to the nearest integer]
+- Malicious inputs
+    - How to prevent small number of relays from modifying output too much
+    - Report individual statistics with large amount of noise to check for reasonable size, then aggregate inputs with smaller noise
+    - Divide relays into buckets and report aggregate values from each bucket, choose at least bucket using trust or bandwidth or flags, choose shared random value to generate buckets produced *after* the inputs have been received
+    - Specific proposal: choose a few (say, 20) of the largest relays for one bucket, choose a few relays randomly based on bandwidth (say, 20), and also release the network-wide aggregate. Use the bucket values to "check" the network aggregate.
+- Metrics statistical analysis
+    - PrivCount needs to tell Metrics *which* relays provided inputs.
+    - PrivCount also needs to tell Metrics the standard deviation of the noise.
+    - Sampling error needs to be taken into account. Ian wrote an email describing how to do this.
+- Noise
+    - How much excess noise to handle malicious inputs
+    - How to divide noise among relays
+    - How to share noise across statistics.
+    - Divide it based on consensus weight. Then make some assumption about maximum fraction of bandwidth that is malicious or fails, which answers how excess noise is added.
+    - Have a minimum requirement also for number (or weight) of relays that need submit inputs to proceed with the aggregation.
+- Versioning
+    - Ensure noise is correctly despite multiple versions
+    - Noise parameters and scaling
+    - If relays know the noise allocated to each statistic (and the number of relays on each version?), they can calculate how much noise to add to the statistics they are collecting