Preserving hashed IP addresses in sanitized bridge descriptors
Roger mentioned in a comment of legacy/trac#2372 (moved):
One issue that comes to mind that we might want to research is how often a given bridge moves IP address. The method you describe above would lose that info, yes? Whereas if we do a keyed hash of the IP address (and never disclose the key), we could distinguish "same" from "different". I remember we had the keyed hash design in some other sanitization context, but I don't remember which one -- how is the idea working out in that other context?
(It's possible that we already do the keyed hash for the regular bridge descriptors, so we would just need to match up the sha1(fingerprint) in this file with the sha1(fingerprint) in that file and we could look up the IP address. In which case maybe there's merit in doing the same keyed hash in both places, to ease the job of future researchers.)
When we discussed this topic the last time, I suggested replacing bridge IP addresses with something very similar to this:
H(IP address + bridge identity + secret)[:3]
The input IP address is the 4-byte long binary representation of the bridge's current IP address. The bridge identity is the 20-byte long binary representation of the bridge's long-term identity fingerprint. The secret is an arbitrary, sufficiently long (say, 20 bytes), secure random string that does not change over time and that is only known to the machine running the bridge descriptor sanitizer plus backups. H is SHA-1. The [:x] operator means that we pick the x most significant bytes of the result.
The original transformation used 4 bytes of the output, but I changed this to use only 3 bytes here. The idea is to write the resulting "IP addresses" as 10.x.x.x in the sanitized descriptors to make it clear that these are no public IP addresses. I want to avoid confusion with the non-sanitized IP addresses in exit policies. I'm aware of the higher collision probability, but the probability and impact of missing an IP address change are still sufficiently low.
The resulting "IP address" helps us detect whether a specific bridge has changed its IP address. It does not tell us if two bridges run on the same IP address. It also does not tell us when a bridge changes its fingerprint but keeps its IP address.
The two important pieces of this transformation are that a) someone who learns a bridge's identity cannot guess the bridge's previous IP addresses (which would have been possible without using the secret); b) someone who guesses the secret cannot guess the IP addresses of all bridges (which would have been possible without using the bridge identity).
There are more details about preserving hashed IP addresses in this thread.