Oh great, thanks!
I believe this one https://github.com/shadow/shadow-plugin-tor/issues/63 may also be relevant
Adding a reference to this pre-print since it is a potential direction for solving protocol-related side-channels.
If such a protocol enables some malicious nodes to extract more information than they can already do, or extract it faster, that would be indeed a good reason to scream.
So let's see:
Assuming a blank echo (no information exchanged besides 'the current state is unexpected', and no echo-response)
This does not seem to be more helpful for an active attacker compared to what they can already do.
I believe that as soon as an answer can be triggered (even a blank response), there is a potential for adversarial exploitation. However, this is where things can become helpful as well for honest parties.
Two potential paths forward with this idea: one would be to design it while being very careful about what can be exchanged and how many echo/echo-responses can be triggered on the same circuit. The other would be to leverage some notion of Trust. E.g., some trusted relay may be allowed to fetch some information from the peer. This information could then be quite helpful for debugging honest parties or investigating on byzantine clients.
I believe trust is being considered for path selection (tpo/network-health/metrics/relay-search#40001). That could also be an option here.
While discussing #40400 on irc/matrix with @nickm, we thought it might be good to have a ticket discussing potential protocol extensions that would help track down reasons behind non-compliant protocol messages.
As of the current situation, #40400 and the related merge request deal with improving the logging system:
Protocol level extensions could be useful to fetch/share the nodes' view of the circuit and could complement the periodic Heartbeat message. The main interest would be to help into distinguishing between 1) bugs, 2) bugged byzantine client/relays, 3) actively malicious client/relay with information report from one endpoint. The main problem would be to avoid sharing any too sensitive information.
An echo/echo-response type of protocol may help in exchanging/getting more information when an abnormal protocol state is detected. Upon receiving a RELAY-level protocol echo, the node (endpoint) may log information or/and echo back some information to the other endpoint. Those questions yet remain:
The circuit padding framework supports negotiating padding upon various events. Among them, CIRCPAD_CIRC_OPENED states that a given padding machine should be applied to a circuit when a circuit has opened.
However, no code seems to trigger this mechanism. When a circuit has built, the function circpad_machine_event_circ_built() is called and checks whether some machine may be removed/added to the circuit. However, at this stage of the circuit building process, the circuit has built but is not marked as open yet.
If some machine uses client_machine->conditions.apply_state_mask = CIRCPAD_CIRC_OPENED;
the machine would only be applied when another event than a circ building/opening triggers the function circpad_add_matching_machines() (e.g., ap conn links a stream, or the circ purpose changes from general to something else).
When circuituse.c calls circuit_has_opened(), it should also call the circpad module; e.g., a new function circpad_machine_event_circ_opened() that checks for adding machine to the circuit.
Running a version forked from 0.4.5.7
Contains some logs showing a call to circpad_machine_event_circ_built() while the circuit is still marked as building. Also contains custom logs:
Jun 30 11:23:50.000 [info] internal (high-uptime) circ (length 3, last hop test000a): $22BA781A60C0CBB7FFAEA8858128427F67F60038(open) $7684DE04DCBB44538554E2CD1D14CDF836D5AF4D(open) $C7ADB1DBCE99F0B2ED2812B1953E4986EE9846DB(open)
Jun 30 11:23:50.000 [debug] dispatch_send_msg_unchecked(): Queued: ocirc_cevent (<gid=7 evtype=2 reason=0 onehop=0>) from or, on ocirc.
Jun 30 11:23:50.000 [debug] dispatcher_run_msg_cbs(): Delivering: ocirc_cevent (<gid=7 evtype=2 reason=0 onehop=0>) from or, on ocirc:
Jun 30 11:23:50.000 [debug] dispatcher_run_msg_cbs(): Delivering to btrack.
Jun 30 11:23:50.000 [debug] btc_cevent_rcvr(): CIRC gid=7 evtype=2 reason=0 onehop=0
Jun 30 11:23:50.000 [debug] circuit_build_times_add_time(): Adding circuit build time 43
Jun 30 11:23:50.000 [debug] circpad_machine_conditions_apply(): Checking circuit purpose, 5
Jun 30 11:23:50.000 [debug] circpad_machine_conditions_apply(): Checking condition state mask 21 vs condition: 2
Jun 30 11:23:50.000 [debug] circpad_machine_conditions_apply(): Checking circuit purpose, 5
Jun 30 11:23:50.000 [debug] circpad_machine_conditions_apply(): Checking circuit purpose, 5
Jun 30 11:23:50.000 [debug] circpad_machine_event_circ_built(): Circpad module event circ built -- circ state: 0
Jun 30 11:23:50.000 [debug] circpad_machine_conditions_apply(): Checking circuit purpose, 5
Jun 30 11:23:50.000 [debug] circpad_machine_conditions_apply(): Checking condition state mask 21 vs condition: 2
Jun 30 11:23:50.000 [debug] circpad_machine_conditions_apply(): Checking circuit purpose, 5
Jun 30 11:23:50.000 [debug] circpad_machine_conditions_apply(): Checking circuit purpose, 5
Jun 30 11:23:50.000 [debug] invoke_plugin_operation_or_default(): Plugin found for caller calling a plugin in the circpad module when a circuit has built
Jun 30 11:23:50.000 [info] circpad_dropmark_activate_when_built(): Looks like the client_dropmark_def machine does not exist over this circuit
Jun 30 11:23:50.000 [debug] plugin_run(): Plugin execution returned -2147483648
Jun 30 11:23:50.000 [debug] plugin_run(): vm error message: (null)
Jun 30 11:23:50.000 [info] entry_guards_note_guard_success(): Recorded success for primary confirmed guard test002r ($22BA781A60C0CBB7FFAEA8858128427F67F60038)
Jun 30 11:23:50.000 [debug] dispatch_send_msg_unchecked(): Queued: ocirc_state (<gid=7 state=4 onehop=0>) from or, on ocirc.
Jun 30 11:23:50.000 [debug] dispatcher_run_msg_cbs(): Delivering: ocirc_state (<gid=7 state=4 onehop=0>) from or, on ocirc:
Jun 30 11:23:50.000 [debug] dispatcher_run_msg_cbs(): Delivering to btrack.
Jun 30 11:23:50.000 [debug] btc_state_rcvr(): CIRC gid=7 state=4 onehop=0
Jun 30 11:23:50.000 [info] circuit_build_no_more_hops(): circuit built!
Jun 30 11:23:50.000 [info] pathbias_count_build_success(): Got success count 3.000000/3.000000 for guard test002r ($22BA781A60C0CBB7FFAEA8858128427F67F60038)
Jun 30 11:23:50.000 [debug] circuit_has_opened(): calling circuit_has_opened()
Add a new function circpad_machine_event_circ_opened() called from circuituse.c when the circuit has opened.
@mikeperry reading
See https://github.com/torproject/tor/blob/master/doc/HACKING/CircuitPaddingDevelopment.md
Man this is great. Are you planning to eventually give the ability to receive padding machines from the network as well? Loading the bitstring representation of its configuration and applying it?
This is neat stuff! Now, I assume this is somewhat linked to the very concern of this thread. Do you plan to model the Tor routing protocol as a finite state machine, and get this machine updated through the network via something like a signed bitstring representation of the new protocol state machine?
I guess that should effectively replace Postel's principle. The Tor routing protocol could stay conservative (and up to date when it receives a new bitstring from the authorities) by ensuring a circuit's state machine is always in a valid state. Otherwise, it kills the circuit.
That looks complicated though :) I would be happy if you could confirm these thoughts?
@mikeperry sounds good! I'll try experimenting with circuit padding inside our framework. We're working on a proof-of-concept paper explaining our methodology, and demonstrating the capabilities of a anonymous network built upon 1), 2) and 3) described in my previous comment.
Regarding circpad, here's the plan: I will work soon on getting this framework 'remotely' re-programmable, such that one can inject circpad machines to some peer (experimenting alternative deployment methods) to protect their own traffic. We will try demonstrating this capability while protecting against dropmarks. That should help you get your work tested at the same time.
I am thinking of some scenario such as the client reprogramming its middle relay at circuit-establishment for it to send padding cells until the client tells the relay to stop because it successfully connected to some destination.
Forward compatibility is no longer an issue for us, at least not via Postel's maxim. Tor now supports protocol versions for specific feature support at relays. There is no reason why a relay has to accept anything not listed in its protocol versions, or send anything not listed in them, other than sloppiness.
Yes, I've see your Vanguard addon and work to detect and react to unvalid protocol messages. This is indeed very needed. But, if I understand correctly, it only answers the first part of a complex problem, i.e., we need to be both conservative AND super flexible:
Basically, any instance SHOULD NOT accept anything that is not part of its valid protocol operations.
Clearly what you did is a step forward, but I think we should be careful not to break too much ease of deployment, i.e., that's the difficult problem to me:
Obviously, being so conservative while maintaining a flexible and maintainable distributed network seems a lost cause to fight for
I believe that protocol negotiation is not enough. I believe we need something quite crazy to eventually call the problem solved: 1) we need the full network to be conservative. 2) We need having the ability to 'hit' some deploy button to make the whole network compatible to some new protocol feature at once. 3) Finally, we need to be able to show cryptographic evidence of misbehavior when some relay/client sends unvalid protocol messages.
I think 1) is ok. 2) is ongoing, and 3) needs some thoughts.
Regarding the Vanguards addon (I am totally supportive of such a design). I see you're expecting some tests of your defenses, and some thinking is ongoing regarding different threat models (i.e., Guard malicious/ISP malicious and race conditions between dropmarks and destroy cells). That's great. The reasoning there is worth digging more, and we need to test the defenses to move forward.
We would be happy to help, but we need to play the academic game and somehow produce a paper at the end of the process. That's how I can land a job in the future x) Vanguard could make an interesting paper, and we could rebase my old attack code on a more recent Tor version, test the design and help elaborating on the diverse questions.
Hello friends!
I am hijacking this thread to let you know I have resumed working on this problem very recently, with Tariq Elahi. I am happy to see you're considering the side-channel built on forward compatibility abuses seriously. The few attacks that have been published on the dropmark paper are just the tip of the iceberg; I remember listing few other things an active attacker could abuse from Postel's principle implemented in Tor to guarantee end-to-end correlation. I have never further investigated this list since the root of the issue is the same, and it is far better to work on the disease itself than the symptoms.
So, I was hoping whether you could keep us in the loop for the progress you make on this specific topic? I am planning to investigate an even harder anti-Postel direction than Mike's anti-Postel suggestion, and I would be happy to keep you informed if progress are made. However, that won't be very compatible with your codebase.
Basically, any instance SHOULD NOT accept anything that is not part of its valid protocol operations. Obviously, being so conservative while maintaining a flexible and maintainable distributed network seems a lost cause to fight for. I think this assessment is correct for our current software architecture and lifecycle; i.e., the current way we specify, implement and then deploy new versions of the protocol. Basically, specification and implementation are under our control, and we can cycle and improve between these two steps. We have less control over deployment, as it is mostly under control of the relay operator or its OS policies. I plan working on an architecture that can give full and fine-grained control over these three steps, i.e., specification, implementation and deployment to the developers, and see how it goes.
The best case scenario: we can come up with some new design and implementation in a few years, and learn something from it. I expect it to blend in the Tor network as an independent implementation. The worst case scenario: a bunch of academic papers not so useful.