This is for the upload to server variant of the sniper attack. For each exit stream at the exit relay, track a LastReadTimestamp - the last time that stream was read from by the destination. Then, while the OOM killer is checking all circuits for the one with the oldest cell, have it also consider the exit streams' LastReadTimestamps and kill the oldest circuit/stream accordingly.
"How long has the oldest cell on the circuit been there" (what we did for legacy/trac#9093 (moved)) is not quite the same thing as "when did this stream last successfully write" (what I think you're proposing above. The latter check is trivially defeated by a slow but steady drain, right?
Yeah, you're right, I wasn't very clear. I was trying to suggest something similar to a timestamp per cell like we have with circuit queues. How about we keep a timestamp queue on the streams. We append a timestamp for every N bytes written from the circuit to the edge buffer, and then pop the head of the timestamp queue after we flush N bytes. Will this allow us to track how long every N byte chunk has been waiting in the buffer?
I've forward-ported to bug10169_024 and bug10169_025; I'm writing unit tests for the latter since the unit testing framework in 0.2.5 is what I need here. I'm doing a branch that combines unit tests and fixes as bug10169_025_tmp; I'm going to split them up into separate commits and cherry-pick the bugfixes to bug10169_023 while cherry-picking the tests in bug10169_025.
total_bytes_allocated_in_chunks still has a DOCDOC in buffers.c
I wonder if END_CIRC_REASON_RESOURCELIMIT can be used to perform a modified sniper-based oracle attack if there's another way to fill the remaining 10% when the 90% memory usage is reached.
It might be useful to the relay operator if Tor says how many circuits remained alive after circuits_handle_oom runs, e.g. modify the log notice at the end of that function.
Maybe the comments for circuit_get_streams_max_data_age and marked_circuit_streams_free_bytes should notate that they are helpers for circuit_max_queued_data_age and marked_circuit_free_stream_bytes, respectively?
Is it possible that when the stream buffers are "aggressively freed" using chunk_free_unchecked() they may not actually be freed, but prepended to a freelist and, as a result, less memory is actually freed in circuits_handle_oom() than is expected?
total_bytes_allocated_in_chunks still has a DOCDOC in buffers.c
Will fix later in 0.2.5.
I wonder if END_CIRC_REASON_RESOURCELIMIT can be used to perform a modified sniper-based oracle attack if there's another way to fill the remaining 10% when the 90% memory usage is reached.
The idea being, cause a node to go nearly OOM, and then see which streams (as a client!) got END_STREAM_REASON_RESOURCELIMIT, so you know that you're nearly at the OOM point, and then somehow make the node consume another .1 * MaxMem ?
If that's what you meant, it would work, but I think it only means that MaxMem needs to be set conservatively, and we need to be on the lookout for other ways to pump up a node's memory consumption. Even if we didn't send END_STREAM_REASON_RESOURCELIMIT, an attacker could still snipe a node if they know a way to make it run out of memory without its buffers and cell queues exceeding .9*MaxMem.
It might be useful to the relay operator if Tor says how many circuits remained alive after circuits_handle_oom runs, e.g. modify the log notice at the end of that function.
Maybe the comments for circuit_get_streams_max_data_age and marked_circuit_streams_free_bytes should notate that they are helpers for circuit_max_queued_data_age and marked_circuit_free_stream_bytes, respectively?
Is it possible that when the stream buffers are "aggressively freed" using chunk_free_unchecked() they may not actually be freed, but prepended to a freelist and, as a result, less memory is actually freed in circuits_handle_oom() than is expected?
I don't like the name buf_get_oldest_chunk_timestamp() for a function that
returns an age rather than a timestamp.
Okay, let's change that in the 0.2.5 version.
Can anything horrible happen with all this if the clock gets reset?
We could kill the wrong circuits if the clock goes backwards and then doesn't catch up with itself before we hit an OOM.
Perhaps it would be wise to use clock_gettime(CLOCK_MONOTONIC, ...) where
available if we aren't doing so already.
Can that be an 0.2.5 only thing? Doing a portable monotonic timer is a bit tricksy. On Linux, you want clock_gettime(CLOCK_MONOTONIC_COARSE). On OSX, you want mach_absolute_time(). On other Unix, you want clock_gettime(CLOCK_MONOTONIC) if possible. On Windows, there's a complicated mishmash of things using QueryPerformanceCounter(), GetTickCount64(), and GetTickCount(). As a fallback, you can use gettimeofday() and check the result to make sure it doesn't go backwards.
I guess we could do just the fallback check-and-latch in 0.2.3/0.2.4, and aim for the more complex ones in 0.2.5 or later?
I've added 833d0277 to bug10169_023 ; if you like it, I'll merge it forwards. It implements a trivial latch to make sure that our time can't go backwards there.
Andrea said this looked good to merge into 0.2.5. I've updated the branches "bug10169_024" and "bug10169_025_v2", and merged the latter.
FYI, I have not forgotten about testing this defense. I've been trying to test bug10169_025_v2 in Shadow for the last week. I've been running into bugs that appeared after updating the version of Tor I'm using in Shadow. Stay tuned.
After merge hell, I finally got this working. I set MaxMemInQueues to 50MB, which automatically gets pushed up to the minimum of 256MB. I've attached a graph showing that circuits_handle_oom() did not appear to be triggered.
Also, what are you testing that gets you "merge hell"? I'd suggest that you just test master, or 0.2.5.3-alpha.
This was mostly due to the fact that I implemented some necessary client pieces for the attack on 0.2.3.25, and made the mistake of optimistically assuming everything would work OOTB instead of starting my testing on a minimal network. So yes, its my own fault.
I tried this out on my small 10 node test network in Shadow, where all relays has ample 10 MiB/s connections. I merged both my sniper attack code and nickm's bug10169_025_v2 with tor-0.2.5.2-alpha. Then I tested the sniper attack using 1 team of 10 circuits (1 client instance to use a ping circuit to measure rtt, 1 client instance to launch 9 sniper circuits). I tested the attack without nickm's defense, and with nickm's defense using MaxMemInQueues 50 MB (which automatically gets adjusted up to 256MB). Then I ran a second test with 2 teams of 10 circuits.
The results are in the attack graph. Both the graph and the log file indicates that the sniper's circuits were successfully killed after memory exceeded the 256MB limit.
I'm not exactly sure why the defense was not being triggered before, but looking back at my config I may have been using MaxMemInQueues of 500 MB (which would have been to large to trigger OOM killer).