Snapshot of where we are and what remains to do, as of hackweek wrapup on 2022-07-01:
Status:
My ticket40634_048-patch2 branch works (you'll need to apt install libb2-dev). Services can advertise a
suggested effort, clients see it and produce and send a PoW at that
effort level. Services learn the effort in incoming cells and keep a pqueue of next
rend requests to respond to. But, that pqueue is typically empty
because of the blocker below, so in practice PoW-solving clients
don't get better priority yet.
Primary blocker:
We need a new design for how to decide how many rend requests to
answer and when. If we just answer them as fast as possible,
we end up with many in-flight rendezvous circuits, but we always
drain the request pqueue immediately, so the PoW requests never
get priority. Whereas if we delay answering requests in case
a PoW request arrives soon, our overall handling rate suffers.
Secondary todo's:
Make the suggested effort dynamic on the service side based on load
Get rid of the new blake2 dependency, either by using openssl's blake
or by using one of our existing hash functions instead
Nick's idea of incorporating KH-derived bytes into the PoW, to tie each
PoW to the particular intro circuit it arrives on.
Unit tests for everything!
Do we want to change the design so clients compute their PoWs inside
a cpuworker? Currently clients don't use cpuworkers. Need to compare
how much performance gain we'd get vs the complexity we'd add.
Make sure we're obeying the Equix LGPL license properly.
Find and fix edge cases where clients with a cached onion service
descriptor use the wrong effort level.
Put a cap on the max effort a client is willing to attempt.
Nick's idea of incorporating KH-derived bytes into the PoW, to tie each PoW to the particular intro circuit it arrives on.
More details:
We include a new unencrypted extension, "introduction-binding". It is 32 bytes long.
This extension MUST be present in the unencrypted part of the INTRODUCE2 whenever POW is used. It may be present if POW is not used.
If the service receives an INTRODUCE2 message without an introduction-binding extension, it MUST treat that message as if it did not include POW at all.
When negotiating an ntor handshake, clients and relays MUST derive 32 extra bytes for use as the introduction-binding for the resulting circuit hop. This is the "introduction-binding" value.
The introduction point must inspect the "introduction-binding" extension if it is present in an INTRODUCE1 cell. If it is present and set to anything other than the correct "introduction-binding" value for the circuit, the introduction point must reject the cell.
"introduction-binding" value is used as another input to equix_solve() and blake2b() in the POW algorithm.
Results of this idea:
If the intro-point has upgraded, and it is honest, then POW replays are now impossible, and clients cannot compute POW in advance before building the circuit to the intro-point.
If an intro-point has not upgraded or is not honest, then the POW algorithm is no less reliable than before.
If the service knows that an intro-point has upgraded, it can safely use a small bloom filter with a higher false-positive rate, and just count the apparent replays from the intro point to see if they rise above a certain threshold.
If a service is under attack, it can decide to choose only upgraded intro points.
The client does not need to care whether the intro-point has upgraded: It sends the same data in either case.
I pushed improvements to my ticket40634_048-patch2 branch thanks to more help from Jim and Nick.
Now there are still a bunch of rough edges to be cleaned up, but the core of the idea is there and working.
We got around the 'primary blocker' above by limiting ourselves to at most 16 "cheap" (effort < suggested) rendezvous responses in flight at once, but being willing to answer as many real (full effort) cells as we get. That way we scale as much as we want for real (full effort) intro2 cells if there are a lot of them, but we still provide some service to the cheap seats, but also we have some capacity available in case a new full-effort cell comes in.
I did this one too: "Put a cap on the max effort a client is willing to attempt." I picked 500 since it seems huge. Feel free to propose a better number. :)
And, it is becoming increasingly clear to me that we are going to need to do the client-side cpuworker thing. I just told my client to visit this onion service 15 times, and it started in on its 15 pows, and even though it came up for air every so often, the next thing it did was get to work on the next pow attempt. The main loop didn't get any attention for something like 30 seconds, which is going to starve a lot of other things.
This sends 100000 requests total, maintaining 100 in parallel. It uses torsocks's -i option so that each request builds a new circuit.
A single instance of this saturated a single CPU on my desktop. Since I have 6 cores on my desktop I ran 6 instances of tor and of this script, which then saturated all 6 cores.
We didn't seem to use all that much resources on the onion service itself at any point, and one-off non-attack requests of time torsocks -i -P curl http://of5kal2657x3tsnkabxj7oczy74idqapup2tlki3o6z3lotfnbg7wmad.onion/ were still getting through without POW.
We had mixed results with POW enabled. I kept the "DOS clients" using a "legacy" tor version that didn't do POW, and we tried sending one-off requests with both legacy and POW-enabled tor clients. The results were a bit mixed; I think with some tuning we got to the point where it seemed like it was helping the POW client get through, but not consistently.
Btw I don't think we ever actually overloaded the server machine; its CPU usage stayed low.
We had mixed results with POW enabled. I kept the "DOS clients" using a "legacy" tor version that didn't do POW, and we tried sending one-off requests with both legacy and POW-enabled tor clients.
Btw I don't think we ever actually overloaded the server machine; its CPU usage stayed low.
Ok thanks for this summary. I know @arma also did some optimizations of eventloop scheduling, but now that we have my patch that stays at 0-pow until queue overload, we can do stuff like benchmark its throughput before it switches to using non-zero pow, and compare that to the throughput when the pow system is completely disabled. This will let us tune the eventloop and queue scheduling further.
So we have two integration tests:
Stress-test and benchmark onion service throughput with the 0-pow case, and without the code enabled, to further tune eventloop scheduling
Benchmark and improve the queue and effort thresholds to activate non-zero pow effort usage
@dgoulet and I should be able to do this, when he gets back.
I have some initial partial hacks at this one on my laptop, between PETS flights. So, feel free to beat me to it, but also feel free to do other ones first and maybe in August I will reappear with some code for this one. :)
A thought @dgoulet and I had today on the bbb: if the attacker is able to send so many intro2 cells that not all of them get to the onion service, then they can successfully (probabilistically) deny service even from people who solve the PoW.
That is, as we explore the onion cannon, we might notice that there's a load level above which the honest clients don't get through because the intro2 cells don't even properly arrive.
Maybe the fix is a whole bunch of onionbalances on the service side. In any case we'll still be raising the bar.
Yeah, it seems like we should provide options to allow the use of Nick's binding idea and use intro point transparent PoW (ie unencrypted intro point extensions), for intro-point PoW validation, if the intro point supports it.
Also note that we are seeing huge cell queues on intro circuits in the wild, which causes intermediate relays to close them. So this is another bottleneck that adversaries could deliberately target to deny service.
We could have the service send its service queue processing rate to the intro point in the form of HS DoS rate limit parameters, automatically. This would reduce that circuit teardown ability, but not eliminate it entirely (ie: burst attacks might still work, before a queue processing limit can be calculated/sent). The other alternative is to perform congestion control on intro circuits, and have the intro point signal that the congestion window is being exceeded (it is the sender that knows the congestion window size, not the receiver). This is obviously even more complex, though.
But of course, most of these ideas require intro point upgrade, so in the meantime, the main answer will have to be "use onionbalance" and "use HS DoS rate limits", if any of these things happen.
My ticket40634_048-patch3 branch now has client-side cpuworker pow support. It is built on Mike's latest branch so it should be the best of everything we've got so far.
In Limerick we decided that the right plan is to throw out the complicated, hard-to-specify, weirdly licensed, woah-is-that-really-assembly-that-it-uses, PoW scheme and use something way simpler, maybe from a common library that everybody already has.
@ahf knows the details of this plan, and/or is intending to produce further details. :)
Current blocker: Mike's dynamic effort adjustment (AIMD) approach does not seem easily compatible with my hack for how to choose the number of cells that we hold in the queue vs try to answer:
We got around the 'primary blocker' above by limiting ourselves to at most 16 "cheap" (effort < suggested) rendezvous responses in flight at once, but being willing to answer as many real (full effort) cells as we get. That way we scale as much as we want for real (full effort) intro2 cells if there are a lot of them, but we still provide some service to the cheap seats, but also we have some capacity available in case a new full-effort cell comes in.
Specifically, the behavior on the current branch is that we start at effort 0, and whenever we get an intro2 cell with effort 0, it is sufficient effort so we answer it, so nothing ever remains in the queue long enough for the queue to get big enough for the service to ever decide to raise the effort parameter.
We need some new design breakthrough for how to recognize that we are getting more requests than we can handle -- because we don't know how many we can handle so we don't have any orderly kind of backpressure approach currently.
We need some new design breakthrough for how to recognize that we are getting more requests than we can handle -- because we don't know how many we can handle so we don't have any orderly kind of backpressure approach currently.
Hrmm.. The difference in eventloop prioritization between cell processing (via hs_circ_handle_introduce2()) versus the dequeue callback (handle_rend_pqueue_cb(), which is a mainloop_event_postloop_new() event that should happen after other events are checked, was meant to generate this backpressure. The thinking was that incoming cell processing (the cheap part) would then pre-empt the post-loop launching of circuits (the expensive part).
But perhaps this post-loop concept is not enough below the incoming cell processing in priority to allow intro cells to build up in the queue? Or we're launching too many circuits in that post-loop event, relative to how many can come in during cell processing?
Yes, something like that. We are able to launch all of the circuits that we want to launch, because we just... do it. The process of receiving an intro2 cell is cheap, and the process of telling Tor to start answering an intro2 cell is cheap too. We end up with too many circuits if we try to answer too many of them, and that leads to not all of them succeeding, but there is no mechanism which says "we have too much going on right now, so we can't open more."
I did add a mechanism like that, on the branch here, for the case where the next intro2 cell in the queue is under-effort. But the goal there was to intentionally leave some open space in case a proper-effort intro2 cell comes in next.
Here's an idea. We could use that same mechanism again, to say "if the next cell up is under-effort then only launch it if < 16 circuits are in flight" (already implemented) plus "if the next cell up is full-effort then only launch it if < x circuits are in flight" (this new idea), for some value of x much higher than 16. This approach seems like it would provide the pushback we need.
But how should we pick x? I didn't worry that 16 is a low number because it was only for under-effort answers, and if we over-aggressively limit those it's fine, at least we're answering some. But for the full-effort case, if we pick a number that's too low we limit how well the onion service can keep up with its users, and if we pick a number that's too high we do a less good job at pushing back (and thus noticing we need to raise our requested effort).
That said, there is another feature of limiting ourselves to x in-flight responses if we can pick an adequate x: if we launch 1500 circuits at once, we're screwing ourselves and all 1500 of them will suffer, whereas if we had intentionally kept some from launching, at least the ones we do try will be more likely to succeed. So there is something to be said for having a cap, of some sort, even if we don't do this PoW thing, to improve onion service performance.
The right value of x depends not just on the local computer's capabilities (cpu for doing new handshakes, good network connection for getting the create and extend cells out, etc) but also on how the overall Tor network is doing. So tuning it once globally is unlikely to perform well. Is there some self-measurement an onion service can do at startup to help it decide a good number for its x? Like, launch 100 circuits at once and see when the 50th circuit finishes getting established. Another option is to watch our actual response circuits (from real users) and auto tune our threshold based on how many are succeeding; but we should be wary of any approach that lets the attacker control the circuits that we use to adjust our threshold. Can we get away with something simpler?
On the plus side: the "x circuits in flight" mechanism I describe is already much more dynamic and reactive than a static "x circuits per second" approach. So in that sense the mechanism is already adapting to the current network conditions, if we can find an adequate x to use.
On the minus side: if the pushback mechanism is having no more than x circuits in flight, can the adversary craft circuits that take especially long for us to finish, thus filling us up quicker and more thoroughly than we expected, and delaying the honest intro2 cells in the queue? Maybe this is ok, since if the adversary can be the first x cells in our queue then by definition we need to raise our effort parameter. But it isn't quite 'the first x cells' -- it's more like they need to get us to process x cells in each time interval, where the time interval is how long those circuits take to finish, e.g. 30 seconds to time out. In any case it makes me uncomfy that this attack has an angle where the asymmetry favors the attacker. Hm.
We just hired two excellent engineers to help with this. They start next week and next month.
The TL;DR summary is that onion service DoS activity appears to remain abnormally high and consistent. Shooting from the hip, it appears that onion service DoS attacks are the bulk of sustained DoS activity. Other aspects of the DoS are highly variable, as if adversaries are probing for things that can damage the network, and take specific classes of relays offline. We have a pipeline of tickets for those mitigations. Most are confidential, due to the strange nature of this probing.
There's one minor fix in but it's mostly just @arma and @dgoulet's work rebased onto main at this point. I've finally got a somewhat-reliable way to locally generate DoS floods that affect my service under test but not the wider Tor network, and I've been kicking the tires on the patch to see what works and what to do next. Some obvious things that need to be done still:
the client sometimes isn't detecting PoW parameter changes, getting stuck retrying repeatedly with the wrong params
need metrics: pqueue depth and effort values, in-progress rend circuits, suggested effort
split out equix dependency into a dynamically loadable module
There's also been some talk about reconsidering the hash function. The current implementation seems to be doing its job, i.e. making legit clients spend a tolerable amount of time and making it far too expensive to run the DoS with PoW enabled. Maybe we have room to do something simpler though and retain those high level properties.
I added metrics for the suggested effort as well as the queue depth, and I added an explicit rate limiter for controlling how fast we dequeue rendezvous requests. This should be useful in production to keep the number of active rend circuits under control, and I can set that artificially low to make it easier to test DDoS scenarios with less actual attack power. You can see the PoW effort suggestion oscillate; it gradually decreases until the queue begins to overload, which causes a sudden bump.
One of the big remaining obstacles before this can be more widely distributed: we're still relying on a hash function (Equi-X) that has no specification and just one implementation, and that implementation is LGPL'ed. The branch linked above has a kind of unsatisfying approach that includes a copy of equix (making the tor binary effectively GPL'ed, but without documenting this fact) and also adds a new external dependency for libb2. We have some options for dealing with this:
entire PoW strategy is pluggable:
probably bad to encourage proliferation of site-specific binary blobs
don't want to expose any internals of tor, so this still wouldn't allow plugins to do anything novel with network or threading
minimal pluggable module system just for PoW function itself:
keeps tor dependencies minimal
not much value beyond using the libequix.so interface that already exists
link against libequix.so
lib is still immature, would need to be stabilized and packaged
easiest to maintain, especially while stabilizing equix
can modify equix to share its blake2 implementation
when enabled, entire 'tor' binary becomes effectively GPL'ed. is this a problem?
revise the protocol
could use a more mature hash function
equix seems to work well, but it also seems like we'd get most of the PoW benefits from nearly any halfway decent client puzzle
we can extend the protocol later with new versions of the pow-params
equix could be replaced with a partial hash preimage puzzle, but it should involve a memory scratchpad for a little GPU and ASIC resistance.
blake2b can be replace by any convenient hash we already use in tor, like sha3
For discussion's sake I wrote up a tiny version of (2), but I think I'm leaning more toward either (4) or (5) now.
scope:- intended only to implement the current PoW scheme, not to support a proliferation of site-specific binary blobs- allow resources to be reused between invocations, within a single thread- no reliance on tor internals, minimal ABI- just enough futureproofing to handle other things we may want to call a "proof of work plugin" in the future.torrc:ProofOfWorkPlugin /path/to/module_name.soexports:typedef struct tor_pow_hs_v1_context_t { void *opaque;} tor_pow_hs_v1_context_t; typedef struct tor_pow_hs_v1_nonce_t { uint8_t bytes[16];} tor_pow_hs_v1_nonce_t;typedef struct tor_pow_hs_v1_seed_t { uint8_t bytes[32];} tor_pow_hs_v1_seed_t;typedef struct tor_pow_hs_v1_equix_t { uint8_t bytes[16];} tor_pow_hs_v1_equix_t;typedef struct tor_pow_hs_v1_puzzle_t { uint32_t effort; tor_pow_hs_v1_seed_t seed;} tor_pow_hs_v1_puzzle_t;typedef struct tor_pow_hs_v1_solution_t { tor_pow_hs_v1_nonce_t nonce; tor_pow_hs_v1_equix_t equix;} tor_pow_hs_v1_solution_t;typedef struct tor_pow_hs_v1_t { size_t struct_size; const char *plugin_name; const char *plugin_version; tor_pow_hs_v1_context_t *(*context_alloc)(const tor_pow_hs_v1_t *self); void (*context_free)(const tor_pow_hs_v1_t *self, tor_pow_hs_v1_context_t *context); int (*solve)(const tor_pow_hs_v1_t *self, tor_pow_hs_v1_context_t *context, const tor_pow_hs_v1_puzzle_t *puzzle, tor_pow_hs_v1_solution_t *solution_out); int (*verify)(const tor_pow_hs_v1_t *self, tor_pow_hs_v1_context_t *context, const tor_pow_hs_v1_puzzle_t *puzzle, const tor_pow_hs_v1_solution_t *solution);} tor_pow_hs_v1_t;typedef struct tor_pow_plugin_host_t { size_t struct_size; void (*register_hs_v1)(const tor_pow_plugin_host_t *host, const tor_pow_hs_v1_t *hs_v1);} tor_pow_plugin_host_t;extern void tor_pow_plugin_init(const tor_pow_plugin_host_t *host);
I think I still favor '5', because of the complexity of the equix stuff, because of the license mess, because it's hard to audit, and because arti isn't going to want to use it anyway. (Still the same opinion as in #40634 (comment 2844708))
Got it. I can focus on prototyping a simpler client puzzle algorithm made using the pieces we already have at hand, maybe call that "v0" to go with the equix-based "v1".
In the onion services working group meeting this today we were leaning toward (4) but I can give (5) a quick go unless anyone has objections.
We really should rework this ticket into a new one with a checklists for "Alpha MVP Must Have", "Want to have for arti", and "Nice to have eventually". I think we're missing the forest for the hash functions, here.
Shipping a Tor Browser Alpha with equix well may be enough to allow the services under attack to upgrade, tell their users to use the alpha to access them, and make the attackers go away, while we figure out the "perfect solution". If we can solve the licensing problem with Option 4, that seams an easy fast path to this.
Second, even if we are to replace equix (and I agree there are good reasons to do so), I don't think we should NIH our way into redesigning a new pow function before at least evaluating yespower, or releasing an alpha with equix, as both already provide GPU resistance.
If we re-invent the wheel without any evaluation, we risk re-inventing a wheel that is already broken, or can be quickly swapped for something that runs on bitcoin ASICs, GPU servers, or even just an architecture choice that does the PoW much faster than others. We also risk taking more time to do so than just doing a configure option, or using yespower.
Maybe it's easier to make a resistant pow than it looks, if you know what you're doing. But from my view, it looks like it takes a while, and if we do it blindly, we risk stepping on a landmine of a design that has already been optimized/broken elsewhere.
After we get this code shippable, we can design hash functions to our hearts content. This is the kind of thing I wanted to capture on a new, consolidated set of priorities and checklists.
I think aiming for 4 first is the best approach like we talked about in the WG meeting today, yeah.
Once we have something in tor.git, we can look at next steps. We risk blocking ourselves otherwise since this is a very open-ended problem we are trying to solve. Getting something out that we can start getting some feedback on is probably the best thing we can do for now and once that is out we can start doing more experimental stuff here :-)
Just to pull on loose threads in the convo that may be useful later...
I had been looking at yespower after Mike mentioned it in the meeting, and it did seem like a good candidate for a "v0" algorithm. Are there arguments against yespower I should know about?
I hadn't mentioned it specifically in the meeting but one strategy we have at our disposal for a "v0" strategy is to just avoid worrying about GPU and ASIC resistance at all, and use a hashcash-style scheme using the fastest modern hash or kdf we're already including in tor. I don't know if this is a good idea, it kind of depends on the attacker's willingness to gather GPUs for their botnet. In any case, this approach seems slightly worse than just using yespower.
A lot of the complexity in Equi-X comes from the hashx algorithm, which seems powerful for the level of ASIC resistance it provides (you'd effectively have to implement a CPU in your ASIC) but custom hardware doesn't seem to loom large in our threat model. We're in a tight spot with our memory requirements. A lot of the really strong GPU- and ASIC- resistant algorithms that exist for coin junk will aim to have a working set that's solidly above what will fit on anyone's on-chip cache. But a mobile Tor app might actually have to fit in so little RAM that a server's L2 cache would easily hold any working set we can afford.
In addition to being harder than just using an existing solution, the reason to avoid a DIY v0 is exactly because of the "crypto junk". There are cloud mining services and pow optimization libraries that could be repurposed for attack.
We have also seen continual adaptation from the attackers over the past year. It is no longer a reasonable assumption to believe that we can just do a quick hack and they will have trouble adapting. This is no longer the case.
It is entirely possible that the attackers could find some cheap mining service, deprecated-but-available accelerator hardware, and/or pow optimization libraries that can be swapped in to break our DIY effort, especially if we are not aware of that full landscape.
We spent a long review process eliminating hash functions and pow libraries during the proposal process, with a list of requirements, to avoid easily optimizable candidates. Trivial hash functions satisfied none of them, and even many existing and available pow candidates were also ruled out by available optimizations on certain commodity hardware. Both tevador and Solar Designer helped with this analysis.
This analysis is exactly why tevador designed EquiX. They designed it for us, for our requirements. Solar Designer was nervous about yespower being used in a way that would incentivize ASIC development, but despite the adaptation, we don't think that our adversaries will adapt to the point of making custom chips to perform their attacks. They are still cost-sensitive, and are doing this for extortion and market-capture reasons, and while their resources are considerable, they are likely not at the level of cryptocurrency ASIC design and fab, since they would be the sole users of such a chip (as opposed to an entire mining industry that would want to buy a run of them). They are also criminals, and likely prefer attack points that they can burndown and abandon, and will not want to risk leaving multi-million dollar hardware (with an identifiable supply chain) behind.
As for the RAM question, yes, you are right. We are operating under the assumption that so long as users can use desktop Tor Browser to access a site that is under attack, the incentive for adversaries to attack it goes away. So mobile support is not a major concern, at the moment. This is also why we want to get this into an alpha as soon as possible, because even that may be enough to make it no longer worthwhile behavior.
So my only concern with option 4 is that downstream Tor Browser consumers (which use our bundled tor daemon and friends in their apps) may inadvertently package tor into their apps (as they have been doing) without realizing it's now GPL. Especially since we've made it easy to automate the process with the tor-expert-bundle packages.
It would help to enable PoW for mobile if we can do so easily, but we haven't been prioritizing mobile since we don't expect it will work as well as on desktop.
The algorithm we're using has three versions: an interpreted version that will probably run fine on anything little-endian, and JIT-compiled versions for x86_64 and arm64.
Do we have any information about how many downstream apps, that are not currently GPL, would be affected by this?
Downstream apps that repackage Tor Browser Alpha binaries into non-GPL bundles are still shipping a GPL binary that has full unmodified source available (from us), even if their repackage is not GPL. There is no license violation, because that binary can be reproduced from our sources and modified, as per the GPL. The GPL is not contagious to other separate non-GPL binaries in a bundle.
Would it be wrong to assume that we wouldn't need to turn this on for mobile, because we don't expect the offenders using mobile for their attacks?
It doesn't matter if the attackers use mobile or not, in this case. They may create circuit churn, but that will not affect the reachability of the site for desktop users.
Again, it is important to realize that this whole system functions on an incentives level. If users can still reach a site via desktop, but the pow is too expensive on mobile, or unavailable on mobile, Tor Browser Android can throw up a banner that says "Site is under attack; you must use Tor Browser Desktop to access it". Then, the users can use Tor Browser Desktop, still get access with PoW, and the attacker spends attack resources just to be slightly annoying, instead of actually disabling a site fully.
Also, it is important to note that we're still lost in the weeds here, and are worrying prematurely. The actual plan is not clear from this monster disorganized ticket. I am not happy with the JIT or GPL issue either, but it will work for the incentive purpose for an alpha, get the code some exercise, and also evaluate and integrate yespower, or develop an in-house design.
Downstream apps that repackage Tor Browser Alpha binaries into non-GPL bundles are still shipping a GPL binary that has full unmodified source available (from us), even if their repackage is not GPL. There is no license violation, because that binary can be reproduced from our sources and modified, as per the GPL. The GPL is not contagious to other separate non-GPL binaries in a bundle.
Great, thanks @mikeperry and @beth, seems like the 4 option makes the
most sense with the optional flag. I'm all for cutting ourselves out of
these weeds and figuring out what the worries are when we can see them!
Yes. I look forward to everybody worrying about the same things, together, with the full design, design history, and plan in mind. At least then we'll be on the same page.
I wanted to rework these checklists two weeks ago, but I also have other things to worry about.
Ok @beth asked about blake2b last week. I got ahold of asn and he did some design archeology for us. He said the following two mailing list posts are relevant, specifically the summary of Solar Designer's comments about EthHash's anti-DDoS trick at the end of the first post:
Hi Mike, thanks much for including these references here!
I did read through all these mailing lists posts from 2020 earlier, and it helps a lot to understand the overall strategy for quickly rejecting low-effort requests. From my understanding though this only requires two properties from the hash: speed, and uniform distribution. I don't see any specific reasons why we'd choose blake2b instead of say blake3 or even sha3.
Anyway, it shouldn't matter too much for performance, since I doubt that hashing a few dozen bytes will be the main bottleneck if someone is spamming c-tor with miscomputed PoW introductions. I can test this. I'm fully in favor of just implementing the spec as it's written and not endlessly bikeshedding about hash functions, I just wanted to understand the choices in the existing design.
Hey @beth. Agreed, I think blake2b was chosen just because it's fast and probably because tevador felt comfortable with it. I think it's fine to swap it with any other fast hash function, to avoid adding an extra dependency.
I think it might also be a good idea to re-examine tevador's intuition about this trick not being particularly effective. If that's the case, perhaps it shouldn't be implemented in the first place.
My understanding is that the hashing itself isn't done purely for quick verification, but also as a method of implementing the effort check. That part seems totally reasonable to me.
I'm a bit rusty on this, but I think the effort is checked both in the top-half protection of R * E > UINT32_MAX and also in the bottom-half protection of equix_verify(C || N || E, S) != EQUIX_OK.
This makes me wonder how often we expect the Client checks if R * E <= UINT32_MAX event to trigger. The answer here impacts both how easily legitimate clients create valid tickets, and also how much top-half protection this blake2b feature adds in the first place.
This makes me think that the proposal is underspecified there, since R is the output of blake2b (like 32 bytes) but then it's multipled by an unsigned integer and compared against UINT32_MAX. I think this doesn't make much sense.
Furthermore -- regardless of how the above computation is defined -- since E is attacker controlled (but MUST be above MIN_EFFORT), an attacker who wants to overwhelm the top half will set E==MIN_EFFORT and then hope that blake2b will return a small enough number when given trash. How often this happens depends on the parameter tuning above.
Also, I this top-half protection has not been taken into account in the various parameter tuning sections of the proposal. If getting past the top-half protection is hard, the numbers will look different. If getting past the top-half protection is easy, then maybe it's not useful?
Feel free to ignore this nitpicking. I was hoping that if this blake2b top-half protection doesn't work that well, it might be removed from the proposal and hence make implementation slightly easier.
The equix problem itself isn't adjustable-effort, the point of E is to control on average how many equix solutions must be computed for a particular PoW solution. This is the effort adjustment.
The point of using E in both places (inside the challenge, and in the R * E check) is to force the PoW solution to commit to a particular effort value before calculating equix_solve().
Your question about how many times the client passes the R*E check is proportional, on average, to the value of 1/E. Consider that R is a uniformly distributed uint32, so E adjusts the probability at which that test passes.
The point of this design from my understanding, is that you need both an asymmetric puzzle (equix) and a way to verify that the puzzle was run for a particular number of cycles (blake2 and the R * E check). A legit PoW solver would need to do both steps in order to find a valid solution. An attacker could skip the equix_solve and just run blake2, finding low-cost solutions that will pass the R*E check but fail equix_verify(). This should be fine, since equix_verify() is also pretty fast, but we need to test this case.
Oh I should mention that blake2b has an adjustable output size, and it's specified here that we are talking about a 32 bit hash. I looked closer at this and blake2b is internally always 512-bit, and the 32 bit hash is taken by truncating the first (least significant) 32 bits from the full hash as well as including this adjusted hash length in the blake2b initial state.
Here's some metrics port output from the testbed I've had running the current version of this patch:
The testbed setup is not currently public but the basic specs are:
Four middle relays with other DoS mitigations disabled.
One onion service acting as defender, with HiddenServicePoWQueueRate set unusually low for testing purposes (10/sec).
One regular client that connects to the defender repeatedly in a single loop.
One desktop computer running a simulated DDoS consisting of individual attacks that overlap and periodically restart. The attacker uses a modified copy of tor, and it also participates in the PoW algorithm.
The AIMD effort estimation system is obviously oscillating here but I don't think that's necessarily a problem. This is going to be a complex dynamic system and tuning it any further is going to require a more accurate attack environment than my little simulator can provide. We do see that sometimes the single client will get unlucky with the current state of the effort estimator and pqueue, and the blank spots in the top-left graph indicate points where the client is stuck until its request times out and it tries again.
The AIMD effort estimation system is obviously oscillating here but I don't think that's necessarily a problem. This is going to be a complex dynamic system and tuning it any further is going to require a more accurate attack environment than my little simulator can provide.
Yes, the AIMD will oscillate. This is just what AIMD systems do. The good news is they are convergent even if not fully tuned (and there are actually proofs about this, wrt TCP).
My fork of tevador's sim did not commit the new graphs to that repo, but if you re-run it, you can see the oscillation there, too. Though you need to hardcode input and graph choices, and such. https://github.com/mikeperry-tor/scratchpad/tree/master/tor-pow
We do see that sometimes the single client will get unlucky with the current state of the effort estimator and pqueue, and the blank spots in the top-left graph indicate points where the client is stuck until its request times out and it tries again.
Yeah, this is a bit unfortunate, but not the end of the world. We could add metricport counters for this, and log the frequency of this to the log, to try to minimize it later.
I imagine @mikeperry might have a more accurate list of what still needs to be done, but from looking at the lists above and the current state of the patch the main to-dos I have at this time are:
Improve the reliability at which we can guess a good effort value. The current system works alright, but efforts tend to oscillate and it's not uncommon for clients to pick a lower-than-average value that turns out not to work.
Nitpick the spec a little, make sure it is clear about the various endianness decisions we make
Reconsider both our upper and lower limits on effort
Reduce caching period of pow-enabled hsdescs to 1-2 min
Test loop for pow-speed, oom, and binomial dist estimtation of how long a given pow effort will take
Control port flags on circs for POW_EFFORT, POW_TIME
Control port event for "POW too high"?
Nonce replay detection bloom filter, or nickm's intro point reply idea
Re-examine the approach to event loop integration, it seems CPU-hungry
Still some lingering TODO/XXX comments in the patch
I think we have some major blockers taken care of though:
Portable across CPU architectures, big and little endian
Passes CI
Explicit GPL build mode
Stable and unit-tested wire protocol
Observable via MetricsPort on the service side
It feels like a good time to start code review, since I think the code as it stands is ready for main even if we may still want to hack on it a bit before it's really considered done. I'll attach a merge request shortly.
Been working on the above loose ends, mostly effort estimation, and trying to build a tor browser with PoW enabled to see how that integrates. The MR has changes to the effort upper/lower limits, and a change to the way efforts are estimated when the queue is full of 0-effort requests (as it often will be during an attack).
I'm not sure the current algorithm is really what we want. It's not strictly AIMD, since the first leg of update_suggested_effort() with its max_trimmed_effort test can either raise or lower the suggested effort. When the pqueue is flooded with 0-effort requests, we will never reach the normal multiplicative decrease leg of update_suggested_effort(). I'm considering replacing that mechanism with something that looks at the minimum effort of a request we successfully processed in the last time period. I'm not sure yet that this change is actually necessary.
I'll likely remove the XXX comments about the event loop integration, as it seems fine now that we have rate limiting on the dequeue side.
I'm not sure the current algorithm is really what we want. It's not strictly AIMD, since the first leg of update_suggested_effort() with its max_trimmed_effort test can either raise or lower the suggested effort. When the pqueue is flooded with 0-effort requests, we will never reach the normal multiplicative decrease leg of update_suggested_effort(). I'm considering replacing that mechanism with something that looks at the minimum effort of a request we successfully processed in the last time period. I'm not sure yet that this change is actually necessary.
If you do make some changes, please also try them in that python sim so we can have consistency between the two, for spot checking that the issues we see are not just C-Tor being C-Tor :)
hmm! one big difference between this sim and what we have implemented now, is that the sim assumes clients will retry with a doubled effort after a timeout. I haven't seen that on the todo lists yet, so:
Client side support for increasing (doubling?) effort after a timeout
As I mentioned in the meeting: nice find! This is very important, because a timeout is also a congestion signal to the client (similar to TCP). Doubling every timeout also helps ramp up the difficulty quickly, similar to slow start.
I just checked, and this doubling is mentioned in the spec. However, we should probably clean up the spec XXX's, and make the spec much more clear that it is deliberately managing this queue similar to slow start+AIMD congestion control, and provide a link to this new sim, so this behavior is not replaced with something "simpler" in arti, that breaks the convergence properties.
cool! I haven't done much with the effort sim yet but it's still on my todo list. The effort estimation seems to be behaving much better now, and it seems to even work fine when the token bucket limiter is disabled and the dequeue rate is instead limited by CPU time.
With the changes settling down on the c-tor side I had a chance to think more carefully about this effort_sim, and I think we're still basically doing the same algorithm here and in c-tor within the bounds of what's represented well in the sim anyway. I think it captures the macro-scale activity pretty well. The most obvious difference is that the scale is very different. In the real system right now, an effort of '1' is small but still useful, and an effort of 1000 is enough the user will wait maybe 10 seconds. The efforts in the 6 digits are well above our current client-side limit (10000) and would take hours to run.
The client-side effort increase is done a little differently. The effort sim doubles each time, whereas the implementation in !702 (merged) right now uses (suggested_effort * (1 + retry_counter)). I think doubling for every retry is too aggressive and the checked in code is closer to what we want, but this whole business isn't very comparable between sim and c-tor because c-tor's retry behavior is still confusing and possibly buggy.
The other small change is that I modified the 'increase' strategy to ensure we are always adding at least 1 to the effort. This change would be hard to observe with the current sim settings, because of the scale difference.