Faster Ed25519 implementation.

changed milestone to %Tor: 0.2.7.x-final in legacy/trac

added component::core tor/tor in Legacy / Trac milestone::Tor: 0.2.7.x-final in Legacy / Trac parent::9663 in Legacy / Trac performance in Legacy / Trac priority::medium in Legacy / Trac resolution::implemented in Legacy / Trac status::closed in Legacy / Trac tor-relay in Legacy / Trac type::enhancement in Legacy / Trac version::tor 0.2.7 in Legacy / Trac labels

Making legacy/trac#9663 (moved) the parent for now, since there's a cross-dependency there, and I've been doing the integration as: https://github.com/Yawning/tor/compare/feature9663

For those that are too lazy to look at legacy/trac#9663 (moved) to pull out things relevant here:

The "on my desktop with TurboBoost etc" x86_64 performance gains look like:

Generate public key: 60.96 usec -> 22.22 usec
Sign a short message: 62.16 usec -> 23.84 usec
Verify signature: 161.53 usec -> 77.68 usec
Blind a public key: 139.10 usec -> 70.78 usec
Convert public point from curve25519: 9.81 usec -> 5.30 usec

TODO/Open questions:

Write the self-test/fallback code.
Actually use the batch verification code.
Is it worth potential pain to enable SSE2 on ia32 targets? (SSE2 performance appears to be worse with x86_64). Benchmarks needed, it is probably an ok assumption that the "portable" 32 bit code outperforms ref10 though.
Someone who cares maybe should unbreak the Visual Studio build (Apparently the VS version of -donna doesn't have SSE2 assembly, but it should build).

Ok, first pass at: https://github.com/Yawning/tor/compare/feature9663

It may be sort of annoying to review since the branch also does legacy/trac#9663 (moved), however only b61fa4c59c4d3c9226f4514521a5193b77a911dd really is the Curve25519 ntor optimization, the rest falling under this ticket (pulling in donna, adding support for our constructs, adding test cases etc).

I'm not sure if the struct full of function pointers approach I took for runtime Ed25519 implementation selection was the best way to do things, but this is both easy to rip back out, and easy to add another Ed25519 implementation if someone happens to write ed25519-super-turbo-2 or whatever.

For Ed25519, the unit tests fuzz all the things, the runtime sanity check calls the donna internal test code, and validates that public key derivation/signing/checking signatures appears to be working (Maybe this should fuzz as well, but I don't think it will really catch anything).

TODO: Batch verification, leaving it up to nickm (I left a comment in the #if 0ed out code where there's a bug...)

Trac:
Status: new to needs_review

Code review using clang (Xcode 6.4 (6E35b) Apple LLVM version 6.1.0 (clang-602.0.53) (based on LLVM 3.6.0svn)) on OS X 10.10.4.

Build & static analysis:

clang would prefer iters to be a #define

test_crypto_ed25519_fuzz_donna(void *arg)
{
  const int iters = 1024;
  uint8_t msg[iters];

tor/src/test/test_crypto.c:1643:15: Variable length array folded to constant array as an extension

Other than that, there are no build or static analyzer complaints from clang.

(I spoke too soon, it turns out I hadn't updated my Xcode project to include the new source files.)

OS X defines an ALIGN macro to calculate pointer alignment based on a passed pointer value. It's not the same as the compiler alignment attribute. Can we rename the macro across the ed25519/donna codebase? tor/src/ext/ed25519/donna/ed25519-donna-portable.h:23:10: 'ALIGN' macro redefined

expand256_modm doesn't appear to initialise the bytes of work above len, or, if len is less than 32, the bytes of out above 32.

The clang static analyzer to complain about garbage values being passed to the + operator in the call stack:

ed25519_donna_keygen
ed25519_donna_pubkey
ge25519_scalarmult_base_niels
curve25519_sub_reduce
tor/src/ext/ed25519/donna/ed25519.c:202:3: The left operand of '+' is a garbage value (within a call to 'ge25519_scalarmult_base_niels')
- which is actually line 85 in curve25519-donna-64bit.h:

out[0] = a[0] + fourP0    - b[0]    ; c = (out[0] >> 51); out[0] &= reduce_mask_51;

I've tried zeroing out most of the variables involved, but I might have missed some. I can't to work out how to fix this analysis issue. I wonder if the assembly is confusing clang, but it's worked fine with other assembly in the past.

Should we be using the di_ops functions for memset, memcpy and similar?

clang's static analyzer also complains about the same issue when ed25519_donna_sign calls ge25519_scalarmult_base_niels.

There's also a typo in README.tor: "next-gen hidden srevice descriptors"

Replying to teor:

clang would prefer iters to be a #define {{{ test_crypto_ed25519_fuzz_donna(void *arg) { const int iters = 1024; uint8_t msg[iters]; }}} tor/src/test/test_crypto.c:1643:15: Variable length array folded to constant array as an extension

Fixed. (116a56a234631978c46a1bf9ac1649eef8ae6bf4)

OS X defines an ALIGN macro to calculate pointer alignment based on a passed pointer value. It's not the same as the compiler alignment attribute. Can we rename the macro across the ed25519/donna codebase? tor/src/ext/ed25519/donna/ed25519-donna-portable.h:23:10: 'ALIGN' macro redefined

I'd rather not (OSX defining ALIGN to me is "shitting up the global namespace), since it makes it even harder to fold in changes to upstream. There's a pull request open that undef's ALIGN (https://github.com/floodyberry/ed25519-donna/pull/21), which keeps the codebases closer.

Fixed. (d83a65041b648a30835be29a9a54cfe34ab5608b)

expand256_modm doesn't appear to initialise the bytes of work above len, or, if len is less than 32, the bytes of out above 32.

Huh? unsigned char work[64] = {0}; The entirety of work is initialized to 0. This is language standard behavior. All of out gets set as well (typedef bignum256modm_element_t bignum256modm[5];).

Should we be using the di_ops functions for memset, memcpy and similar?

None of the memsets are in danger of being optimized away (I use memwipe where this is a possibility).

The one place where a data dependent memcmp is used outside of tests is in the batch verify code, in a routine explicitly tagged as vartime I will assume the implementer knew what he was doing.

There's also a typo in README.tor: "next-gen hidden srevice descriptors"

Fixed. (d83a65041b648a30835be29a9a54cfe34ab5608b)

The clang static analyzer to complain about garbage values being passed to the + operator in the call stack:

Are these actual issues or just false positives? If they're false positives, I'm inclined not to change anything (squashing things just to appease tools isn't behavior I particularly like).

Apologies for misreading the code on expand256_modm. I was trying to work out why the static analyzer was complaining.

I can't tell whether they are actual issues, or false positives, just from reading the code.

I built using this incredibly convoluted set of compiler command-line flags with clang-3.7: clang -g -fno-omit-frame-pointer -fasynchronous-unwind-tables -fno-optimize-sibling-calls -fno-inline -DPARANOIA -fstack-protector-all -fsanitize=undefined -fno-sanitize-recover=all -fsanitize-blacklist=contrib/clang/sanitize_blacklist.txt -fno-omit-frame-pointer -fasynchronous-unwind-tables -fno-optimize-sibling-calls -fno-inline -fsanitize=address -fsanitize-blacklist=contrib/clang/sanitize_blacklist.txt -fno-omit-frame-pointer -fasynchronous-unwind-tables -fno-optimize-sibling-calls -fno-inline -O0 -arch x86_64

And I ran out of registers:

In file included from src/ext/ed25519/donna/ed25519.c:18:
In file included from src/ext/ed25519/donna/ed25519-donna.h:101:
src/ext/ed25519/donna/ed25519-donna-64bit-x86.h:14:3: error: inline assembly requires more registers than available
  "movq %0, %%rax                  ;\n"
  ^
src/ext/ed25519/donna/ed25519-donna-64bit-x86.h:14:3: error: inline assembly requires more registers than available
2 errors generated.
make[1]: *** [src/ext/ed25519/donna/src_ext_ed25519_donna_libed25519_donna_a-ed25519.o] Error 1

But a minimal configure with no extra clang options works fine: ./configure --with-openssl-dir=/opt/local --with-libevent-dir=/opt/local

This appears to be an issue with the options I'm using taking up registers. I bet it's either the sanitizers, or the debugging / optimization-disabling options.

I'll try to work out which. Any pointers in the right direction would be appreciated.

This set of flags also works fine: -O2 -funroll-loops -ffp-contract=on -fno-omit-frame-pointer -mrelax-all -fstrict-aliasing -Wstrict-aliasing -arch x86_64

Leaving me with the following (deduplicated) potential problematic options: -fasynchronous-unwind-tables -fno-optimize-sibling-calls -fno-inline -fstack-protector-all -fsanitize=undefined -fno-sanitize-recover=all -fsanitize=address -O0

I'll do some builds with each option and see.

If the sanitizers are the issue, I'll add the problematic code to: -fsanitize-blacklist=contrib/clang/sanitize_blacklist.txt

With the first set of flags in comment 10, I see the following performance gains on OS X 10.10.4, clang 3.7, x86_64, 2 GHz Intel Core i7:

##### ed25519
Ed25519-donna = disabled.
Generate public key: 66.08 usec
Sign a short message: 85.41 usec
Verify signature: 251.77 usec
Convert public point from curve25519: 13.07 usec
Blind a public key: 230.85 usec
Ed25519-donna = enabled.
Generate public key: 25.10 usec
Sign a short message: 26.55 usec
Verify signature: 108.35 usec
Convert public point from curve25519: 5.65 usec
Blind a public key: 120.96 usec

-fsanitize=address causes clang to run out of registers when compiling ge25519_scalarmult_base_choose_niels in src/ext/ed25519/donna/ed25519-donna-64bit-x86.h.

Adding the function to the sanitizer blacklist resolves this issue, as long as people use it.

I can make a branch for it if you'd like. It would add:

# The optimised x86_64 assembly version of ge25519_scalarmult_base_choose_niels
# runs out of registers when compiled with ASan                                 
fun:ge25519_scalarmult_base_choose_niels

to the end of contrib/clang/sanitize_blacklist.txt

If that all that's needed I can add that and save you the trouble (Especially since I'll probably squash/rebase my branch anyway). Sorry for the extra work, and thanks.

I'm somewhat curious to see if the ntor benchmark also is faster with my additions, but since the donna code appears to be getting built correctly by clang, I'm 99% sure the table based precomputation stuff is faster (as intended).

No worries, better to catch incompatibilities at this stage.

By the way, .gitignore probably needs an entry for src/ext/ed25519/donna/libed25519_donna.a It seems to already have wildcard entries for other object files produced during the process.

Take 2: https://github.com/Yawning/tor/compare/feature16467_9663

Major changes are, that I squash/rebased things, and restructured how to handle the tor specific code in donna slightly (ed25591_donna_tor.h, ed25519_tor.c instead of chainsawing up the latter).

This should have all the fixes for the issues found by teor except for the clang static analysis garbage value thing which I assume is a false positive. I opted to solve the asan issue by disabling inline assembly entirely in donna for asan enabled builds. Performance is going to be horrific anyway, so might as well sanitize something.

I'll wait for nickm to get back from vacation at this point, unless the various random people find more issues (unlikely?). I'm inclined to say "use my branch or something that approximates it for the entirely of 0.2.7's lifespan, then yank ref10 in the 0.2.8-alpha timeframe", but that would be more of a nickm call than anything else.

I believe that clang static analyzer complains about garbage values because it doesn't know the length of the char * buffers passed in to ed25519_donna_keygen and ed25519_donna_sign.

As far as I can tell, clang assumes that char *'s are NUL-terminated strings, and expects to see a strlen or similar call. So it gets confused when the byte buffers are walked in 4-byte chunks, up to 64 bytes. (And I'm not sure it's a simple matter to convince clang how that the buffers are 64 bytes, nor do I feel it's a particularly productive activity.)

Yawning, would you mind if we added a comment to those two functions about the spurious errors? It would save the effort of someone tracking them down again. But I really don't mind either way.

Replying to teor:

Yawning, would you mind if we added a comment to those two functions about the spurious errors? It would save the effort of someone tracking them down again. But I really don't mind either way.

I don't mind at this point since I split off the tor version of the ed25519 code into it's own file when I did the squash/rebase, mostly for my sanity (In general I want to try to minimize changes to the code that's still common with upstream, but this stuff is mostly tor specific). There's a giant comment describing what our external interface does, is that where something like this should be, or should the comment(s) live inside each function?

They'd be most helpful in each function, where the errors are logged by clang sanitizer.

Replying to teor:

They'd be most helpful in each function, where the errors are logged by clang sanitizer.

I'm now confused as to which functions exactly produce the errors, possibly because I moved code around etc. Additionally, I'm not exactly sure if I buy the explanation that Clang wants a strlen, since assuming unsigned char * is a NUL-terminated string seems like a rather poor assumption to be making in the first place.

I'm going to leave things as is, we can always add comments later as needed.

I'm merging, but I'll be glad to take more comments in it. Can somebody open another ticket for batch verification?

Trac:
Status: needs_review to closed
Resolution: N/A to implemented

closed

mentioned in issue legacy/trac#16533 (moved)

mentioned in issue legacy/trac#16535 (moved)

moved from legacy/trac#16467 (moved)

added Feature label and removed 1 deleted label

added Relay label and removed Performance + 1 deleted label

Faster Ed25519 implementation.

Child items 0

Activity