The ref10 Ed25519 code currently being used is rather slow, and does not support batch verification. Andrew Moon's Ed25519-donna (https://github.com/floodyberry/ed25519-donna) is both faster and provides batch verification, in addition to the optimized Curve25519 scalar basepoint multiply that legacy/trac#9663 (moved) depends on.
We should use Ed25519-donna instead of ref10 for our Ed25519 implementation, at first with cross-checking/fallback, and eventually as our sole Ed25519 provider.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related.
Learn more.
For those that are too lazy to look at legacy/trac#9663 (moved) to pull out things relevant here:
The "on my desktop with TurboBoost etc" x86_64 performance gains look like:
Generate public key: 60.96 usec -> 22.22 usec
Sign a short message: 62.16 usec -> 23.84 usec
Verify signature: 161.53 usec -> 77.68 usec
Blind a public key: 139.10 usec -> 70.78 usec
Convert public point from curve25519: 9.81 usec -> 5.30 usec
TODO/Open questions:
Write the self-test/fallback code.
Actually use the batch verification code.
Is it worth potential pain to enable SSE2 on ia32 targets? (SSE2 performance appears to be worse with x86_64). Benchmarks needed, it is probably an ok assumption that the "portable" 32 bit code outperforms ref10 though.
Someone who cares maybe should unbreak the Visual Studio build (Apparently the VS version of -donna doesn't have SSE2 assembly, but it should build).
It may be sort of annoying to review since the branch also does legacy/trac#9663 (moved), however only b61fa4c59c4d3c9226f4514521a5193b77a911dd really is the Curve25519 ntor optimization, the rest falling under this ticket (pulling in donna, adding support for our constructs, adding test cases etc).
I'm not sure if the struct full of function pointers approach I took for runtime Ed25519 implementation selection was the best way to do things, but this is both easy to rip back out, and easy to add another Ed25519 implementation if someone happens to write ed25519-super-turbo-2 or whatever.
For Ed25519, the unit tests fuzz all the things, the runtime sanity check calls the donna internal test code, and validates that public key derivation/signing/checking signatures appears to be working (Maybe this should fuzz as well, but I don't think it will really catch anything).
TODO: Batch verification, leaving it up to nickm (I left a comment in the #if 0ed out code where there's a bug...)
(I spoke too soon, it turns out I hadn't updated my Xcode project to include the new source files.)
OS X defines an ALIGN macro to calculate pointer alignment based on a passed pointer value. It's not the same as the compiler alignment attribute. Can we rename the macro across the ed25519/donna codebase?
tor/src/ext/ed25519/donna/ed25519-donna-portable.h:23:10: 'ALIGN' macro redefined
expand256_modm doesn't appear to initialise the bytes of work above len, or, if len is less than 32, the bytes of out above 32.
The clang static analyzer to complain about garbage values being passed to the + operator in the call stack:
ed25519_donna_keygen
ed25519_donna_pubkey
ge25519_scalarmult_base_niels
curve25519_sub_reduce
tor/src/ext/ed25519/donna/ed25519.c:202:3: The left operand of '+' is a garbage value (within a call to 'ge25519_scalarmult_base_niels')
which is actually line 85 in curve25519-donna-64bit.h:
I've tried zeroing out most of the variables involved, but I might have missed some.
I can't to work out how to fix this analysis issue. I wonder if the assembly is confusing clang, but it's worked fine with other assembly in the past.
Should we be using the di_ops functions for memset, memcpy and similar?
clang would prefer iters to be a #define
{{{
test_crypto_ed25519_fuzz_donna(void *arg)
{
const int iters = 1024;
uint8_t msg[iters];
}}}
tor/src/test/test_crypto.c:1643:15: Variable length array folded to constant array as an extension
Fixed. (116a56a234631978c46a1bf9ac1649eef8ae6bf4)
OS X defines an ALIGN macro to calculate pointer alignment based on a passed pointer value. It's not the same as the compiler alignment attribute. Can we rename the macro across the ed25519/donna codebase?
tor/src/ext/ed25519/donna/ed25519-donna-portable.h:23:10: 'ALIGN' macro redefined
I'd rather not (OSX defining ALIGN to me is "shitting up the global namespace), since it makes it even harder to fold in changes to upstream. There's a pull request open that undef's ALIGN (https://github.com/floodyberry/ed25519-donna/pull/21), which keeps the codebases closer.
Fixed. (d83a65041b648a30835be29a9a54cfe34ab5608b)
expand256_modm doesn't appear to initialise the bytes of work above len, or, if len is less than 32, the bytes of out above 32.
Huh? unsigned char work[64] = {0}; The entirety of work is initialized to 0. This is language standard behavior. All of out gets set as well (typedef bignum256modm_element_t bignum256modm[5];).
Should we be using the di_ops functions for memset, memcpy and similar?
None of the memsets are in danger of being optimized away (I use memwipe where this is a possibility).
The one place where a data dependent memcmp is used outside of tests is in the batch verify code, in a routine explicitly tagged as vartime I will assume the implementer knew what he was doing.
There's also a typo in README.tor: "next-gen hidden srevice descriptors"
Fixed. (d83a65041b648a30835be29a9a54cfe34ab5608b)
The clang static analyzer to complain about garbage values being passed to the + operator in the call stack:
Are these actual issues or just false positives? If they're false positives, I'm inclined not to change anything (squashing things just to appease tools isn't behavior I particularly like).
I built using this incredibly convoluted set of compiler command-line flags with clang-3.7:
clang -g -fno-omit-frame-pointer -fasynchronous-unwind-tables -fno-optimize-sibling-calls -fno-inline -DPARANOIA -fstack-protector-all -fsanitize=undefined -fno-sanitize-recover=all -fsanitize-blacklist=contrib/clang/sanitize_blacklist.txt -fno-omit-frame-pointer -fasynchronous-unwind-tables -fno-optimize-sibling-calls -fno-inline -fsanitize=address -fsanitize-blacklist=contrib/clang/sanitize_blacklist.txt -fno-omit-frame-pointer -fasynchronous-unwind-tables -fno-optimize-sibling-calls -fno-inline -O0 -arch x86_64
And I ran out of registers:
In file included from src/ext/ed25519/donna/ed25519.c:18:In file included from src/ext/ed25519/donna/ed25519-donna.h:101:src/ext/ed25519/donna/ed25519-donna-64bit-x86.h:14:3: error: inline assembly requires more registers than available "movq %0, %%rax ;\n" ^src/ext/ed25519/donna/ed25519-donna-64bit-x86.h:14:3: error: inline assembly requires more registers than available2 errors generated.make[1]: *** [src/ext/ed25519/donna/src_ext_ed25519_donna_libed25519_donna_a-ed25519.o] Error 1
But a minimal configure with no extra clang options works fine:
./configure --with-openssl-dir=/opt/local --with-libevent-dir=/opt/local
This appears to be an issue with the options I'm using taking up registers. I bet it's either the sanitizers, or the debugging / optimization-disabling options.
I'll try to work out which. Any pointers in the right direction would be appreciated.
This set of flags also works fine:
-O2 -funroll-loops -ffp-contract=on -fno-omit-frame-pointer -mrelax-all -fstrict-aliasing -Wstrict-aliasing -arch x86_64
Leaving me with the following (deduplicated) potential problematic options:
-fasynchronous-unwind-tables -fno-optimize-sibling-calls -fno-inline -fstack-protector-all -fsanitize=undefined -fno-sanitize-recover=all -fsanitize=address -O0
I'll do some builds with each option and see.
If the sanitizers are the issue, I'll add the problematic code to:
-fsanitize-blacklist=contrib/clang/sanitize_blacklist.txt
With the first set of flags in comment 10, I see the following performance gains on OS X 10.10.4, clang 3.7, x86_64, 2 GHz Intel Core i7:
##### ed25519Ed25519-donna = disabled.Generate public key: 66.08 usecSign a short message: 85.41 usecVerify signature: 251.77 usecConvert public point from curve25519: 13.07 usecBlind a public key: 230.85 usecEd25519-donna = enabled.Generate public key: 25.10 usecSign a short message: 26.55 usecVerify signature: 108.35 usecConvert public point from curve25519: 5.65 usecBlind a public key: 120.96 usec
-fsanitize=address causes clang to run out of registers when compiling ge25519_scalarmult_base_choose_niels in src/ext/ed25519/donna/ed25519-donna-64bit-x86.h.
Adding the function to the sanitizer blacklist resolves this issue, as long as people use it.
I can make a branch for it if you'd like.
It would add:
# The optimised x86_64 assembly version of ge25519_scalarmult_base_choose_niels# runs out of registers when compiled with ASan fun:ge25519_scalarmult_base_choose_niels
to the end of contrib/clang/sanitize_blacklist.txt
If that all that's needed I can add that and save you the trouble (Especially since I'll probably squash/rebase my branch anyway). Sorry for the extra work, and thanks.
I'm somewhat curious to see if the ntor benchmark also is faster with my additions, but since the donna code appears to be getting built correctly by clang, I'm 99% sure the table based precomputation stuff is faster (as intended).
No worries, better to catch incompatibilities at this stage.
By the way, .gitignore probably needs an entry for
src/ext/ed25519/donna/libed25519_donna.a
It seems to already have wildcard entries for other object files produced during the process.
Major changes are, that I squash/rebased things, and restructured how to handle the tor specific code in donna slightly (ed25591_donna_tor.h, ed25519_tor.c instead of chainsawing up the latter).
This should have all the fixes for the issues found by teor except for the clang static analysis garbage value thing which I assume is a false positive. I opted to solve the asan issue by disabling inline assembly entirely in donna for asan enabled builds. Performance is going to be horrific anyway, so might as well sanitize something.
I'll wait for nickm to get back from vacation at this point, unless the various random people find more issues (unlikely?). I'm inclined to say "use my branch or something that approximates it for the entirely of 0.2.7's lifespan, then yank ref10 in the 0.2.8-alpha timeframe", but that would be more of a nickm call than anything else.
I believe that clang static analyzer complains about garbage values because it doesn't know the length of the char * buffers passed in to ed25519_donna_keygen and ed25519_donna_sign.
As far as I can tell, clang assumes that char *'s are NUL-terminated strings, and expects to see a strlen or similar call. So it gets confused when the byte buffers are walked in 4-byte chunks, up to 64 bytes. (And I'm not sure it's a simple matter to convince clang how that the buffers are 64 bytes, nor do I feel it's a particularly productive activity.)
Yawning, would you mind if we added a comment to those two functions about the spurious errors?
It would save the effort of someone tracking them down again.
But I really don't mind either way.
Yawning, would you mind if we added a comment to those two functions about the spurious errors?
It would save the effort of someone tracking them down again.
But I really don't mind either way.
I don't mind at this point since I split off the tor version of the ed25519 code into it's own file when I did the squash/rebase, mostly for my sanity (In general I want to try to minimize changes to the code that's still common with upstream, but this stuff is mostly tor specific). There's a giant comment describing what our external interface does, is that where something like this should be, or should the comment(s) live inside each function?
They'd be most helpful in each function, where the errors are logged by clang sanitizer.
I'm now confused as to which functions exactly produce the errors, possibly because I moved code around etc. Additionally, I'm not exactly sure if I buy the explanation that Clang wants a strlen, since assuming unsigned char * is a NUL-terminated string seems like a rather poor assumption to be making in the first place.
I'm going to leave things as is, we can always add comments later as needed.