Implement prop250 (Random Number Generation During Tor Voting)

changed milestone to %Tor: 0.2.9.x-final in legacy/trac

added component::core tor/tor in Legacy / Trac milestone::Tor: 0.2.9.x-final in Legacy / Trac owner::dgoulet in Legacy / Trac parent::8244 in Legacy / Trac points::6 in Legacy / Trac priority::high in Legacy / Trac resolution::fixed in Legacy / Trac review-group-3 in Legacy / Trac review-group-4 in Legacy / Trac reviewer::nickm in Legacy / Trac severity::normal in Legacy / Trac sponsor::R-must in Legacy / Trac status::closed in Legacy / Trac tor-hs in Legacy / Trac type::enhancement in Legacy / Trac labels

OK. Some very initial work can be found at shared-random-refactor1 in my repo.

The main contribution here is refactoring dirvote_compute_consensuses(). This is the function that will eventually create the SR doc, and it was a big clusterfuck. The first three commits of my branch try to refactor in a way that will allow us to add code to it, without complicating it even more.

The final commit (80bd3da) adds in some of the SR stuff that we wrote during camp plus a few more stuff. These are just ideas and scaffolding, so don't take it seriously. However, I didn't manage to proceed with this too much because the consensus signature code is very thick. It seems to me that more refactoring is needed.

The main problem is that when dirvote.c thinks consensus, it actually thinks of networkstatus. So all consensus-related functions are created with networkstatus documents in mind, which is totally different from the SR doc. You can see this by seeing how pending_consensus_t is used around the code, and also by taking a look at the following functions:

networkstatus_add_detached_signatures()
dirvote_fetch_missing_signatures()
dirvote_add_signatures_to_pending_consensus()
dirvote_add_signatures_to_all_pending_consensuses()
dirvote_add_signatures()
networkstatus_check_consensus_signature()

Also, grep for N_CONSENSUS_FLAVORS to see various places where tor loops over all the various flavors and expect to find networkstatus documents underneath. All the above functions will need to be refactored, or their caller will need to handle SR doc totally different.

We will need to find a nice solution to this. We could for example introduce a new data structure called consensus_t that is a union that hosts either a networkstatus or an SR doc. And then write various methods for it, to be able to interact with it. I did a beginning with this, but it required shitloads of changes all around the codebase and then I dropped it.

A lame fix here would be to just stuff all the SR info into a networkstatus_t, which might solve a few problems, but it's terrible.

Seems like the "Ah, let's just use a new consensus flavor. What can go wrong." idea is starting to show its complexity :)

So we David we discussed a possible refactoring here that would allow us to introduce SR doc logic to functions like networkstatus_add_detached_signatures().

The idea is that we will make a new data structure called consensus_t which will be used in places where we don't care about the specific consensus flavor. This data structure will be able to contain networkstatus_t or shared_random_doc_t or any other weird flavor we come up in the future.

Then we can have some methods for consensus_t, which based on the flavor returns you the appropriate data structure.

Then we will need to refactor all the functions mentioned in the previous comment (and a few more), to be able to do their thing on a networkstatus_t but then also do a similar thing for shared_random_doc_t. This will require lots of hairy code to be refactored which is scary, but doable.

Any opinions on this strategy?

OK. I did some work on figuring out the current phase during voting, and on generating commitments/reveals. You can find it here: https://gitweb.torproject.org/user/asn/tor.git/log/?h=prop250-voting-day1

OK, second day progress at branch prop250-voting-day2: https://gitweb.torproject.org/user/asn/tor.git/log/?h=prop250-voting-day2

This branch can do the following things:

Authorities can keep track of the protocol run. Authorities know which protocol phase they are into at any given time. Both in the real network and in the testing network. (Unit tests included for real network)
Authorities know when a protocol run starts and can do some basic housekeeping. Like wipe counters, and generate fresh commitment values for themselves.
Authorities can generate actual reveal values. Authorities can generate some dummy commitment values.
Initial work has started on encoding commits/reveals to votes. Scaffolding work has also started on parsing commits/reveals from votes.

I also looked at your branch David. I see you have code for parsing the binary representation of commitments and reveals . And also have code for parsing the state file.

I noticed that the state file parsing code is quite similar to the code that we will have to write for parsing commits/reveals from votes, but with statefile code patterns instead of routeparse code patterns.

I'm talking about this type of code:

+  alg = crypto_digest_algorithm_parse_name(value);
+  if (alg == -1 || alg != SR_DIGEST_ALG) {
+  }
+  value = smartlist_get(args, 1);
+  if (digest256_from_base64(identity, value) < 0) {

It would be nice to be able to write utility parsing functions that can be used by both subsystems so that we don't have duplicate code. I wrote some skeleton code for the routerparse.c code in case you want to think more of how we could merge those two logics. Check out commit 39cb7ab.

Cheers!

Trac:
Milestone: Tor: 0.2.??? to Tor: 0.2.8.x-final

Trac:
Sponsor: N/A to SponsorR?

Trac:
Sponsor: SponsorR? to SponsorR

Trac:
Points: N/A to large

Hello,

please see my branch prop250-voting-day5 for further work on generating commitments and reveals. This is the first part of switching from the dummy integer commitments/reveals of prop250-voting-day4, to the signature-based commitments of prop250.

This is still work in progress, but commit e3b6f4f introduces a very important unit test that shows a skeleton of how commitments and reveals are supposed to be generated, parsed and verified. Let me know if this logic seems acceptable to you, or we should change it.

To move forward now, we need to introduce the SR-specific ed25519 auth key and include its certificate in votes, so that we stop using the ancient RSA fingerprints and be able to start verifying signatures.

Tasks that need to be done next:

Create and start using the ed25519 SR key and certificate chains. Needs proposal!
Include the shared secret in the consensus.
Use weasel's algorithm in get_testing_network_protocol_phase().
Implement the conflict line (see add_conflict_to_sr_state()).
Implement the maintainance function and has_majority logic (see state_phase_transition()).
Plug ALL the memory leaks (many!)
Fix the disk state (but let's do this later so that all the data structures have stabilized).

[FWIW I left the stupid integer-based commits/reveals as generate_sr_commitment_stupid() in case we want to do some chutney testing. Eventually, generate_sr_commitment_stupid() will be replaced by generate_sr_commitment().]

Cheers!

Replying to asn:

Hello,

Tasks that need to be done next:

Create and start using the ed25519 SR key and certificate chains. Needs proposal!

FWIW, I made ticket legacy/trac#17349 (moved) specifically for this task.

Trac:
Severity: N/A to Blocker

I worked a bit on the outdated spec! Please see branch prop250-asn in my torspec repo!

Some major mods are:

I described how to get the ed25519 shared randomness keys from the votes.
I changed the protocol to start at midnight and last the whole day!
I fixed the majority section and the agreement section.

Please see and add your own changes!

Hello,

please see prop250-general-day1 for the latest branch here.

Changes since last time:

Calculates shared random value normally and puts it in consensus.
Uses ed25519 shared randomness keys.
Improved (but not yet ready) disk state logic.
Fixed the broken unittests.
Uses weasel's algorithm.

OK, here is the latest state of the game here.

In my torspec branch prop250-nosrkeys-v2 you can find prop250 without the SR keys as discussed here: https://lists.torproject.org/pipermail/tor-dev/2015-November/009875.html

In my tor branch prop250-nosrkeys-day1 you can find the corresponding code, where we use the ed25519 master key for referencing, and the RSA identity key to detect multiple commits by a single authority. I also fixed the get_state_valid_until_time() function and wrote some unittests.

I think the next step in our simplification process here is to refactor the decide() step which is not really useful anymore now that we don't do majority or conflicts. I imagine that we could make it so that there is no decide step or voted_commits, and we just move valid authoritative commits directly to the sr state during vote parsing.

Proposal: Please find a more refined version of the proposal at my torspec branch prop250-final-v2: https://gitweb.torproject.org/user/asn/torspec.git/log/?h=prop250-final-v2

Code: Branch prop250-nosrkeys-day2 in my repo contains the latest code. It fixes some annoying bugs that were making testing harder.

Latest branch with code rebased on master and improvement: prop250_v4.

This branch has few new things.

A random number is now always hashed H(RN) so we avoid leaking bytes from our PRNG to the network.
Add SR flag to the vote indicating if the vote participates in the protocol
Parse votes and populate an object in the networkstatus_t object. Once done, we only handle those commits once we are sure a vote is valid and from known authority (we weren't doing that before).
Refactor some functions to be able to support the last bullet above and improve naming on some functions.

Few things are missing that I'll get to it:

Consensus params for the number of participants.
Torrc option to disable SR support (basically making the dirauth not adding the SR flag to the vote).
Some issues with the disk state also because we know rely on the RSA fingerprint of a dirauth.

Replying to dgoulet:

Latest branch with code rebased on master and improvement: prop250_v4.

Great! Please also check my prop250_v4 branch. Based on yours with a few mods.

Specifically:

Some small code tweaks and variable/function renames.
I removed the disaster formula. It's not useful anymore. Please see commit message for more info.
Made the vote parsing logic a bit more robust, I hope.
I renamed and changed the semantics of NumSRParticipants. Now, instead of using it just to count the number of participants, we use it to reject SRVs that have been voted by less than NumSRParticipants dirauths. This makes us more robust against edge cases like the one where the 5th SR dirauth appears during the reveal phase and calculates a different SRV than the other 4.
Also, now on the majority calculation, I use the total number of dirauths instead of the number of voters. I did this to mirror the way the n_required works in networkstatus_check_consensus_signature().

Let me know if you find these things reasonable!!!

More future things:

We should start actually voting the NumSRVAgreements parameter. And maybe for the real network the default value should be ((n_auths / 2) + 1), instead of n_auths that is now.
Also, I've been getting these interesting log messages for a while now:

[SR] You may not provide a value for virtual option 'SharedRandCurrentValue'

but I'm not sure what they mean. I think they started appearing when we switched from LINELIST_S to LINELIST_V for SRVs.

Replying to asn:

Great! Please also check my prop250_v4 branch. Based on yours with a few mods.

Specifically:

Some small code tweaks and variable/function renames.

Commit: d653f8cc17d6e278e16acb8a3101a821f9f42f6e

- Macroify previous/current SRV strings.

Not sure I agree with that. I much prefer typed variable here instead of macros for strings that we only use locally in this file. We have compiler protection with that where macros is pretty yolo in terms of warnings for compiler. You see a win for macros?

More future things:

We should start actually voting the NumSRVAgreements parameter. And maybe for the real network the default value should be ((n_auths / 2) + 1), instead of n_auths that is now.

So you mean before merging 250, we should make an adhoc patch right now? Most dirauth won't upgrade until a stable and since we plan to have this in 028-stable, we can make this a big block.

Hrm, consensus params are options in a file for dirauth so we have to decide on a "hard" number which in this case would be 5. And we can see how the SR behaves over time and increase it if needed?

Also, I've been getting these interesting log messages for a while now: {{{ [SR] You may not provide a value for virtual option 'SharedRandCurrentValue' }}} but I'm not sure what they mean. I think they started appearing when we switched from LINELIST_S to LINELIST_V for SRVs.

Yeah I have yet understantd this one but I think that when the state file is read that option can be missing and thus we have this warning...

Latest spec at prop250-final-v4. Should include everything we have discussed up to and including CCC.

Hello! asn and I are very happy to present to you wonderful reviewers the implementation for proposal 250 along with the final specification:

Both branch in tor.git and torspec.git: dgoulet/prop250_final_v1

Some notes. We've separated this in 7 commits prefixed with prop250: except first one that adds a needed tor_htonll/ntohll function to tor utils. This code is mostly contained in two new files (with their headers) that are shared-random.{c|h} and shared-random-state.{c|h}.

Our unit test code coverage:

    shared-random-state.c   92.8 %      376 / 405      65.3 %      109 / 167
    shared-random.c         85.1 %      326 / 383      66.9 %      101 / 151

Also, there are attacks to this protocol that we are well aware of but all are easily detectable so for this reason we've wrote a DocTor specification that atagar will help use deploy once this is merged.

https://storm.torproject.org/shared/kXO7N2oatC9RzRcZvfLNVaA1yNkZ6m5rODGUYEt08H6

Finally, we expect this code to run for a long time before the shared random values generated by the authorities are used thus for now you will NOT find anything using them.

Please ask questions! This won't be that trivial to review :).

Trac:
Type: defect to enhancement
Priority: Medium to High
Severity: Blocker to Normal
Status: new to needs_review

Trac:
Cc: dgoulet to dgoulet, teor

After proposal review that happened few days ago on #tor-dev, we made some adjustments to the specification and code.

Please find the latest code here: prop250_final_v3

Proposal 250 has been updated with the latest in torspec.git

Trac:
Cc: dgoulet, teor to teor

Replying to dgoulet:

After proposal review that happened few days ago on #tor-dev, we made some adjustments to the specification and code.

Please find the latest code here: prop250_final_v3

Proposal 250 has been updated with the latest in torspec.git

Code review:

General comments:

Every other multi-word file name in src/or uses underscores.

If a function moves data from (dtype* dst, const stype* src), can the arguments be in that order, and can the src argument be const?

There are a lot of missed opportunities for const pointer arguments.

e282227ebc03ae71f1cd6490a2adf2058140c66e:

If you use #ifdef WORDS_BIGENDIAN (from orconfig.h), rather than a runtime check against htonl, some compilers will produce better code.

72baec6805aaab8d62f2deed8fbc658c82a7d086:

shared-random-state.c:

If we use an asymmetric magic number, we can detect when the two halves are swapped.

#define SR_DISK_STATE_MAGIC 42424242

A hard-coded SHARED_RANDOM_N_ROUNDS is going to make it really hard to test hidden services quickly using chutney. (We'll always be testing them using the default initial shared random value.) Can we make this configurable in test networks?

#define SHARED_RANDOM_N_ROUNDS 12

get_start_time_of_current_round() doesn't take TestingV3AuthVotingStartOffset into account. Why not use:

time_t next_start = voting_schedule.interval_starts;
time_t curr_start = dirvote_get_start_of_next_interval(next_start - interval - 1,interval,
                      options->TestingV3AuthVotingStartOffset);

get_next_valid_after_time() duplicates code in dirvote_recalculate_timing(). Why not use voting_schedule.interval_starts? (You'd need to make sure dirvote_recalculate_timing() had been called before either of these functions used voting_schedule.interval_starts.)

Nitpick: why 2? Can it be a named constant like SHARED_RANDOM_N_PHASES? get_state_valid_until_time, get_sr_protocol_phase

int total_rounds = SHARED_RANDOM_N_ROUNDS * 2;

get_state_valid_until_time() also ignores TestingV3AuthVotingStartOffset. I think this should fix it:

current_round = (beginning_of_current_round / voting_interval) % total_rounds;

get_sr_protocol_phase() also ignores TestingV3AuthVotingStartOffset. Perhaps passing around the time isn't the best way to do this? It seems like if we can make this same mistake in multiple places, then we should refactor the round number calculation so it's in a single function.

Does get_sr_protocol_phase() need to check it's getting an actual valid_after time, rather than now?

Why do we have disk_state_validate() and disk_state_validate_cb()? Should one call the other? Should they check _magic? Should they check ValidAfter < ValidUntil?

disk_state_parse_commits() fails to SMARTLIST_FOREACH(args, char *, cp, tor_free(cp)); if commit is NULL.

disk_state_parse_sr_values() contains duplicate code in the if/else block. The function comment should say it parses both SharedRandPreviousValue and SharedRandCurrentValue.

I think disk_state_parse_sr_values() might be able to leak memory if disk_state_parse_srv() fails. At the very least, it leaves an allocated but zero pointer, which could confuse the rest of the code.

disk_state_parse_sr_values() can assert its arguments are not NULL.

disk_state_put_commit_line() can have commit as a const pointer, and in the opposite order for consistency.

In disk_state_put_commit_line(), do we need to memwipe reveal_str? Or are we about to make it public anyway?

disk_state_put_srv_line() can have srv as a const pointer, and in the opposite order for consistency.

I think disk_state_reset() is the moral equivalent of disk_state_free() followed by disk_state_new(). Why are they different functions? (Apart from the memory thing.)

We don't seem to check for errors in disk_state_update(). I think this is ok, but I'm not sure exactly why.

In disk_state_save_to_disk() and children (and other functions), should we memwipe any temporary memory that contains the reveal value, or the encoded reveal value? (before it is revealed) We do this some places, but not others.

Can data in state_query_get_() be a const pointer? Some of the sr_state_set_* functions can take const pointers.

shared-random-state.h:

I think this comment is in the wrong place, shouldn't it correspond to SHARED_RANDOM_STATE_PRIVATE?

/* Private methods (only used by shared-random.c): */

shared-random.c:

char b64_decoded[SR_COMMIT_LEN + 2];

Can we either fix legacy/trac#17868 (moved) as part of this branch, or add a #define for the rounding-up that has to occur when the base64 value isn't padded?

#define B64_DECODE_PAD(len) ((len) + ((len) % 3))
char b64_decoded[B64_DECODE_PAD(SR_COMMIT_LEN)];

Is TIMESTAMP first, or H(REVEAL)? commit_decode says one thing in the comments, and another in the code.

Can we assert that decoded_len == SR_COMMIT_LEN? Is that useful?

reveal_decode() duplicates commit_decode(). Can we refactor to avoid that?

In sr_parse_srv(), why do we use base16 for the SRV, but base64 for the commits?

In sr_parse_commit(), we ignore extra arguments after the fourth. Is this intentional? A comment either way would help.

I have 5 more commits to review. (Please add fixups to the end of the branch so I don't lose my place.)

Replying to teor:

Thanks for that teor!

New branch in prop250_final_v4.

Every other multi-word file name in src/or uses underscores.

Fixed. Extra commit here since this applies to almost all commits.

If a function moves data from (dtype* dst, const stype* src), can the arguments be in that order, and can the src argument be const?

There are a lot of missed opportunities for const pointer arguments.

Fixed most of those. See "fixup" commits

e282227ebc03ae71f1cd6490a2adf2058140c66e:

If you use #ifdef WORDS_BIGENDIAN (from orconfig.h), rather than a runtime check against htonl, some compilers will produce better code.

Fixed!

72baec6805aaab8d62f2deed8fbc658c82a7d086:

shared-random-state.c:

If we use an asymmetric magic number, we can detect when the two halves are swapped. {{{ #define SR_DISK_STATE_MAGIC 42424242 }}}

Fixed

A hard-coded SHARED_RANDOM_N_ROUNDS is going to make it really hard to test hidden services quickly using chutney. (We'll always be testing them using the default initial shared random value.) Can we make this configurable in test networks? {{{ #define SHARED_RANDOM_N_ROUNDS 12 }}}

The part I do not like about changing this value for testing network is that we do NOT get the real behavior of the protocol... I'm not against for a testing value but I would do that after merge in a separate ticket.

get_start_time_of_current_round() doesn't take TestingV3AuthVotingStartOffset into account. Why not use: {{{ time_t next_start = voting_schedule.interval_starts; time_t curr_start = dirvote_get_start_of_next_interval(next_start - interval - 1,interval, options->TestingV3AuthVotingStartOffset); }}}

get_next_valid_after_time() duplicates code in dirvote_recalculate_timing(). Why not use voting_schedule.interval_starts? (You'd need to make sure dirvote_recalculate_timing() had been called before either of these functions used voting_schedule.interval_starts.)

Changing this part of the code after the extensive testing we did could be very delicate. asn is looking at possible solution.

Nitpick: why 2? Can it be a named constant like SHARED_RANDOM_N_PHASES? get_state_valid_until_time, get_sr_protocol_phase {{{ int total_rounds = SHARED_RANDOM_N_ROUNDS * 2; }}}

Definitely better.

get_state_valid_until_time() also ignores TestingV3AuthVotingStartOffset. I think this should fix it: {{{ current_round = (beginning_of_current_round / voting_interval) % total_rounds; }}}

get_sr_protocol_phase() also ignores TestingV3AuthVotingStartOffset. Perhaps passing around the time isn't the best way to do this? It seems like if we can make this same mistake in multiple places, then we should refactor the round number calculation so it's in a single function.

Does get_sr_protocol_phase() need to check it's getting an actual valid_after time, rather than now? No, valid_after time that is now is valid which is what we set when you start up since we have no valid_after time until first consensus.

Why do we have disk_state_validate() and disk_state_validate_cb()?

We must have a callback for the "state file" API, NULL is not accepted. Furthermore, if the callback return an error, right now tor does a wonderful assert(0)... so it has been ignored.

Should one call the other? Should they check _magic? This is done with CONFIG_CHECK() which is called multiple time when you use config_assign().

Should they check ValidAfter < ValidUntil?

Fixed! (But the correct check is >= here :)

disk_state_parse_commits() fails to SMARTLIST_FOREACH(args, char *, cp, tor_free(cp)); if commit is NULL.

Fixed

disk_state_parse_sr_values() contains duplicate code in the if/else block. The function comment should say it parses both SharedRandPreviousValue and SharedRandCurrentValue.

Fixed

I think disk_state_parse_sr_values() might be able to leak memory if disk_state_parse_srv() fails. At the very least, it leaves an allocated but zero pointer, which could confuse the rest of the code.

Fixed with the commit above.

disk_state_parse_sr_values() can assert its arguments are not NULL.

Fixed with commit above

disk_state_put_commit_line() can have commit as a const pointer, and in the opposite order for consistency.

Fixed.

In disk_state_put_commit_line(), do we need to memwipe reveal_str? Or are we about to make it public anyway?

Indeed. I think you are right. We do put our commit on disk during the commit phase so better to cleanup memory here to be safe.

disk_state_put_srv_line() can have srv as a const pointer, and in the opposite order for consistency.

Fixed.

I think disk_state_reset() is the moral equivalent of disk_state_free() followed by disk_state_new(). Why are they different functions? (Apart from the memory thing.)

Not much. We avoid alloc/free every time we update the state which in a voting period happens often. (probably not enough to be noticeable but still)

We don't seem to check for errors in disk_state_update(). I think this is ok, but I'm not sure exactly why.

We assume that everything in our memory state is coherent so disk update can only use value that are usable. I think it's fine, since if the memory state has crazy stuff even if we don't put on disk or not, thing will go bad.

In disk_state_save_to_disk() and children (and other functions), should we memwipe any temporary memory that contains the reveal value, or the encoded reveal value? (before it is revealed) We do this some places, but not others.

I've added a needed memwipe in reveal_encode since we do encode our commitment during the commit phase so the reveal value should only be in the commit object memory and not on stack.

Can data in state_query_get_() be a const pointer? Some of the sr_state_set_* functions can take const pointers. the get_ yes, I fixed it but not for the set since digestmap doesn't handle a const.

shared-random-state.h:

I think this comment is in the wrong place, shouldn't it correspond to SHARED_RANDOM_STATE_PRIVATE? {{{ /* Private methods (only used by shared-random.c): */ }}} Ah, by that we mean methods that are ONLY used by shared_random.c so "private" to the subsystems but not static because they need to be visible.

shared-random.c:

{{{ char b64_decoded[SR_COMMIT_LEN + 2]; }}} Can we either fix legacy/trac#17868 (moved) as part of this branch, or add a #define for the rounding-up that has to occur when the base64 value isn't padded? {{{ #define B64_DECODE_PAD(len) ((len) + ((len) % 3)) char b64_decoded[B64_DECODE_PAD(SR_COMMIT_LEN)]; }}}

I would really want to fix legacy/trac#17868 (moved) instead of a temporary workaround macro. You'll notice that there are still XXX: at that place and it's to show the obvious bad thing of +2.

Is TIMESTAMP first, or H(REVEAL)? commit_decode says one thing in the comments, and another in the code.

Hrm both the code and comment do the same or am I missing something?

Can we assert that decoded_len == SR_COMMIT_LEN? Is that useful?

We prefer not since this is data from the network so potentially untrusted.

reveal_decode() duplicates commit_decode(). Can we refactor to avoid that?

Plausible. I would do that in a later iteration though. Having both separated gives us more room for better logging and if one of those change in some way, we can adapt easily.

In sr_parse_srv(), why do we use base16 for the SRV, but base64 for the commits?

No reasons... We'll make it base64.

In sr_parse_commit(), we ignore extra arguments after the fourth. Is this intentional? A comment either way would help. We do and the top commit of the function details the order of args. What extra commit you have in mind?

I have 5 more commits to review. (Please add fixups to the end of the branch so I don't lose my place.)

Replying to teor:

Replying to dgoulet:

After proposal review that happened few days ago on #tor-dev, we made some adjustments to the specification and code.

Please find the latest code here: prop250_final_v3

Proposal 250 has been updated with the latest in torspec.git

Code review:

General comments:

get_start_time_of_current_round() doesn't take TestingV3AuthVotingStartOffset into account. Why not use: {{{ time_t next_start = voting_schedule.interval_starts; time_t curr_start = dirvote_get_start_of_next_interval(next_start - interval - 1,interval, options->TestingV3AuthVotingStartOffset); }}}

get_next_valid_after_time() duplicates code in dirvote_recalculate_timing(). Why not use voting_schedule.interval_starts? (You'd need to make sure dirvote_recalculate_timing() had been called before either of these functions used voting_schedule.interval_starts.)

Hmm, I would like to do that but it doesn't seem so easy. dirvote_recalculate_timing() has not been called when we call state_new(), since we call it when we boot up and we haven't voted yet at that point.

I then called dirvote_recalculate_timing() myself and everything worked fine, but I'm not sure if it's actually correct. That function updates the global voting_schedule structure which is then used by other parts of the codebase and I'm not sure if initializing it with 'now' set to the next 'valid_after' would be correct here :x

Replying to dgoulet:

Replying to teor:

Thanks for that teor!

New branch in prop250_final_v4.
A hard-coded SHARED_RANDOM_N_ROUNDS is going to make it really hard to test hidden services quickly using chutney. (We'll always be testing them using the default initial shared random value.) Can we make this configurable in test networks?
#define SHARED_RANDOM_N_ROUNDS 12
}}}
The part I do not like about changing this value for testing network is that we do NOT get the real behavior of the protocol... I'm not against for a testing value but I would do that after merge in a separate ticket.

Not necessarily - 12 can still be the default number of rounds. (I'm pretty sure we haven't hard-coded this macro value anywhere. So it should be easy to replace with a function that reads an option for the number of rounds.)

Opened legacy/trac#18295 (moved) for this.

get_state_valid_until_time() also ignores TestingV3AuthVotingStartOffset. I think this should fix it: {{{ current_round = (beginning_of_current_round / voting_interval) % total_rounds; }}}

get_sr_protocol_phase() also ignores TestingV3AuthVotingStartOffset. Perhaps passing around the time isn't the best way to do this? It seems like if we can make this same mistake in multiple places, then we should refactor the round number calculation so it's in a single function.

Can we refactor the round number / time calculations so they're in one place? I feel nervous we'll break them if we try to fix them later, and I'd rather do that now when we're not live.

shared-random.c:

{{{ char b64_decoded[SR_COMMIT_LEN + 2]; }}} Can we either fix legacy/trac#17868 (moved) as part of this branch, or add a #define for the rounding-up that has to occur when the base64 value isn't padded? {{{ #define B64_DECODE_PAD(len) ((len) + ((len) % 3)) char b64_decoded[B64_DECODE_PAD(SR_COMMIT_LEN)]; }}}

I would really want to fix legacy/trac#17868 (moved) instead of a temporary workaround macro. You'll notice that there are still XXX: at that place and it's to show the obvious bad thing of +2.

Made legacy/trac#17868 (moved) a child ticket of this ticket.

Is TIMESTAMP first, or H(REVEAL)? commit_decode says one thing in the comments, and another in the code.

Hrm both the code and comment do the same or am I missing something?

The commit_decode() function header comment says: {{{ base64-encode( H(REVEAL) || TIMESTAMP )


The code says:

/* First is the timestamp (8 bytes). / commit->commit_ts = (time_t) tor_ntohll(get_uint64(b64_decoded)); offset += sizeof(uint64_t); / Next is hashed reveal. */ memcpy(commit->hashed_reveal, b64_decoded + offset, sizeof(commit->hashed_reveal));


> > Can we assert that `decoded_len == SR_COMMIT_LEN`? Is that useful?
> > 
> We prefer not since this is data from the network so potentially untrusted.

OK, then we're deliberately ignoring any characters after SR_COMMIT_LEN?
Can we make that explicit in a comment?

> > reveal_decode() duplicates commit_decode(). Can we refactor to avoid that?
> > 
> Plausible. I would do that in a later iteration though. Having both separated gives us more room for better logging and if one of those change in some way, we can adapt easily.

The risk of leaving them duplicated is that bugs get fixed in one and not the other.
Or they fall out of sync accidentally. This is a real mess to clean up later - see router_pick_directory_server_impl() and router_pick_trusteddirserver_impl() for an example.

This has already happened to these two functions:
* commit_decode() uses offset, reveal_decode() uses a hard-coded 8.
* the "too small" warning in reveal_decode() is less detailed, but could be identical to commit_decode()

So I have a more radical suggestion for you:
* you have two groups of 3 struct fields that are the same types, and have the same operators performed on them;
* abstract these 3 fields into their own struct type (ts, hashed, encoded)
* add operations for encoding and decoding this type (and perhaps, turning this authority's reveal into its commit)
* refactor the existing code to call these operations

That way, the interface stays the same, you can add any extra logging you need in the commit- or reveal-specific functions, but you get the benefit of using the same code for encoding/decoding commits and reveals.

> > In sr_parse_commit(), we ignore extra arguments after the fourth. Is this intentional?
> > A comment either way would help.
> We do and the top commit of the function details the order of args. What extra commit you have in mind?

If an authority supplies 5 arguments in its vote, we will ignore the fifth.
There's no comment documenting whether this is intentional or not.
And should we ignore the entire line instead?

Replying to asn:

Replying to teor:

Replying to dgoulet:

After proposal review that happened few days ago on #tor-dev, we made some adjustments to the specification and code.

Please find the latest code here: prop250_final_v3

Proposal 250 has been updated with the latest in torspec.git

Code review:

General comments:

get_start_time_of_current_round() doesn't take TestingV3AuthVotingStartOffset into account. Why not use: {{{ time_t next_start = voting_schedule.interval_starts; time_t curr_start = dirvote_get_start_of_next_interval(next_start - interval - 1,interval, options->TestingV3AuthVotingStartOffset); }}}

get_next_valid_after_time() duplicates code in dirvote_recalculate_timing(). Why not use voting_schedule.interval_starts? (You'd need to make sure dirvote_recalculate_timing() had been called before either of these functions used voting_schedule.interval_starts.)

Hmm, I would like to do that but it doesn't seem so easy. dirvote_recalculate_timing() has not been called when we call state_new(), since we call it when we boot up and we haven't voted yet at that point.

I then called dirvote_recalculate_timing() myself and everything worked fine, but I'm not sure if it's actually correct. That function updates the global voting_schedule structure which is then used by other parts of the codebase and I'm not sure if initializing it with 'now' set to the next 'valid_after' would be correct here :x

Can we refactor the relevant calculations out of dirvote_recalculate_timing(), and call them from dirvote_recalculate_timing() and get_next_valid_after_time() and get_start_time_of_current_round()?

My concern is that we'll break test networks that use TestingV3AuthVotingStartOffset by ignoring it during shared random calculations. And that these 3 functions should produce the same results.

Do get_next_valid_after_time() and get_start_time_of_current_round() belong in the core dirserv file instead? They don't seem to be specific to shared random.

This is my 3rd comment in a row. Please read all 3 comments.

Reviewing 3406980ed700cc7fcd95b9b3774256493805057b:

Nitpick: why add a phase_str for "unknown" when you assert rather than use it?

commit_log() should probably use safe_str() on the reveal value, as it is sensitive. Other debug messages also have this issue.

It also logs commits and reveals in hex, but votes use base64. This will make them difficult to compare. generate_srv() also does this with the SRV.

I see you fixed the commit_decode() function comment to base64-encode( TIMESTAMP || H(REVEAL) ).

reveal_encode() and commit_encode() are duplicate code, with similar code drift issues ti the *decode() functions. See comment 25 for a suggestion on this.

sr_generate_our_commit() can use crypto_strongest_rand() rather than doing the hash itself. It can also use a const pointer.

sr_compute_srv() places a hard limit of 254 (SR-voting) authorities in a tor network.

tor_assert(reveal_num < UINT8_MAX);

Is this OK? It should be documented somewhere.

In sr_generate_our_commit() and sr_compute_srv(), crypto_digest256 never returns < 0. It returns 1 on failure. In generate_srv(), you don't check the return value of crypto_digest256. You might want to check how you handle the return values of all your crypto_* calls.

In sr_compute_srv, should we assert on crypto_digest256 failure? Or should we do something with the state?

47ade62b1811c16c28d73ebe59f47daa16cd4522:

AuthDirSharedRandomness needs a tor manual page entry.

There are missed opportunities for const pointers in this commit.

srv_to_ns_string() and get_majority_srv_from_votes() also encode the SRV as hex rather than base64.

get_majority_srv_from_votes() could be significantly simplified by adding to container.c:

/** Return the most frequent member of the sorted list of DIGEST256_LEN
 * digests in <b>sl</b>. If count_out is non-NULL, set it to the count
 * of the most frequent member. */
const uint8_t *
smartlist_get_most_frequent_digest256_(smartlist_t *sl, int *count_out)
{
  return smartlist_get_most_frequent_(sl, compare_digests256_, count_out);
}

Then using:

  int count = 0;
  value = smartlist_get_most_frequent_digest256_(sr_digests, &count);
  /* Was this SRV voted by enough auths for us to keep it? */
  if (!should_keep_srv(count)) {
    goto end;
  }

Nitpick: get_majority_srv_from_votes() uses unsigned int current for a boolean value. Tor typically uses int for boolean values.

8e180aa061f0ea39968f05c85796b26f10a81744:

extract_shared_random_commits() and sr_state_copy_reveal_info() and should_keep_commit() and possibly other functions could use const pointers.

Why pick the microdesc consensus to update the state? Any particular reason? Should we fall back to the NS consensus if there's no microdesc consensus or SR values?

  /* A new consensus has been created: pass it to the shared random subsystem
     to update the SR state. */
  sr_act_post_consensus(pending[FLAV_MICRODESC].consensus);

extract_shared_random_commits() does tor_assert(tok->n_args >= 3); on untrusted input. That lets a single authority crash all the other authorities by including a line with only 2 arguments. Let's skip bad commits instead.

Let's also make sure we don't assert() on untrusted input in other parsing functions.

extract_shared_random_srvs() can error out on the current SRV, and leave the previous SRV allocated and initialised. What happens then?

verify_commit_and_reveal() also expects crypto_digest256() to return -1. It never will.

In should_keep_commit(), we warn using LD_BUG when another authority gets the protocol wrong. This could be misleading, and leave authority operators thinking they have a tor version with a bug. (Or do we do this elsewhere?)

8f3f5667754037cccd758eb8631636e2e78fed93:

Looks ok.

3f2014d7aabe2e10163f4f44a8ca46fd5a91f04a:

I'm not going to review the unit tests in detail. (We generally accept unit tests if they work, right?)

src/test/sr_srv_calc_ref.py looks like it uses SHA256, and not SHA3-256.

I get the following crashes when I run the unit tests on OS X 10.11.3, Apple LLVM version 7.0.2 (clang-700.1.81), x86_64.

Five times:

shared-random/utils: [forking] Feb 10 16:13:46.122 [err] tor_assertion_failed_: Bug: src/or/shared-random.c:854: sr_generate_our_commit: Assertion my_rsa_cert failed; aborting. (on Tor 0.2.8.1-alpha-dev )

2   libsystem_c.dylib             	0x00007fff975b06e7 abort + 129
3   test                          	0x0000000101a8af29 sr_generate_our_commit + 825 (shared-random.c:854)
4   test                          	0x0000000101a8cf11 sr_state_update + 177 (shared-random-state.c:1056)
5   test                          	0x0000000101a8d56d sr_state_init + 349 (shared-random-state.c:1231)
6   test                          	0x00000001018e43c3 test_a_networkstatus + 339 (test_dir.c:1800)
7   test                          	0x00000001019dfadd testcase_run_one + 845 (tinytest.c:106)
8   test                          	0x00000001019dffd3 tinytest_main + 499 (tinytest.c:432)
9   test                          	0x00000001019df43f main + 639 (testing_common.c:298)

One time:

shared-random/state_transition: [forking] Feb 10 16:13:46.389 [err] tor_assertion_failed_: Bug: src/or/shared-random-state.c:210: commit_add_to_state: Assertion saved_commit == NULL failed; aborting. (on Tor 0.2.8.1-alpha-dev )

2   libsystem_c.dylib             	0x00007fff975b06e7 abort + 129
3   test                          	0x0000000101a8e00f commit_add_to_state + 175 (shared-random-state.c:210)
4   test                          	0x0000000101a8c970 sr_state_add_commit + 48 (shared-random-state.c:941)
5   test                          	0x0000000101982dfe test_state_transition + 574 (test_shared_random.c:976)
6   test                          	0x00000001019dfadd testcase_run_one + 845 (tinytest.c:106)
7   test                          	0x00000001019dffd3 tinytest_main + 499 (tinytest.c:432)
8   test                          	0x00000001019df43f main + 639 (testing_common.c:298)

And I also see 12/720 TESTS FAILED. (1 skipped).

The dgoulet/prop250_final_v4 branch also has similar unit test issues.

Now on to the extra commits in dgoulet/prop250_final_v4: There are twelve extra commits in this branch.

625e08167d9d8e855ddefa45f8d9902f65122f1f: 0308e7bf92c3699fa7e88140ae5037d8bc0d480b: 7bb6eac9fb6af2c13f61e8ddb9e9c741fa4736d1: fe710d15af7709f9d88ae9d8b5e451e5b849983b:

I trust the compiler to warn about consts in the wrong places.

18771227176e6f634540d483b8c1b1a6501aec14: 75a243b9aad489929d8ed8273920fcf7fc295ebb: 50ab3ae82b8248a45203c4ee9fc6be5d18522ac8: c0401326224a994dfe35e39da4aa8849ac053ee3:

Look good.

e5613fad328e78b2477d8e848bb163aa8dcb1e8b:

Thanks for removing the duplicate code here.

08beead18ed2bfdc8f09b17e7628ac9997222e3a: 10ccf50d599896feaf48bf0aecf4a1c2da1fea45:

Better safe than sorry.

c219b23d47e9c742532354eedae10ca0509a1d78:

One I didn't pick up, thanks for fixing it.

I've finished reviewing the code in dgoulet/prop250_final_v4. Over to you to respond to my comments 25, 26 & 27.

I'd still like to run the SRV code in chutney, is there a chutney template for that? (I think I'll need to wait for the unit tests to be fixed first.)

Replying to teor:

Replying to dgoulet:
Replying to teor:

Thanks for that teor!

New branch in prop250_final_v4.
A hard-coded SHARED_RANDOM_N_ROUNDS is going to make it really hard to test hidden services quickly using chutney. (We'll always be testing them using the default initial shared random value.) Can we make this configurable in test networks?
#define SHARED_RANDOM_N_ROUNDS 12
}}}
The part I do not like about changing this value for testing network is that we do NOT get the real behavior of the protocol... I'm not against for a testing value but I would do that after merge in a separate ticket.
Not necessarily - 12 can still be the default number of rounds. (I'm pretty sure we haven't hard-coded this macro value anywhere. So it should be easy to replace with a function that reads an option for the number of rounds.)

Opened legacy/trac#18295 (moved) for this.

I'll comment on the ticket from now on. thanks! :)

get_state_valid_until_time() also ignores TestingV3AuthVotingStartOffset. I think this should fix it: {{{ current_round = (beginning_of_current_round / voting_interval) % total_rounds; }}}

get_sr_protocol_phase() also ignores TestingV3AuthVotingStartOffset. Perhaps passing around the time isn't the best way to do this? It seems like if we can make this same mistake in multiple places, then we should refactor the round number calculation so it's in a single function.

Can we refactor the round number / time calculations so they're in one place? I feel nervous we'll break them if we try to fix them later, and I'd rather do that now when we're not live.

I'll wait on asn's work on this before refactoring anything.

Can we assert that decoded_len == SR_COMMIT_LEN? Is that useful?

We prefer not since this is data from the network so potentially untrusted.

OK, then we're deliberately ignoring any characters after SR_COMMIT_LEN? Can we make that explicit in a comment?

I think this at the start of the function should catch it. Pass that, their can't be anything pass SR_COMMIT_LEN I think: {{{ if (strlen(encoded) > SR_COMMIT_BASE64_LEN)


> 
> > > reveal_decode() duplicates commit_decode(). Can we refactor to avoid that?
> > > 
> > Plausible. I would do that in a later iteration though. Having both separated gives us more room for better logging and if one of those change in some way, we can adapt easily.
> 
> The risk of leaving them duplicated is that bugs get fixed in one and not the other.
> Or they fall out of sync accidentally. This is a real mess to clean up later - see router_pick_directory_server_impl() and router_pick_trusteddirserver_impl() for an example.
> 
> This has already happened to these two functions:
> * commit_decode() uses offset, reveal_decode() uses a hard-coded 8.

The hardcoded 8 is _BAD_... that is a good find.
[snip]
I'll work on something here.

> > > In sr_parse_commit(), we ignore extra arguments after the fourth. Is this intentional?
> > > A comment either way would help.
> > We do and the top commit of the function details the order of args. What extra commit you have in mind?
> 
> If an authority supplies 5 arguments in its vote, we will ignore the fifth.
> There's no comment documenting whether this is intentional or not.
> And should we ignore the entire line instead?

True. We should check if we bust the number of accepted argument and if so, reject.

I'll apply the fixes of the above and submit them in the next comment. I would like to avoid multiple thread in this ticket :).

Replying to teor:

Replying to asn:

Replying to teor:

Replying to dgoulet:

After proposal review that happened few days ago on #tor-dev, we made some adjustments to the specification and code.

Please find the latest code here: prop250_final_v3

Proposal 250 has been updated with the latest in torspec.git

Code review:

get_next_valid_after_time() duplicates code in dirvote_recalculate_timing(). Why not use voting_schedule.interval_starts? (You'd need to make sure dirvote_recalculate_timing() had been called before either of these functions used voting_schedule.interval_starts.)

Can we refactor the relevant calculations out of dirvote_recalculate_timing(), and call them from dirvote_recalculate_timing() and get_next_valid_after_time() and get_start_time_of_current_round()?

My concern is that we'll break test networks that use TestingV3AuthVotingStartOffset by ignoring it during shared random calculations. And that these 3 functions should produce the same results.

Do get_next_valid_after_time() and get_start_time_of_current_round() belong in the core dirserv file instead? They don't seem to be specific to shared random.

OK, I implemented the above logic in my branch prop250_final_v4: https://gitweb.torproject.org/user/asn/tor.git/log/?h=prop250_final_v4

The refactoring was a bit more messy than I expected, but the branch seems to work in chutney. The tests still pass without any changes, so that's good.

I also moved get_next_valid_after_time() in dirvote.c as you suggested.

I've reviewed asn's prop250_final_v4 up to ab5a2ffed4fa02d3ac8a2e46b8f3c38a318a7d3a. Thanks for the refactoring, I am much more comfortable that the new timing code uses the same values as the existing timing code.

Still waiting to hear about the unit tests failures and chutney template.

Trac:
Status: needs_review to needs_revision

Replying to teor:

This is my 3rd comment in a row. Please read all 3 comments.

Ok, FYI I've autosquash the fixup from last comment. It took me a while to have it done cleanly but now it's done. You have a series of fixup now for this comment.

See branch: prop250_final_v5

Reviewing 3406980ed700cc7fcd95b9b3774256493805057b:

Nitpick: why add a phase_str for "unknown" when you assert rather than use it?

get_phase_str() uses the enum value as the index in the phase_str array. "unknown" is there as a taint if we ever use an uninit phase value.

commit_log() should probably use safe_str() on the reveal value, as it is sensitive. Other debug messages also have this issue.

I've refactor that a bit better to print the encoded value instead so we can match them to strings in vote or other logs that do print the encoded part.

It also logs commits and reveals in hex, but votes use base64. This will make them difficult to compare. generate_srv() also does this with the SRV.

Fixed (for both this and the commit_log)

I see you fixed the commit_decode() function comment to base64-encode( TIMESTAMP || H(REVEAL) ).

reveal_encode() and commit_encode() are duplicate code, with similar code drift issues ti the *decode() functions. See comment 25 for a suggestion on this.

I've discussed this with asn and after coding two different ways of achieving this, we don't feel super comfortable for two reasons. First is that both options don't make it very clean tbh. Second, reveal and commit values are two different logical object and they just happen to have the same encoding format so we are fine leaving them like that since it's little code duplication but also simply different things...

sr_generate_our_commit() can use crypto_strongest_rand() rather than doing the hash itself. It can also use a const pointer.

Fixed! (3 different commits to be squashed on different master commit)

sr_compute_srv() places a hard limit of 254 (SR-voting) authorities in a tor network.
tor_assert(reveal_num < UINT8_MAX);
}}}
Is this OK?
It should be documented somewhere.

This is indeed a limitation. Proposal 250 does force INT_1(REVEAL_NUM). I've pushed an update to torspec mentionning the limitation.

In sr_generate_our_commit() and sr_compute_srv(), crypto_digest256 never returns < 0. It returns 1 on failure. In generate_srv(), you don't check the return value of crypto_digest256. You might want to check how you handle the return values of all your crypto_* calls.

Good catch! Fixed! (two fixup commits)

In sr_compute_srv, should we assert on crypto_digest256 failure? Or should we do something with the state?

Hrm... assert sounds a bit scary to just kill an authority. Most of tor code just ignores the return value. So I'm guessing having an empty SRV is valid here in case of an epic error like that and the authority should recover after a while.

47ade62b1811c16c28d73ebe59f47daa16cd4522:

AuthDirSharedRandomness needs a tor manual page entry.

Fixed in b37214db72d42af8b25f7f87a5d3c6894c032115.

There are missed opportunities for const pointers in this commit.

All fixed (hopefully).

srv_to_ns_string() and get_majority_srv_from_votes() also encode the SRV as hex rather than base64.

Fixed.

get_majority_srv_from_votes() could be significantly simplified by adding to container.c: {{{ /** Return the most frequent member of the sorted list of DIGEST256_LEN

digests in sl. If count_out is non-NULL, set it to the count

of the most frequent member. */ const uint8_t * smartlist_get_most_frequent_digest256_(smartlist_t *sl, int *count_out) { return smartlist_get_most_frequent_(sl, compare_digests256_, count_out); } }}}

Then using: {{{ int count = 0; value = smartlist_get_most_frequent_digest256_(sr_digests, &count); /* Was this SRV voted by enough auths for us to keep it? */ if (!should_keep_srv(count)) { goto end; } }}}

Indeed. I've created smartlist_get_most_frequent_digest256_ and changed around that function. Fixed.

Nitpick: get_majority_srv_from_votes() uses unsigned int current for a boolean value. Tor typically uses int for boolean values.

Should we really continue that "tradition"? :)

8e180aa061f0ea39968f05c85796b26f10a81744:

extract_shared_random_commits() and sr_state_copy_reveal_info() and should_keep_commit() and possibly other functions could use const pointers.

Fixed.

Why pick the microdesc consensus to update the state? Any particular reason? Should we fall back to the NS consensus if there's no microdesc consensus or SR values? {{{ /* A new consensus has been created: pass it to the shared random subsystem to update the SR state. */ sr_act_post_consensus(pending[FLAV_MICRODESC].consensus); }}}

It's the consensus that client uses so we use that consensus to set our SRV value post consensus thus having the same view as client. Does that make sense? You think we should use FLAV_NS?

extract_shared_random_commits() does tor_assert(tok->n_args >= 3); on untrusted input. That lets a single authority crash all the other authorities by including a line with only 2 arguments. Let's skip bad commits instead.

Let's also make sure we don't assert() on untrusted input in other parsing functions.

Indeed, very bad. Fixed!

extract_shared_random_srvs() can error out on the current SRV, and leave the previous SRV allocated and initialised. What happens then?

sr_info in a the networkstatus_t structure is allocated with the ns so on error, the parsing stops and the object dissapear on its own. {{{ if (extract_shared_random_srvs(ns, tokens) < 0) { log_warn(LD_DIR, "SR: Unable to parse SRV(s)"); goto err; }


>
> verify_commit_and_reveal() also expects crypto_digest256() to return -1. It never will.

Fixed.

>
> In should_keep_commit(), we warn using LD_BUG when another authority gets the protocol wrong. This could be misleading, and leave authority operators thinking they have a tor version with a bug. (Or do we do this elsewhere?)

True, should be a `LD_DIR` here. Fixed

> src/test/sr_srv_calc_ref.py looks like it uses SHA256, and not SHA3-256.

Yah that python one is out of date and SHA3 is not in python hashlib as far as
I know so fix to that thing is pending until we can do it in python...

>
> I get the following crashes when I run the unit tests on OS X 10.11.3, Apple LLVM version 7.0.2 (clang-700.1.81), x86_64.
>
> Five times:
> {{{
> shared-random/utils: [forking] Feb 10 16:13:46.122 [err] tor_assertion_failed_: Bug: src/or/shared-random.c:854: sr_generate_our_commit: Assertion my_rsa_cert failed; aborting. (on Tor 0.2.8.1-alpha-dev )
> }}}
>
> {{{
> 2   libsystem_c.dylib                 0x00007fff975b06e7 abort + 129
> 3   test                              0x0000000101a8af29 sr_generate_our_commit + 825 (shared-random.c:854)
> 4   test                              0x0000000101a8cf11 sr_state_update + 177 (shared-random-state.c:1056)
> 5   test                              0x0000000101a8d56d sr_state_init + 349 (shared-random-state.c:1231)
> 6   test                              0x00000001018e43c3 test_a_networkstatus + 339 (test_dir.c:1800)
> 7   test                              0x00000001019dfadd testcase_run_one + 845 (tinytest.c:106)
> 8   test                              0x00000001019dffd3 tinytest_main + 499 (tinytest.c:432)
> 9   test                              0x00000001019df43f main + 639 (testing_common.c:298)
> }}}

I think it's fixed. Since I can't reproduce, I hope you can confirm.

>
> One time:
> {{{
> shared-random/state_transition: [forking] Feb 10 16:13:46.389 [err] tor_assertion_failed_: Bug: src/or/shared-random-state.c:210: commit_add_to_state: Assertion saved_commit == NULL failed; aborting. (on Tor 0.2.8.1-alpha-dev )
> }}}
>
> {{{
> 2   libsystem_c.dylib                 0x00007fff975b06e7 abort + 129
> 3   test                              0x0000000101a8e00f commit_add_to_state + 175 (shared-random-state.c:210)
> 4   test                              0x0000000101a8c970 sr_state_add_commit + 48 (shared-random-state.c:941)
> 5   test                              0x0000000101982dfe test_state_transition + 574 (test_shared_random.c:976)
> 6   test                              0x00000001019dfadd testcase_run_one + 845 (tinytest.c:106)
> 7   test                              0x00000001019dffd3 tinytest_main + 499 (tinytest.c:432)
> 8   test                              0x00000001019df43f main + 639 (testing_common.c:298)
> }}}

This one is a bit more confusing to me. I don't see any possible way how that
assert can be triggered apart from _already_ having a commit in our state when
adding it. You'll have to tell me if you hit it again after the fixes and
probably we'll have to dig deeper.

>
> And I also see `12/720 TESTS FAILED. (1 skipped)`.
[snip]
>
> I'd still like to run the SRV code in chutney, is there a chutney template for that?
> (I think I'll need to wait for the unit tests to be fixed first.)

You can with that branch. We'll simply use the default values for the super
majority. grep for prefix "SR:" logs and see sr-state file.


**Trac**:  
**Status**: needs_revision **to** needs_review

Trac:
Owner: N/A to dgoulet
Status: needs_review to assigned

Trac:
Status: assigned to needs_review

Replying to dgoulet:

Replying to teor:

This is my 3rd comment in a row. Please read all 3 comments.

Ok, FYI I've autosquash the fixup from last comment. It took me a while to have it done cleanly but now it's done. You have a series of fixup now for this comment.

See branch: prop250_final_v5

The last commit I reviewed was c219b23d47e9c742532354eedae10ca0509a1d78 in prop250_final_v4. This appears to correspond to 4f5e212cbd34c8cfb701b25a007e88eacbac1f1f "Rename shared random files to use underscore" in prop250_final_v5, which includes an autosquash and a rebase to master.

I'm only going to comment on new commits, and any issues I still have with the old code. (I'll remove any questions, fixes, and explanations I'm ok with.)

There are missed opportunities for const pointers in this commit.

All fixed (hopefully).

Except for the new smartlist_get_most_frequent_srv, which can take const smartlist_t *sl. (It was added in 3b04ebdb94e29c0b2569ff8c64256f37f1389180.)

Nitpick: get_majority_srv_from_votes() uses unsigned int current for a boolean value. Tor typically uses int for boolean values.

Should we really continue that "tradition"? :)

I'm in favour of consistently using int, but it doesn't matter that much. (We could decide to move to stdbool.h, and assert that our booleans are actually 0 or 1, but I think that's a topic for another ticket.)

Why pick the microdesc consensus to update the state? Any particular reason? Should we fall back to the NS consensus if there's no microdesc consensus or SR values?
  /* A new consensus has been created: pass it to the shared random subsystem
     to update the SR state. */
  sr_act_post_consensus(pending[FLAV_MICRODESC].consensus);
}}}
It's the consensus that client uses so we use that consensus to set our SRV value post consensus thus having the same view as client. Does that make sense? You think we should use FLAV_NS?

Most clients use the microdesc consensus, some still use NS.

I don't mind which consensus we use, but:

I think we should have a comment explaining the arbitrary choice, and
I wonder what happens if one consensus fails, but the other doesn't (can this even happen?).

I get the following crashes when I run the unit tests on OS X 10.11.3, Apple LLVM version 7.0.2 (clang-700.1.81), x86_64.

Five times: {{{ shared-random/utils: [forking] Feb 10 16:13:46.122 [err] tor_assertion_failed_: Bug: src/or/shared-random.c:854: sr_generate_our_commit: Assertion my_rsa_cert failed; aborting. (on Tor 0.2.8.1-alpha-dev ) }}}

{{{ 2 libsystem_c.dylib 0x00007fff975b06e7 abort + 129 3 test 0x0000000101a8af29 sr_generate_our_commit + 825 (shared-random.c:854) 4 test 0x0000000101a8cf11 sr_state_update + 177 (shared-random-state.c:1056) 5 test 0x0000000101a8d56d sr_state_init + 349 (shared-random-state.c:1231) 6 test 0x00000001018e43c3 test_a_networkstatus + 339 (test_dir.c:1800) 7 test 0x00000001019dfadd testcase_run_one + 845 (tinytest.c:106) 8 test 0x00000001019dffd3 tinytest_main + 499 (tinytest.c:432) 9 test 0x00000001019df43f main + 639 (testing_common.c:298) }}}

I think it's fixed. Since I can't reproduce, I hope you can confirm.

I still see 5 of these. One example is:

{{{ 3 test 0x00000001053d4045 sr_generate_our_commit + 709 (shared_random.c:870) 4 test 0x00000001053d60f1 sr_state_update + 177 (shared_random_state.c:1068) 5 test 0x00000001053d67d3 sr_state_init + 483 (shared_random_state.c:1243) 6 test 0x00000001052c9573 test_sr_get_majority_srv_from_votes + 51 (test_shared_random.c:811)


> > One time:
> > ```
> > shared-random/state_transition: [forking] Feb 10 16:13:46.389 [err] tor_assertion_failed_: Bug: src/or/shared-random-state.c:210: commit_add_to_state: Assertion saved_commit == NULL failed; aborting. (on Tor 0.2.8.1-alpha-dev )
> > }}}
> >
> > {{{
> > 2   libsystem_c.dylib                 0x00007fff975b06e7 abort + 129
> > 3   test                              0x0000000101a8e00f commit_add_to_state + 175 (shared-random-state.c:210)
> > 4   test                              0x0000000101a8c970 sr_state_add_commit + 48 (shared-random-state.c:941)
> > 5   test                              0x0000000101982dfe test_state_transition + 574 (test_shared_random.c:976)
> > 6   test                              0x00000001019dfadd testcase_run_one + 845 (tinytest.c:106)
> > 7   test                              0x00000001019dffd3 tinytest_main + 499 (tinytest.c:432)
> > 8   test                              0x00000001019df43f main + 639 (testing_common.c:298)
> > }}}
> 
> This one is a bit more confusing to me. I don't see any possible way how that
> assert can be triggered apart from _already_ having a commit in our state when
> adding it. You'll have to tell me if you hit it again after the fixes and
> probably we'll have to dig deeper.

I still see this one, same location, line numbers have changed slightly:

{{{
3   test                          	0x00000001053d6f4f commit_add_to_state + 175 (shared_random_state.c:212)
4   test                          	0x00000001053d5b50 sr_state_add_commit + 48 (shared_random_state.c:953)
5   test                          	0x00000001052cac6e test_state_transition + 574 (test_shared_random.c:1000)

And I also see 12/720 TESTS FAILED. (1 skipped).

Now I see 7/724 TESTS FAILED. (1 skipped).

6 of these are the crashes above, the final one is:

shared-random/state_update: [forking] 
  FAIL src/test/test_shared_random.c:1190: assert(state->n_commit_rounds == 1): 2 vs 1
  [state_update FAILED]

I'd still like to run the SRV code in chutney, is there a chutney template for that? (I think I'll need to wait for the unit tests to be fixed first.)

You can with that branch. We'll simply use the default values for the super majority. grep for prefix "SR:" logs and see sr-state file.

OK, I'll wait to use chutney until the unit tests pass with no crashes.

I'm going to look into those crashes myself. If I find anything, I'll post another comment.

Trac:
Status: needs_review to needs_revision

Replying to teor:

There are missed opportunities for const pointers in this commit.

All fixed (hopefully).

Except for the new smartlist_get_most_frequent_srv, which can take const smartlist_t *sl. (It was added in 3b04ebdb94e29c0b2569ff8c64256f37f1389180.)

Fixed 136da34.

Nitpick: get_majority_srv_from_votes() uses unsigned int current for a boolean value. Tor typically uses int for boolean values.

Should we really continue that "tradition"? :)

I'm in favour of consistently using int, but it doesn't matter that much. (We could decide to move to stdbool.h, and assert that our booleans are actually 0 or 1, but I think that's a topic for another ticket.)

Fair enough. Fixed 279764a

Why pick the microdesc consensus to update the state? Any particular reason? Should we fall back to the NS consensus if there's no microdesc consensus or SR values? {{{ /* A new consensus has been created: pass it to the shared random subsystem to update the SR state. */ sr_act_post_consensus(pending[FLAV_MICRODESC].consensus); }}}

It's the consensus that client uses so we use that consensus to set our SRV value post consensus thus having the same view as client. Does that make sense? You think we should use FLAV_NS?

Most clients use the microdesc consensus, some still use NS.

I don't mind which consensus we use, but:

I think we should have a comment explaining the arbitrary choice, and

I actually changed it to use the FLAV_NS which is the main one that if we can't create, problem! I also checked before using the consensus pointer that it does exists.

See 63abee2

I wonder what happens if one consensus fails, but the other doesn't (can this even happen?).

Yeah, quite unsure but plausible from what I can see.

[snip]

Now I see 7/724 TESTS FAILED. (1 skipped).

6 of these are the crashes above, the final one is:

Thanks to asn, we found out why the failures! See 30edde4.

See branch: prop250_final_v5 with the fixups.

Trac:
Status: needs_revision to needs_review

I've reviewed all the changes up to 30edde4bce7c5f34abb673604cc6e08c43195425. The code changes are trivially correct, the unit test changes are proven by execution.

All unit tests pass ok with no failures, as does make test-network-all.

I would be ok with this being merged, but I'd like to run it through coverity or clang-scan eventually.

(This is my second comment. Please read both.)

Compiling with clang (with some extra warnings) yields the following integer size complaints:

Fixed in 185189a in my branch prop250_final_v5 in https://github.com/teor2345/tor.git (You might want to split this into a series of fixups.)

next_valid_after_time should be time_t in:

time_t
get_next_valid_after_time(time_t now)
{
  int next_valid_after_time;

sr_parse_srv has the only remaining implicit integer conversion warning - it would be nice to silence it by casting the result of tor_parse_long() to int:

sr_srv_t *
sr_parse_srv(const smartlist_t *args)
{
  char *value;
  int num_reveals, ok, ret;
  sr_srv_t *srv = NULL;

  tor_assert(args);

  if (smartlist_len(args) < 2) {
    goto end;
  }

  /* First argument is the number of reveal values */
  num_reveals = tor_parse_long(smartlist_get(args, 0),
                               10, 0, INT32_MAX, &ok, NULL);

ts should be time_t:

static void
test_encoding(void *arg)
{
  (void) arg;
  int ret, duper_rand = 42;
  /* Random number is 32 bytes. */
  char raw_rand[32];
  uint64_t ts = 1454333590;

These should be tt_u64_op:

tt_int_op(state->n_protocol_runs, ==, 1);
tt_int_op(state->n_protocol_runs, ==, 45);
tt_int_op(state->n_protocol_runs, ==, 46);

(This is my third comment.)

I also get an assertion failure:

shared-random/utils: [forking] 
  FAIL /Users/twilsonb/tor/tor-sr/src/test/test_shared_random.c:957: assert(commit_is_authoritative(&commit, fp) == 0): 1 vs 0
  [utils FAILED]

But only when compiling under certain clang options (OS X):

-fstack-protector -fstrict-aliasing -DPARANOIA -fsanitize=undefined-trap -fsanitize-undefined-trap-on-error -fno-sanitize-recover
No optimisation (-O0)
64 bit

I see undefined behaviour under 32 bit, possibly in tor_gmtime (which isn't mentioned below, but is in the tor backtrace):

2   libsystem_c.dylib             	0x9d7f5c34 abort + 156
3   test                          	0x0057e5bf crash_handler + 383
4   libsystem_platform.dylib      	0x990fb01b _sigtramp + 43
5   ???                           	0xffffffff 0 + 4294967295
6   test                          	0x0057e440 0x1000 + 5755968
7   test                          	0x007d20eb parse_iso_time_ + 1739 (util.c:1713)
8   test                          	0x007d2134 parse_iso_time + 52 (util.c:1723)
9   test                          	0x00702157 config_assign_value + 7239 (confparse.c:346)
10  test                          	0x006fbba7 config_assign_line + 4663 (confparse.c:529)
11  test                          	0x006fa435 config_assign + 2133 (confparse.c:828)
12  test                          	0x009be851 disk_state_load_from_disk_impl + 417 (shared_random_state.c:651)
13  test                          	0x0073b9f5 test_state_load_from_disk + 1525 (test_shared_random.c:617)

And finally, from clang-scan:

In sr_act_post_consensus, this code can be executed when consensus is NULL - it should be surrounded by if (consensus) like some of the code above it:

  /* Update our state with the valid_after time of the next consensus so once
   * the next voting period start we are ready to receive votes. */
  time_t next_consensus_valid_after =
    get_next_valid_after_time(consensus->valid_after);
  sr_state_update(next_consensus_valid_after);

Replying to teor:

(This is my third comment.)

I also get an assertion failure: {{{ shared-random/utils: [forking] FAIL /Users/twilsonb/tor/tor-sr/src/test/test_shared_random.c:957: assert(commit_is_authoritative(&commit, fp) == 0): 1 vs 0 [utils FAILED] }}}

But only when compiling under certain clang options (OS X):

-fstack-protector -fstrict-aliasing -DPARANOIA -fsanitize=undefined-trap -fsanitize-undefined-trap-on-error -fno-sanitize-recover

No optimisation (-O0)

64 bit

I wonder if one of those options somehow removes the memset(0) just above the assert which would explain why 1 is returned instead of 0 else I can't see why this would return 1!

I see undefined behaviour under 32 bit, possibly in tor_gmtime (which isn't mentioned below, but is in the tor backtrace): {{{ 2 libsystem_c.dylib 0x9d7f5c34 abort + 156 3 test 0x0057e5bf crash_handler + 383 4 libsystem_platform.dylib 0x990fb01b sigtramp + 43 5 ??? 0xffffffff 0 + 4294967295 6 test 0x0057e440 0x1000 + 5755968 7 test 0x007d20eb parse_iso_time + 1739 (util.c:1713) 8 test 0x007d2134 parse_iso_time + 52 (util.c:1723) 9 test 0x00702157 config_assign_value + 7239 (confparse.c:346) 10 test 0x006fbba7 config_assign_line + 4663 (confparse.c:529) 11 test 0x006fa435 config_assign + 2133 (confparse.c:828) 12 test 0x009be851 disk_state_load_from_disk_impl + 417 (shared_random_state.c:651) 13 test 0x0073b9f5 test_state_load_from_disk + 1525 (test_shared_random.c:617) }}}

I'm not sure what to do about this one. I wonder if this actually also happens with our current state file since this is plain ISO TIME parsing. Nothing different between this state file and the one we have in terms of parsing timestamp...

I've fixed the rest that is I took your commit 185189a. Function sr_act_post_consensus has now a check on the pointer. This function can't be called with a NULL consensus for now but better safe in the future.

Since the above are easy to confirm without fixup commits, I've created a new branch with everything squashed. You can find the fixup commits in prop250_final_v5.

Fully squashed and rebased on git master "ready for merge" branch is: prop250_final_v6

Replying to dgoulet:

Replying to teor:

I also get an assertion failure: {{{ shared-random/utils: [forking] FAIL /Users/twilsonb/tor/tor-sr/src/test/test_shared_random.c:957: assert(commit_is_authoritative(&commit, fp) == 0): 1 vs 0 [utils FAILED] }}} I wonder if one of those options somehow removes the memset(0) just above the assert which would explain why 1 is returned instead of 0 else I can't see why this would return 1!

The memcpy(fp, commit.rsa_identity_fpr, sizeof(fp)); copies the uninitialised commit.rsa_identity_fpr into fp. Therefore, clang can assume that commit.rsa_identity_fpr is all zero (or, alternately, commit.rsa_identity_fpr uses memory that is initialised zero). Therefore, you get the same result for both commit_is_authoritative calls.

I see undefined behaviour under 32 bit, possibly in tor_gmtime (which isn't mentioned below, but is in the tor backtrace): {{{ 2 libsystem_c.dylib 0x9d7f5c34 abort + 156 3 test 0x0057e5bf crash_handler + 383 4 libsystem_platform.dylib 0x990fb01b sigtramp + 43 5 ??? 0xffffffff 0 + 4294967295 6 test 0x0057e440 0x1000 + 5755968 7 test 0x007d20eb parse_iso_time + 1739 (util.c:1713) 8 test 0x007d2134 parse_iso_time + 52 (util.c:1723) 9 test 0x00702157 config_assign_value + 7239 (confparse.c:346) 10 test 0x006fbba7 config_assign_line + 4663 (confparse.c:529) 11 test 0x006fa435 config_assign + 2133 (confparse.c:828) 12 test 0x009be851 disk_state_load_from_disk_impl + 417 (shared_random_state.c:651) 13 test 0x0073b9f5 test_state_load_from_disk + 1525 (test_shared_random.c:617) }}}

I'm not sure what to do about this one. I wonder if this actually also happens with our current state file since this is plain ISO TIME parsing. Nothing different between this state file and the one we have in terms of parsing timestamp...

You pass 2666 as the year. Any years after 2037 overflow a 32 bit time_t.

See also legacy/trac#18383 (moved) for a related issue in the dirauth code that was causing some assertion failures.

Trac:
Status: needs_review to needs_revision

My branch prop250_final_v6 has fixes for these.

Trac:
Status: needs_revision to needs_review

Please find the latest branch for review at dgoulet's repo at prop250_final_v6.

Trac:
Reviewer: N/A to N/A
Milestone: Tor: 0.2.8.x-final to Tor: 0.2.9.x-final

Trac:
Sponsor: SponsorR to SponsorR-must

Every postponed needs_review ticket should get a review in April

Trac:
Keywords: N/A deleted, TorCoreTeam201604 added

Trac:
Reviewer: N/A to nickm

Implement prop250 (Random Number Generation During Tor Voting)

Child items ...

Activity