This is the ticket for implementing proposal 250. This is the proposal for having the Tor dirauths collectively generate a fresh random value everyday.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
OK. Some very initial work can be found at shared-random-refactor1 in my repo.
The main contribution here is refactoring dirvote_compute_consensuses(). This is the function that will eventually create the SR doc, and it was a big clusterfuck. The first three commits of my branch try to refactor in a way that will allow us to add code to it, without complicating it even more.
The final commit (80bd3da) adds in some of the SR stuff that we wrote during camp plus a few more stuff. These are just ideas and scaffolding, so don't take it seriously.
However, I didn't manage to proceed with this too much because the consensus signature code is very thick. It seems to me that more refactoring is needed.
The main problem is that when dirvote.c thinks consensus, it actually thinks of networkstatus. So all consensus-related functions are created with networkstatus documents in mind, which is totally different from the SR doc. You can see this by seeing how pending_consensus_t is used around the code, and also by taking a look at the following functions:
Also, grep for N_CONSENSUS_FLAVORS to see various places where tor loops over all the various flavors and expect to find networkstatus documents underneath. All the above functions will need to be refactored, or their caller will need to handle SR doc totally different.
We will need to find a nice solution to this. We could for example introduce a new data structure called consensus_t that is a union that hosts either a networkstatus or an SR doc. And then write various methods for it, to be able to interact with it. I did a beginning with this, but it required shitloads of changes all around the codebase and then I dropped it.
A lame fix here would be to just stuff all the SR info into a networkstatus_t, which might solve a few problems, but it's terrible.
Seems like the "Ah, let's just use a new consensus flavor. What can go wrong." idea is starting to show its complexity :)
So we David we discussed a possible refactoring here that would allow us to introduce SR doc logic to functions like networkstatus_add_detached_signatures().
The idea is that we will make a new data structure called consensus_t which will be used in places where we don't care about the specific consensus flavor. This data structure will be able to contain networkstatus_t or shared_random_doc_t or any other weird flavor we come up in the future.
Then we can have some methods for consensus_t, which based on the flavor returns you the appropriate data structure.
Then we will need to refactor all the functions mentioned in the previous comment (and a few more), to be able to do their thing on a networkstatus_t but then also do a similar thing for shared_random_doc_t. This will require lots of hairy code to be refactored which is scary, but doable.
Authorities can keep track of the protocol run. Authorities know which
protocol phase they are into at any given time. Both in the real network and
in the testing network. (Unit tests included for real network)
Authorities know when a protocol run starts and can do some basic
housekeeping. Like wipe counters, and generate fresh commitment values for
themselves.
Authorities can generate actual reveal values. Authorities can generate
some dummy commitment values.
Initial work has started on encoding commits/reveals to votes. Scaffolding
work has also started on parsing commits/reveals from votes.
I also looked at your branch David. I see you have code for parsing the binary
representation of commitments and reveals . And also have code for parsing the
state file.
I noticed that the state file parsing code is quite similar to the code that we
will have to write for parsing commits/reveals from votes, but with statefile
code patterns instead of routeparse code patterns.
I'm talking about this type of code:
+ alg = crypto_digest_algorithm_parse_name(value);+ if (alg == -1 || alg != SR_DIGEST_ALG) {+ }+ value = smartlist_get(args, 1);+ if (digest256_from_base64(identity, value) < 0) {
It would be nice to be able to write utility parsing functions that can be used
by both subsystems so that we don't have duplicate code. I wrote some skeleton
code for the routerparse.c code in case you want to think more of how we could
merge those two logics. Check out commit 39cb7ab.
please see my branch prop250-voting-day5 for further work on
generating commitments and reveals. This is the first part of
switching from the dummy integer commitments/reveals of
prop250-voting-day4, to the signature-based commitments of prop250.
This is still work in progress, but commit e3b6f4f introduces a very
important unit test that shows a skeleton of how commitments and
reveals are supposed to be generated, parsed and verified. Let me know
if this logic seems acceptable to you, or we should change it.
To move forward now, we need to introduce the SR-specific ed25519 auth
key and include its certificate in votes, so that we stop using the
ancient RSA fingerprints and be able to start verifying signatures.
Tasks that need to be done next:
Create and start using the ed25519 SR key and certificate chains. Needs proposal!
Include the shared secret in the consensus.
Use weasel's algorithm in get_testing_network_protocol_phase().
Implement the conflict line (see add_conflict_to_sr_state()).
Implement the maintainance function and has_majority logic (see state_phase_transition()).
Plug ALL the memory leaks (many!)
Fix the disk state (but let's do this later so that all the data structures have stabilized).
[FWIW I left the stupid integer-based commits/reveals as
generate_sr_commitment_stupid() in case we want to do some chutney
testing. Eventually, generate_sr_commitment_stupid() will be replaced
by generate_sr_commitment().]
In my tor branch prop250-nosrkeys-day1 you can find the corresponding code, where we use the ed25519 master key for referencing, and the RSA identity key to detect multiple commits by a single authority. I also fixed the get_state_valid_until_time() function and wrote some unittests.
I think the next step in our simplification process here is to refactor the decide() step which is not really useful anymore now that we don't do majority or conflicts. I imagine that we could make it so that there is no decide step or voted_commits, and we just move valid authoritative commits directly to the sr state during vote parsing.
Latest branch with code rebased on master and improvement: prop250_v4.
This branch has few new things.
A random number is now always hashed H(RN) so we avoid leaking bytes from our PRNG to the network.
Add SR flag to the vote indicating if the vote participates in the protocol
Parse votes and populate an object in the networkstatus_t object. Once done, we only handle those commits once we are sure a vote is valid and from known authority (we weren't doing that before).
Refactor some functions to be able to support the last bullet above and improve naming on some functions.
Few things are missing that I'll get to it:
Consensus params for the number of participants.
Torrc option to disable SR support (basically making the dirauth not adding the SR flag to the vote).
Some issues with the disk state also because we know rely on the RSA fingerprint of a dirauth.
Latest branch with code rebased on master and improvement: prop250_v4.
Great! Please also check my prop250_v4 branch. Based on yours with a few mods.
Specifically:
Some small code tweaks and variable/function renames.
I removed the disaster formula. It's not useful anymore. Please see commit message for more info.
Made the vote parsing logic a bit more robust, I hope.
I renamed and changed the semantics of NumSRParticipants. Now, instead of using it just to count the number of participants, we use it to reject SRVs that have been voted by less than NumSRParticipants dirauths. This makes us more robust against edge cases like the one where the 5th SR dirauth appears during the reveal phase and calculates a different SRV than the other 4.
Also, now on the majority calculation, I use the total number of dirauths instead of the number of voters. I did this to mirror the way the n_required works in networkstatus_check_consensus_signature().
Let me know if you find these things reasonable!!!
More future things:
We should start actually voting the NumSRVAgreements parameter. And maybe for the real network the default value should be ((n_auths / 2) + 1), instead of n_auths that is now.
Also, I've been getting these interesting log messages for a while now:
[SR] You may not provide a value for virtual option 'SharedRandCurrentValue'
but I'm not sure what they mean. I think they started appearing when we switched from LINELIST_S to LINELIST_V for SRVs.
Great! Please also check my prop250_v4 branch. Based on yours with a few mods.
Specifically:
Some small code tweaks and variable/function renames.
Commit: d653f8cc17d6e278e16acb8a3101a821f9f42f6e
- Macroify previous/current SRV strings.
Not sure I agree with that. I much prefer typed variable here instead of macros for strings that we only use locally in this file. We have compiler protection with that where macros is pretty yolo in terms of warnings for compiler. You see a win for macros?
More future things:
We should start actually voting the NumSRVAgreements parameter. And maybe for the real network the default value should be ((n_auths / 2) + 1), instead of n_auths that is now.
So you mean before merging 250, we should make an adhoc patch right now? Most dirauth won't upgrade until a stable and since we plan to have this in 028-stable, we can make this a big block.
Hrm, consensus params are options in a file for dirauth so we have to decide on a "hard" number which in this case would be 5. And we can see how the SR behaves over time and increase it if needed?
Also, I've been getting these interesting log messages for a while now:
{{{
[SR] You may not provide a value for virtual option 'SharedRandCurrentValue'
}}}
but I'm not sure what they mean. I think they started appearing when we switched from LINELIST_S to LINELIST_V for SRVs.
Yeah I have yet understantd this one but I think that when the state file is read that option can be missing and thus we have this warning...
Hello! asn and I are very happy to present to you wonderful reviewers the implementation for proposal 250 along with the final specification:
Both branch in tor.git and torspec.git: dgoulet/prop250_final_v1
Some notes. We've separated this in 7 commits prefixed with prop250: except first one that adds a needed tor_htonll/ntohll function to tor utils. This code is mostly contained in two new files (with their headers) that are shared-random.{c|h} and shared-random-state.{c|h}.
Also, there are attacks to this protocol that we are well aware of but all are easily detectable so for this reason we've wrote a DocTor specification that atagar will help use deploy once this is merged.
Finally, we expect this code to run for a long time before the shared random values generated by the authorities are used thus for now you will NOT find anything using them.
Please ask questions! This won't be that trivial to review :).
Trac: Type: defect to enhancement Priority: Medium to High Severity: Blocker to Normal Status: new to needs_review
After proposal review that happened few days ago on #tor-dev, we made some adjustments to the specification and code.
Please find the latest code here: prop250_final_v3
Proposal 250 has been updated with the latest in torspec.git
Code review:
General comments:
Every other multi-word file name in src/or uses underscores.
If a function moves data from (dtype* dst, const stype* src), can the arguments be in that order, and can the src argument be const?
There are a lot of missed opportunities for const pointer arguments.
e282227ebc03ae71f1cd6490a2adf2058140c66e:
If you use #ifdef WORDS_BIGENDIAN (from orconfig.h), rather than a runtime check against htonl, some compilers will produce better code.
72baec6805aaab8d62f2deed8fbc658c82a7d086:
shared-random-state.c:
If we use an asymmetric magic number, we can detect when the two halves are swapped.
#define SR_DISK_STATE_MAGIC 42424242
A hard-coded SHARED_RANDOM_N_ROUNDS is going to make it really hard to test hidden services quickly using chutney. (We'll always be testing them using the default initial shared random value.) Can we make this configurable in test networks?
#define SHARED_RANDOM_N_ROUNDS 12
get_start_time_of_current_round() doesn't take TestingV3AuthVotingStartOffset into account. Why not use:
get_next_valid_after_time() duplicates code in dirvote_recalculate_timing().
Why not use voting_schedule.interval_starts?
(You'd need to make sure dirvote_recalculate_timing() had been called before either of these functions used voting_schedule.interval_starts.)
Nitpick: why 2? Can it be a named constant like SHARED_RANDOM_N_PHASES?
get_state_valid_until_time, get_sr_protocol_phase
int total_rounds = SHARED_RANDOM_N_ROUNDS * 2;
get_state_valid_until_time() also ignores TestingV3AuthVotingStartOffset. I think this should fix it:
get_sr_protocol_phase() also ignores TestingV3AuthVotingStartOffset. Perhaps passing around the time isn't the best way to do this? It seems like if we can make this same mistake in multiple places, then we should refactor the round number calculation so it's in a single function.
Does get_sr_protocol_phase() need to check it's getting an actual valid_after time, rather than now?
Why do we have disk_state_validate() and disk_state_validate_cb()?
Should one call the other? Should they check _magic?
Should they check ValidAfter < ValidUntil?
disk_state_parse_commits() fails to SMARTLIST_FOREACH(args, char *, cp, tor_free(cp)); if commit is NULL.
disk_state_parse_sr_values() contains duplicate code in the if/else block. The function comment should say it parses both SharedRandPreviousValue and SharedRandCurrentValue.
I think disk_state_parse_sr_values() might be able to leak memory if disk_state_parse_srv() fails. At the very least, it leaves an allocated but zero pointer, which could confuse the rest of the code.
disk_state_parse_sr_values() can assert its arguments are not NULL.
disk_state_put_commit_line() can have commit as a const pointer, and in the opposite order for consistency.
In disk_state_put_commit_line(), do we need to memwipe reveal_str?
Or are we about to make it public anyway?
disk_state_put_srv_line() can have srv as a const pointer, and in the opposite order for consistency.
I think disk_state_reset() is the moral equivalent of disk_state_free() followed by disk_state_new(). Why are they different functions? (Apart from the memory thing.)
We don't seem to check for errors in disk_state_update(). I think this is ok, but I'm not sure exactly why.
In disk_state_save_to_disk() and children (and other functions), should we memwipe any temporary memory that contains the reveal value, or the encoded reveal value? (before it is revealed)
We do this some places, but not others.
Can data in state_query_get_() be a const pointer?
Some of the sr_state_set_* functions can take const pointers.
shared-random-state.h:
I think this comment is in the wrong place, shouldn't it correspond to SHARED_RANDOM_STATE_PRIVATE?
/* Private methods (only used by shared-random.c): */
shared-random.c:
char b64_decoded[SR_COMMIT_LEN + 2];
Can we either fix legacy/trac#17868 (moved) as part of this branch, or add a #define for the rounding-up that has to occur when the base64 value isn't padded?
Every other multi-word file name in src/or uses underscores.
Fixed. Extra commit here since this applies to almost all commits.
If a function moves data from (dtype* dst, const stype* src), can the arguments be in that order, and can the src argument be const?
There are a lot of missed opportunities for const pointer arguments.
Fixed most of those. See "fixup" commits
e282227ebc03ae71f1cd6490a2adf2058140c66e:
If you use #ifdef WORDS_BIGENDIAN (from orconfig.h), rather than a runtime check against htonl, some compilers will produce better code.
Fixed!
72baec6805aaab8d62f2deed8fbc658c82a7d086:
shared-random-state.c:
If we use an asymmetric magic number, we can detect when the two halves are swapped.
{{{
#define SR_DISK_STATE_MAGIC 42424242
}}}
Fixed
A hard-coded SHARED_RANDOM_N_ROUNDS is going to make it really hard to test hidden services quickly using chutney. (We'll always be testing them using the default initial shared random value.) Can we make this configurable in test networks?
{{{
#define SHARED_RANDOM_N_ROUNDS 12
}}}
The part I do not like about changing this value for testing network is that we do NOT get the real behavior of the protocol... I'm not against for a testing value but I would do that after merge in a separate ticket.
get_start_time_of_current_round() doesn't take TestingV3AuthVotingStartOffset into account. Why not use:
{{{
time_t next_start = voting_schedule.interval_starts;
time_t curr_start = dirvote_get_start_of_next_interval(next_start - interval - 1,interval,
options->TestingV3AuthVotingStartOffset);
}}}
get_next_valid_after_time() duplicates code in dirvote_recalculate_timing().
Why not use voting_schedule.interval_starts?
(You'd need to make sure dirvote_recalculate_timing() had been called before either of these functions used voting_schedule.interval_starts.)
Changing this part of the code after the extensive testing we did could be very delicate. asn is looking at possible solution.
Nitpick: why 2? Can it be a named constant like SHARED_RANDOM_N_PHASES?
get_state_valid_until_time, get_sr_protocol_phase
{{{
int total_rounds = SHARED_RANDOM_N_ROUNDS * 2;
}}}
Definitely better.
get_state_valid_until_time() also ignores TestingV3AuthVotingStartOffset. I think this should fix it:
{{{
current_round = (beginning_of_current_round / voting_interval) % total_rounds;
}}}
get_sr_protocol_phase() also ignores TestingV3AuthVotingStartOffset. Perhaps passing around the time isn't the best way to do this? It seems like if we can make this same mistake in multiple places, then we should refactor the round number calculation so it's in a single function.
Does get_sr_protocol_phase() need to check it's getting an actual valid_after time, rather than now?
No, valid_after time that is now is valid which is what we set when you start up since we have no valid_after time until first consensus.
Why do we have disk_state_validate() and disk_state_validate_cb()?
We must have a callback for the "state file" API, NULL is not accepted. Furthermore, if the callback return an error, right now tor does a wonderful assert(0)... so it has been ignored.
Should one call the other? Should they check _magic?
This is done with CONFIG_CHECK() which is called multiple time when you use config_assign().
Should they check ValidAfter < ValidUntil?
Fixed! (But the correct check is >= here :)
disk_state_parse_commits() fails to SMARTLIST_FOREACH(args, char *, cp, tor_free(cp)); if commit is NULL.
Fixed
disk_state_parse_sr_values() contains duplicate code in the if/else block. The function comment should say it parses both SharedRandPreviousValue and SharedRandCurrentValue.
Fixed
I think disk_state_parse_sr_values() might be able to leak memory if disk_state_parse_srv() fails. At the very least, it leaves an allocated but zero pointer, which could confuse the rest of the code.
Fixed with the commit above.
disk_state_parse_sr_values() can assert its arguments are not NULL.
Fixed with commit above
disk_state_put_commit_line() can have commit as a const pointer, and in the opposite order for consistency.
Fixed.
In disk_state_put_commit_line(), do we need to memwipe reveal_str?
Or are we about to make it public anyway?
Indeed. I think you are right. We do put our commit on disk during the commit phase so better to cleanup memory here to be safe.
disk_state_put_srv_line() can have srv as a const pointer, and in the opposite order for consistency.
Fixed.
I think disk_state_reset() is the moral equivalent of disk_state_free() followed by disk_state_new(). Why are they different functions? (Apart from the memory thing.)
Not much. We avoid alloc/free every time we update the state which in a voting period happens often. (probably not enough to be noticeable but still)
We don't seem to check for errors in disk_state_update(). I think this is ok, but I'm not sure exactly why.
We assume that everything in our memory state is coherent so disk update can only use value that are usable. I think it's fine, since if the memory state has crazy stuff even if we don't put on disk or not, thing will go bad.
In disk_state_save_to_disk() and children (and other functions), should we memwipe any temporary memory that contains the reveal value, or the encoded reveal value? (before it is revealed)
We do this some places, but not others.
I've added a needed memwipe in reveal_encode since we do encode our commitment during the commit phase so the reveal value should only be in the commit object memory and not on stack.
Can data in state_query_get_() be a const pointer?
Some of the sr_state_set_* functions can take const pointers.
the get_ yes, I fixed it but not for the set since digestmap doesn't handle a const.
shared-random-state.h:
I think this comment is in the wrong place, shouldn't it correspond to SHARED_RANDOM_STATE_PRIVATE?
{{{
/* Private methods (only used by shared-random.c): */
}}}
Ah, by that we mean methods that are ONLY used by shared_random.c so "private" to the subsystems but not static because they need to be visible.
shared-random.c:
{{{
char b64_decoded[SR_COMMIT_LEN + 2];
}}}
Can we either fix legacy/trac#17868 (moved) as part of this branch, or add a #define for the rounding-up that has to occur when the base64 value isn't padded?
{{{
#define B64_DECODE_PAD(len) ((len) + ((len) % 3))
char b64_decoded[B64_DECODE_PAD(SR_COMMIT_LEN)];
}}}
I would really want to fix legacy/trac#17868 (moved) instead of a temporary workaround macro. You'll notice that there are still XXX: at that place and it's to show the obvious bad thing of +2.
Is TIMESTAMP first, or H(REVEAL)?
commit_decode says one thing in the comments, and another in the code.
Hrm both the code and comment do the same or am I missing something?
Can we assert that decoded_len == SR_COMMIT_LEN? Is that useful?
We prefer not since this is data from the network so potentially untrusted.
reveal_decode() duplicates commit_decode(). Can we refactor to avoid that?
Plausible. I would do that in a later iteration though. Having both separated gives us more room for better logging and if one of those change in some way, we can adapt easily.
In sr_parse_srv(), why do we use base16 for the SRV, but base64 for the commits?
No reasons... We'll make it base64.
In sr_parse_commit(), we ignore extra arguments after the fourth. Is this intentional?
A comment either way would help.
We do and the top commit of the function details the order of args. What extra commit you have in mind?
I have 5 more commits to review. (Please add fixups to the end of the branch so I don't lose my place.)
After proposal review that happened few days ago on #tor-dev, we made some adjustments to the specification and code.
Please find the latest code here: prop250_final_v3
Proposal 250 has been updated with the latest in torspec.git
Code review:
General comments:
get_start_time_of_current_round() doesn't take TestingV3AuthVotingStartOffset into account. Why not use:
{{{
time_t next_start = voting_schedule.interval_starts;
time_t curr_start = dirvote_get_start_of_next_interval(next_start - interval - 1,interval,
options->TestingV3AuthVotingStartOffset);
}}}
get_next_valid_after_time() duplicates code in dirvote_recalculate_timing().
Why not use voting_schedule.interval_starts?
(You'd need to make sure dirvote_recalculate_timing() had been called before either of these functions used voting_schedule.interval_starts.)
Hmm, I would like to do that but it doesn't seem so easy. dirvote_recalculate_timing() has not been called when we call state_new(), since we call it when we boot up and we haven't voted yet at that point.
I then called dirvote_recalculate_timing() myself and everything worked fine, but I'm not sure if it's actually correct. That function updates the global voting_schedule structure which is then used by other parts of the codebase and I'm not sure if initializing it with 'now' set to the next 'valid_after' would be correct here :x
A hard-coded SHARED_RANDOM_N_ROUNDS is going to make it really hard to test hidden services quickly using chutney. (We'll always be testing them using the default initial shared random value.) Can we make this configurable in test networks?
#define SHARED_RANDOM_N_ROUNDS 12}}}
The part I do not like about changing this value for testing network is that we do NOT get the real behavior of the protocol... I'm not against for a testing value but I would do that after merge in a separate ticket.
Not necessarily - 12 can still be the default number of rounds.
(I'm pretty sure we haven't hard-coded this macro value anywhere. So it should be easy to replace with a function that reads an option for the number of rounds.)
get_state_valid_until_time() also ignores TestingV3AuthVotingStartOffset. I think this should fix it:
{{{
current_round = (beginning_of_current_round / voting_interval) % total_rounds;
}}}
get_sr_protocol_phase() also ignores TestingV3AuthVotingStartOffset. Perhaps passing around the time isn't the best way to do this? It seems like if we can make this same mistake in multiple places, then we should refactor the round number calculation so it's in a single function.
Can we refactor the round number / time calculations so they're in one place?
I feel nervous we'll break them if we try to fix them later, and I'd rather do that now when we're not live.
shared-random.c:
{{{
char b64_decoded[SR_COMMIT_LEN + 2];
}}}
Can we either fix legacy/trac#17868 (moved) as part of this branch, or add a #define for the rounding-up that has to occur when the base64 value isn't padded?
{{{
#define B64_DECODE_PAD(len) ((len) + ((len) % 3))
char b64_decoded[B64_DECODE_PAD(SR_COMMIT_LEN)];
}}}
I would really want to fix legacy/trac#17868 (moved) instead of a temporary workaround macro. You'll notice that there are still XXX: at that place and it's to show the obvious bad thing of +2.
Is TIMESTAMP first, or H(REVEAL)?
commit_decode says one thing in the comments, and another in the code.
Hrm both the code and comment do the same or am I missing something?
The commit_decode() function header comment says:
{{{
base64-encode( H(REVEAL) || TIMESTAMP )
The code says:
/* First is the timestamp (8 bytes). /
commit->commit_ts = (time_t) tor_ntohll(get_uint64(b64_decoded));
offset += sizeof(uint64_t);
/ Next is hashed reveal. */
memcpy(commit->hashed_reveal, b64_decoded + offset,
sizeof(commit->hashed_reveal));
> > Can we assert that `decoded_len == SR_COMMIT_LEN`? Is that useful?> > > We prefer not since this is data from the network so potentially untrusted.OK, then we're deliberately ignoring any characters after SR_COMMIT_LEN?Can we make that explicit in a comment?> > reveal_decode() duplicates commit_decode(). Can we refactor to avoid that?> > > Plausible. I would do that in a later iteration though. Having both separated gives us more room for better logging and if one of those change in some way, we can adapt easily.The risk of leaving them duplicated is that bugs get fixed in one and not the other.Or they fall out of sync accidentally. This is a real mess to clean up later - see router_pick_directory_server_impl() and router_pick_trusteddirserver_impl() for an example.This has already happened to these two functions:* commit_decode() uses offset, reveal_decode() uses a hard-coded 8.* the "too small" warning in reveal_decode() is less detailed, but could be identical to commit_decode()So I have a more radical suggestion for you:* you have two groups of 3 struct fields that are the same types, and have the same operators performed on them;* abstract these 3 fields into their own struct type (ts, hashed, encoded)* add operations for encoding and decoding this type (and perhaps, turning this authority's reveal into its commit)* refactor the existing code to call these operationsThat way, the interface stays the same, you can add any extra logging you need in the commit- or reveal-specific functions, but you get the benefit of using the same code for encoding/decoding commits and reveals.> > In sr_parse_commit(), we ignore extra arguments after the fourth. Is this intentional?> > A comment either way would help.> We do and the top commit of the function details the order of args. What extra commit you have in mind?If an authority supplies 5 arguments in its vote, we will ignore the fifth.There's no comment documenting whether this is intentional or not.And should we ignore the entire line instead?
After proposal review that happened few days ago on #tor-dev, we made some adjustments to the specification and code.
Please find the latest code here: prop250_final_v3
Proposal 250 has been updated with the latest in torspec.git
Code review:
General comments:
get_start_time_of_current_round() doesn't take TestingV3AuthVotingStartOffset into account. Why not use:
{{{
time_t next_start = voting_schedule.interval_starts;
time_t curr_start = dirvote_get_start_of_next_interval(next_start - interval - 1,interval,
options->TestingV3AuthVotingStartOffset);
}}}
get_next_valid_after_time() duplicates code in dirvote_recalculate_timing().
Why not use voting_schedule.interval_starts?
(You'd need to make sure dirvote_recalculate_timing() had been called before either of these functions used voting_schedule.interval_starts.)
Hmm, I would like to do that but it doesn't seem so easy. dirvote_recalculate_timing() has not been called when we call state_new(), since we call it when we boot up and we haven't voted yet at that point.
I then called dirvote_recalculate_timing() myself and everything worked fine, but I'm not sure if it's actually correct. That function updates the global voting_schedule structure which is then used by other parts of the codebase and I'm not sure if initializing it with 'now' set to the next 'valid_after' would be correct here :x
Can we refactor the relevant calculations out of dirvote_recalculate_timing(), and call them from dirvote_recalculate_timing() and get_next_valid_after_time() and get_start_time_of_current_round()?
My concern is that we'll break test networks that use TestingV3AuthVotingStartOffset by ignoring it during shared random calculations. And that these 3 functions should produce the same results.
Do get_next_valid_after_time() and get_start_time_of_current_round() belong in the core dirserv file instead? They don't seem to be specific to shared random.
Nitpick: why add a phase_str for "unknown" when you assert rather than use it?
commit_log() should probably use safe_str() on the reveal value, as it is sensitive.
Other debug messages also have this issue.
It also logs commits and reveals in hex, but votes use base64. This will make them difficult to compare. generate_srv() also does this with the SRV.
I see you fixed the commit_decode() function comment to base64-encode( TIMESTAMP || H(REVEAL) ).
reveal_encode() and commit_encode() are duplicate code, with similar code drift issues ti the *decode() functions. See comment 25 for a suggestion on this.
sr_generate_our_commit() can use crypto_strongest_rand() rather than doing the hash itself.
It can also use a const pointer.
sr_compute_srv() places a hard limit of 254 (SR-voting) authorities in a tor network.
tor_assert(reveal_num < UINT8_MAX);
Is this OK?
It should be documented somewhere.
In sr_generate_our_commit() and sr_compute_srv(), crypto_digest256 never returns < 0. It returns 1 on failure.
In generate_srv(), you don't check the return value of crypto_digest256.
You might want to check how you handle the return values of all your crypto_* calls.
In sr_compute_srv, should we assert on crypto_digest256 failure? Or should we do something with the state?
47ade62b1811c16c28d73ebe59f47daa16cd4522:
AuthDirSharedRandomness needs a tor manual page entry.
There are missed opportunities for const pointers in this commit.
srv_to_ns_string() and get_majority_srv_from_votes() also encode the SRV as hex rather than base64.
get_majority_srv_from_votes() could be significantly simplified by adding to container.c:
/** Return the most frequent member of the sorted list of DIGEST256_LEN * digests in <b>sl</b>. If count_out is non-NULL, set it to the count * of the most frequent member. */const uint8_t *smartlist_get_most_frequent_digest256_(smartlist_t *sl, int *count_out){ return smartlist_get_most_frequent_(sl, compare_digests256_, count_out);}
Then using:
int count = 0; value = smartlist_get_most_frequent_digest256_(sr_digests, &count); /* Was this SRV voted by enough auths for us to keep it? */ if (!should_keep_srv(count)) { goto end; }
Nitpick: get_majority_srv_from_votes() uses unsigned int current for a boolean value. Tor typically uses int for boolean values.
8e180aa061f0ea39968f05c85796b26f10a81744:
extract_shared_random_commits() and sr_state_copy_reveal_info() and should_keep_commit() and possibly other functions could use const pointers.
Why pick the microdesc consensus to update the state? Any particular reason?
Should we fall back to the NS consensus if there's no microdesc consensus or SR values?
/* A new consensus has been created: pass it to the shared random subsystem to update the SR state. */ sr_act_post_consensus(pending[FLAV_MICRODESC].consensus);
extract_shared_random_commits() does tor_assert(tok->n_args >= 3); on untrusted input. That lets a single authority crash all the other authorities by including a line with only 2 arguments. Let's skip bad commits instead.
Let's also make sure we don't assert() on untrusted input in other parsing functions.
extract_shared_random_srvs() can error out on the current SRV, and leave the previous SRV allocated and initialised. What happens then?
verify_commit_and_reveal() also expects crypto_digest256() to return -1. It never will.
In should_keep_commit(), we warn using LD_BUG when another authority gets the protocol wrong. This could be misleading, and leave authority operators thinking they have a tor version with a bug. (Or do we do this elsewhere?)
8f3f5667754037cccd758eb8631636e2e78fed93:
Looks ok.
3f2014d7aabe2e10163f4f44a8ca46fd5a91f04a:
I'm not going to review the unit tests in detail. (We generally accept unit tests if they work, right?)
src/test/sr_srv_calc_ref.py looks like it uses SHA256, and not SHA3-256.
I get the following crashes when I run the unit tests on OS X 10.11.3, Apple LLVM version 7.0.2 (clang-700.1.81), x86_64.
Five times:
shared-random/utils: [forking] Feb 10 16:13:46.122 [err] tor_assertion_failed_: Bug: src/or/shared-random.c:854: sr_generate_our_commit: Assertion my_rsa_cert failed; aborting. (on Tor 0.2.8.1-alpha-dev )
2 libsystem_c.dylib 0x00007fff975b06e7 abort + 1293 test 0x0000000101a8af29 sr_generate_our_commit + 825 (shared-random.c:854)4 test 0x0000000101a8cf11 sr_state_update + 177 (shared-random-state.c:1056)5 test 0x0000000101a8d56d sr_state_init + 349 (shared-random-state.c:1231)6 test 0x00000001018e43c3 test_a_networkstatus + 339 (test_dir.c:1800)7 test 0x00000001019dfadd testcase_run_one + 845 (tinytest.c:106)8 test 0x00000001019dffd3 tinytest_main + 499 (tinytest.c:432)9 test 0x00000001019df43f main + 639 (testing_common.c:298)
One time:
shared-random/state_transition: [forking] Feb 10 16:13:46.389 [err] tor_assertion_failed_: Bug: src/or/shared-random-state.c:210: commit_add_to_state: Assertion saved_commit == NULL failed; aborting. (on Tor 0.2.8.1-alpha-dev )
2 libsystem_c.dylib 0x00007fff975b06e7 abort + 1293 test 0x0000000101a8e00f commit_add_to_state + 175 (shared-random-state.c:210)4 test 0x0000000101a8c970 sr_state_add_commit + 48 (shared-random-state.c:941)5 test 0x0000000101982dfe test_state_transition + 574 (test_shared_random.c:976)6 test 0x00000001019dfadd testcase_run_one + 845 (tinytest.c:106)7 test 0x00000001019dffd3 tinytest_main + 499 (tinytest.c:432)8 test 0x00000001019df43f main + 639 (testing_common.c:298)
And I also see 12/720 TESTS FAILED. (1 skipped).
The dgoulet/prop250_final_v4 branch also has similar unit test issues.
Now on to the extra commits in dgoulet/prop250_final_v4:
There are twelve extra commits in this branch.
A hard-coded SHARED_RANDOM_N_ROUNDS is going to make it really hard to test hidden services quickly using chutney. (We'll always be testing them using the default initial shared random value.) Can we make this configurable in test networks?
#define SHARED_RANDOM_N_ROUNDS 12}}}
The part I do not like about changing this value for testing network is that we do NOT get the real behavior of the protocol... I'm not against for a testing value but I would do that after merge in a separate ticket.
Not necessarily - 12 can still be the default number of rounds.
(I'm pretty sure we haven't hard-coded this macro value anywhere. So it should be easy to replace with a function that reads an option for the number of rounds.)
I'll comment on the ticket from now on. thanks! :)
get_state_valid_until_time() also ignores TestingV3AuthVotingStartOffset. I think this should fix it:
{{{
current_round = (beginning_of_current_round / voting_interval) % total_rounds;
}}}
get_sr_protocol_phase() also ignores TestingV3AuthVotingStartOffset. Perhaps passing around the time isn't the best way to do this? It seems like if we can make this same mistake in multiple places, then we should refactor the round number calculation so it's in a single function.
Can we refactor the round number / time calculations so they're in one place?
I feel nervous we'll break them if we try to fix them later, and I'd rather do that now when we're not live.
I'll wait on asn's work on this before refactoring anything.
Can we assert that decoded_len == SR_COMMIT_LEN? Is that useful?
We prefer not since this is data from the network so potentially untrusted.
OK, then we're deliberately ignoring any characters after SR_COMMIT_LEN?
Can we make that explicit in a comment?
I think this at the start of the function should catch it. Pass that, their can't be anything pass SR_COMMIT_LEN I think:
{{{
if (strlen(encoded) > SR_COMMIT_BASE64_LEN)
> > > > reveal_decode() duplicates commit_decode(). Can we refactor to avoid that?> > > > > Plausible. I would do that in a later iteration though. Having both separated gives us more room for better logging and if one of those change in some way, we can adapt easily.> > The risk of leaving them duplicated is that bugs get fixed in one and not the other.> Or they fall out of sync accidentally. This is a real mess to clean up later - see router_pick_directory_server_impl() and router_pick_trusteddirserver_impl() for an example.> > This has already happened to these two functions:> * commit_decode() uses offset, reveal_decode() uses a hard-coded 8.The hardcoded 8 is _BAD_... that is a good find.[snip]I'll work on something here.> > > In sr_parse_commit(), we ignore extra arguments after the fourth. Is this intentional?> > > A comment either way would help.> > We do and the top commit of the function details the order of args. What extra commit you have in mind?> > If an authority supplies 5 arguments in its vote, we will ignore the fifth.> There's no comment documenting whether this is intentional or not.> And should we ignore the entire line instead?True. We should check if we bust the number of accepted argument and if so, reject.I'll apply the fixes of the above and submit them in the next comment. I would like to avoid multiple thread in this ticket :).
After proposal review that happened few days ago on #tor-dev, we made some adjustments to the specification and code.
Please find the latest code here: prop250_final_v3
Proposal 250 has been updated with the latest in torspec.git
Code review:
get_next_valid_after_time() duplicates code in dirvote_recalculate_timing().
Why not use voting_schedule.interval_starts?
(You'd need to make sure dirvote_recalculate_timing() had been called before either of these functions used voting_schedule.interval_starts.)
Can we refactor the relevant calculations out of dirvote_recalculate_timing(), and call them from dirvote_recalculate_timing() and get_next_valid_after_time() and get_start_time_of_current_round()?
My concern is that we'll break test networks that use TestingV3AuthVotingStartOffset by ignoring it during shared random calculations. And that these 3 functions should produce the same results.
Do get_next_valid_after_time() and get_start_time_of_current_round() belong in the core dirserv file instead? They don't seem to be specific to shared random.
The refactoring was a bit more messy than I expected, but the branch seems to work in chutney. The tests still pass without any changes, so that's good.
I also moved get_next_valid_after_time() in dirvote.c as you suggested.
I've reviewed asn's prop250_final_v4 up to ab5a2ffed4fa02d3ac8a2e46b8f3c38a318a7d3a.
Thanks for the refactoring, I am much more comfortable that the new timing code uses the same values as the existing timing code.
Still waiting to hear about the unit tests failures and chutney template.
This is my 3rd comment in a row. Please read all 3 comments.
Ok, FYI I've autosquash the fixup from last comment. It took me a while to have
it done cleanly but now it's done. You have a series of fixup now for this
comment.
Nitpick: why add a phase_str for "unknown" when you assert rather than use it?
get_phase_str() uses the enum value as the index in the phase_str array.
"unknown" is there as a taint if we ever use an uninit phase value.
commit_log() should probably use safe_str() on the reveal value, as it is sensitive.
Other debug messages also have this issue.
I've refactor that a bit better to print the encoded value instead so we can
match them to strings in vote or other logs that do print the encoded part.
It also logs commits and reveals in hex, but votes use base64. This will make them difficult to compare. generate_srv() also does this with the SRV.
Fixed (for both this and the commit_log)
I see you fixed the commit_decode() function comment to base64-encode( TIMESTAMP || H(REVEAL) ).
reveal_encode() and commit_encode() are duplicate code, with similar code drift issues ti the *decode() functions. See comment 25 for a suggestion on this.
I've discussed this with asn and after coding two different ways of achieving
this, we don't feel super comfortable for two reasons. First is that both
options don't make it very clean tbh. Second, reveal and commit values are two
different logical object and they just happen to have the same encoding format
so we are fine leaving them like that since it's little code duplication but
also simply different things...
sr_generate_our_commit() can use crypto_strongest_rand() rather than doing the hash itself.
It can also use a const pointer.
Fixed! (3 different commits to be squashed on different master commit)
sr_compute_srv() places a hard limit of 254 (SR-voting) authorities in a tor network.
tor_assert(reveal_num < UINT8_MAX);}}}Is this OK?It should be documented somewhere.
This is indeed a limitation. Proposal 250 does force INT_1(REVEAL_NUM). I've
pushed an update to torspec mentionning the limitation.
In sr_generate_our_commit() and sr_compute_srv(), crypto_digest256 never returns < 0. It returns 1 on failure.
In generate_srv(), you don't check the return value of crypto_digest256.
You might want to check how you handle the return values of all your crypto_* calls.
Good catch! Fixed! (two fixup commits)
In sr_compute_srv, should we assert on crypto_digest256 failure? Or should we do something with the state?
Hrm... assert sounds a bit scary to just kill an authority. Most of tor code
just ignores the return value. So I'm guessing having an empty SRV is valid
here in case of an epic error like that and the authority should recover after
a while.
47ade62b1811c16c28d73ebe59f47daa16cd4522:
AuthDirSharedRandomness needs a tor manual page entry.
Fixed in b37214db72d42af8b25f7f87a5d3c6894c032115.
There are missed opportunities for const pointers in this commit.
All fixed (hopefully).
srv_to_ns_string() and get_majority_srv_from_votes() also encode the SRV as hex rather than base64.
Fixed.
get_majority_srv_from_votes() could be significantly simplified by adding to container.c:
{{{
/** Return the most frequent member of the sorted list of DIGEST256_LEN
digests in sl. If count_out is non-NULL, set it to the count
of the most frequent member. */
const uint8_t *
smartlist_get_most_frequent_digest256_(smartlist_t *sl, int *count_out)
{
return smartlist_get_most_frequent_(sl, compare_digests256_, count_out);
}
}}}
Then using:
{{{
int count = 0;
value = smartlist_get_most_frequent_digest256_(sr_digests, &count);
/* Was this SRV voted by enough auths for us to keep it? */
if (!should_keep_srv(count)) {
goto end;
}
}}}
Indeed. I've created smartlist_get_most_frequent_digest256_ and changed
around that function. Fixed.
Nitpick: get_majority_srv_from_votes() uses unsigned int current for a boolean value. Tor typically uses int for boolean values.
Should we really continue that "tradition"? :)
8e180aa061f0ea39968f05c85796b26f10a81744:
extract_shared_random_commits() and sr_state_copy_reveal_info() and should_keep_commit() and possibly other functions could use const pointers.
Fixed.
Why pick the microdesc consensus to update the state? Any particular reason?
Should we fall back to the NS consensus if there's no microdesc consensus or SR values?
{{{
/* A new consensus has been created: pass it to the shared random subsystem
to update the SR state. */
sr_act_post_consensus(pending[FLAV_MICRODESC].consensus);
}}}
It's the consensus that client uses so we use that consensus to set our SRV
value post consensus thus having the same view as client. Does that make sense?
You think we should use FLAV_NS?
extract_shared_random_commits() does tor_assert(tok->n_args >= 3); on untrusted input. That lets a single authority crash all the other authorities by including a line with only 2 arguments. Let's skip bad commits instead.
Let's also make sure we don't assert() on untrusted input in other parsing functions.
Indeed, very bad. Fixed!
extract_shared_random_srvs() can error out on the current SRV, and leave the previous SRV allocated and initialised. What happens then?
sr_info in a the networkstatus_t structure is allocated with the ns so on
error, the parsing stops and the object dissapear on its own.
{{{
if (extract_shared_random_srvs(ns, tokens) < 0) {
log_warn(LD_DIR, "SR: Unable to parse SRV(s)");
goto err;
}
>> verify_commit_and_reveal() also expects crypto_digest256() to return -1. It never will.Fixed.>> In should_keep_commit(), we warn using LD_BUG when another authority gets the protocol wrong. This could be misleading, and leave authority operators thinking they have a tor version with a bug. (Or do we do this elsewhere?)True, should be a `LD_DIR` here. Fixed> src/test/sr_srv_calc_ref.py looks like it uses SHA256, and not SHA3-256.Yah that python one is out of date and SHA3 is not in python hashlib as far asI know so fix to that thing is pending until we can do it in python...>> I get the following crashes when I run the unit tests on OS X 10.11.3, Apple LLVM version 7.0.2 (clang-700.1.81), x86_64.>> Five times:> {{{> shared-random/utils: [forking] Feb 10 16:13:46.122 [err] tor_assertion_failed_: Bug: src/or/shared-random.c:854: sr_generate_our_commit: Assertion my_rsa_cert failed; aborting. (on Tor 0.2.8.1-alpha-dev )> }}}>> {{{> 2 libsystem_c.dylib 0x00007fff975b06e7 abort + 129> 3 test 0x0000000101a8af29 sr_generate_our_commit + 825 (shared-random.c:854)> 4 test 0x0000000101a8cf11 sr_state_update + 177 (shared-random-state.c:1056)> 5 test 0x0000000101a8d56d sr_state_init + 349 (shared-random-state.c:1231)> 6 test 0x00000001018e43c3 test_a_networkstatus + 339 (test_dir.c:1800)> 7 test 0x00000001019dfadd testcase_run_one + 845 (tinytest.c:106)> 8 test 0x00000001019dffd3 tinytest_main + 499 (tinytest.c:432)> 9 test 0x00000001019df43f main + 639 (testing_common.c:298)> }}}I think it's fixed. Since I can't reproduce, I hope you can confirm.>> One time:> {{{> shared-random/state_transition: [forking] Feb 10 16:13:46.389 [err] tor_assertion_failed_: Bug: src/or/shared-random-state.c:210: commit_add_to_state: Assertion saved_commit == NULL failed; aborting. (on Tor 0.2.8.1-alpha-dev )> }}}>> {{{> 2 libsystem_c.dylib 0x00007fff975b06e7 abort + 129> 3 test 0x0000000101a8e00f commit_add_to_state + 175 (shared-random-state.c:210)> 4 test 0x0000000101a8c970 sr_state_add_commit + 48 (shared-random-state.c:941)> 5 test 0x0000000101982dfe test_state_transition + 574 (test_shared_random.c:976)> 6 test 0x00000001019dfadd testcase_run_one + 845 (tinytest.c:106)> 7 test 0x00000001019dffd3 tinytest_main + 499 (tinytest.c:432)> 8 test 0x00000001019df43f main + 639 (testing_common.c:298)> }}}This one is a bit more confusing to me. I don't see any possible way how thatassert can be triggered apart from _already_ having a commit in our state whenadding it. You'll have to tell me if you hit it again after the fixes andprobably we'll have to dig deeper.>> And I also see `12/720 TESTS FAILED. (1 skipped)`.[snip]>> I'd still like to run the SRV code in chutney, is there a chutney template for that?> (I think I'll need to wait for the unit tests to be fixed first.)You can with that branch. We'll simply use the default values for the supermajority. grep for prefix "SR:" logs and see sr-state file.**Trac**: **Status**: needs_revision **to** needs_review
This is my 3rd comment in a row. Please read all 3 comments.
Ok, FYI I've autosquash the fixup from last comment. It took me a while to have
it done cleanly but now it's done. You have a series of fixup now for this
comment.
See branch: prop250_final_v5
The last commit I reviewed was c219b23d47e9c742532354eedae10ca0509a1d78 in prop250_final_v4.
This appears to correspond to 4f5e212cbd34c8cfb701b25a007e88eacbac1f1f "Rename shared random files to use underscore" in prop250_final_v5, which includes an autosquash and a rebase to master.
I'm only going to comment on new commits, and any issues I still have with the old code.
(I'll remove any questions, fixes, and explanations I'm ok with.)
There are missed opportunities for const pointers in this commit.
All fixed (hopefully).
Except for the new smartlist_get_most_frequent_srv, which can take const smartlist_t *sl.
(It was added in 3b04ebdb94e29c0b2569ff8c64256f37f1389180.)
Nitpick: get_majority_srv_from_votes() uses unsigned int current for a boolean value. Tor typically uses int for boolean values.
Should we really continue that "tradition"? :)
I'm in favour of consistently using int, but it doesn't matter that much.
(We could decide to move to stdbool.h, and assert that our booleans are actually 0 or 1, but I think that's a topic for another ticket.)
Why pick the microdesc consensus to update the state? Any particular reason?
Should we fall back to the NS consensus if there's no microdesc consensus or SR values?
/* A new consensus has been created: pass it to the shared random subsystem to update the SR state. */ sr_act_post_consensus(pending[FLAV_MICRODESC].consensus);}}}
It's the consensus that client uses so we use that consensus to set our SRV
value post consensus thus having the same view as client. Does that make sense?
You think we should use FLAV_NS?
Most clients use the microdesc consensus, some still use NS.
I don't mind which consensus we use, but:
I think we should have a comment explaining the arbitrary choice, and
I wonder what happens if one consensus fails, but the other doesn't (can this even happen?).
I get the following crashes when I run the unit tests on OS X 10.11.3, Apple LLVM version 7.0.2 (clang-700.1.81), x86_64.
Five times:
{{{
shared-random/utils: [forking] Feb 10 16:13:46.122 [err] tor_assertion_failed_: Bug: src/or/shared-random.c:854: sr_generate_our_commit: Assertion my_rsa_cert failed; aborting. (on Tor 0.2.8.1-alpha-dev )
}}}
{{{
2 libsystem_c.dylib 0x00007fff975b06e7 abort + 129
3 test 0x0000000101a8af29 sr_generate_our_commit + 825 (shared-random.c:854)
4 test 0x0000000101a8cf11 sr_state_update + 177 (shared-random-state.c:1056)
5 test 0x0000000101a8d56d sr_state_init + 349 (shared-random-state.c:1231)
6 test 0x00000001018e43c3 test_a_networkstatus + 339 (test_dir.c:1800)
7 test 0x00000001019dfadd testcase_run_one + 845 (tinytest.c:106)
8 test 0x00000001019dffd3 tinytest_main + 499 (tinytest.c:432)
9 test 0x00000001019df43f main + 639 (testing_common.c:298)
}}}
I think it's fixed. Since I can't reproduce, I hope you can confirm.
I still see 5 of these. One example is:
{{{
3 test 0x00000001053d4045 sr_generate_our_commit + 709 (shared_random.c:870)
4 test 0x00000001053d60f1 sr_state_update + 177 (shared_random_state.c:1068)
5 test 0x00000001053d67d3 sr_state_init + 483 (shared_random_state.c:1243)
6 test 0x00000001052c9573 test_sr_get_majority_srv_from_votes + 51 (test_shared_random.c:811)
> > One time:> > ```> > shared-random/state_transition: [forking] Feb 10 16:13:46.389 [err] tor_assertion_failed_: Bug: src/or/shared-random-state.c:210: commit_add_to_state: Assertion saved_commit == NULL failed; aborting. (on Tor 0.2.8.1-alpha-dev )> > }}}> >> > {{{> > 2 libsystem_c.dylib 0x00007fff975b06e7 abort + 129> > 3 test 0x0000000101a8e00f commit_add_to_state + 175 (shared-random-state.c:210)> > 4 test 0x0000000101a8c970 sr_state_add_commit + 48 (shared-random-state.c:941)> > 5 test 0x0000000101982dfe test_state_transition + 574 (test_shared_random.c:976)> > 6 test 0x00000001019dfadd testcase_run_one + 845 (tinytest.c:106)> > 7 test 0x00000001019dffd3 tinytest_main + 499 (tinytest.c:432)> > 8 test 0x00000001019df43f main + 639 (testing_common.c:298)> > }}}> > This one is a bit more confusing to me. I don't see any possible way how that> assert can be triggered apart from _already_ having a commit in our state when> adding it. You'll have to tell me if you hit it again after the fixes and> probably we'll have to dig deeper.I still see this one, same location, line numbers have changed slightly:{{{3 test 0x00000001053d6f4f commit_add_to_state + 175 (shared_random_state.c:212)4 test 0x00000001053d5b50 sr_state_add_commit + 48 (shared_random_state.c:953)5 test 0x00000001052cac6e test_state_transition + 574 (test_shared_random.c:1000)
And I also see 12/720 TESTS FAILED. (1 skipped).
Now I see 7/724 TESTS FAILED. (1 skipped).
6 of these are the crashes above, the final one is:
There are missed opportunities for const pointers in this commit.
All fixed (hopefully).
Except for the new smartlist_get_most_frequent_srv, which can take const smartlist_t *sl.
(It was added in 3b04ebdb94e29c0b2569ff8c64256f37f1389180.)
Fixed 136da34.
Nitpick: get_majority_srv_from_votes() uses unsigned int current for a boolean value. Tor typically uses int for boolean values.
Should we really continue that "tradition"? :)
I'm in favour of consistently using int, but it doesn't matter that much.
(We could decide to move to stdbool.h, and assert that our booleans are actually 0 or 1, but I think that's a topic for another ticket.)
Fair enough. Fixed 279764a
Why pick the microdesc consensus to update the state? Any particular reason?
Should we fall back to the NS consensus if there's no microdesc consensus or SR values?
{{{
/* A new consensus has been created: pass it to the shared random subsystem
to update the SR state. */
sr_act_post_consensus(pending[FLAV_MICRODESC].consensus);
}}}
It's the consensus that client uses so we use that consensus to set our SRV
value post consensus thus having the same view as client. Does that make sense?
You think we should use FLAV_NS?
Most clients use the microdesc consensus, some still use NS.
I don't mind which consensus we use, but:
I think we should have a comment explaining the arbitrary choice, and
I actually changed it to use the FLAV_NS which is the main one that if we can't
create, problem! I also checked before using the consensus pointer that it does
exists.
See 63abee2
I wonder what happens if one consensus fails, but the other doesn't (can this even happen?).
Yeah, quite unsure but plausible from what I can see.
[snip]
Now I see 7/724 TESTS FAILED. (1 skipped).
6 of these are the crashes above, the final one is:
Thanks to asn, we found out why the failures! See 30edde4.
I've reviewed all the changes up to 30edde4bce7c5f34abb673604cc6e08c43195425.
The code changes are trivially correct, the unit test changes are proven by execution.
All unit tests pass ok with no failures, as does make test-network-all.
I would be ok with this being merged, but I'd like to run it through coverity or clang-scan eventually.
Compiling with clang (with some extra warnings) yields the following integer size complaints:
Fixed in 185189a in my branch prop250_final_v5 in https://github.com/teor2345/tor.git
(You might want to split this into a series of fixups.)
next_valid_after_time should be time_t in:
time_tget_next_valid_after_time(time_t now){ int next_valid_after_time;
sr_parse_srv has the only remaining implicit integer conversion warning - it would be nice to silence it by casting the result of tor_parse_long() to int:
sr_srv_t *sr_parse_srv(const smartlist_t *args){ char *value; int num_reveals, ok, ret; sr_srv_t *srv = NULL; tor_assert(args); if (smartlist_len(args) < 2) { goto end; } /* First argument is the number of reveal values */ num_reveals = tor_parse_long(smartlist_get(args, 0), 10, 0, INT32_MAX, &ok, NULL);
ts should be time_t:
static voidtest_encoding(void *arg){ (void) arg; int ret, duper_rand = 42; /* Random number is 32 bytes. */ char raw_rand[32]; uint64_t ts = 1454333590;
In sr_act_post_consensus, this code can be executed when consensus is NULL - it should be surrounded by if (consensus) like some of the code above it:
/* Update our state with the valid_after time of the next consensus so once * the next voting period start we are ready to receive votes. */ time_t next_consensus_valid_after = get_next_valid_after_time(consensus->valid_after); sr_state_update(next_consensus_valid_after);
I also get an assertion failure:
{{{
shared-random/utils: [forking]
FAIL /Users/twilsonb/tor/tor-sr/src/test/test_shared_random.c:957: assert(commit_is_authoritative(&commit, fp) == 0): 1 vs 0
[utils FAILED]
}}}
But only when compiling under certain clang options (OS X):
I wonder if one of those options somehow removes the memset(0) just above the assert which would explain why 1 is returned instead of 0 else I can't see why this would return 1!
I see undefined behaviour under 32 bit, possibly in tor_gmtime (which isn't mentioned below, but is in the tor backtrace):
{{{
2 libsystem_c.dylib 0x9d7f5c34 abort + 156
3 test 0x0057e5bf crash_handler + 383
4 libsystem_platform.dylib 0x990fb01b sigtramp + 43
5 ??? 0xffffffff 0 + 4294967295
6 test 0x0057e440 0x1000 + 5755968
7 test 0x007d20eb parse_iso_time + 1739 (util.c:1713)
8 test 0x007d2134 parse_iso_time + 52 (util.c:1723)
9 test 0x00702157 config_assign_value + 7239 (confparse.c:346)
10 test 0x006fbba7 config_assign_line + 4663 (confparse.c:529)
11 test 0x006fa435 config_assign + 2133 (confparse.c:828)
12 test 0x009be851 disk_state_load_from_disk_impl + 417 (shared_random_state.c:651)
13 test 0x0073b9f5 test_state_load_from_disk + 1525 (test_shared_random.c:617)
}}}
I'm not sure what to do about this one. I wonder if this actually also happens with our current state file since this is plain ISO TIME parsing. Nothing different between this state file and the one we have in terms of parsing timestamp...
I've fixed the rest that is I took your commit 185189a. Function sr_act_post_consensus has now a check on the pointer. This function can't be called with a NULL consensus for now but better safe in the future.
Since the above are easy to confirm without fixup commits, I've created a new branch with everything squashed. You can find the fixup commits in prop250_final_v5.
Fully squashed and rebased on git master "ready for merge" branch is: prop250_final_v6
I also get an assertion failure:
{{{
shared-random/utils: [forking]
FAIL /Users/twilsonb/tor/tor-sr/src/test/test_shared_random.c:957: assert(commit_is_authoritative(&commit, fp) == 0): 1 vs 0
[utils FAILED]
}}}
I wonder if one of those options somehow removes the memset(0) just above the assert which would explain why 1 is returned instead of 0 else I can't see why this would return 1!
The memcpy(fp, commit.rsa_identity_fpr, sizeof(fp)); copies the uninitialised commit.rsa_identity_fpr into fp. Therefore, clang can assume that commit.rsa_identity_fpr is all zero (or, alternately, commit.rsa_identity_fpr uses memory that is initialised zero). Therefore, you get the same result for both commit_is_authoritative calls.
I see undefined behaviour under 32 bit, possibly in tor_gmtime (which isn't mentioned below, but is in the tor backtrace):
{{{
2 libsystem_c.dylib 0x9d7f5c34 abort + 156
3 test 0x0057e5bf crash_handler + 383
4 libsystem_platform.dylib 0x990fb01b sigtramp + 43
5 ??? 0xffffffff 0 + 4294967295
6 test 0x0057e440 0x1000 + 5755968
7 test 0x007d20eb parse_iso_time + 1739 (util.c:1713)
8 test 0x007d2134 parse_iso_time + 52 (util.c:1723)
9 test 0x00702157 config_assign_value + 7239 (confparse.c:346)
10 test 0x006fbba7 config_assign_line + 4663 (confparse.c:529)
11 test 0x006fa435 config_assign + 2133 (confparse.c:828)
12 test 0x009be851 disk_state_load_from_disk_impl + 417 (shared_random_state.c:651)
13 test 0x0073b9f5 test_state_load_from_disk + 1525 (test_shared_random.c:617)
}}}
I'm not sure what to do about this one. I wonder if this actually also happens with our current state file since this is plain ISO TIME parsing. Nothing different between this state file and the one we have in terms of parsing timestamp...
You pass 2666 as the year. Any years after 2037 overflow a 32 bit time_t.
See also legacy/trac#18383 (moved) for a related issue in the dirauth code that was causing some assertion failures.