Contrary to our testing results the .dmg bundles we produce for 8.0a9 are differing on two machines. And it turns out Linux bundles are affected as well (see: #27937 (moved) for the original Linux bug report).
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related.
Learn more.
As I expected this is a problem introduced by us switching to 1.26.1. Using version 1.25.0 of the Rust compiler XUL is still reproducible. So, to sum up:
Starting with the switch to Rust 1.26.1 Tor Browser for macOS (and only for that platform) is not being built reproducibly anymore, once we enable Stylo. There are small differences visible in XUL stemming from gkrust (see: comment:1 and comment:4). Using either Rust 1.25.0 or disabling Stylo (what we are doing right now) is "solving" this problem.
Alex, Manish: Do you know of any commit that could have caused this? If not, I can bisect I guess.
Trac: Cc: boklm to boklm, manishearth@gmail.com, acrichton@mozilla.com Summary: ESR60-based .dmg images are not built reproducibly with Stylo enabled to ESR60-based .dmg images are not built reproducibly with Stylo enabled using rustc > 1.25.0 Status: new to needs_information
I know of https://github.com/rust-lang/rust/issues/47086 as an macOS-only deterministic compilation problem, but other than that I unfortunately don't know of any macOS-specific deterministic compilation issues as part of the toolchain itself.
I know of https://github.com/rust-lang/rust/issues/47086 as an macOS-only deterministic compilation problem, but other than that I unfortunately don't know of any macOS-specific deterministic compilation issues as part of the toolchain itself.
So, I tried for a while to get the bisecting going but it seems non-trivial. Alex: What is the recommended way of doing rust compiler bisecting? What I tried was using git bisect with the commit for 1.25.0 and 1.26.1, updating the sub modules and tarring the result up + using it on my build machines, but that breaks for different reasons (first one is no properly vendored crates are included (we compile with --enable-vendor); trying to get crates via crates.io-index fails with an SSL error...).
Maybe easier: Are there somewhere nightly source tarballs available (like they are for the official releases) which I could test? I did not find any so far but that might already narrow down the problem sufficiently.
Okay, I tried the nightlies from 2018-02-25, 2018-03-28, and 2018-05-08 to find a regression range but to my surprise they all are good, meaning I get the same result on different build machines (which does not happen with stable versions > 1.25.0). So, it seems there is something in the stable code but not in the nightlies what is causing this which is confusing to me. We build with
Oh dear that is indeed worrisome! The only difference between nightly and stable builds is --release-channel passed to ./configure so you should already be emulating our stable builds. It looks like you're cross-compiling rustc from Linux though? For us we compile natively on OSX and I wonder if that causes differences?
Oh dear that is indeed worrisome! The only difference between nightly and stable builds is --release-channel passed to ./configure so you should already be emulating our stable builds. It looks like you're cross-compiling rustc from Linux though? For us we compile natively on OSX and I wonder if that causes differences?
I avoided that by taking the respective nightly sources and compiled them as we compile the stable compiler.
So, the plan right now is to take the nightly sources that are closest to 1.26.1 and remove parts of the diff until I find the problem.
Ok sure yeah, you may want to go via git bisection in the repo perhaps? That'll need to download other sources but they're done via lock files and such so you shouldn't be downloading or using any different sources than we used to build the releases
Okay, some good news here: I double-checked my setup and it seems I made a mistake which led to no bad nightly being found. I fixed that and the last nightly that is good is the one from March 07. Then there are no nightly sources between 03/08 and 03/14 (inclusive) and the nightly from March 15 is the first bad one.
I fought a bit with using the repo for bisecting (I need to vendor all the crates before creating a tarball, making sure that I really have all the submodules initialized and updated, no .git dirs bundled but .gitmodules files available etc.) but it seems to work now. Bisecting...
Bisecting seems tricky as I get intermittent differences which sometimes go away if one just tries several times. I've attached the disassmebly diff of gkrust-d3a9de07b53ab691.gkrust0.rcgu.o as it might help a bit narrowing things down. I wonder whether we actually got the differences due to an LLVM update...
Hm interesting! I wonder if this is perhaps related to https://github.com/rust-lang/rust/issues/52044? That claims it was fixed with the most recent LLVM upgrade. Are you able to reproduce the non-determinism on the most recent nightly?
Hm interesting! I wonder if this is perhaps related to https://github.com/rust-lang/rust/issues/52044? That claims it was fixed with the most recent LLVM upgrade. Are you able to reproduce the non-determinism on the most recent nightly?
Aha! That sounds promising and I certainly feel glandium's "This is driving me crazy", so this should be the issue then, right? ;)
That said, I compiled the nightly from 2018-07-13 which should contain the LLVM upgrade and I can't reproduce the problem anymore. However, I can't either when compiling the one from from 2018-07-11 which should not contain the LLVM upgrade (it's based on commit e5f6498d3d5c9dac841009d7b49738923826af75). So, it seem the LLVM uprade (alone) is not enough to explain this bug, or am I missing something?
Trying to figure out where all this started, I am pretty sure that 2018-02-15 is good and 2018-03-07 is bad.
Hm interesting! I wonder if this is perhaps related to https://github.com/rust-lang/rust/issues/52044? That claims it was fixed with the most recent LLVM upgrade. Are you able to reproduce the non-determinism on the most recent nightly?
Aha! That sounds promising and I certainly feel glandium's "This is driving me crazy", so this should be the issue then, right? ;)
That said, I compiled the nightly from 2018-07-13 which should contain the LLVM upgrade and I can't reproduce the problem anymore. However, I can't either when compiling the one from from 2018-07-11 which should not contain the LLVM upgrade (it's based on commit e5f6498d3d5c9dac841009d7b49738923826af75). So, it seem the LLVM uprade (alone) is not enough to explain this bug, or am I missing something?
Ah yeah for the LLVM 7 change we're gonna let that ride the trains (aka not backport). If you can temporarily use a nightly that'd work but otherwise this may have to wait until that's released.
Knowing for sure though what was causing the nondeterminism in LLVM would be great!
Hm interesting! I wonder if this is perhaps related to https://github.com/rust-lang/rust/issues/52044? That claims it was fixed with the most recent LLVM upgrade. Are you able to reproduce the non-determinism on the most recent nightly?
[snip]
Trying to figure out where all this started, I am pretty sure that 2018-02-15 is good and 2018-03-07 is bad.
So, here is another interesting bit: While running the script (https://github.com/rust-lang/rust/issues/52044#issuecomment-402349038) on a Linux box does replicate the issue for a Linux build using the nightly from 2018-07-11, it does not replicate it using the nightly from 2018-03-07 (I ran the script a couple of times). On the other hand, using the nightly from 2018-03-07 to generate a Firefox build for macOS does show differences (while, as mentioned above, I seemingly can't reproduce the problem with the nightly from 2018-07-11 anymore when cross-compiling for macOS).
To test better I adapted your repro script and added a respective --target x86_64-apple-darwin and with that setup it's easy to see the bug with the nightly from 2018-03-07.
So, to sum up, the problem is not just bumping the LLVM version but it seems to be somewhat target dependent, too.
Interesting, look what I get after trying to run the repro script for the macOS target after the bump to LLVM 7 (i.e. with the nightly from 2017-07-13):
+ for i in `seq 1 100`+ rm -rf a b+ mkdir a b+ rustc /dev/stdin -O -C lto -C panic=abort -C codegen-units=1 --emit llvm-ir,obj --crate-type staticlib --out-dir a --target x86_64-apple-darwin+ rustc /dev/stdin -O -C lto -C panic=abort -C codegen-units=1 --emit llvm-ir,obj --crate-type staticlib --out-dir b --target x86_64-apple-darwin++ md5sum a/stdin.ll++ awk '{print $1}'+ a=d3665a14a5cf9bf1c4630c001e8d4dfc++ md5sum b/stdin.ll++ awk '{print $1}'+ b=4d3336f9f9c5bc4b91653c1fc51d0bf6+ '[' d3665a14a5cf9bf1c4630c001e8d4dfc '!=' 4d3336f9f9c5bc4b91653c1fc51d0bf6 ']'+ echo IR is differentIR is different+ exit 1
Linux is fine (which is probably what glandium found).
FWIW: I see different objects for macOS with the repro script when using the nightly from 2018-07-11, so, indeed, I did not do enough Firefox compilations to trigger the problem for macOS there.
Hm fascinating! That's a good data point towards "maybe fixed" in LLVM but only accidentally for one platform rather than across the board?
I actually think the issue glandium had and we have are not the same but probably related: the macOS one is happening with the upgrade to LLVM 6 (I am about to start bisecting those changes) while Linux is unaffected by that. But I'll have clues for the latter as well, but macOS first.
Alright, so, the problematic commit is 6b7b6b63a928479a29d9fc1282e553e409c66934. I tried to bisect LLVM to find out the bad revision. I compiled just LLVM separately and used --llvm-root to point to it during the rust build. It turns out that doing that, even with just the llvm code copied out of rust/src/llvm being on commit 6b7b6b63a928479a29d9fc1282e553e409c66934 is fine. However, double-checking, building LLVM during the usual rust build (i.e. without --llvm-root) still reproduces the bug.
Thus, I can only think of two reasons causing the reprouducibility issue:
The LLVM part is compiled differently within rust than I did. The compiler used is the same, though, not sure what other flags could cause this. I am doing
Nice! Bisecting to the LLVM upgrade definitely makes sense to me. That was a massive LLVM upgrade though (from LLVM 4.0 to LLVM 6.0), so that would be quite the bisection range for a regression to be introduced in :(
If it works, though, when you build LLVM yourself that's quite curious. The command we use to build LLVM though is pretty huge. Looking at one of our recent builds (https://travis-ci.org/rust-lang/rust/builds/411219658) the command we use on OSX is:
I wonder if perhaps the way we compile LLVM is affecting this? Maybe some flag or maybe our own compiler we use on automation is introducing bugs? Or maybe it has to do with the C++ standard library and which is used?
Nice! Bisecting to the LLVM upgrade definitely makes sense to me. That was a massive LLVM upgrade though (from LLVM 4.0 to LLVM 6.0), so that would be quite the bisection range for a regression to be introduced in :(
If it works, though, when you build LLVM yourself that's quite curious. The command we use to build LLVM though is pretty huge. Looking at one of our recent builds (https://travis-ci.org/rust-lang/rust/builds/411219658) the command we use on OSX is:
"cmake" "/Users/travis/build/rust-lang/rust/src/llvm" "-DLLVM_ENABLE_ASSERTIONS=OFF" "-DLLVM_TARGETS_TO_BUILD=X86;ARM;AArch64;Mips;PowerPC;SystemZ;MSP430;Sparc;NVPTX;Hexagon" "-DLLVM_EXPERIMENTAL_TARGETS_TO_BUILD=WebAssembly;RISCV" "-DLLVM_INCLUDE_EXAMPLES=OFF" "-DLLVM_INCLUDE_TESTS=OFF" "-DLLVM_INCLUDE_DOCS=OFF" "-DLLVM_ENABLE_ZLIB=OFF" "-DWITH_POLLY=OFF" "-DLLVM_ENABLE_TERMINFO=OFF" "-DLLVM_ENABLE_LIBEDIT=OFF" "-DLLVM_ENABLE_LIBXML2=OFF" "-DLLVM_PARALLEL_COMPILE_JOBS=4" "-DLLVM_TARGET_ARCH=x86_64" "-DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-apple-darwin" "-DLLVM_OCAML_INSTALL_PATH=usr/lib/ocaml" "-DCMAKE_EXE_LINKER_FLAGS=-static-libstdc++" "-DCMAKE_C_COMPILER=sccache" "-DCMAKE_C_COMPILER_ARG1=/Users/travis/build/rust-lang/rust/clang+llvm-6.0.0-x86_64-apple-darwin/bin/clang" "-DCMAKE_CXX_COMPILER=sccache" "-DCMAKE_CXX_COMPILER_ARG1=/Users/travis/build/rust-lang/rust/clang+llvm-6.0.0-x86_64-apple-darwin/bin/clang++" "-DCMAKE_C_FLAGS=-ffunction-sections -fdata-sections -fPIC --target=x86_64-apple-darwin -stdlib=libc++" "-DCMAKE_CXX_FLAGS=-ffunction-sections -fdata-sections -fPIC --target=x86_64-apple-darwin -stdlib=libc++" "-DCMAKE_INSTALL_PREFIX=/Users/travis/build/rust-lang/rust/build/x86_64-apple-darwin/llvm" "-DCMAKE_BUILD_TYPE=Release"}}}I wonder if perhaps the way we compile LLVM is affecting this? Maybe some flag or maybe our own compiler we use on automation is introducing bugs? Or maybe it has to do with the C++ standard library and which is used?
Well, I dont't know whether the problem we have is happening when using the "official" binaries. This is happening when cross-compiling rust ourselves for macOS. That said, I think I can exclude 1) from comment:33. I compiled LLVM with exactly the same arguments as used during the rust build (using the same runc container, compiler etc.) which are:
{{{
cmake .. -G "Unix Makefiles" -DLLVM_ENABLE_ASSERTIONS=OFF -DLLVM_TARGETS_TO_BUILD="X86;ARM;AArch64;Mips;PowerPC;SystemZ;MSP430;Sparc;NVPTX;Hexagon" -DLLVM_EXPERIMENTAL_TARGETS_TO_BUILD=WebAssembly -DLLVM_INCLUDE_EXAMPLES=OFF -DLLVM_INCLUDE_TESTS=OFF -DLLVM_INCLUDE_DOCS=OFF -DLLVM_ENABLE_ZLIB=OFF -DWITH_POLLY=OFF -DLLVM_ENABLE_TERMINFO=OFF -DLLVM_ENABLE_LIBEDIT=OFF -DLLVM_PARALLEL_COMPILE_JOBS=4 -DLLVM_TARGET_ARCH=x86_64 -DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-unknown-linux-gnu -DLLVM_LINK_LLVM_DYLIB=ON -DCMAKE_C_COMPILER=cc -DCMAKE_CXX_COMPILER=c++ -DCMAKE_C_FLAGS="-ffunction-sections -fdata-sections -fPIC -m64" -DCMAKE_CXX_FLAGS="-ffunction-sections -fdata-sections -fPIC -m64" -DCMAKE_INSTALL_PREFIX=$distdir -DCMAKE_BUILD_TYPE:String=Release -DLLVM_INSTALL_UTILS=on $LLVM_HOME
(`-DLLVM_INSTALL_UTILS=on` is only used when compiling LLVM outside of the rust compilation but I doubt this would make a difference with respect to this bug)and used it with `--llvm-root` and the test script is running fine.Which leaves 2). I reverted the `dlmalloc` and `libcompiler_builtins/compiler-rt` submodule updates but compiling LLVM during the rust build still exhibits the failing test script. Thus, I think those updates does not cause this bug. I wonder what else is different between the in-tree LLVM compilation and the one outside of it (I am just tarring up src/llvm, excluding `.git`, using that tarball for compilation)...
Not setting LLVM_RUSTLLVM does not change things as you expected. So, I could do the LLVM bisecting with LLVM built during the rust compilation to give us a better hint at what is going wrong. However, I'd like to avoid that as it's probably a tedious pain in the rear and while the LLVM upgrade is triggering the bug it's probably nothing we should fix on the LLVM side. I'll think about something smarter and will go back to this bug after we got Tor Browser 8 out.
We might stick to just building LLVM outside of rust for macOS cross-compilation for the time being to move around this issue.
#27937 (moved) is a duplicate: We have similar issues on Linux even though it seems they are even harder to reproduce. I assume for now the underlying issue is the same for both platforms.
Trac: Description: Contrary to our testing results the .dmg bundles we produce for 8.0a9 are differing on two machines.
to
Contrary to our testing results the .dmg bundles we produce for 8.0a9 are differing on two machines. And it turns out Linux bundles are affected as well (see: #27937 (moved) for the original Linux bug report). Summary: ESR60-based .dmg images are not built reproducibly with Stylo enabled using rustc > 1.25.0 to ESR60-based Tor Browser bundles are not built reproducibly with Stylo enabled using rustc > 1.25.0 Priority: Very High to Immediate
if self.config.rust_codegen_units.is_none() && self.build.is_rust_llvm(compiler.host) && self.config.rust_thinlto { cargo.env("RUSTC_THINLTO", "1"); } else if self.config.rust_codegen_units.is_none() { // Generally, if ThinLTO has been disabled for some reason, we // want to set the codegen units to 1. However, we shouldn't do // this if the option was specifically set by the user. cargo.env("RUSTC_CODEGEN_UNITS", "1"); }
There are additional pieces getting compiled in/used during the LLVM compilation done during the rust build that are causing the problem.
That's what's happening in this case. The relevant part is
if self.config.rust_codegen_units.is_none() && self.build.is_rust_llvm(compiler.host) && self.config.rust_thinlto { cargo.env("RUSTC_THINLTO", "1"); } else if self.config.rust_codegen_units.is_none() { // Generally, if ThinLTO has been disabled for some reason, we // want to set the codegen units to 1. However, we shouldn't do // this if the option was specifically set by the user. cargo.env("RUSTC_CODEGEN_UNITS", "1"); }
in builder.rs.
For some reason RUSTC_THINLTO is only set if one does not specify a LLVM with --llvm-root and that is the difference I hit. With it set the reproducibility problem emerges (even for LLVMs provided by --llvm-root if I rip out self.build.is_rust_llvm(compiler.host) &&) without it not.
So, to sum up so far: Compiling with -C lto is fine unless RUSTC_THINLTO is used for compiling rust for macOS. Or to be more precise: unless RUSTC_THINLTO is used for the apple target libstd and related libraries. I could not pinpoint the exact lib that is causing this issue, though, yet (I can't easily replace them one by one as otherwise rustc is complaining about libstd being in need of recompilation).
I guess more bisecting is next. :) Alex, do you think we could just avoid setting RUSTC_THINLTO for now when compiling the rust compiler? Or does that have any serious, known downsides?
Oh sorry about this, that's a good discovery! Those settings definitely engage a lot of LLVM infrastructure that's not otherwise engaged, which could help explain why something nondeterministic is coming out the other end.
The settings in bootstrap are pretty confusing, but what's happening here is either rustc is compiled with 16 codegen units (each crate turns into 16 object files) which are then optimized as a set with ThinLTO or each compiler crate is compiled into one object file and no ThinLTO is used.
To clarify is the compiler's own binary nondeterministic when the compiler's crates are built with 16 CGUs + ThinLTO? Or is the compiler's binary deterministic but its output deterministic?
In terms of impact of these settings:
1 CGU builds are basically equivalent to 16 CGUs + ThinLTO. The 1 CGU build is slower to compile the compiler (less opportunities for parallelism), but is likely 1%-ish faster than the 16 CGUs + ThinLTO.
The 16 CGUs + ThinLTO is the default setting (as you've found)
If 16 CGUs are used without ThinLTO, the resulting compiler is probably horrendously slow (lots of missed inlining opportunities).
Or put another way, if you disable ThinLTO you'll want to be sure to also compile with one codegen unit, which should happen in rustbuild via the above block.
Oh sorry about this, that's a good discovery! Those settings definitely engage a lot of LLVM infrastructure that's not otherwise engaged, which could help explain why something nondeterministic is coming out the other end.
The settings in bootstrap are pretty confusing, but what's happening here is either rustc is compiled with 16 codegen units (each crate turns into 16 object files) which are then optimized as a set with ThinLTO or each compiler crate is compiled into one object file and no ThinLTO is used.
To clarify is the compiler's own binary nondeterministic when the compiler's crates are built with 16 CGUs + ThinLTO? Or is the compiler's binary deterministic but its output deterministic?
When the compiler's crates (in fact just the macOS libstd + code it depends on) are built with 16 CGUs + ThinLTO its output is non-deterministic but only if I use -C lto (if I drop -C lto all is fine, too). The binary might, too, be non-deterministic, I have not checked. We are currently not concerned with getting a reproducibly built rust compiler. Right now, just the output that compiler gives us matters.
In terms of impact of these settings:
1 CGU builds are basically equivalent to 16 CGUs + ThinLTO. The 1 CGU build is slower to compile the compiler (less opportunities for parallelism), but is likely 1%-ish faster than the 16 CGUs + ThinLTO.
The 16 CGUs + ThinLTO is the default setting (as you've found)
If 16 CGUs are used without ThinLTO, the resulting compiler is probably horrendously slow (lots of missed inlining opportunities).
Or put another way, if you disable ThinLTO you'll want to be sure to also compile with one codegen unit, which should happen in rustbuild via the above block.
Okay, thanks for those explanations, really helpful.
One way I might be able to still help as well is narrowing down where the nondeterminism is introduced. If you use -C save-temps when compiling the compiler should spew dozens of files all over the place. Each of these files in theory represents the various stages of compilation and provides snapshots into the compiler's pipeline. If you could find the set of files that are nondeterministic (we know it's at least the object files!) then that may help narrow this down as well!
If a 16 CGU libstd + full crate LTO is the issue it sounds like this may be an issue with the LLVM "linker", but that's just a guess!
I have a patch that sets codegen-units to 1 which seems the way to disable Thin LTO. Alas, there is no direct option for disabling LTO yet in 1.26.1. Moreover, using the --set configure option to adjust the codegen-units does not seem to work as the value gets interpreted as a string but we need an integer. So, I resorted to good ol' sed.
FWIW: I am actually not sure if I solved #27937 (moved) with the patch as I was never able to reproduce the Linux issue. Looking at the diff my best guess is that it might not be panic but still LTO related. And the idea is to have the patch up for review squashing both cases.
Glad that ended up working out! From my testing oh-so-long-ago at this point I suspected this was an LLVM issue, and disabling ThinLTO effectively serializes all of LLVM's work, so I wonder if this is a race somewhere in LLVM between some shared data structure across threads which affects ordering...
Glad that ended up working out! From my testing oh-so-long-ago at this point I suspected this was an LLVM issue, and disabling ThinLTO effectively serializes all of LLVM's work, so I wonder if this is a race somewhere in LLVM between some shared data structure across threads which affects ordering...
Maybe. I am still bisecting LLVM (what introduced the issue and what fixed it as the latter is not clear either) which is... non-trivial. I'll update this ticket with my findings.
After doing two separates testbuild using this branch, I got two non matching TorBrowser-8.5a3-osx64_en-US.dmg files. After checking the content of the dmg files, only the snowflake files are different, so this is probably #27827 (moved). So it seems the issue from this ticket is fixed.
If you want to compare with your build, those mar-tools-mac64.zip have been generated with make testbuild on branch bug_26475_v2, with testbuild configured to do alpha builds.
Thanks for the review and, Alex, thanks for your help! I merged the patch to master (commit 0166d3d70e81c6c1072e03f6b950c8b2eb181343), marking it for backport as we want Stylo enabled for macOS and potentially fix the reproducibility issue on Linux.
I'll update this ticket with my bisecting results once I am done, so that we have a chance to understand what actually went wrong and how it got fixed later on.
Finally, after bisecting compiler crashes (aka https://bugs.llvm.org/show_bug.cgi?id=33917), too, I got the problematic commit. It is r304593 (together with r301649). Let me know if I can be of any help if that does not make sense on first glance. :) I am happy to help understanding what is going wrong here.
That said, I'll start figuring out why the issue vanished as I could not reproduce it anymore taking a nightly from mid-September (I found that different IRs showed up after the switch to LLVM 7, see: comment:28 which is replacing the issue in this bug with another reproducibility problem).
The first commit you mentioned - https://reviews.llvm.org/D33320 - looks like it's just some minor renamings? The second though - https://reviews.llvm.org/D32653 - definitely looks more suspicious. Since the second commit is turning a flag on by default, could you take a "good LLVM" just before that commit, turning the flag on, and seeing if it has the same reproducibility issue?
Otherwise, do you have a standalone LLVM test case you were testing with? Or was it largely always through rustc? I don't mind helping out to file a bug in LLVM!
Oops, right. That's the last good commit. The problem is r304594 (which is enabling r301649).
The second though - https://reviews.llvm.org/D32653 - definitely looks more suspicious. Since the second commit is turning a flag on by default, could you take a "good LLVM" just before that commit, turning the flag on, and seeing if it has the same reproducibility issue?
I can do that with r305493 (i.e. the last good commit) but given that r305494 is just flipping that feature on by default (and removing unused code) I doubt this will give us new insights. Let me know, though, if you think otherwise.
Otherwise, do you have a standalone LLVM test case you were testing with? Or was it largely always through rustc? I don't mind helping out to file a bug in LLVM!
Ok so just to make sure I understand, LLVM is completely deterministic up to and including r304593 - https://reviews.llvm.org/D33320. When you go one more commit to r304594 - https://reviews.llvm.org/D32653 - this LLVM is no longer deterministic. The commit in question here that makes LLVM nondeterministic is enabling r301649 - https://reviews.llvm.org/D31085 - a heuristic for something.
This was all tested with an adapted script, where you're compiling a fixed version of rustc against a varying version of LLVM. The rustc linked with LLVM from r304593 is deterministic and the rustc linked with LLVM form r304594 is nondeterministic. The script is then an adaptation of the comment you mentioned.
Does that all sound right? If so I think that's definitely enough to open an issue on LLVM itself and try to get the ball rolling there! It'd be maximally useful to have a standalone test case, but I sort of suspect this may not have an easy standalone test case and is related to how rustc is using LLVM on multiple threads internally, which LLVM's CLI tools don't do.
Oh so if that's all true, another question, what version of rustc is this using?
And additionally, y'all are seeing reproducibility issues on the current nightly release of Rust, right? (latest LLVM we're using as well as latest rustc). IIRC though those issues haven't been reduced yet and it's suspected that this one is the cause?
Ok so just to make sure I understand, LLVM is completely deterministic up to and including r304593 - https://reviews.llvm.org/D33320. When you go one more commit to r304594 - https://reviews.llvm.org/D32653 - this LLVM is no longer deterministic. The commit in question here that makes LLVM nondeterministic is enabling r301649 - https://reviews.llvm.org/D31085 - a heuristic for something.
Yes.
This was all tested with an adapted script, where you're compiling a fixed version of rustc against a varying version of LLVM. The rustc linked with LLVM from r304593 is deterministic and the rustc linked with LLVM form r304594 is nondeterministic. The script is then an adaptation of the comment you mentioned.
Oh so if that's all true, another question, what version of rustc is this using?
The one at 6b7b6b63a928479a29d9fc1282e553e409c66934 with a bunch of patches for the LLVM bisecting.
And additionally, y'all are seeing reproducibility issues on the current nightly release of Rust, right? (latest LLVM we're using as well as latest rustc). IIRC though those issues haven't been reduced yet and it's suspected that this one is the cause?
I think we could be lucky in that it got fixed more or less recently during the upgrade to LLVM 7. The result is not conclusive yet as there popped another reproducibility issue up (see: comment:28).
So, I'd suggest waiting with filing an LLVM bug until I have bisected that other issue.