Identifying the cause of a non-deterministic build is a challenge on every platform, but on Android this has additional complexity due to Fenix's layered architecture - each layer undergoing active development.
We will use tor-browser-build#40485 while describing how to investigate and resolve this problem.
Needed Tools:
-`unzip`, provided by the `unzip` package on Debian
-`apktool`, provided by the `apktool` package on Debian
- command for creating a collision-resistant hash of files
- this guide will use `sha256sum` and `sha256deep`
- On Debian-based systems, these are available in `coreutils` and `hashdeep`, respectively
- (optionally) `find`, provided by `findutils` on Debian
- (optionally) `diffoscope`, provided by `diffoscope` on Debian
Note, `diffoscope` is a powerful all-in-one solution for diffing two archives, but this guide prefers following a more manual process at each stage for finer grain control over the investigation. Feel free to use `diffoscope` if you feel comfortable skipping some of the more manual steps and want a tool that tells you exactly how two archives are different.
### Step 0: Obtain two differing build packages
The initial step is obtaining two packages that were built from the same source code, but resulted in two packages that differ in some way. Often at least one of these packages is built by another person, but you may create both packages yourself, if building the same code at different times results in differing packages.
### Step 1: Identify which files differ within the package, or if the archives themselves are a problem
We have seen [situations](https://gitlab.torproject.org/tpo/applications/tor-browser-build/-/issues/32360) where ZIP archives are created non-deterministically. Please amend this guide with additional steps regarding solving this issue.
Identifying non-matching elements within the apk is as simple as extracting the files within the apks (or aars) and comparing them.
The use of `$PWD` is important because `sha256deep` includes the absolute file path, therefore comparing two outputs of `sha256deep` will always differ. Replace `$PWD` with something different, if you prefer.
We can ignore them because they are the result of other files - if all of the input files are identical, then these files should be identical. However, we can see that the following two files differ:
-`unzipped/classes2.dex`
-`unzipped/classes.dex`
Therefore, we start our investigation with those. If none of the other files differ, then begin looking closer at the content and file formats of the differing files - possibly using `xxd` or `hexdump`.
`.dex` files are Android's byte code object files. `apktool` provides a way to "decompile" these into a more readable format.
```
$ cd 11.5a10-build2/1/
$ mkdir decompiled
$ cd decompiled
$ apktool d ../tor-browser-11.5a10-android-x86_64-multi-qa.apk
While not immediately obvious, looking at this diff we see that the content is identical, but the location within the files of the content is different. For example, `HOMESCREEN_BANNER` is on line 53 in one file, but on line 75 in another file.
`11.5a10-build2` was non-deterministic due to *both* of these issues in two different components. Looking at the smali files that don't match, we see two groupings: Nimbus and Glean. Nimbus is part of [applications-services](https://github.com/mozilla/application-services.git) and [Glean](https://github.com/mozilla/glean.git) is a standalone project. However, the situation is even more complicated than it seems.
#### Nimbus Investigation
Let's start with Nimbus. As above, we have these four files:
1. If you diff each of these files, you see that they all have the same content, but they are in a different order. This likely means the code is generated somehow.
2. That these files are in the `org.mozilla.fenix.nimbus` namespace, this means that we need to look in Fenix for more information about them - they are not in the Numbus repository. If you don't where you can find these files, then `git grep` is your friend, but you must keep in mind that if these files are actually generated code then you may not find any files in the fenix repository matching these class names. Therefore, you may need to be creative when searching for the source of these files: try different name mangling techniques. For example, if we look for `HomeScreenSection`, try searching for:
-`HomeScreenSection`
-`homeScreenSection`
-`Home-Screen-Section`
-`Home-screen-section`
-`Home_Screen_Section`
-`home_screen_section`
Using case insensitive search (`git grep -i`) helps, too. In this case, the class name was not mangled:
`nimbus.fml.yaml` seems like it is the important file here. Indeed, that is the file that declares the `HomeScreenSection` type. Next, find how this file is consumed:
We see it is consumed within `app/build.gradle` and there appears to be a gradle plugin for this: `org.mozilla.components.nimbus-gradle-plugin`
Find the source code for this is not always trivial and may require some searching on your favorite search engine and asking Mozilla folks. In this case, we can find the source code in the `application-services` repo under [components/support/nimbus-fml/](https://github.com/mozilla/application-services/tree/main/components/support/nimbus-fml). The most important observation that you should remember while you look through this code, is that you are looking at code-generating code. In particular, this means that you must look for how data is input into this code/program, and how the result is output from it. From past experience, we know that choice of data structure plays a significant role in deterministic results, therefore look for code that parses data and store it in non-deterministic data structures, such as rust's `HashMap`. For example, in `nimbus-fml` we see the use of `HashMap` in its [parser.rs](https://github.com/mozilla/application-services/blob/bf2bce239ab13e9216e6f5190d2bb2c3771f6049/components/support/nimbus-fml/src/parser.rs#L27).
In #40420 we found that `BTreeMap` is a reasonable deterministic replacement for `HashMap`. Next, walk through the parsing and code generation logic and replace instances of `HashMap` with `BTreeMap` where iterating over the data structure may cause different ordering between the input file and output file. Create a patch and test it in tor-browser-build:
1. Copy the patch into the correct project (e.g., `fenix`), let's call it `bugXXXXX.patch`
1. Add the patch file as an input file of the project
1.`- filename: bugXXXXX.patch`
1. Apply the patch in `build`, likely close to where the compile/build command is executed
1.`patch -p1 < $rootdir/bugXXXXX.patch`
If it compiles, then backup the results and run the build again: compare the resulting packages and investigate further if differences remain.
#### Glean Investigation
Glean is a little trickier because it is a dependency at multiple layers of the architecture and because we don't build it ourselves.
This looks like a timestamp issue created a build time. Next, knowing Glean, as we do, we know that the code generation portion of Glean is contained in a separate repository: [glean_parser](https://github.com/mozilla/glean_parser). Searching that repo for `BuildInfo` reveals:
```
$ git grep -n BuildInfo
CHANGELOG.md:74:- For Kotlin skip generating `GleanBuildInfo.kt` when requested (with `with_buildinfo=false`) ([#341](https://github.com/mozilla/glean_parser/pull/341))
glean_parser/kotlin.py:272: - `with_buildinfo`: If "true" a `GleanBuildInfo.kt` file is generated.
glean_parser/kotlin.py:299: with (output_dir / "GleanBuildInfo.kt").open("w", encoding="utf-8") as fd:
glean_parser/swift.py:137:class BuildInfo:
glean_parser/swift.py:187: - with_buildinfo: If "true" the `GleanBuildInfo` is generated.
tests/test_swift.py:53: assert "BuildInfo(buildDate:" in content
tests/test_swift.py:100: assert "BuildInfo(buildDate:" in content
```
`glean_parser/templates/kotlin.buildinfo.jinja2` seems to be the definition (e.g., `internal object GleanBuildInfo {`). Indeed, we can see [buildDate = {{ build_date }}](https://github.com/mozilla/glean_parser/blob/main/glean_parser/templates/kotlin.buildinfo.jinja2#L28), next we find where that is created. We can take at least two routes:
1. search for it and try walking the code paths leading here
1. find the patch(set) that introduced this line and work forward from there
We already discussed the first option earlier, so let's try the second option. We can try using `git blame` to identify the commit that introduced this line, but that won't always be successful because there may have been subsequent modifications of that line - but feel free to follow that path if you feel comfortable with it.
This shows us all of the commits that contained `build_date` in the template file. In this case, there was only one commit - so `git blame` would've worked just as well.
At this point, read through the commit and understand how the timestamp is generated and investigate paths to providing or injecting your own timestamp. In this specific case Mozilla explicitly gives us a solution to this problem:
```diff
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -4,6 +4,16 @@
- Support global file-level tags in metrics.yaml ([bug 1745283](https://bugzilla.mozilla.org/show_bug.cgi?id=1745283))
- Glinter: Reject metric files if they use `unit` by mistake. It should be `time_unit` ([#432](https://github.com/mozilla/glean_parser/pull/432)).
+- Automatically generate a build date when generating build info ([#431](https://github.com/mozilla/glean_parser/pull/431)).
+ Enabled for Kotlin and Swift.
+ This can be changed with the `build_date` command line option.
+ `build_date=0` will use a static unix epoch time.
+ `build_date=2022-01-03T17:30:00` will parse the ISO8601 string to use (as a UTC timestamp).
Next, we must find how we can use this information. This may take some bruteforce searching: search for `build_date` in all of the projects that consume/use/depend on `glean_parser`. Here we see that `glean` uses `build_date` in some way:
samples/android/app/src/androidTest/java/org/mozilla/samples/gleancore/pings/BaselinePingTest.kt:51: var buildDate = clientInfo.getString("build_date")
`gradle-plugin` seems interesting and relevant because that is the package that `fenix` depends on (in `app/build.gradle`: `apply plugin: "org.mozilla.telemetry.glean-gradle-plugin"`). Looking at [gradle-plugin/src/main/groovy/mozilla/telemetry/glean-gradle-plugin/GleanGradlePlugin.groovy](https://github.com/mozilla/glean/blob/b213cdd168b6d7b559eaf7a65fc573062482e239/gradle-plugin/src/main/groovy/mozilla/telemetry/glean-gradle-plugin/GleanGradlePlugin.groovy#L219) :
```
// For applications check if they overwrote the build date.
Finally, in `glean`'s [sample app](https://github.com/mozilla/glean/blob/b213cdd168b6d7b559eaf7a65fc573062482e239/samples/android/app/build.gradle#L61) we see how they inject this in a `build.gradle`:
```
// Fixed build date so we can test for it
ext.gleanBuildDate = "2020-11-06T11:30:50+00:00"
```
Now, we can add something like this in `fenix`'s `build.gradle` and try resolving the issue.