hashx: Performance improvements for program generation
The Rust implementation of Equi-X is similar to the C implementation in solve performance, but about 40% slower at verification. The applicable bottleneck is the HashX program generator implementation. This MR is an optimization pass against that bottleneck, with a few incidental fixes included where it was convenient.
Pairs well with the cachegrind benchmark from !1513 (merged). I'll give results from the Criterion benchmark here in wallclock time though, since that's ultimately more relevant than cache footprint.
Results vary by CPU type. The current approach with a lot of inlining works pretty well on high-end CPUs with a lot of cache, since we don't pay a penalty for the large code size and the savings on inter-procedure glue are quite good. For example, I saw better results in wallclock time on my Intel i5-2500 by inlining the entire Rng, but if we were optimizing for systems with less L1 cache it would be better to mark the Rng entry points as never-inline.
Some hashx benchmarks comparing before and after this patch stack on three CPU types:
- aarch64, Qualcomm SDM845 @ 1.766GHz
generate/generate-interp
time: [94.484 µs 94.522 µs 94.562 µs]
change: [-4.0463% -3.9576% -3.8729%] (p = 0.00 < 0.05)
Performance has improved.
generate/generate-aarch64
time: [111.83 µs 111.86 µs 111.89 µs]
change: [-5.1945% -4.8115% -4.3872%] (p = 0.00 < 0.05)
Performance has improved.
- x86_64, Intel i5-2500 CPU @ 3.30GHz
generate/generate-interp
time: [61.817 µs 61.886 µs 61.964 µs]
change: [-16.796% -16.329% -15.929%] (p = 0.00 < 0.05)
Performance has improved.
generate/generate-x86_64
time: [78.018 µs 78.214 µs 78.478 µs]
change: [-15.474% -14.711% -13.894%] (p = 0.00 < 0.05)
Performance has improved.
- x86_64, Intel i9-7900X CPU @ 3.30GHz
generate/generate-interp
time: [63.052 µs 63.139 µs 63.250 µs]
change: [-19.550% -19.422% -19.285%] (p = 0.00 < 0.05)
Performance has improved.
generate/generate-x86_64
time: [77.429 µs 77.682 µs 77.983 µs]
change: [-17.433% -17.048% -16.721%] (p = 0.00 < 0.05)
Performance has improved.
And an Equi-X benchmark, from the latter CPU, to compare before/after as well as showing how far we are from the C implementation now:
verify/interp-verify time: [87.916 µs 88.142 µs 88.417 µs]
change: [-12.485% -12.075% -11.688%] (p = 0.00 < 0.05)
Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
3 (3.00%) low mild
3 (3.00%) high mild
10 (10.00%) high severe
verify/interp-verify-c time: [69.555 µs 69.756 µs 70.081 µs]
change: [+0.5184% +1.0026% +1.5766%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
2 (2.00%) high mild
8 (8.00%) high severe
verify/interp-verify-c-reuse
time: [70.774 µs 70.949 µs 71.118 µs]
change: [-0.0992% +0.3256% +0.7107%] (p = 0.13 > 0.05)
No change in performance detected.
verify/x86_64-verify time: [83.616 µs 83.853 µs 84.108 µs]
change: [-13.792% -13.506% -13.143%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
4 (4.00%) high mild
1 (1.00%) high severe
verify/x86_64-verify-c time: [60.784 µs 60.921 µs 61.087 µs]
change: [-0.8599% -0.5826% -0.3230%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
5 (5.00%) high mild
5 (5.00%) high severe
verify/x86_64-verify-c-reuse
time: [53.673 µs 53.797 µs 53.967 µs]
change: [-2.7278% -1.2479% -0.0823%] (p = 0.06 > 0.05)
No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
3 (3.00%) high mild
10 (10.00%) high severe