Flesh out arti-bench a bit more to make it suitable for long-term tracking

There are a few features I think are needed for arti-bench to be good enough that we can use it for release-to-release tracking:

support for doing multiple tests in parallel
support for writing information out to some maybe-standard JSON-or-something format, together with the test parameters and arti-bench version
maybe trying to standardise the results somehow, like running them in CI?