Flesh out arti-bench a bit more to make it suitable for long-term tracking

There are a few features I think are needed for arti-bench to be good enough that we can use it for release-to-release tracking:

  • support for doing multiple tests in parallel
  • support for writing information out to some maybe-standard JSON-or-something format, together with the test parameters and arti-bench version
  • maybe trying to standardise the results somehow, like running them in CI?