arti-bench: do not allocate individual receive buffers for every receiver
When we're running a huge number of arti-bench timing cases in parallel, it's annoying to have to allocate (e.g.) 10MiB for each test case just for the the received buffer in run_timing.  It would probably be better to allocate at most 4-16k in run_timing, and then to just read in chunks.
We should still compare the result to our receive input.
Found while doing #87.