Analyze unusual distribution of time to extend to first hop in circuit

I spent some time looking at OnionPerf measurements today. I found something that I did not expect: It seems like the time required to build the first hop in a circuit has a huge variance and rather unusual distribution. I'll attach a graph shortly that visualizes that.

Looking at that graph, we can see a few things:

The three OnionPerf instances have very different performance regarding circuit extension. (I checked the earlier instances op-hk, op-nl, and op-us, and they had the same characteristics.)
There are huge plateaus for op-us2 and op-nl2 in the first hop graph where some circuits have been successfully extended and others not. Typically, we'd expect a distribution like op-nl2's, just pulled to the right. But that's not the case here. The blue line is special at around 0.6 seconds and the red line at around 0.9 seconds. In fact, the green line is also a bit special at around 0.3 seconds when it almost flattens, only to increase linearly until it reaches 100% at around 0.6 seconds.
If we assume that the U.S. and Hong Kong are simply far away from many relays in a geographical sense, that doesn't explain why extending to the middle node and to the exit goes relatively fast even for those two hosts. Keep in mind that extending to the second hop requires a round-trip to the first hop, and that extending to the third hop requires a round-trip to the first and the second hop.

What's going on here? What properties of these relays should we be looking at? I already looked at:

consensus weight,
date/time of building these circuits, and
whether these are just a small number of guards being reused over and over; but nothing of these explained the shape of these ECDFs. I'm going to attach the data file that this graph is based on, if others want to take a look.

And would it make sense to try out running an OnionPerf instance on another set of hosts that are geographically close to our current hosts? Maybe it's related to how these hosts are set up, including their network? (Just in case that other set of hosts produces different results, we would still have to investigate how that affects our overall measurements of things like time to first byte or throughput.)

This is relevant to Sponsor 59, because we need to make sure that our current measurements are going to be a solid baseline for future experiments. Classifying as potential defect.