#40095 is about creating a hackish setup for Windows/Mac runners. let's create runners for normal linux containers without all that hackery, inside a normal VM inside the ganeti cluster. we do have access to the f-droid runners, but those are a little overwhelmed right now and we have spare cycles, so let's just do this.
the CI setup was completed in Puppet. one big caveat is that modules need to be disabled before docker can be successfully installed, so i had to touch /etc/no_modules_disabled before doing the build. this should be documented in the wiki, along with the rest of this.
also, a bunch of jobs crashed because the runner was misconfigud, e.g.
okay, this is almost done. a little bit of auditing and documentation and i think we can consider this done. naturally, once the queue gets clogged, we can start another instance, but for now it seems to have resolved the bottleneck quite nicely, which was the urgent thing.
regarding the cleanup task, @pollo does this, which just removes everything:
i raised the concurrency for the runner to 4, to saturate the CPUs on the host a little more... it was kind of idle, and the queue was piling up again. we seem to be draining it a bit faster now, to get from 25 to 10 jobs took 1h instead of 3h here, and that's before we raised concurrency.
and now of course, people are happily throwing a lot of jobs in the pipelines so it just saturates more! :)
still, we should be able to drain that pipeline faster now, and that's only with one runner: we can make more.
in any case, our hardware capacity is probably nowhere near where we want to be when we actually have everything on gitlab. we'll certainly need a few more runners to keep up with surges like this.
then, on the other hand, this is a tpo/core.git release, so maybe it's a rather special moment?
after the concurrency change, 15 jobs were processed in... 20 minutes!!! that is pretty awesome, it's a nine-fold improvement over our previous metric (3h), almost an order of magnitude!
parallelism. turns out it works. ;) (oh, and "just throw hardware at the problem, that works too...)
compare the three slopes here:
first annotation on the left, regular single f-droid runner, 3 hours for 15 jobs (25 to 10). then, second annotation, new runner comes in, we go to about 15 jobs / 1h. and finally, the third annotation is where concurrency was changed, and we go down to 20 minutes to process a similar workload.
oh, and i discussed possible improvements to the core.git pipeline: they could use an already existing docker image with deps (e.g. buildpacks-deps:buster-scm) to reduce the number of packages to install, and build their own docker image with a built tor image to run the next steps in the pipeline (and therefore reuse the compiled code, instead of rebuilding at every stage).
even without building their own image, they should be reusing their artifacts between stages...
FYI I've been discussing with OSUOSL about running a pool of x86_64 gitlab runners for F-Droid, Tor Project, Debian, and Guardian Project. They've agreed to the idea, but no timeline yet. They also said they'd provide at least one runner on aarch64.
that would/will be awesome! i'll subscribe to that issue, thanks. :) the arm stuff will be particularly important for us because we don't have any of those runners right now, and i had no clear plan on where to run one (but had ideas, as we do have jenkins builders on arm64 now, which we could recycle...)
audit the helper image stuff, make sure it's somewhat sane in the debian package
it's still unclear. the debian/postinst definitely builds a /var/lib/gitlab-runner/gitlab-runner-helper.tar.xz (with this script from this dockerfile which is based on a cdebootstrapped system, which is nice) but it's not clear it overrides the help image because there's no "helper image" in /etc/gitlab-runner/config.toml. this might be something we need to setup when we register the runner?