provide linux runners for gitlab CI

added Doing label

had to expand the instance limits in ganeti:

gnt-cluster modify --ipolicy-bounds-specs \
max:cpu-count=16,disk-count=16,disk-size=1048576,\
memory-size=65536,nic-count=8,spindle-use=12\
/min:cpu-count=1,disk-count=1,disk-size=1024,\
memory-size=128,nic-count=1,spindle-use=1

(see also #33786)

then created the VM.

gnt-instance add \
      -o debootstrap+buster \
      -t drbd --no-wait-for-sync \
      --net 0:ip=pool,network=gnt-chi-01 \
      --no-ip-check \
      --no-name-check \
      --disk 0:size=10G \
      --disk 1:size=2G,name=swap \
      --disk 2:size=60G \
      --backend-parameters memory=64g,vcpus=8 \
      ci-runner-01.torproject.org

root password in tor-passwords, bootstrapped puppet, added to nagios. first bootstrap done (apart from a reboot).

next step: CI setup.

the CI setup was completed in Puppet. one big caveat is that modules need to be disabled before docker can be successfully installed, so i had to touch /etc/no_modules_disabled before doing the build. this should be documented in the wiki, along with the rest of this.

also, a bunch of jobs crashed because the runner was misconfigud, e.g.

https://gitlab.torproject.org/tpo/core/tor/-/jobs/9040

this was because i was missing a dependency (cdebootstrap), still need to investigate, but i added it to puppet so this shouldn't happen again:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=916256#20

next time, i need to introduce runners more slowly, one project at a time, so that it doesn't "drain" the queue all at once.

also need to check best practices with other sysadmins who have already ran such things (DSA, riseup, pollo, etc).

but this is looking great already.

mentioned in issue #40138 (closed)

changed the description

okay, this is almost done. a little bit of auditing and documentation and i think we can consider this done. naturally, once the queue gets clogged, we can start another instance, but for now it seems to have resolved the bottleneck quite nicely, which was the urgent thing.

regarding the cleanup task, @pollo does this, which just removes everything:

0 2 * * 7 root rm -R /var/lib/docker/volumes/* /var/lib/docker/overlay2/*
0 3 * * 7 root /usr/bin/docker system prune -a -f > /dev/null 2>&1
0 4 * * 7 root /bin/systemctl restart docker

kind of a big hammer. i put a link to another approach in the issue summary instead.

i raised the concurrency for the runner to 4, to saturate the CPUs on the host a little more... it was kind of idle, and the queue was piling up again. we seem to be draining it a bit faster now, to get from 25 to 10 jobs took 1h instead of 3h here, and that's before we raised concurrency.

and now of course, people are happily throwing a lot of jobs in the pipelines so it just saturates more! :)

still, we should be able to drain that pipeline faster now, and that's only with one runner: we can make more.

in any case, our hardware capacity is probably nowhere near where we want to be when we actually have everything on gitlab. we'll certainly need a few more runners to keep up with surges like this.

then, on the other hand, this is a tpo/core.git release, so maybe it's a rather special moment?

after the concurrency change, 15 jobs were processed in... 20 minutes!!! that is pretty awesome, it's a nine-fold improvement over our previous metric (3h), almost an order of magnitude!

parallelism. turns out it works. ;) (oh, and "just throw hardware at the problem, that works too...)

compare the three slopes here:

first annotation on the left, regular single f-droid runner, 3 hours for 15 jobs (25 to 10). then, second annotation, new runner comes in, we go to about 15 jobs / 1h. and finally, the third annotation is where concurrency was changed, and we go down to 20 minutes to process a similar workload.

yaay!

oh, and i discussed possible improvements to the core.git pipeline: they could use an already existing docker image with deps (e.g. buildpacks-deps:buster-scm) to reduce the number of packages to install, and build their own docker image with a built tor image to run the next steps in the pipeline (and therefore reuse the compiled code, instead of rebuilding at every stage).

even without building their own image, they should be reusing their artifacts between stages...

FYI I've been discussing with OSUOSL about running a pool of x86_64 gitlab runners for F-Droid, Tor Project, Debian, and Guardian Project. They've agreed to the idea, but no timeline yet. They also said they'd provide at least one runner on aarch64.

We're tracking this in the Debian context here https://salsa.debian.org/salsa/todo/-/issues/39

that would/will be awesome! i'll subscribe to that issue, thanks. :) the arm stuff will be particularly important for us because we don't have any of those runners right now, and i had no clear plan on where to run one (but had ideas, as we do have jenkins builders on arm64 now, which we could recycle...)

marked the checklist item merge the profile::gitlab_runner code with the existing roles::ci stuff, which I had forgotten about as completed

regarding this:

audit the helper image stuff, make sure it's somewhat sane in the debian package

it's still unclear. the debian/postinst definitely builds a /var/lib/gitlab-runner/gitlab-runner-helper.tar.xz (with this script from this dockerfile which is based on a cdebootstrapped system, which is nice) but it's not clear it overrides the help image because there's no "helper image" in /etc/gitlab-runner/config.toml. this might be something we need to setup when we register the runner?

ah, actually, this patch loads it from the local filesystem so it seems we're good.

this hack should probably be documented in the wiki.

marked the checklist item audit the helper image stuff, make sure it's somewhat sane in the debian package as completed

marked the checklist item setup cleanup for old jobs (e.g. with this) as completed

marked the checklist item document how to enable/disable runners (basically https://gitlab.torproject.org/admin/runners) as completed

marked the checklist item document runner tags as completed

marked the checklist item document the CI design in wiki as completed

i worked more on the docs, and i think we're done here.

closed

provide linux runners for gitlab CI

Designs

Child items ...

Activity