Note that even though the current system is Jenkins, this page mostly documents GitLab CI as that will be the likely, long term replacement.
- Monitoring and testing
- Logs and metrics
- Other documentation
The GitLab CI quickstart should get you started here. Note that
there are some "shared runners" you can already use, and which should
be available to all projects. So your main task here is basically to
Why is my CI job not running?
There might be too many jobs in the queue. You can monitor the queue in our Grafana dashboard.
If a runner is misbehaving, it might be worth "pausing" it while we investigate, so that jobs don't all fail on that runner. For this, head for the runner admin interface and hit the "pause" button on the runner.
Registering more runners
Anyone can run their own personal runner in their own infrastructure and register them inside a project on our GitLab instance. For this you need to first install a runner and register it in GitLab. But we already have shared runners, if they are not sufficient, it might be best to request a new one from TPA.
Converting a Jenkins job
Upstream has generic documentation on how to migrate from Jenkins which could be useful for us. We have yet to write a more complete guide on how to migrate jobs to GitLab CI.
- do runners have network access? yes, but that might eventually change
- how to build from multiple git repositories? install
gitand clone the extra repositories. using git submodules might work around eventual network access restrictions
- how do I trust runners? you can setup your own runner for your own project in the GitLab app, but in any case you need to trust the GitLab app. we are considering options for this, see security
- how do i control the image used by the runners? the docker image is
specified in the
.gitlab-ci.ymlfile. but through Docker image policies, it might be possible for specific runners to be restricted to specific, controlled, Docker images.
- do we provide, build, or host our own Docker images? not
yet. ideally, we would never use images straight from
hub.docker.com and build our own ecosystem of images, built
FROM scratchor from
A runner fails all jobs
Jobs pile up
If too many jobs pile up in the queue, consider inspecting which jobs those are in the job admin interface. Jobs can be canceled there by GitLab admins. For really long jobs, consider talking with the project maintainers and see how those jobs can be optimized.
Runner disk fills up
If you see a warning like:
DISK WARNING - free space: /srv 6483 MB (11% inode=82%):
It's because the runner is taking up all the disk space. This is usually containers, images, or caches from the runner. Those are normally purged regularly but some extra load on the CI system might use up too much space all of a sudden.
To diagnose this issue better, you can see the running containers with:
... and include stopped or dead containers with:
docker ps -a
Images are visible with:
And volumes with:
docker volume ls
... although that output is often not very informative because GitLab runner uses volumes to cache data and uses opaque volume names.
If there are any obvious offenders, they can be removed with
docker rm (for containers),
docker image rm (for images) and
docker volume rm (for volumes). But usually, you should probably just run
the cleanup jobs by hand, in order:
docker system prun --filter until=72h
The time frame can be lowered for a more aggressive cleanup.
Alternatively, this will also clean old containers:
DNS resolution failures
Under certain circumstances (upgrades?) Docker loses DNS resolution (and possibly all of networking?). A symptom is that it simply fails to clone the repository at the start of the job, for example:
fatal: unable to access 'https://gitlab-ci-token:[MASKED]@gitlab.torproject.org/tpo/network-health/sbws.git/': Could not resolve host: gitlab.torproject.org
A workaround is to reboot the runner's virtual machine. It might be that we need to do some more configuration of Docker, see upstream issue 6644, although it's unclear why this problem is happening right now. Still to be more fully investigated, see tpo/tpa/gitlab#93.
Runners should be disposable: if a runner is destroyed, at most the jobs it is currently running will be lost. Otherwise artifacts should be present on the GitLab server, so to recover a runner is as "simple" as creating a new one.
Since GitLab CI is basically GitLab with external runners hooked up to it, this section documents how to install and register runners into GitLab.
Docker on Debian
A first runner (
ci-runner-01) was setup by Puppet in the
cluster, using this command:
gnt-instance add \ -o debootstrap+buster \ -t drbd --no-wait-for-sync \ --net 0:ip=pool,network=gnt-chi-01 \ --no-ip-check \ --no-name-check \ --disk 0:size=10G \ --disk 1:size=2G,name=swap \ --disk 2:size=60G \ --backend-parameters memory=64g,vcpus=8 \ ci-runner-01.torproject.org
roles::gitlab::runner::docker Puppet class deploys the GitLab
runner code and hooks it into GitLab. It uses the
module from Voxpupuli to avoid reinventing the wheel. But before
enabling it on the instance, the following operations need to be
The shared runner token needs to be setup in Trocla, using:
trocla create profile::gitlab::runner::token plain
NOTE: this was probably already done. If you need a more specific runner (say group- or project-specific), a new Role (e.g.
roles::gitlab::runner::docker::tpacould be created and pass a different token (set in Trocla like the above).
TODO: this is one case where the Trocla Hiera support (which we do not currently use), could come in handy. See our Puppet Trocla docs for more details.
setup the large partition in
/srv, and bind-mount it to cover for Docker:
mkfs -t ext4 -j /dev/sdc echo "/dev/sdc /srv ext4 defaults 1 2" >> /etc/fstab echo "/srv/docker /var/lib/docker none bind 0 0" >> /etc/fstab mount /srv mount /var/lib/docker
disable module loading:
touch /etc/no_modules_disabled reboot
... otherwise the Docker package will fail to install because it will try to load extra kernel modules.
ONLY THEN should you deploy
NOTE: we originally used the Debian packages (docker.io and gitlab-runner) instead of the upstream official packages, because those have a somewhat messed up installer and weird key deployment policies. In other words, we would rather avoid having to trust the upstream packages for runners, even though we use them for the GitLab omnibus install. The Debian packages are both somewhat out of date, and the latter is not available in Debian buster (current stable), so it had to be installed from bullseye.
UPDATE: the above turned out to fail during the bullseye freeze (2021-04-27), as gitlab-runner was removed from bullseye, because of an unpatched security issue. We have switched to the upstream Debian packages, since they are used for GitLab itself anyways, which is unfortunate, but will have to do for now.
We also avoided using the puppetlabs/docker module because we "only" need to setup Docker, and not specifically deal with containers, volumes and so on right now. All that is (currently) handled by GitLab runner.
A special machine (currently
chi-node-13) was built to allow builds
to run on MacOS and Windows virtual machines. The machine was
installed in the Cymru cluster (so following
new-machine-cymru). On top of that procedure, the following extra
steps were taken on the machine:
- a bridge (
br0) was setup
- a basic libvirt configuration was built in Puppet (within
gitlab-ci-admin role user and group have access to the
TODO: The remaining procedure still needs to be implemented and documented, here, and eventually converted into a Puppet manifest, see issue 40095. @ahf document how MacOS/Windows images are created and runners are setup. don't hesitate to create separate headings for Windows vs MacOS and for image creation vs runner setup.
The GitLab CI service is offered on a "best effort" basis and might not be fully available.
The CI service is currently being serviced by Jenkins, but we are looking at replacing this with GitLab CI in the 2021 roadmap. This section therefore mostly documents how the new GitLab CI service is built. See Jenkins section below for more information about the old Jenkins service.
GitLab CI architecture
GitLab CI sits somewhat outside of the main GitLab architecture, in that it is not featured prominently in the GitLab architecture documentation. In practice, it is a core component of GitLab in that the continuous integration and deployment features of GitLab have become a key feature and selling point for the project.
GitLab CI works by scheduling "pipelines" which are made of one or
many "jobs", defined in a project's git repository (the
.gitlab-ci.yml file). Those jobs then get picked up by one of
many "runners". Those runners are separate processes, usually running
on a different host than the main GitLab server.
GitLab runner is a program written in Golong which clocks at about 800,000 SLOC, including vendored dependencies, 80,000 SLOC without.
Runners regularly poll the central GitLab for jobs and execute those inside an "executor". We currently support only "Docker" as an executor but are working on different ones, like a custom "podman" (for more trusted runners, see below) or KVM executor (for foreign platforms like MacOS or Windows).
What the runner effectively does is basically this:
- it fetches the git repository of the project
- it runs a sequence of shell commands on the project inside the executor (e.g. inside a Docker container) with specific environment variables populated from the project's settings
- it collects artifacts and logs and uploads those back to the main GitLab server
The jobs are therefore affected by the
.gitlab-ci.yml file but also
the configuration of each project. It's a simple yet powerful design.
Types of runners
There are three types of runners:
- shared: "shared" across all projects, they will pick up any job from any project
- group: those are restricted to run jobs only within a specific group
- project: those will only run job within a specific project
In addition, jobs can be targeted at specific runners by assigning them a "tag".
Whether a runner will pick a job depends on a few things:
- if it is a "shared", "project" or "group-"specific runner (above)
- if it has a tag matching the
tagsfield in the configuration
We currently use the following tags:
amd64, for example, runs on the normal 64-bit Intel/AMD architecture, new tags like this may be introduced when other architectures are supported
linuxis usually implicit but other tags might eventually be added for other OS
dockerare the typical runners,
KVMrunners are possibly more powerful and can, for example, run Docker-inside-Docker (DinD)
privileged: those containers have actual root access and should explicitly be able to run
interactive web terminal: supports interactively debugging jobs
fdroid: provided as a courtesy by the F-Droid project
Use tags in your configuration only if your job can be fulfilled by only some of those runners. For example, only specify a memory tag if your job requires a lot of memory.
Upstream release schedules
GitLab CI is an integral part of GitLab itself and gets released along with the core releases. GitLab runner is a separate software project but usually gets released alongside GitLab.
We do not currently trust GitLab runners for security purposes: at most we trust them to correctly report errors in test suite, but we do not trust it with compiling and publishing artifacts, so they have a low value in our trust chain.
This might eventually change: we may eventually want to build artefacts (e.g. tarballs, binaries, Docker images!) through GitLab CI and even deploy code, at which point GitLab runners could actually become important "trust anchors" with a smaller attack surface than the entire GitLab infrastructure.
The tag-, group-, and project- based allocation of runners is based on a secret token handled on the GitLab server. It is technically possible for an attacker to compromise the GitLab server and access a runner, which makes those restrictions depend on the security of the GitLab server as a whole. Thankfully, the permission model of runners now actually reflects the permissions in GitLab itself, so there are some constraints in place.
Inversely, if a runner's token is leaked, it could be used to impersonate the runner and "steal" jobs from projects. Normally, runners do not leak their own token, but this could happen through, for example, a virtualization or container escape.
Runners currently have full network access: this could be abused by an hostile contributor to use the runner as a start point for scanning or attacking other entities on the network, and even without our network. We might eventually want to firewall runners to prevent them from accessing certain network resources, but that is currently not implemented.
Image, volume and container storage and caching
GitLab runner creates quite a few containers, volumes and images in the course of its regular work. Those tend to pile up, unless they get cleaned. Upstream suggests a fairly naive shell script to do this cleanup, but it has a number of issues:
- it is noisy (patched locally with this MR)
- it might be too aggressive
So we only run it weekly, and instead run a more "gentle"
docker system prune command to cleanup orphaned stuff after 3 days.
We are considering podman for running containers more securely: because they can run containers "rootless" (without running as root on the host), they are generally thought to be better immune against container escapes. See those instructions. Do note that custom executors have limitations that the default Docker executor do not, see for example the lack of ENTRYPOINT support.
This could also possibly make it easier to build containers inside GitLab CI, which would otherwise require docker-in-docker (DinD), unsupported by upstream. This can be done with buildah using, for example, those instructions.
GitLab CI, at TPO, currently runs the following services:
- continuous integration: mostly testing after commit
This is currently used by many teams and is quickly becoming a critical service.
It could eventually also run those services:
- web page hosting through GitLab pages or the existing static site system. this is a requirement to replace Jenkins
- continuous deployment: applications and services could be deployed directly from GitLab CI/CD, for example through a Kubernetes cluster or just with plain Docker
- artifact publication: tarballs, binaries and Docker images could be built by GitLab runners and published on the GitLab server (or elsewhere). this is a requirement to replace Jenkins
Monitoring and testing
To test a runner, it can be registered only with a project, to run non-critical jobs against it. See the installation section for details on the setup.
Monitoring is otherwise done through Prometheus, on a need-to basis, see the log and metrics section below.
Logs and metrics
GitLab runners send logs to
systemd. They contain minimal
private information: the most I could find were Git repository and
Docker image URLs, which do contain usernames. Those end up in
/var/log/daemon.log, which gets rotated daily, with a one-week
The GitLab instance exports a set of metrics to monitor CI. For
ci_pending_builds shows the size of the queue,
ci_running_builds shows the number of currently running builds,
etc. Those are visible in the GitLab grafana dashboard,
particularly in this view.
Other metrics might become available in the future: for example,
runners can export their own Prometheus metrics, but currently do
not. They are, naturally, monitored through the
all other TPO servers, however.
We may eventually monitor GitLab runners directly; they can be
configured to expose metrics through a Prometheus exporter. The Puppet
module supports this through the
variable, but we would need to hook it into our server as well. See
also the upstream documentation. Right now it feels the existing
"node"-level and the GitLab-level monitoring in Prometheus is
This service requires no backups: all configuration should be performed by Puppet and/or documented in this wiki page. A lost runner should be rebuilt from scratch, as per disaster recover.
Tor currently uses Jenkins to run tests, builds and various automated jobs. This discussion is about if and how to replace this with GitLab CI.
Ever since the GitLab migration, we have discussed the possibility of replacing Jenkins with GitLab CI, or at least using GitLab CI in some way.
Tor currently utilizes a mixture of different CI systems to ensure some form of quality assurance as part of the software development process:
- Jenkins (provided by TPA)
- Gitlab CI (currently Docker builders kindly provided by the FDroid project via Hans from The Guardian Project)
- Travis CI (used by some of our projects such as tpo/core/tor.git for Linux and MacOS builds)
- Appveyor (used by tpo/core/tor.git for Windows builds)
By the end of 2020 however, pricing changes at Travis
CI made it difficult for the network team to continue running the
Mac OS builds there. Furthermore, it was felt that Appveyor was too
slow to be useful for builds, so it was proposed (issue 40095) to
create a pair of bare metal machines to run those builds, through a
libvirt architecture. This is an exception to TPA-RFC 7: tools
which was formally proposed in TPA-RFC-8.
In general, the idea here is to evaluate GitLab CI as a unified platform to replace Travis, and Appveyor in the short term, but also, in the longer term, Jenkins itself.
- automated configuration: setting up new builders should be done through Puppet
- the above requires excellent documentation of the setup procedure in the development stages, so that TPA can transform that into a working Puppet manifest
- Linux, Windows, Mac OS support
- x86-64 architecture ("64-bit version of the x86 instruction set", AKA x64, AMD64, Intel 64, what most people use on their computers)
- Travis replacement
- autonomy: users should be able to setup new builds without intervention from the service (or system!) administrators
- clean environments: each build should run in a clean VM
Nice to have
- fast: the runners should be fast (as in: powerful CPUs, good disks, lots of RAM to cache filesystems, CoW disks) and impose little overhead above running the code natively (as in: no emulation)
- ARM64 architecture
- Apple M-1 support
- Jenkins replacement
- Appveyor replacement
- BSD support (FreeBSD, OpenBSD, and NetBSD in that order)
- in the short term, we don't aim at doing "Continuous Deployment". this is one of the possible goal of the GitLab CI deployment, but it is considered out of scope for now. see also the LDAP proposed solutions section
TPA's approbation required for the libvirt exception, see TPA-RFC-8.
[...] Reserve two (ideally) "fast" Debian-based machines on TPO infrastructure to build the following:
- Run Gitlab CI runners via KVM (initially with focus on Windows x86-64 and macOS x86-64). This will replace the need for Travis CI and Appveyor. This should allow both the network team, application team, and anti-censorship team to test software on these platforms (either by building in the VMs or by fetching cross-compiled binaries on the hosts via the Gitlab CI pipeline feature). Since none(?) of our engineering staff are working full-time on MacOS and Windows, we rely quite a bit on this for QA.
- Run Gitlab CI runners via KVM for the BSD's. Same argument as above, but is much less urgent.
- Spare capacity (once we have measured it) can be used a generic Gitlab CI Docker runner in addition to the FDroid builders.
- The faster the CPU the faster the builds.
- Lots of RAM allows us to do things such as having CoW filesystems in memory for the ephemeral builders and should speed up builds due to faster I/O.
This is an excerpt from the proposal sent to TPA:
[TPA would] build two (bare metal) machines (in the Cymru cluster) to manage those runners. The machines would grant the GitLab runner (and also @ahf) access to the libvirt environment (through a role user).
ahf would be responsible for creating the base image and deploying the first machine, documenting every step of the way in the TPA wiki. The second machine would be built with Puppet, using those instructions, so that the first machine can be rebuilt or replaced. Once the second machine is built, the first machine should be destroyed and rebuilt, unless we are absolutely confident the machines are identical.
The machines used were donated, but that is still an "hardware opportunity cost" that is currently undefined.
Staff costs, naturally, should be counted. It is estimated the initial runner setup should take less than two weeks.
Ganeti has been considered as an orchestration/deployment platform for the runners, but there is no known integration between GitLab CI runners and Ganeti.
If we find the time or an existing implementation, this would still be a nice improvement.
This works by using an existing machine as a place to run the jobs. Problem is it doesn't run with a clean environment, so it's not a good fit.
Note: couldn't figure out what the difference is between Parallels and VirtualBox, nor if it matters.
Obviously, VirtualBox could be used to run Windows (and possibly MacOS?) images (and maybe BSDs?) but unfortunately, Oracle has made of mess of VirtualBox which keeps it out of Debian so this could be a problematic deployment as well.
Support in Debian has improved, but is still hit-and-miss. no support for Windows or MacOS, as far as I know, so not a complete solution, but could be used for Linux runners.
This was abandoned upstream and is considered irrelevant.
@anarcat has been thinking about setting up a Kubernetes cluster for GitLab. There are high hopes that it will help us not only with GitLab CI, but also the "CD" (Continuous Deployment) side of things. This approach was briefly discussed in the LDAP audit, but basically the idea would be to replace the "SSH + role user" approach we currently use for service with GitLab CI.
As explained in the goals section above, this is currently out of scope, but could be considered instead of Docker for runners.
See the Jenkins replacement discussion for more details about that alternative.