finish documenting CI, for now authored by anarcat's avatar anarcat
......@@ -65,9 +65,60 @@ guide on how to migrate jobs to GitLab CI.
<!-- how to deal with them. this should be easy to follow: think of -->
<!-- your future self, in a stressful situation, tired and hungry. -->
TODO: @ahf what happens if there's trouble with the f-droid runners? who to
ping? anything we can do to diagnose the problem? what kind of
information to send them?
### A runner fails all jobs
[Pause the runner](#enabling/disabling-runners).
### Jobs pile up
If too many jobs pile up in the queue, consider inspecting which jobs
those are in the [job admin interface](https://gitlab.torproject.org/admin/jobs). Jobs can be canceled there
by GitLab admins. For really long jobs, consider talking with the
project maintainers and see how those jobs can be optimized.
### Runner disk fills up
If you see a warning like:
DISK WARNING - free space: /srv 6483 MB (11% inode=82%):
It's because the runner is taking up all the disk space. This is
usually containers, images, or caches from the runner. Those are
normally [purged regularly](#image-volume-and-container-storage-and-caching) but some extra load on the CI system
might use up too much space all of a sudden.
To diagnose this issue better, you can see the running containers
with:
docker ps
... and include stopped or dead containers with:
docker ps -a
Images are visible with:
docker images
And volumes with:
docker volume ls
... although that output is often not very informative because GitLab
runner uses volumes to cache data and uses opaque volume names.
If there are any obvious offenders, they can be removed with `docker
rm` (for containers), `docker image rm` (for images) and `docker
volume rm` (for volumes). But usually, you should probably just run
the cleanup jobs by hand, in order:
docker system prun --filter until=72h
The timeframe can be lowered for a more aggressive cleanup.
Alternatively, this will also clean old containers:
/usr/local/sbin/tpo-docker-clean-cache
## Disaster recovery
......@@ -150,11 +201,6 @@ We also avoided using the [puppetlabs/docker](https://forge.puppet.com/modules/p
containers, volumes and so on right now. All that is (currently)
handled by GitLab runner.
### F-Droid runners
TODO: @ahf document how the F-Droid runners were hooked up to GitLab
CI. Anything special on top of [the official docs](https://docs.gitlab.com/runner/register/)?
### MacOS/Windows
A special machine (currently `chi-node-13`) was built to allow builds
......@@ -168,13 +214,13 @@ steps were taken on the machine:
`roles::gitlab::ci::foreign`)
The `gitlab-ci-admin` role user and group have access to the
machine. The remaining of the procedure still needs to be implemented
and documented, here, and eventually converted into a Puppet manifest,
see [issue #40095](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40095).
machine.
TODO: @ahf document how MacOS/Windows images are created and runners
are setup. don't hesitate to create separate headings for Windows vs
MacOS and for image creation vs runner setup.
TODO: The remaining procedure still needs to be implemented and
documented, here, and eventually converted into a Puppet manifest, see
[issue 40095](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40095). @ahf document how MacOS/Windows images are created
and runners are setup. don't hesitate to create separate headings for
Windows vs MacOS and for image creation vs runner setup.
## SLA
......@@ -270,15 +316,39 @@ project](https://gitlab.com/gitlab-org/gitlab-runner) but usually gets released
### Security
TODO: Some things to look into:
* https://docs.gitlab.com/ee/user/project/new_ci_build_permissions_model.html
* https://docs.gitlab.com/runner/security/
We do not currently trust GitLab runners for security purposes: at
most we trust them to correctly report errors in test suite, but we do
not trust it with compiling and publishing artifacts, so they have a
low value in our trust chain. This might eventually change.
low value in our trust chain.
This might eventually change: we may eventually want to build
artefacts (e.g. tarballs, binaries, Docker images!) through GitLab CI
and even deploy code, at which point GitLab runners could actually
become important "trust anchors" with a smaller attack surface than
the entire GitLab infrastructure.
The tag-, group-, and project- based allocation of runners is based on
a secret token handled on the GitLab server. It is technically
possible for an attacker to compromise the GitLab server and access a
runner, which makes those restrictions depend on the security of the
GitLab server as a whole. Thankfully, the [permission model](https://docs.gitlab.com/ee/user/project/new_ci_build_permissions_model.html) of
runners now actually reflects the permissions in GitLab itself, so
there are some constraints in place.
Inversely, if a runner's token is leaked, it could be used to
impersonate the runner and "steal" jobs from projects. Normally,
runners do not leak their own token, but this could happen through,
for example, a virtualization or container escape.
Runners currently have full network access: this could be abused by an
hostile contributor to use the runner as a start point for scanning or
attacking other entities on the network, and even without our
network. We might eventually want to firewall runners to prevent them
from accessing certain network resources, but that is currently not
implemented.
The [runner documentation](https://docs.gitlab.com/runner/) has a [section on security](https://docs.gitlab.com/runner/security/) which
this section is based on.
### Image, volume and container storage and caching
......@@ -298,14 +368,15 @@ inconsistent at best, see [this other MR](https://gitlab.com/gitlab-org/gitlab-r
### rootless containers
TODO: consider podman for running containers more securely, and
possibly also to build container images inside GitLab CI, which would
otherwise require docker-in-docker (DinD), unsupported by
upstream. some ideas here:
We are considering [podman](https://podman.io/) for running containers more securely:
because they can run containers "rootless" (without running as root on
the host), they are generally thought to be better immune against
container escapes. See [those instructions](https://github.com/jonasbb/podman-gitlab-runner)
* https://medium.com/prgcont/using-buildah-in-gitlab-ci-9b529af19e42
* https://github.com/containers/podman/issues/7982
* https://github.com/jonasbb/podman-gitlab-runner
This could also possibly make it easier to build containers inside
GitLab CI, which would otherwise require docker-in-docker (DinD),
unsupported by upstream. This can be done with [buildah](https://buildah.io/) using, for
example, [those instructions](https://medium.com/prgcont/using-buildah-in-gitlab-ci-9b529af19e42).
### Current services
......@@ -366,10 +437,13 @@ runners can export their own Prometheus metrics, but currently do
not. They are, naturally, monitored through the `node-exporter` like
all other TPO servers, however.
TODO: monitor GitLab runners; they can be configured to expose metrics
through a Prometheus exporter. The Puppet module supports this through
the `gitlab_ci_runner::metrics_server` variable, but we would need to
hook it into our server as well. See also [the upstream documentation](https://docs.gitlab.com/runner/monitoring/README.html).
We may eventually monitor GitLab runners directly; they can be
configured to expose metrics through a Prometheus exporter. The Puppet
module supports this through the `gitlab_ci_runner::metrics_server`
variable, but we would need to hook it into our server as well. See
also [the upstream documentation](https://docs.gitlab.com/runner/monitoring/README.html). Right now it feels the existing
"node"-level and the GitLab-level monitoring in Prometheus is
sufficient.
## Backups
......
......