Changes

anarcat · 62fff625
--- a/service/ci.md
+++ b/service/ci.md
@@ -65,9 +65,60 @@ guide on how to migrate jobs to GitLab CI.
 <!-- how to deal with them. this should be easy to follow: think of -->
 <!-- your future self, in a stressful situation, tired and hungry. -->

-TODO: @ahf what happens if there's trouble with the f-droid runners? who to
-ping? anything we can do to diagnose the problem? what kind of
-information to send them?
+### A runner fails all jobs
+
+[Pause the runner](#enabling/disabling-runners).
+
+### Jobs pile up
+
+If too many jobs pile up in the queue, consider inspecting which jobs
+those are in the [job admin interface](https://gitlab.torproject.org/admin/jobs). Jobs can be canceled there
+by GitLab admins. For really long jobs, consider talking with the
+project maintainers and see how those jobs can be optimized.
+
+### Runner disk fills up
+
+If you see a warning like:
+
+    DISK WARNING - free space: /srv 6483 MB (11% inode=82%):
+
+It's because the runner is taking up all the disk space. This is
+usually containers, images, or caches from the runner. Those are
+normally [purged regularly](#image-volume-and-container-storage-and-caching) but some extra load on the CI system
+might use up too much space all of a sudden. 
+
+To diagnose this issue better, you can see the running containers
+with:
+
+    docker ps
+
+... and include stopped or dead containers with:
+
+    docker ps -a
+
+Images are visible with:
+
+    docker images
+
+And volumes with:
+
+    docker volume ls
+
+... although that output is often not very informative because GitLab
+runner uses volumes to cache data and uses opaque volume names.
+
+If there are any obvious offenders, they can be removed with `docker
+rm` (for containers), `docker image rm` (for images) and `docker
+volume rm` (for volumes). But usually, you should probably just run
+the cleanup jobs by hand, in order:
+
+    docker system prun --filter until=72h
+
+The timeframe can be lowered for a more aggressive cleanup.
+
+Alternatively, this will also clean old containers:
+
+    /usr/local/sbin/tpo-docker-clean-cache

 ## Disaster recovery

@@ -150,11 +201,6 @@ We also avoided using the [puppetlabs/docker](https://forge.puppet.com/modules/p
 containers, volumes and so on right now. All that is (currently)
 handled by GitLab runner.

-### F-Droid runners
-
-TODO: @ahf document how the F-Droid runners were hooked up to GitLab
-CI. Anything special on top of [the official docs](https://docs.gitlab.com/runner/register/)?
-
 ### MacOS/Windows

 A special machine (currently `chi-node-13`) was built to allow builds
@@ -168,13 +214,13 @@ steps were taken on the machine:
    `roles::gitlab::ci::foreign`)

 The `gitlab-ci-admin` role user and group have access to the
-machine. The remaining of the procedure still needs to be implemented
-and documented, here, and eventually converted into a Puppet manifest,
-see [issue #40095](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40095).
+machine. 

-TODO: @ahf document how MacOS/Windows images are created and runners
-are setup. don't hesitate to create separate headings for Windows vs
-MacOS and for image creation vs runner setup.
+TODO: The remaining procedure still needs to be implemented and
+documented, here, and eventually converted into a Puppet manifest, see
+[issue 40095](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40095). @ahf document how MacOS/Windows images are created
+and runners are setup. don't hesitate to create separate headings for
+Windows vs MacOS and for image creation vs runner setup.

 ## SLA

@@ -270,15 +316,39 @@ project](https://gitlab.com/gitlab-org/gitlab-runner) but usually gets released

 ### Security

-TODO: Some things to look into:
-
- * https://docs.gitlab.com/ee/user/project/new_ci_build_permissions_model.html
- * https://docs.gitlab.com/runner/security/
-
 We do not currently trust GitLab runners for security purposes: at
 most we trust them to correctly report errors in test suite, but we do
 not trust it with compiling and publishing artifacts, so they have a
-low value in our trust chain. This might eventually change.
+low value in our trust chain.
+
+This might eventually change: we may eventually want to build
+artefacts (e.g. tarballs, binaries, Docker images!) through GitLab CI
+and even deploy code, at which point GitLab runners could actually
+become important "trust anchors" with a smaller attack surface than
+the entire GitLab infrastructure.
+
+The tag-, group-, and project- based allocation of runners is based on
+a secret token handled on the GitLab server. It is technically
+possible for an attacker to compromise the GitLab server and access a
+runner, which makes those restrictions depend on the security of the
+GitLab server as a whole. Thankfully, the [permission model](https://docs.gitlab.com/ee/user/project/new_ci_build_permissions_model.html) of
+runners now actually reflects the permissions in GitLab itself, so
+there are some constraints in place.
+
+Inversely, if a runner's token is leaked, it could be used to
+impersonate the runner and "steal" jobs from projects. Normally,
+runners do not leak their own token, but this could happen through,
+for example, a virtualization or container escape.
+
+Runners currently have full network access: this could be abused by an
+hostile contributor to use the runner as a start point for scanning or
+attacking other entities on the network, and even without our
+network. We might eventually want to firewall runners to prevent them
+from accessing certain network resources, but that is currently not
+implemented.
+
+The [runner documentation](https://docs.gitlab.com/runner/) has a [section on security](https://docs.gitlab.com/runner/security/) which
+this section is based on.

 ### Image, volume and container storage and caching

@@ -298,14 +368,15 @@ inconsistent at best, see [this other MR](https://gitlab.com/gitlab-org/gitlab-r

 ### rootless containers

-TODO: consider podman for running containers more securely, and
-possibly also to build container images inside GitLab CI, which would
-otherwise require docker-in-docker (DinD), unsupported by
-upstream. some ideas here:
+We are considering [podman](https://podman.io/) for running containers more securely:
+because they can run containers "rootless" (without running as root on
+the host), they are generally thought to be better immune against
+container escapes. See [those instructions](https://github.com/jonasbb/podman-gitlab-runner)

- * https://medium.com/prgcont/using-buildah-in-gitlab-ci-9b529af19e42
- * https://github.com/containers/podman/issues/7982
- * https://github.com/jonasbb/podman-gitlab-runner
+This could also possibly make it easier to build containers inside
+GitLab CI, which would otherwise require docker-in-docker (DinD),
+unsupported by upstream. This can be done with [buildah](https://buildah.io/) using, for
+example, [those instructions](https://medium.com/prgcont/using-buildah-in-gitlab-ci-9b529af19e42).

 ### Current services

@@ -366,10 +437,13 @@ runners can export their own Prometheus metrics, but currently do
 not. They are, naturally, monitored through the `node-exporter` like
 all other TPO servers, however.

-TODO: monitor GitLab runners; they can be configured to expose metrics
-through a Prometheus exporter. The Puppet module supports this through
-the `gitlab_ci_runner::metrics_server` variable, but we would need to
-hook it into our server as well. See also [the upstream documentation](https://docs.gitlab.com/runner/monitoring/README.html).
+We may eventually monitor GitLab runners directly; they can be
+configured to expose metrics through a Prometheus exporter. The Puppet
+module supports this through the `gitlab_ci_runner::metrics_server`
+variable, but we would need to hook it into our server as well. See
+also [the upstream documentation](https://docs.gitlab.com/runner/monitoring/README.html). Right now it feels the existing
+"node"-level and the GitLab-level monitoring in Prometheus is
+sufficient.

 ## Backups