anarcat · e17121b1
--- a/service/ci.md
+++ b/service/ci.md
+[Continuous Integration](https://en.wikipedia.org/wiki/Continuous_integration) is the system that allows tests to be ran
+and packages to be built, automatically, when new code is pushed to
+the version control system (currently [git](howto/git)).
+Note that even though the current system is [Jenkins][], this page mostly documents GitLab
+CI as that will be the likely, long term replacement.
+[Jenkins]: https://jenkins.torproject.org
+[[_TOC_]]
+# Tutorial
+[GitLab CI][GitLab CI splash] has [good documentation upstream][GitLab CI upstream]. This section
+documents frequent questions we might get about the work.
+[GitLab CI upstream]: https://docs.gitlab.com/ee/ci/
+[GitLab CI splash]: https://about.gitlab.com/stages-devops-lifecycle/continuous-integration/
+[GitLab CI quickstart]: https://docs.gitlab.com/ee/ci/quick_start/README.html
+<!-- simple, brainless step-by-step instructions requiring little or -->
+<!-- no technical background -->
+## Getting started
+The [GitLab CI quickstart][] should get you started here. Note that
+there are some "shared runners" you can already use, and which should
+be available to all projects.
+TODO: time limits? should we say how to enable the shared runners?
+# How-to
+<!-- more in-depth procedure that may require interpretation -->
+## Pager playbook
+<!-- information about common errors from the monitoring system and -->
+<!-- how to deal with them. this should be easy to follow: think of -->
+<!-- your future self, in a stressful situation, tired and hungry. -->
+TODO: what happens if there's trouble with the f-droid runners? who to
+ping? anything we can do to diagnose the problem? what kind of
+information to send them?
+## Disaster recovery
+Runners should be disposable: if a runner is destroyed, at most the
+jobs it is currently running will be lost. Otherwise artifacts should
+be present on the GitLab server, so to recover a runner is as "simple"
+as creating a new one.
+# Reference
+## Installation
+Since GitLab CI is basically GitLab with external runners hooked up to
+it, this section documents how to install and register runners into
+GitLab.
+### Linux
+TODO: document how the F-Droid runners were hooked up to GitLab
+CI. Anything special on top of [the official docs](https://docs.gitlab.com/runner/register/)?
+### MacOS/Windows
+TODO: @ahf document how MacOS/Windows images are created and runners
+are setup. don't hesitate to create separate headings for Windows vs
+MacOS and for image creation vs runner setup.
+## SLA
+The GitLab CI service is offered on a "best effort" basis and might
+not be fully available.
+## Design
+<!-- how this is built -->
+<!-- should reuse and expand on the "proposed solution", it's a -->
+<!-- "as-built" documented, whereas the "Proposed solution" is an -->
+<!-- "architectural" document, which the final result might differ -->
+<!-- from, sometimes significantly -->
+<!-- a good guide to "audit" an existing project's design: -->
+<!-- https://bluesock.org/~willkg/blog/dev/auditing_projects.html -->
+## Issues
+[File][] or [search][] for issues in the [GitLab issue tracker][search].
+ [File]: https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/new
+ [search]: https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues
+## Monitoring and testing
+TODO: @ahf how do we monitor the runners? maybe the prometheus
+exporter has something? should we hook it inside nagios to get alerts
+when runners get overwhelmed? 
+## Logs and metrics
+TODO: do runners keep logs? where? does it matter? any PII?
+TODO: how about performance metrics? how do we know when we'll run out
+of capacity in the runner network since we don't host the f-droid
+stuff?
+## Backups
+This service requires no backups: all configuration should be
+performed by Puppet and/or documented in this wiki page. A lost runner
+should be rebuilt from scratch, as per [disaster recover](#Disaster recovery).
+## Other documentation
+ * [GitLab CI promotional page][GitLab CI splash]
+ * [GitLab CI upstream documentation portal][GitLab CI upstream]
+   * [GitLab CI quickstart][]
+[GitLab CI upstream]: https://docs.gitlab.com/ee/ci/
+[GitLab CI splash]: https://about.gitlab.com/stages-devops-lifecycle/continuous-integration/
+[GitLab CI quickstart]: https://docs.gitlab.com/ee/ci/quick_start/README.html
+# Discussion
+Tor currently uses [Jenkins][] to run tests, builds and various
+automated jobs. This discussion is about if and how to replace this
+with GitLab CI.
+## Overview
+<!-- describe the overall project. should include a link to a ticket -->
+<!-- that has a launch checklist -->
+Ever since the [GitLab migration](howto/gitlab), we have discussed the
+possibility of replacing Jenkins with GitLab CI, or at least using
+GitLab CI in some way. 
+Tor currently utilizes a mixture of different CI systems to ensure
+some form of quality assurance as part of the software development
+process:
+- Jenkins (provided by TPA)
+- Gitlab CI (currently Docker builders kindly provided by the FDroid
+  project via Hans from The Guardian Project)
+- Travis CI (used by some of our projects such as tpo/core/tor.git for
+  Linux and MacOS builds)
+- Appveyor (used by tpo/core/tor.git for Windows builds)
+By the end of 2020 however, [pricing changes at Travis
+CI](https://blog.travis-ci.com/2020-11-02-travis-ci-new-billing) made it difficult for the network team to continue running the
+Mac OS builds there. Furthermore, it was felt that Appveyor was too
+slow to be useful for builds, so it was proposed ([issue 40095][]) to
+create a pair of bare metal machines to run those builds, through a
+`libvirt` architecture. This is an exception to [TPA-RFC 7: tools](policy/tpa-rfc-7-tools)
+which was formally proposed in [TPA-RFC-8][].
+[issue 40095]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40095
+[TPA-RFC-8]: policy/tpa-rfc-8-gitlab-ci-libvirt
+## Goals
+In general, the idea here is to evaluate GitLab CI as a unified
+platform to replace Travis, and Appveyor in the short term, but also,
+in the longer term, Jenkins itself.
+### Must have
+ * automated configuration: setting up new builders should be done
+   through Puppet
+ * the above requires excellent documentation of the setup procedure
+   in the development stages, so that TPA can transform that into a
+   working Puppet manifest
+ * Linux, Windows, Mac OS support
+ * x86-64 architecture ("64-bit version of the x86 instruction set",
+   AKA x64, AMD64, Intel 64, what most people use on their computers)
+ * Travis replacement
+ * autonomy: users should be able to setup new builds without
+   intervention from the service (or system!) administrators
+ * clean environments: each build should run in a clean VM
+### Nice to have
+ * fast: the runners should be fast (as in: powerful CPUs, good disks,
+   lots of RAM to cache filesystems, CoW disks) and impose little
+   overhead above running the code natively (as in: no emulation)
+ * ARM64 architecture
+ * Apple M-1 support
+ * Jenkins replacement
+ * Appveyor replacement
+ * BSD support (FreeBSD, OpenBSD, and NetBSD in that order)
+### Non-Goals
+ * in the short term, we don't aim at doing "Continuous
+   Deployment". this is one of the possible goal of the GitLab CI
+   deployment, but it is considered out of scope for now. see also the
+   [LDAP proposed solutions section][]
+[LDAP proposed solutions section]: howto/ldap#Proposed-Solution
+## Approvals required
+TPA's approbation required for the libvirt exception, see
+[TPA-RFC-8][].
+## Proposed Solution
+The [original proposal][issue 40095] from @ahf when as follows:
+> [...] Reserve two (ideally) "fast" Debian-based machines on TPO infrastructure to build the following:
+>
+> * Run Gitlab CI runners via KVM (initially with focus on Windows
+>   x86-64 and macOS x86-64). This will replace the need for Travis CI
+>   and Appveyor. This should allow both the network team, application
+>   team, and anti-censorship team to test software on these platforms
+>   (either by building in the VMs or by fetching cross-compiled
+>   binaries on the hosts via the Gitlab CI pipeline feature). Since
+>   none(?) of our engineering staff are working full-time on MacOS
+>   and Windows, we rely quite a bit on this for QA.
+> * Run Gitlab CI runners via KVM for the BSD's. Same argument as
+>   above, but is much less urgent.
+> * Spare capacity (once we have measured it) can be used a generic
+>   Gitlab CI Docker runner in addition to the FDroid builders.
+> * The faster the CPU the faster the builds.
+> * Lots of RAM allows us to do things such as having CoW filesystems
+>   in memory for the ephemeral builders and should speed up builds
+>   due to faster I/O.
+All this would be implemented through a GitLab [custom executor][]
+using [libvirt](https://libvirt.org/) (see [this example implementation](https://docs.gitlab.com/runner/executors/custom_examples/libvirt.html)).
+This is an excerpt from the [proposal sent to TPA][TPA-RFC-8]:
+> [TPA would] build two (bare metal) machines (in the Cymru cluster)
+> to manage those runners. The machines would grant the GitLab runner
+> (and also @ahf) access to the libvirt environment (through a role
+> user).
+> 
+> ahf would be responsible for creating the base image and deploying the
+> first machine, documenting every step of the way in the TPA wiki. The
+> second machine would be built with Puppet, using those instructions,
+> so that the first machine can be rebuilt or replaced. Once the second
+> machine is built, the first machine should be destroyed and rebuilt,
+> unless we are absolutely confident the machines are identical.
+> 
+> [custom executor]: https://docs.gitlab.com/runner/executors/custom.html
+## Cost
+The machines used were donated, but that is still an "hardware
+opportunity cost" that is currently undefined.
+Staff costs, naturally, should be counted. It is estimated the initial
+runner setup should take less than two weeks.
+## Alternatives considered
+### Ganeti
+Ganeti has been considered as an orchestration/deployment platform for
+the runners, but there is no known integration between GitLab CI
+runners and Ganeti.
+If we find the time or an existing implementation, this would still be
+a nice improvement.
+### SSH/shell executors
+This works by using an existing machine as a place to run the
+jobs. Problem is it doesn't run with a clean environment, so it's not
+a good fit.
+### Parallels/VirtualBox
+Note: couldn't figure out what the difference is between Parallels and
+VirtualBox, nor if it matters.
+Obviously, VirtualBox could be used to run Windows (and possibly
+MacOS?) images (and maybe BSDs?) but unfortunately, Oracle has made of
+mess of VirtualBox which [keeps it out of Debian](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=794466) so this could be
+a problematic deployment as well.
+### Docker
+[Support in Debian](https://tracker.debian.org/pkg/docker.io) has improved, but is still hit-and-miss. no
+support for Windows or MacOS, as far as I know, so not a complete
+solution, but could be used for Linux runners.
+### Docker machine
+This was abandoned upstream and is considered irrelevant.
+### Kubernetes
+@anarcat has been thinking about setting up a Kubernetes cluster for
+GitLab. There are high hopes that it will help us not only with GitLab
+CI, but also the "CD" (Continuous Deployment) side of things. This
+approach was briefly [discussed in the LDAP audit][LDAP proposed solutions section], but basically the
+idea would be to replace the "SSH + role user" approach we currently
+use for service with GitLab CI.
+As explained in the [goals](#Goals) section above, this is currently out of
+scope, but could be considered instead of Docker for runners. 
+### Jenkins
+[Jenkins][Jenkins CI] was a fine piece of software when it came out: builds! We
+can easily do builds! On multiple machines too! And a nice web
+interface with [weird blue balls](https://www.jenkins.io/blog/2012/03/13/why-does-jenkins-have-blue-balls/)! It was great. But then Travis
+came along, and then GitLab CI, and then GitHub actions, and it turns
+out it's much, much easier and intuitive to delegate the build
+configuration to the project as opposed to keeping it in the CI
+system.
+The design of Jenkins, in other words, feels dated now. It imposes an
+unnecessary burden on the service admins, which are responsible for
+configuring and monitoring builds for their users.
+It is also believed that installing GitLab runners will be easier on
+the sysadmins, although that remains to be verified.
+In the short term, Jenkins can keep doing what it does, but in the
+long term, we would greatly benefit from retiring yet another service,
+since it basically duplicates what GitLab CI can do.
+GitLab CI also has the advantage of being able to easily integrate
+with GitLab pages, making it easier for people to build static
+websites than the current combination of Jenkins and our [static sites
+system](howto/static-component). See the [alternatives to the static site
+system](static-component#Alternatives-considered) for more information.
+[Jenkins CI]: https://en.wikipedia.org/wiki/Jenkins_(software)