Skip to content
Snippets Groups Projects
Verified Commit e17121b1 authored by anarcat's avatar anarcat
Browse files

document the heck out of the ci project

parent d6f9f64f
No related branches found
No related tags found
No related merge requests found
......@@ -6,13 +6,9 @@ uses GitLab mainly for issue tracking, wiki hosting and code review
for now, at <https://gitlab.torproject.org>, after migrating from
[howto/trac](howto/trac).
[[_TOC_]]
Note that continuous integration is documented separately, in [the CI page](service/ci).
<!-- note: this template was designed based on multiple sources: -->
<!-- https://www.divio.com/blog/documentation/ -->
<!-- http://opsreportcard.com/section/9-->
<!-- http://opsreportcard.com/section/11 -->
<!-- comments like this one should be removed on instanciation -->
[[_TOC_]]
# Tutorial
......
......@@ -539,6 +539,8 @@ of copies of the sites we have to keep around.
* the [cache system](cache) could be used as a replacement in the
front-end
TODO: benchmark gitlab pages vs (say) apache or nginx.
<!-- LocalWords: atomicity DDOS YAML Hiera webserver NFS CephFS TLS
-->
<!-- LocalWords: filesystem GitLab scalable frontend CDN HTTPS DNS
......
[Continuous Integration](https://en.wikipedia.org/wiki/Continuous_integration) is the system that allows tests to be ran
and packages to be built, automatically, when new code is pushed to
the version control system (currently [git](howto/git)).
Note that even though the current system is [Jenkins][], this page mostly documents GitLab
CI as that will be the likely, long term replacement.
[Jenkins]: https://jenkins.torproject.org
[[_TOC_]]
# Tutorial
[GitLab CI][GitLab CI splash] has [good documentation upstream][GitLab CI upstream]. This section
documents frequent questions we might get about the work.
[GitLab CI upstream]: https://docs.gitlab.com/ee/ci/
[GitLab CI splash]: https://about.gitlab.com/stages-devops-lifecycle/continuous-integration/
[GitLab CI quickstart]: https://docs.gitlab.com/ee/ci/quick_start/README.html
<!-- simple, brainless step-by-step instructions requiring little or -->
<!-- no technical background -->
## Getting started
The [GitLab CI quickstart][] should get you started here. Note that
there are some "shared runners" you can already use, and which should
be available to all projects.
TODO: time limits? should we say how to enable the shared runners?
# How-to
<!-- more in-depth procedure that may require interpretation -->
## Pager playbook
<!-- information about common errors from the monitoring system and -->
<!-- how to deal with them. this should be easy to follow: think of -->
<!-- your future self, in a stressful situation, tired and hungry. -->
TODO: what happens if there's trouble with the f-droid runners? who to
ping? anything we can do to diagnose the problem? what kind of
information to send them?
## Disaster recovery
Runners should be disposable: if a runner is destroyed, at most the
jobs it is currently running will be lost. Otherwise artifacts should
be present on the GitLab server, so to recover a runner is as "simple"
as creating a new one.
# Reference
## Installation
Since GitLab CI is basically GitLab with external runners hooked up to
it, this section documents how to install and register runners into
GitLab.
### Linux
TODO: document how the F-Droid runners were hooked up to GitLab
CI. Anything special on top of [the official docs](https://docs.gitlab.com/runner/register/)?
### MacOS/Windows
TODO: @ahf document how MacOS/Windows images are created and runners
are setup. don't hesitate to create separate headings for Windows vs
MacOS and for image creation vs runner setup.
## SLA
The GitLab CI service is offered on a "best effort" basis and might
not be fully available.
## Design
<!-- how this is built -->
<!-- should reuse and expand on the "proposed solution", it's a -->
<!-- "as-built" documented, whereas the "Proposed solution" is an -->
<!-- "architectural" document, which the final result might differ -->
<!-- from, sometimes significantly -->
<!-- a good guide to "audit" an existing project's design: -->
<!-- https://bluesock.org/~willkg/blog/dev/auditing_projects.html -->
## Issues
[File][] or [search][] for issues in the [GitLab issue tracker][search].
[File]: https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/new
[search]: https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues
## Monitoring and testing
TODO: @ahf how do we monitor the runners? maybe the prometheus
exporter has something? should we hook it inside nagios to get alerts
when runners get overwhelmed?
## Logs and metrics
TODO: do runners keep logs? where? does it matter? any PII?
TODO: how about performance metrics? how do we know when we'll run out
of capacity in the runner network since we don't host the f-droid
stuff?
## Backups
This service requires no backups: all configuration should be
performed by Puppet and/or documented in this wiki page. A lost runner
should be rebuilt from scratch, as per [disaster recover](#Disaster recovery).
## Other documentation
* [GitLab CI promotional page][GitLab CI splash]
* [GitLab CI upstream documentation portal][GitLab CI upstream]
* [GitLab CI quickstart][]
[GitLab CI upstream]: https://docs.gitlab.com/ee/ci/
[GitLab CI splash]: https://about.gitlab.com/stages-devops-lifecycle/continuous-integration/
[GitLab CI quickstart]: https://docs.gitlab.com/ee/ci/quick_start/README.html
# Discussion
Tor currently uses [Jenkins][] to run tests, builds and various
automated jobs. This discussion is about if and how to replace this
with GitLab CI.
## Overview
<!-- describe the overall project. should include a link to a ticket -->
<!-- that has a launch checklist -->
Ever since the [GitLab migration](howto/gitlab), we have discussed the
possibility of replacing Jenkins with GitLab CI, or at least using
GitLab CI in some way.
Tor currently utilizes a mixture of different CI systems to ensure
some form of quality assurance as part of the software development
process:
- Jenkins (provided by TPA)
- Gitlab CI (currently Docker builders kindly provided by the FDroid
project via Hans from The Guardian Project)
- Travis CI (used by some of our projects such as tpo/core/tor.git for
Linux and MacOS builds)
- Appveyor (used by tpo/core/tor.git for Windows builds)
By the end of 2020 however, [pricing changes at Travis
CI](https://blog.travis-ci.com/2020-11-02-travis-ci-new-billing) made it difficult for the network team to continue running the
Mac OS builds there. Furthermore, it was felt that Appveyor was too
slow to be useful for builds, so it was proposed ([issue 40095][]) to
create a pair of bare metal machines to run those builds, through a
`libvirt` architecture. This is an exception to [TPA-RFC 7: tools](policy/tpa-rfc-7-tools)
which was formally proposed in [TPA-RFC-8][].
[issue 40095]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40095
[TPA-RFC-8]: policy/tpa-rfc-8-gitlab-ci-libvirt
## Goals
In general, the idea here is to evaluate GitLab CI as a unified
platform to replace Travis, and Appveyor in the short term, but also,
in the longer term, Jenkins itself.
### Must have
* automated configuration: setting up new builders should be done
through Puppet
* the above requires excellent documentation of the setup procedure
in the development stages, so that TPA can transform that into a
working Puppet manifest
* Linux, Windows, Mac OS support
* x86-64 architecture ("64-bit version of the x86 instruction set",
AKA x64, AMD64, Intel 64, what most people use on their computers)
* Travis replacement
* autonomy: users should be able to setup new builds without
intervention from the service (or system!) administrators
* clean environments: each build should run in a clean VM
### Nice to have
* fast: the runners should be fast (as in: powerful CPUs, good disks,
lots of RAM to cache filesystems, CoW disks) and impose little
overhead above running the code natively (as in: no emulation)
* ARM64 architecture
* Apple M-1 support
* Jenkins replacement
* Appveyor replacement
* BSD support (FreeBSD, OpenBSD, and NetBSD in that order)
### Non-Goals
* in the short term, we don't aim at doing "Continuous
Deployment". this is one of the possible goal of the GitLab CI
deployment, but it is considered out of scope for now. see also the
[LDAP proposed solutions section][]
[LDAP proposed solutions section]: howto/ldap#Proposed-Solution
## Approvals required
TPA's approbation required for the libvirt exception, see
[TPA-RFC-8][].
## Proposed Solution
The [original proposal][issue 40095] from @ahf when as follows:
> [...] Reserve two (ideally) "fast" Debian-based machines on TPO infrastructure to build the following:
>
> * Run Gitlab CI runners via KVM (initially with focus on Windows
> x86-64 and macOS x86-64). This will replace the need for Travis CI
> and Appveyor. This should allow both the network team, application
> team, and anti-censorship team to test software on these platforms
> (either by building in the VMs or by fetching cross-compiled
> binaries on the hosts via the Gitlab CI pipeline feature). Since
> none(?) of our engineering staff are working full-time on MacOS
> and Windows, we rely quite a bit on this for QA.
> * Run Gitlab CI runners via KVM for the BSD's. Same argument as
> above, but is much less urgent.
> * Spare capacity (once we have measured it) can be used a generic
> Gitlab CI Docker runner in addition to the FDroid builders.
> * The faster the CPU the faster the builds.
> * Lots of RAM allows us to do things such as having CoW filesystems
> in memory for the ephemeral builders and should speed up builds
> due to faster I/O.
All this would be implemented through a GitLab [custom executor][]
using [libvirt](https://libvirt.org/) (see [this example implementation](https://docs.gitlab.com/runner/executors/custom_examples/libvirt.html)).
This is an excerpt from the [proposal sent to TPA][TPA-RFC-8]:
> [TPA would] build two (bare metal) machines (in the Cymru cluster)
> to manage those runners. The machines would grant the GitLab runner
> (and also @ahf) access to the libvirt environment (through a role
> user).
>
> ahf would be responsible for creating the base image and deploying the
> first machine, documenting every step of the way in the TPA wiki. The
> second machine would be built with Puppet, using those instructions,
> so that the first machine can be rebuilt or replaced. Once the second
> machine is built, the first machine should be destroyed and rebuilt,
> unless we are absolutely confident the machines are identical.
>
> [custom executor]: https://docs.gitlab.com/runner/executors/custom.html
## Cost
The machines used were donated, but that is still an "hardware
opportunity cost" that is currently undefined.
Staff costs, naturally, should be counted. It is estimated the initial
runner setup should take less than two weeks.
## Alternatives considered
### Ganeti
Ganeti has been considered as an orchestration/deployment platform for
the runners, but there is no known integration between GitLab CI
runners and Ganeti.
If we find the time or an existing implementation, this would still be
a nice improvement.
### SSH/shell executors
This works by using an existing machine as a place to run the
jobs. Problem is it doesn't run with a clean environment, so it's not
a good fit.
### Parallels/VirtualBox
Note: couldn't figure out what the difference is between Parallels and
VirtualBox, nor if it matters.
Obviously, VirtualBox could be used to run Windows (and possibly
MacOS?) images (and maybe BSDs?) but unfortunately, Oracle has made of
mess of VirtualBox which [keeps it out of Debian](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=794466) so this could be
a problematic deployment as well.
### Docker
[Support in Debian](https://tracker.debian.org/pkg/docker.io) has improved, but is still hit-and-miss. no
support for Windows or MacOS, as far as I know, so not a complete
solution, but could be used for Linux runners.
### Docker machine
This was abandoned upstream and is considered irrelevant.
### Kubernetes
@anarcat has been thinking about setting up a Kubernetes cluster for
GitLab. There are high hopes that it will help us not only with GitLab
CI, but also the "CD" (Continuous Deployment) side of things. This
approach was briefly [discussed in the LDAP audit][LDAP proposed solutions section], but basically the
idea would be to replace the "SSH + role user" approach we currently
use for service with GitLab CI.
As explained in the [goals](#Goals) section above, this is currently out of
scope, but could be considered instead of Docker for runners.
### Jenkins
[Jenkins][Jenkins CI] was a fine piece of software when it came out: builds! We
can easily do builds! On multiple machines too! And a nice web
interface with [weird blue balls](https://www.jenkins.io/blog/2012/03/13/why-does-jenkins-have-blue-balls/)! It was great. But then Travis
came along, and then GitLab CI, and then GitHub actions, and it turns
out it's much, much easier and intuitive to delegate the build
configuration to the project as opposed to keeping it in the CI
system.
The design of Jenkins, in other words, feels dated now. It imposes an
unnecessary burden on the service admins, which are responsible for
configuring and monitoring builds for their users.
It is also believed that installing GitLab runners will be easier on
the sysadmins, although that remains to be verified.
In the short term, Jenkins can keep doing what it does, but in the
long term, we would greatly benefit from retiring yet another service,
since it basically duplicates what GitLab CI can do.
GitLab CI also has the advantage of being able to easily integrate
with GitLab pages, making it easier for people to build static
websites than the current combination of Jenkins and our [static sites
system](howto/static-component). See the [alternatives to the static site
system](static-component#Alternatives-considered) for more information.
[Jenkins CI]: https://en.wikipedia.org/wiki/Jenkins_(software)
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment