diff --git a/howto/gitlab.md b/howto/gitlab.md index 7ba8d879829494bc3d1daf8a7585a4d9755958b0..ffe4ff45871c6ff729a22ebc3860a3a763e89f09 100644 --- a/howto/gitlab.md +++ b/howto/gitlab.md @@ -6,13 +6,9 @@ uses GitLab mainly for issue tracking, wiki hosting and code review for now, at <https://gitlab.torproject.org>, after migrating from [howto/trac](howto/trac). -[[_TOC_]] +Note that continuous integration is documented separately, in [the CI page](service/ci). -<!-- note: this template was designed based on multiple sources: --> -<!-- https://www.divio.com/blog/documentation/ --> -<!-- http://opsreportcard.com/section/9--> -<!-- http://opsreportcard.com/section/11 --> -<!-- comments like this one should be removed on instanciation --> +[[_TOC_]] # Tutorial diff --git a/howto/static-component.md b/howto/static-component.md index 61f01ac3e2597eb90b6c45711dcbeba73fa785d2..91b565a25882a0298e60753d89395af03831e0e4 100644 --- a/howto/static-component.md +++ b/howto/static-component.md @@ -539,6 +539,8 @@ of copies of the sites we have to keep around. * the [cache system](cache) could be used as a replacement in the front-end +TODO: benchmark gitlab pages vs (say) apache or nginx. + <!-- LocalWords: atomicity DDOS YAML Hiera webserver NFS CephFS TLS --> <!-- LocalWords: filesystem GitLab scalable frontend CDN HTTPS DNS diff --git a/service/ci.md b/service/ci.md new file mode 100644 index 0000000000000000000000000000000000000000..3eea0613a5212e23a81f2b1ed37a08e6de27db46 --- /dev/null +++ b/service/ci.md @@ -0,0 +1,332 @@ +[Continuous Integration](https://en.wikipedia.org/wiki/Continuous_integration) is the system that allows tests to be ran +and packages to be built, automatically, when new code is pushed to +the version control system (currently [git](howto/git)). + +Note that even though the current system is [Jenkins][], this page mostly documents GitLab +CI as that will be the likely, long term replacement. + +[Jenkins]: https://jenkins.torproject.org + +[[_TOC_]] + +# Tutorial + +[GitLab CI][GitLab CI splash] has [good documentation upstream][GitLab CI upstream]. This section +documents frequent questions we might get about the work. + +[GitLab CI upstream]: https://docs.gitlab.com/ee/ci/ +[GitLab CI splash]: https://about.gitlab.com/stages-devops-lifecycle/continuous-integration/ +[GitLab CI quickstart]: https://docs.gitlab.com/ee/ci/quick_start/README.html + +<!-- simple, brainless step-by-step instructions requiring little or --> +<!-- no technical background --> + +## Getting started + +The [GitLab CI quickstart][] should get you started here. Note that +there are some "shared runners" you can already use, and which should +be available to all projects. + +TODO: time limits? should we say how to enable the shared runners? + +# How-to + +<!-- more in-depth procedure that may require interpretation --> + +## Pager playbook + +<!-- information about common errors from the monitoring system and --> +<!-- how to deal with them. this should be easy to follow: think of --> +<!-- your future self, in a stressful situation, tired and hungry. --> + +TODO: what happens if there's trouble with the f-droid runners? who to +ping? anything we can do to diagnose the problem? what kind of +information to send them? + +## Disaster recovery + +Runners should be disposable: if a runner is destroyed, at most the +jobs it is currently running will be lost. Otherwise artifacts should +be present on the GitLab server, so to recover a runner is as "simple" +as creating a new one. + +# Reference + +## Installation + +Since GitLab CI is basically GitLab with external runners hooked up to +it, this section documents how to install and register runners into +GitLab. + +### Linux + +TODO: document how the F-Droid runners were hooked up to GitLab +CI. Anything special on top of [the official docs](https://docs.gitlab.com/runner/register/)? + +### MacOS/Windows + +TODO: @ahf document how MacOS/Windows images are created and runners +are setup. don't hesitate to create separate headings for Windows vs +MacOS and for image creation vs runner setup. + +## SLA + +The GitLab CI service is offered on a "best effort" basis and might +not be fully available. + +## Design +<!-- how this is built --> +<!-- should reuse and expand on the "proposed solution", it's a --> +<!-- "as-built" documented, whereas the "Proposed solution" is an --> +<!-- "architectural" document, which the final result might differ --> +<!-- from, sometimes significantly --> + +<!-- a good guide to "audit" an existing project's design: --> +<!-- https://bluesock.org/~willkg/blog/dev/auditing_projects.html --> + +## Issues + +[File][] or [search][] for issues in the [GitLab issue tracker][search]. + + [File]: https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues/new + [search]: https://gitlab.torproject.org/tpo/tpa/gitlab/-/issues + +## Monitoring and testing + +TODO: @ahf how do we monitor the runners? maybe the prometheus +exporter has something? should we hook it inside nagios to get alerts +when runners get overwhelmed? + +## Logs and metrics + +TODO: do runners keep logs? where? does it matter? any PII? + +TODO: how about performance metrics? how do we know when we'll run out +of capacity in the runner network since we don't host the f-droid +stuff? + +## Backups + +This service requires no backups: all configuration should be +performed by Puppet and/or documented in this wiki page. A lost runner +should be rebuilt from scratch, as per [disaster recover](#Disaster recovery). + +## Other documentation + + * [GitLab CI promotional page][GitLab CI splash] + * [GitLab CI upstream documentation portal][GitLab CI upstream] + * [GitLab CI quickstart][] + +[GitLab CI upstream]: https://docs.gitlab.com/ee/ci/ +[GitLab CI splash]: https://about.gitlab.com/stages-devops-lifecycle/continuous-integration/ +[GitLab CI quickstart]: https://docs.gitlab.com/ee/ci/quick_start/README.html + +# Discussion + +Tor currently uses [Jenkins][] to run tests, builds and various +automated jobs. This discussion is about if and how to replace this +with GitLab CI. + +## Overview + +<!-- describe the overall project. should include a link to a ticket --> +<!-- that has a launch checklist --> + +Ever since the [GitLab migration](howto/gitlab), we have discussed the +possibility of replacing Jenkins with GitLab CI, or at least using +GitLab CI in some way. + +Tor currently utilizes a mixture of different CI systems to ensure +some form of quality assurance as part of the software development +process: + +- Jenkins (provided by TPA) +- Gitlab CI (currently Docker builders kindly provided by the FDroid + project via Hans from The Guardian Project) +- Travis CI (used by some of our projects such as tpo/core/tor.git for + Linux and MacOS builds) +- Appveyor (used by tpo/core/tor.git for Windows builds) + +By the end of 2020 however, [pricing changes at Travis +CI](https://blog.travis-ci.com/2020-11-02-travis-ci-new-billing) made it difficult for the network team to continue running the +Mac OS builds there. Furthermore, it was felt that Appveyor was too +slow to be useful for builds, so it was proposed ([issue 40095][]) to +create a pair of bare metal machines to run those builds, through a +`libvirt` architecture. This is an exception to [TPA-RFC 7: tools](policy/tpa-rfc-7-tools) +which was formally proposed in [TPA-RFC-8][]. + +[issue 40095]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/40095 +[TPA-RFC-8]: policy/tpa-rfc-8-gitlab-ci-libvirt +## Goals + +In general, the idea here is to evaluate GitLab CI as a unified +platform to replace Travis, and Appveyor in the short term, but also, +in the longer term, Jenkins itself. + +### Must have + + * automated configuration: setting up new builders should be done + through Puppet + * the above requires excellent documentation of the setup procedure + in the development stages, so that TPA can transform that into a + working Puppet manifest + * Linux, Windows, Mac OS support + * x86-64 architecture ("64-bit version of the x86 instruction set", + AKA x64, AMD64, Intel 64, what most people use on their computers) + * Travis replacement + * autonomy: users should be able to setup new builds without + intervention from the service (or system!) administrators + * clean environments: each build should run in a clean VM + +### Nice to have + + * fast: the runners should be fast (as in: powerful CPUs, good disks, + lots of RAM to cache filesystems, CoW disks) and impose little + overhead above running the code natively (as in: no emulation) + * ARM64 architecture + * Apple M-1 support + * Jenkins replacement + * Appveyor replacement + * BSD support (FreeBSD, OpenBSD, and NetBSD in that order) + +### Non-Goals + + * in the short term, we don't aim at doing "Continuous + Deployment". this is one of the possible goal of the GitLab CI + deployment, but it is considered out of scope for now. see also the + [LDAP proposed solutions section][] + +[LDAP proposed solutions section]: howto/ldap#Proposed-Solution + +## Approvals required + +TPA's approbation required for the libvirt exception, see +[TPA-RFC-8][]. + +## Proposed Solution + +The [original proposal][issue 40095] from @ahf when as follows: + +> [...] Reserve two (ideally) "fast" Debian-based machines on TPO infrastructure to build the following: +> +> * Run Gitlab CI runners via KVM (initially with focus on Windows +> x86-64 and macOS x86-64). This will replace the need for Travis CI +> and Appveyor. This should allow both the network team, application +> team, and anti-censorship team to test software on these platforms +> (either by building in the VMs or by fetching cross-compiled +> binaries on the hosts via the Gitlab CI pipeline feature). Since +> none(?) of our engineering staff are working full-time on MacOS +> and Windows, we rely quite a bit on this for QA. +> * Run Gitlab CI runners via KVM for the BSD's. Same argument as +> above, but is much less urgent. +> * Spare capacity (once we have measured it) can be used a generic +> Gitlab CI Docker runner in addition to the FDroid builders. +> * The faster the CPU the faster the builds. +> * Lots of RAM allows us to do things such as having CoW filesystems +> in memory for the ephemeral builders and should speed up builds +> due to faster I/O. + +All this would be implemented through a GitLab [custom executor][] +using [libvirt](https://libvirt.org/) (see [this example implementation](https://docs.gitlab.com/runner/executors/custom_examples/libvirt.html)). + +This is an excerpt from the [proposal sent to TPA][TPA-RFC-8]: + +> [TPA would] build two (bare metal) machines (in the Cymru cluster) +> to manage those runners. The machines would grant the GitLab runner +> (and also @ahf) access to the libvirt environment (through a role +> user). +> +> ahf would be responsible for creating the base image and deploying the +> first machine, documenting every step of the way in the TPA wiki. The +> second machine would be built with Puppet, using those instructions, +> so that the first machine can be rebuilt or replaced. Once the second +> machine is built, the first machine should be destroyed and rebuilt, +> unless we are absolutely confident the machines are identical. +> +> [custom executor]: https://docs.gitlab.com/runner/executors/custom.html + +## Cost + +The machines used were donated, but that is still an "hardware +opportunity cost" that is currently undefined. + +Staff costs, naturally, should be counted. It is estimated the initial +runner setup should take less than two weeks. + +## Alternatives considered + +### Ganeti + +Ganeti has been considered as an orchestration/deployment platform for +the runners, but there is no known integration between GitLab CI +runners and Ganeti. + +If we find the time or an existing implementation, this would still be +a nice improvement. + +### SSH/shell executors + +This works by using an existing machine as a place to run the +jobs. Problem is it doesn't run with a clean environment, so it's not +a good fit. + +### Parallels/VirtualBox + +Note: couldn't figure out what the difference is between Parallels and +VirtualBox, nor if it matters. + +Obviously, VirtualBox could be used to run Windows (and possibly +MacOS?) images (and maybe BSDs?) but unfortunately, Oracle has made of +mess of VirtualBox which [keeps it out of Debian](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=794466) so this could be +a problematic deployment as well. + +### Docker + +[Support in Debian](https://tracker.debian.org/pkg/docker.io) has improved, but is still hit-and-miss. no +support for Windows or MacOS, as far as I know, so not a complete +solution, but could be used for Linux runners. + +### Docker machine + +This was abandoned upstream and is considered irrelevant. + +### Kubernetes + +@anarcat has been thinking about setting up a Kubernetes cluster for +GitLab. There are high hopes that it will help us not only with GitLab +CI, but also the "CD" (Continuous Deployment) side of things. This +approach was briefly [discussed in the LDAP audit][LDAP proposed solutions section], but basically the +idea would be to replace the "SSH + role user" approach we currently +use for service with GitLab CI. + +As explained in the [goals](#Goals) section above, this is currently out of +scope, but could be considered instead of Docker for runners. + +### Jenkins + +[Jenkins][Jenkins CI] was a fine piece of software when it came out: builds! We +can easily do builds! On multiple machines too! And a nice web +interface with [weird blue balls](https://www.jenkins.io/blog/2012/03/13/why-does-jenkins-have-blue-balls/)! It was great. But then Travis +came along, and then GitLab CI, and then GitHub actions, and it turns +out it's much, much easier and intuitive to delegate the build +configuration to the project as opposed to keeping it in the CI +system. + +The design of Jenkins, in other words, feels dated now. It imposes an +unnecessary burden on the service admins, which are responsible for +configuring and monitoring builds for their users. + +It is also believed that installing GitLab runners will be easier on +the sysadmins, although that remains to be verified. + +In the short term, Jenkins can keep doing what it does, but in the +long term, we would greatly benefit from retiring yet another service, +since it basically duplicates what GitLab CI can do. + +GitLab CI also has the advantage of being able to easily integrate +with GitLab pages, making it easier for people to build static +websites than the current combination of Jenkins and our [static sites +system](howto/static-component). See the [alternatives to the static site +system](static-component#Alternatives-considered) for more information. + +[Jenkins CI]: https://en.wikipedia.org/wiki/Jenkins_(software)