diff --git a/tsa/howto/gitlab.mdwn b/tsa/howto/gitlab.mdwn index a8a4674bf05faa36dd715019d67463ece0397077..3a2d05f4ca73fdacacda6fcc5ce295ea45234b5b 100644 --- a/tsa/howto/gitlab.mdwn +++ b/tsa/howto/gitlab.mdwn @@ -51,16 +51,41 @@ lists: <tor-dev@lists.torproject.org> would be best. <!-- how to deal with them. this should be easy to follow: think of --> <!-- your future self, in a stressful situation, tired and hungry. --> + * Grafana Dashboards: + * [GitLab overview](https://grafana.torproject.org/d/QrDJktiMz/gitlab-omnibus) + * [Gitaly](https://grafana.torproject.org/d/x6Z50y-iz/gitlab-gitaly) + ## Disaster recovery -<!-- what to do if all goes to hell. e.g. restore from backups? --> -<!-- rebuild from scratch? not necessarily those procedures (e.g. see --> -<!-- "Installation" below but some pointers. --> +In case the entire GitLab machine is destroyed, a new server should be +provisionned in the [[ganeti]] cluster (or elsewhere) and backups +should be restored using the below procedure. + +### Running an emergency backup + +TBD + +### baremetal recovery + +TBD # Reference ## Installation -<!-- how to setup the service from scratch --> + +The current GitLab server was setup in the [[ganeti]] cluster in a +regular virtual machine. It was configured with [[puppet]] with the +`roles::gitlab`. + +This installs the [GitLab Omnibus](https://docs.gitlab.com/omnibus/) distribution which duplicates a +lot of resources we would otherwise manage elsewhere in Puppet, +including (but possibly not limited to): + + * [[prometheus]] + * [[postgresql]] + +This therefore leads to a "particular" situation regarding monitoring +and PostgreSQL backups, in particular. ## SLA <!-- this describes an acceptable level of service for this service --> @@ -278,6 +303,38 @@ around. <!-- a good guide to "audit" an existing project's design: --> <!-- https://bluesock.org/~willkg/blog/dev/auditing_projects.html --> +GitLab is a fairly large program with multiple components. The +[upstream documentation](https://docs.gitlab.com/ee/development/architecture.html) has a good details of the architecture but +this section aims at providing a shorter summary. Here's an overview +diagram, first: + + + +The web frontend is Nginx (which we incidentally also use in our +[[cache]] system) but GitLab wrote their own reverse proxy called +[GitLab Workhorse](https://gitlab.com/gitlab-org/gitlab-workhorse/) which in turn talks to the underlying GitLab +Rails application, served by the [Unicorn](https://yhbt.net/unicorn/) application +server. The Rails app stores its data in a [[postgresql]] database +(although not our own deployment, for now: TODO). GitLab also offloads +long-term background tasks to a tool called [sidekiq](https://github.com/mperham/sidekiq). + +Those all server HTTP(S) requests but GitLab is of course also +accessible over SSH to push/pull git repositories. This is handled by +a separate component called [gitlab-shell](https://gitlab.com/gitlab-org/gitlab-shell) which acts as a shell +for the `git` user. + +Workhorse, Rails, sidekiq and gitlab-shell all talk with Redis to +store temporary information, caches and session information. They can +also communicate with the [Gitaly](https://gitlab.com/gitlab-org/gitaly) server which handles all +communication with the git repositories themselves. + +Finally, Git)Lab also features GitLab Pages and Continuous Integration +("pages" and CI, neither of which we do not currently use). CI is +handled by [GitLab runners](https://gitlab.com/gitlab-org/gitlab-runner/) which can be deployed by anyone and +registered in the Rails app to pull CI jobs. [GitLab pages](https://gitlab.com/gitlab-org/gitlab-pages) is "a +simple HTTP server written in Go, made to serve GitLab Pages with +CNAMEs and SNI using HTTP/HTTP2". + ## Issues <!-- such projects are never over. add a pointer to well-known issues --> @@ -289,10 +346,20 @@ There is no issue tracker specifically for this project, [File][] or [File]: https://trac.torproject.org/projects/tor/newticket?component=Internal+Services%2FTor+Sysadmin+Team [search]: https://trac.torproject.org/projects/tor/query?status=!closed&component=Internal+Services%2FTor+Sysadmin+Team +TODO. + ## Monitoring and testing -<!-- describe how this service is monitored and how it can be tested --> -<!-- after major changes like IP address changes or upgrades --> +Monitoring right now is minimal: normal host-level metrics like disk +space, CPU usage, web port and TLS certificates are monitored by +Nagios with our normal infrastructure, as a black box. + +Prometheus monitoring is built into the GitLab Omnibus package, so it +is *not* configured through our Puppet like other Prometheus +servers. It has still been (manually) integrated in our Prometheus +setup and Grafana dashboards (see [pager playbook](#Pager_playbook)) have been deployed. + +More work is underway to improve monitoring in [issue 33921](https://gitlab.torproject.org/tpo/tpa/services/-/issues/33921). ## Backups