start filling up the gitlab doc: installation, design, monitoring...

4c97abe8 · anarcat · 688f8fe1 · 4c97abe8
Unverified Commit 4c97abe8 authored 4 years ago by anarcat
--- a/tsa/howto/gitlab.mdwn
+++ b/tsa/howto/gitlab.mdwn
@@ -51,16 +51,41 @@ lists: <tor-dev@lists.torproject.org> would be best.
 <!-- how to deal with them. this should be easy to follow: think of -->
 <!-- your future self, in a stressful situation, tired and hungry. -->

+ * Grafana Dashboards:
+   * [GitLab overview](https://grafana.torproject.org/d/QrDJktiMz/gitlab-omnibus)
+   * [Gitaly](https://grafana.torproject.org/d/x6Z50y-iz/gitlab-gitaly)
+
 ## Disaster recovery

-<!-- what to do if all goes to hell. e.g. restore from backups? -->
-<!-- rebuild from scratch? not necessarily those procedures (e.g. see -->
-<!-- "Installation" below but some pointers. -->
+In case the entire GitLab machine is destroyed, a new server should be
+provisionned in the [[ganeti]] cluster (or elsewhere) and backups
+should be restored using the below procedure.
+
+### Running an emergency backup
+
+TBD
+
+### baremetal recovery
+
+TBD

 # Reference

 ## Installation
-<!-- how to setup the service from scratch -->
+
+The current GitLab server was setup in the [[ganeti]] cluster in a
+regular virtual machine. It was configured with [[puppet]] with the
+`roles::gitlab`.
+
+This installs the [GitLab Omnibus](https://docs.gitlab.com/omnibus/) distribution which duplicates a
+lot of resources we would otherwise manage elsewhere in Puppet,
+including (but possibly not limited to):
+
+ * [[prometheus]]
+ * [[postgresql]]
+
+This therefore leads to a "particular" situation regarding monitoring
+and PostgreSQL backups, in particular.

 ## SLA
 <!-- this describes an acceptable level of service for this service -->
@@ -278,6 +303,38 @@ around.
 <!-- a good guide to "audit" an existing project's design: -->
 <!-- https://bluesock.org/~willkg/blog/dev/auditing_projects.html -->

+GitLab is a fairly large program with multiple components. The
+[upstream documentation](https://docs.gitlab.com/ee/development/architecture.html) has a good details of the architecture but
+this section aims at providing a shorter summary. Here's an overview
+diagram, first:
+
+![GitLab's architecture diagram](https://docs.gitlab.com/ee/development/img/architecture_simplified.png)
+
+The web frontend is Nginx (which we incidentally also use in our
+[[cache]] system) but GitLab wrote their own reverse proxy called
+[GitLab Workhorse](https://gitlab.com/gitlab-org/gitlab-workhorse/) which in turn talks to the underlying GitLab
+Rails application, served by the [Unicorn](https://yhbt.net/unicorn/) application
+server. The Rails app stores its data in a [[postgresql]] database
+(although not our own deployment, for now: TODO). GitLab also offloads
+long-term background tasks to a tool called [sidekiq](https://github.com/mperham/sidekiq).
+
+Those all server HTTP(S) requests but GitLab is of course also
+accessible over SSH to push/pull git repositories. This is handled by
+a separate component called [gitlab-shell](https://gitlab.com/gitlab-org/gitlab-shell) which acts as a shell
+for the `git` user. 
+
+Workhorse, Rails, sidekiq and gitlab-shell all talk with Redis to
+store temporary information, caches and session information. They can
+also communicate with the [Gitaly](https://gitlab.com/gitlab-org/gitaly) server which handles all
+communication with the git repositories themselves.
+
+Finally, Git)Lab also features GitLab Pages and Continuous Integration
+("pages" and CI, neither of which we do not currently use). CI is
+handled by [GitLab runners](https://gitlab.com/gitlab-org/gitlab-runner/) which can be deployed by anyone and
+registered in the Rails app to pull CI jobs. [GitLab pages](https://gitlab.com/gitlab-org/gitlab-pages) is "a
+simple HTTP server written in Go, made to serve GitLab Pages with
+CNAMEs and SNI using HTTP/HTTP2".
+
 ## Issues

 <!-- such projects are never over. add a pointer to well-known issues -->
@@ -289,10 +346,20 @@ There is no issue tracker specifically for this project, [File][] or
 [File]: https://trac.torproject.org/projects/tor/newticket?component=Internal+Services%2FTor+Sysadmin+Team
 [search]: https://trac.torproject.org/projects/tor/query?status=!closed&component=Internal+Services%2FTor+Sysadmin+Team

+TODO.
+
 ## Monitoring and testing

-<!-- describe how this service is monitored and how it can be tested -->
-<!-- after major changes like IP address changes or upgrades -->
+Monitoring right now is minimal: normal host-level metrics like disk
+space, CPU usage, web port and TLS certificates are monitored by
+Nagios with our normal infrastructure, as a black box.
+
+Prometheus monitoring is built into the GitLab Omnibus package, so it
+is *not* configured through our Puppet like other Prometheus
+servers. It has still been (manually) integrated in our Prometheus
+setup and Grafana dashboards (see [pager playbook](#Pager_playbook)) have been deployed.
+
+More work is underway to improve monitoring in [issue 33921](https://gitlab.torproject.org/tpo/tpa/services/-/issues/33921).

 ## Backups