complete the static-shim documentation (team#40364)

8349c724 · anarcat · 4439d5ff · 8349c724 · 8349c724 · 8349c724
Verified Commit 8349c724 authored 3 years ago by anarcat
--- a/service/static-shim.md
+++ b/service/static-shim.md
@@ -5,31 +5,22 @@ hosted in the static mirror infrastructure.

 # Tutorial

-<!-- simple, brainless step-by-step instructions requiring little or -->
-<!-- no technical background -->
-
-TODO: "how do users add/remove sites"
-
-# How-to
-
-<!-- more in-depth procedure that may require interpretation -->
-
-TODO: review ticket for possible howtos
-
 ## Deploying a static site from GitLab CI

-First, you will need to make sure the site builds in GitLab CI. A
+First, you will need to make sure the site builds in [GitLab CI][]. A
 `build` stage MUST be used that will produce artifacts that can be
 used by the `deploy` job provided in the [`static-shim-deploy.yml`
 template][]. How to build the website will vary according to the site,
-obviously. See the [hugo build instructions below](#building-a-hugo-site).
+obviously. See the [Hugo build instructions below](#building-a-hugo-site) for that
+specific generator.

 [`static-shim-deploy.yml` template]: https://gitlab.torproject.org/tpo/tpa/ci-templates/-/blob/main/static-shim-deploy.yml

-TODO: link to documentation on how to build Lektor sites.
+TODO: link to documentation on how to build Lektor sites in GitLab CI.

 It is a good idea to also add a `pages` stage to preview the
-build. The above template has an example `pages` stage.
+build. The above template has an example `pages` stage, see also the
+[publishing GitLab pages](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/gitlab/#publishing-gitlab-pages) section our [GitLab documentation](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/gitlab/).

 Then include the deploy job template in the `.gitlab-ci.yml` with a
 snippet like this:
@@ -79,14 +70,16 @@ variable`, with the following parameters:

 Then the *public* part of that key needs to be added in Puppet. This
 can only be done by TPA, so file a ticket there if you need
-assistance. For TPA, see below for the remaining instructions.
+assistance. For TPA, [see below](#adding-a-new-static-site-shim-in-puppet) for the remaining instructions.

 You can commit the above changes to the `.gitlab-ci.yml` file, but
 when pushed, the pipeline's `deploy` stage is normal, TPA needs to do
 its magic for the deploy to work. Make sure the build works in GitLab
 pages before requesting the deploy in the static mirror system.

-### TPA: adding a new static site shim in Puppet
+# How-to
+
+## Adding a new static site shim in Puppet

 The public key mentioned above should be added in the `tor-puppet.git` repository, in the
 `hiera/common.yaml` file, in the `staticsync::gitlab_shim::ssh::sites`
@@ -251,19 +244,40 @@ template][].

 ## Pager playbook

-<!-- information about common errors from the monitoring system and -->
-<!-- how to deal with them. this should be easy to follow: think of -->
-<!-- your future self, in a stressful situation, tired and hungry. -->
-
-TODO: pager?
+A typical failure will be that users complains that their
+`deploy_static` job fails. We have yet to see such a failure occur,
+but if if does, users should provide a link to the Job log, which
+should provide more information.

 ## Disaster recovery

-<!-- what to do if all goes to hell. e.g. restore from backups? -->
-<!-- rebuild from scratch? not necessarily those procedures (e.g. see -->
-<!-- "Installation" below but some pointers. -->
+The service is "cattle" in that it can easily be rebuilt from scratch
+if the server is completely lost. Naturally it strongly depends on
+GitLab for operation. If GitLab would fail, it should still be
+possible to deploy sites to the static mirror system by deploying them
+by hand to the static shim and calling `static-update-component`
+there. It would be preferable to build the site outside of the
+static-shim server to avoid adding any extra packages we do not need
+there.
+
+The status site is particularly vulnerable to disasters here, see the
+[status-site disaster recovery documentation](service/status#disaster-recovery) for pointers on where
+to go in case things really go south.
+
+Another possible disaster that could happen is a complete GitLab
+compromise or hostile GitLab admin. Such an attacker could deploy any
+site they wanted and therefore deface or sabotage critical websites,
+introducing hostile code in thousands of users. If such an event would
+occur:
+
+ 1. **remove all SSH keys from the Puppet configuration**,
+    specifically in the `staticsync::gitlab_shim::ssh::sites`
+    variable, defined in `hiera/common.yaml`.

-TODO: DR
+ 2. restore sites from a known backup. the [backup service](howto/backup) should
+    have a copy of the static-shim content
+
+ 3. redeploy the sites manually (`static-update-component $URL`)

 # Reference

@@ -283,30 +297,35 @@ during downtimes, updates to websites are not possible.

 ## Design

-<!-- how this is built -->
-<!-- should reuse and expand on the "proposed solution", it's a -->
-<!-- "as-built" documented, whereas the "Proposed solution" is an -->
-<!-- "architectural" document, which the final result might differ -->
-<!-- from, sometimes significantly -->
+The static shim was built to allow [GitLab CI][] to deploy content to the
+[static mirror system][]. 
+
+[GitLab CI]: service/ci
+
+They way it works is that GitLab CI jobs (defined in the
+`.gitlab-ci.yml` file) build the site and then push it to a static
+source (currently `static-gitlab-shim.torproject.org`) with rsync over
+SSH. Then the CI job also calls the `static-update-component` script
+for the master to pull the content just like any other static
+component.

-<!-- a good guide to "audit" an existing project's design: -->
-<!-- https://bluesock.org/~willkg/blog/dev/auditing_projects.html -->
+![SSH deploy design of the static-shim](static-shim/architecture-static-shim-ssh.png)

-<!-- things to evaluate here:
+A [previous design](#webhook-deployment) involved a webhook written in Python, but now most
+of the business logic resides in a [`static-shim-deploy.yml` template]
+template which is basically a shell script embedded in a YAML
+file. The CI hooks are deployed by users, which will typically include
+the above template in their own `.gitlab-ci.yml` file.

- * services
- * storage (databases? plain text files? cloud/S3 storage?)
- * queues (e.g. email queues, job queues, schedulers)
- * interfaces (e.g. webserver, commandline)
- * authentication (e.g. SSH, LDAP?)
- * programming languages, frameworks, versions
- * dependent services (e.g. authenticates against LDAP, or requires
-   git pushes) 
- * deployments: how is code for this deployed (see also Installation)
+[static mirror system]: howto/static-component

-how is this thing built, basically? -->
+### Storage

-TODO: design still in flux, see "alternatives considered" below.
+Files are generated in GitLab CI as artifacts and stored there, which
+makes it possible for them to be deployed by hand as well. A copy is
+also kept on the static-shim server to make future deployments
+faster. We use `rsync --checksum` to avoid updating the timestamps
+even if the source file were just regenerated from scratch.

 ### Authentication

@@ -351,12 +370,6 @@ There is no issue tracker specifically for this project, [File][] or

 This service was designed in [ticket 40364](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40364).

- * the webhook implementation fails if sites take more than 10 seconds
-   to deploy.
- * the webhook implementation doesn't provide much visibility on
-   failures or progress, to see the list of recent webhook calls, head
-   to Settings -> Webhooks -> Edit -> Recent deliveries
-
 ## Maintainer, users, and upstream

 The shim was written by anarcat and is maintained by TPA. It is used
@@ -364,36 +377,43 @@ by all "critical" websites managed in GitLab.

 ## Monitoring and testing

-<!-- describe how this service is monitored and how it can be tested -->
-<!-- after major changes like IP address changes or upgrades. describe -->
-<!-- CI, test suites, linting, how security issues and upgrades are -->
-<!-- tracked -->
+There is not specific monitoring for this service, other than the
+usual server-level monitoring. If the service should fail, users will
+notice because their pipelines start failing.

-TODO: write unit tests
-TODO: how is this monitored?
+Good sites to test that the deployment works are
+<https://research.torproject.org/> ([pipeline link](https://gitlab.torproject.org/tpo/web/research/-/pipelines), not critical)
+or <https://status.torproject.org/> ([pipeline link](https://gitlab.torproject.org/tpo/tpa/status-site/-/pipelines),
+semi-critical).

 ## Logs and metrics

-<!-- where are the logs? how long are they kept? any PII? -->
-<!-- what about performance metrics? same questions -->
+Jobs in GitLab CI have their own logs and retention policies. The
+static shim should not add anything special to this, in theory. In
+practice it's possible some private key leakage occurs if a user would
+display the content of their own private SSH key in the job log. If
+they use the provided template, this should not occur.

-The webhook logs are available through `journalctl -u webhook` and in
-`/var/log/daemon.log`. They should not contain PII that is not already
-present in GitLab itself. Specifically, they might contain webhook
-payloads, artifacts URL and webpages contents.
-
-TODO: metrics?
+We do not maintain any metrics on this service, other than the
+usual server-level metrics.

 ## Backups

 No specific backup procedure is necessary for this server, outside of
 the automated basics. In fact, data on this host is mostly ephemeral
-and could be reconstructed from pipelines in case of a total disaster.
+and could be reconstructed from pipelines in case of a total server
+loss.

-## Other documentation
+As mentioned in the [disaster recovery section](#disaster-recovery), if the GitLab
+server gets compromised, the backup should still contain previous
+good copies of the websites, in any case.

-<!-- references to upstream documentation, if relevant -->
+## Other documentation

+ * GitLab's [CI deployment mechanism](https://about.gitlab.com/blog/2021/02/05/ci-deployment-and-environments/) blog post
+ * [Design and launch ticket](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40364)
+ * our [static mirror system][] documentation
+ * our [GitLab CI documentation][GitLab CI]
 * [Webhook homepage](https://github.com/adnanh/webhook)
   * [hook definition documentation](https://github.com/adnanh/webhook/blob/master/docs/Hook-Definition.md)
   * [hook examples](https://github.com/adnanh/webhook/blob/master/docs/Hook-Examples.md)
@@ -401,51 +421,51 @@ and could be reconstructed from pipelines in case of a total disaster.
   * [how to refer to payload in hook configuration](https://github.com/adnanh/webhook/blob/master/docs/Referencing-Request-Values.md)
   * [usage](https://github.com/adnanh/webhook/blob/master/docs/Webhook-Parameters.md)
 * [GitLab webhook documentation](https://docs.gitlab.com/ee/user/project/integrations/webhooks.html)
- * [Design and launch ticket](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40364)

 # Discussion

 ## Overview

-<!-- describe the overall project. should include a link to a ticket -->
-<!-- that has a launch checklist -->
-
-<!-- if this is an old project being documented, summarize the known -->
-<!-- issues with the project. to quote the "audit procedure":
-
- 5. When was the last security review done on the project? What was
-    the outcome? Are there any security issues currently? Should it
-    have another security review?
-
- 6. When was the last risk assessment done? Something that would cover
-    risks from the data stored, the access required, etc.
-
- 7. Are there any in-progress projects? Technical debt cleanup?
-    Migrations? What state are they in? What's the urgency? What's the
-    next steps?
-
- 8. What urgent things need to be done on this project?
-
-->
+The static shim was built to unblock the [Jenkins retirement
+project](https://gitlab.torproject.org/groups/tpo/-/milestones/27). A key blocker was that the [static mirror system][] was
+strongly coupled with Jenkins: many high traffic and critical websites
+are built and deployed by Jenkins. Unless we wanted to completely
+retire the static mirror system (in favor, say, of GitLab Pages), we
+had to create a way for GitLab CI to deploy content to the static
+mirror system.

 ## Goals

-<!-- include bugs to be fixed -->
-
 ### Must have

+ * deploy sites from GitLab CI to the static mirror system
+ * site A cannot deploy to site B without being explicitly granted
+   permissions
+ * server-side (ie. in Puppet) access control (ie. user X can only
+   deploy site B)
+
 ### Nice to have

+ * automate migration from Jenkins to avoid manually doing many sites
+ * reusable GitLab CI templates
+
 ### Non-Goals

+ * static mirror system replacement
+
 ## Approvals required

-<!-- for example, legal, "vegas", accounting, current maintainer -->
+TPA

 ## Proposed Solution

+We have decided to deploy sites over SSH from GitLab CI, see below for
+a discussion.
+
 ## Cost

+One VM, 20-30 hours of work, see [tpo/tpa/team#40364](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40364) for time tracking.
+
 ## Alternatives considered

 This shim was designed to replace Jenkins with GitLab CI. As suche,
@@ -478,6 +498,18 @@ webhooks, but originally decided against it for the following reasons:
   exception and is more error prone (e.g. if we somehow forget the
   `command=` override, we open full shell access)

+After trying the webhook deployment mechanism (below), we decided to
+go back to the deployment mechanism instead, because:
+
+ * the webhook implementation fails if sites take more than 10 seconds
+   to deploy.
+ * the webhook implementation doesn't provide much visibility on
+   failures or progress, to see the list of recent webhook calls, head
+   to Settings -> Webhooks -> Edit -> Recent deliveries
+
+See below for details on that, and above for the full design of the
+current deployment.
+
 ### webhook deployment

 A designed based on GitLab webhooks was established, with a workflow
@@ -489,8 +521,7 @@ that goes something like this:
    artifacts back to GitLab
 4. GitLab fires a [webhook](https://gitlab.torproject.org/help/user/project/integrations/webhooks#pipeline-events), typically on [pipeline events](https://docs.gitlab.com/ee/user/project/integrations/webhooks.html#pipeline-events)
 5. webhook receives the ping and authenticates against a
-    configuration, mapping to a given `static-component` (TODO: allow
-    list for gitlab?)
+    configuration, mapping to a given `static-component`
 6. after authentication, the webhook fires a script
    (`static-gitlab-shim-pull`)
 7. `static-gitlab-shim-pull` parses the payload from the webhook and

--- a/service/static-shim/architecture-static-shim-ssh.dot
+++ b/service/static-shim/architecture-static-shim-ssh.dot
+digraph static {
+        label="GitLab / static mirror integration architecture, SSH deploy design\nanarcat@torproject.org, september 2021"
+        subgraph "clustergitlab" {
+                label="GitLab"
+                labelloc=top
+
+                CI [ label="CI runners" ]
+                GitLab [ label="GitLab rails\n app" shape=box ]
+                GitLab -> CI [ label="dispatches jobs" ]
+        }
+        subgraph "clustersource" {
+                label="static source"
+                labelloc=bottom
+                rsync [ label="files" shape=cylinder ]
+                update [ label="static-update-component" ]
+        }
+        subgraph "clusterlegend" {
+                service [ shape=box ]
+                files [ shape=cylinder ]
+                process [ shape=oval ]
+                label="legend"
+                labelloc=bottom
+        }
+        CI -> rsync [ label="rsync" ]
+        CI -> update [ label="runs" ]
+        master [ label="static master\nand mirrors..." ]
+        update -> master [ label="notifies" ]
+        rsync -> master [ label="pulls" ]
+}
--- a/service/static-shim/architecture-static-shim-ssh.png
+++ b/service/static-shim/architecture-static-shim-ssh.png