document major issues with the webhook implementation

5452f437 · anarcat · 799ac90a · 5452f437
Verified Commit 5452f437 authored 3 years ago by anarcat
--- a/service/static-shim.md
+++ b/service/static-shim.md
@@ -18,14 +18,26 @@ TODO: review ticket for possible howtos

 ## Deploy artifacts manually

-If there are concerns that the webhook is not behaving as it should,
-it's possible to call the "puller" directly by hand. For example,
-given the [Pipeline 13285](https://gitlab.torproject.org/tpo/tpa/status-site/-/pipelines/13285) has [job 38077](https://gitlab.torproject.org/tpo/tpa/status-site/-/jobs/38077), we can tell the
-puller to deploy in debugging mode with this command:
+If a site is not deploying normally, it's possible to deploy a site by
+hand by downloading and extracting the artifacts using the
+`static-gitlab-shim-pull` script.
+
+For example, given the [Pipeline 13285](https://gitlab.torproject.org/tpo/tpa/status-site/-/pipelines/13285) has [job 38077](https://gitlab.torproject.org/tpo/tpa/status-site/-/jobs/38077), we can
+tell the puller to deploy in debugging mode with this command:

    sudo -u static-gitlab-shim /usr/local/bin/static-gitlab-shim-pull --artifacts-url https://gitlab.torproject.org/tpo/tpa/status-site/-/jobs/38077/artifacts/download --site-url status.torproject.org --debug

-The `--artifacts-url` is the `Download` link in the job page.
+The `--artifacts-url` is the `Download` link in the job page. This
+will:
+ 
+ 1. download the artifacts (which is a ZIP file)
+ 2. extract them in a temporary directory
+ 3. `rsync --checksum` them to the actual source directory (to avoid
+    spurious timestamp changes)
+ 4. call `static-update-component` to deploy the site
+
+Note that this script is part of the webhook implementation and might
+eventually be retired if that implementation is completely removed.

 ## Pager playbook

@@ -84,70 +96,24 @@ during downtimes, updates to websites are not possible.

 how is this thing built, basically? -->

-TODO: expand and review design.
-
-NOTE: this is worded as if this was already implement, but this
-implementation might be incomplete or even inexistent. See [ticket
-40364](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40364) for more information on progress.
-
-GitLab can publish pages in the static component mirror network
-through the use of a "static shim" deployed on a static source. The
-workflow goes a little like this:
-
- 1. user pushes a change to GitLab, which ...
- 2. triggers a CI pipeline
- 3. CI runner picks up the jobs and builds the website, pushes the
-    artifacts back to GitLab
- 4. GitLab fires a [webhook](https://gitlab.torproject.org/help/user/project/integrations/webhooks#pipeline-events), typically on [pipeline events](https://docs.gitlab.com/ee/user/project/integrations/webhooks.html#pipeline-events)
- 5. webhook receives the ping and authenticates against a
-    configuration, mapping to a given `static-component` (TODO: allow
-    list for gitlab?)
- 6. after authentication, the webhook fires a script
-    (`static-gitlab-shim-pull`)
- 7. `static-gitlab-shim-pull` parses the payload from the webhook and
-    finds the URL for the artifacts
- 8. it extracts the artifacts in a temporary directory
- 9. it runs `rsync -c` into the local static source, to avoid
-    resetting timestamps
- 10. it fires the static-update-component command to propagate changes
-     to the rest of the static-component system
-
-A subset of those steps can be seen in the following design:
-
-![Design of the static-shim](static-shim/architecture-static-shim.png)
-
-The shim components run on a separate static-source, called
-`static-gitlab-shim-source`. This is done to avoid adding complexity
-to the already complicated, general purpose static source
-(`staticiforme`). This has the added benefit that the source can be
-hardened in the sense that access is restricted to TPA (which is not
-the case of `staticiforme`).
-
-The mapping between webhooks and static components is established in
-Puppet, which generates the secrets and writes it to the webhook
-configuration, along with the `site_url` which corresponds to the site
-URL in the `static-components.yaml` file. This is done to ensure that
-a given GitLab project only has access to a single site and cannot
-overwrite other sites.
-
-This involves that each site configured in this way must have a
-secret token (in Trocla) and configuration (in Hiera) created by TPA
-in Puppet. The secret token must also be configured in the GitLab
-project. This could be automated by the judicious use of the GitLab
-API using admin credentials, but considering that new sites are not
-created very frequently, it could also be done by hand.
+TODO: design still in flux, see "alternatives considered" below.

 ## Issues

-<!-- such projects are never over. add a pointer to well-known issues -->
-<!-- and show how to report problems. usually a link to the bugtracker -->
-
 There is no issue tracker specifically for this project, [File][] or
 [search][] for issues in the [team issue tracker][search].

 [File]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/new
 [search]: https://gitlab.torproject.org/tpo/tpa/team/-/issues

+This service was designed in [ticket 40364](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40364).
+
+ * the webhook implementation fails if sites take more than 10 seconds
+   to deploy.
+ * the webhook implementation doesn't provide much visibility on
+   failures or progress, to see the list of recent webhook calls, head
+   to Settings -> Webhooks -> Edit -> Recent deliveries
+
 ## Maintainer, users, and upstream

 The shim was written by anarcat and is maintained by TPA. It is used
@@ -239,8 +205,14 @@ and could be reconstructed from pipelines in case of a total disaster.

 ## Alternatives considered

-We briefly considered using GitLab's [CI deployment mechanism](https://about.gitlab.com/blog/2021/02/05/ci-deployment-and-environments/)
-instead of webhooks, but decided against it for the following reasons:
+This shim was designed to replace Jenkins with GitLab CI. As suche,
+other options were considered, see also the [Jenkins documentation](service/jenkins#gitlab-ci)
+and [ticket 40364](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40364) on those.
+
+### CI deployment
+
+We considered using GitLab's [CI deployment mechanism](https://about.gitlab.com/blog/2021/02/05/ci-deployment-and-environments/) instead of
+webhooks, but originally decided against it for the following reasons:

 * the complexity is similar: both need a shared token between GitLab
   and the static source
@@ -263,6 +235,70 @@ instead of webhooks, but decided against it for the following reasons:
   exception and is more error prone (e.g. if we somehow forget the
   `command=` override, we open full shell access)

-NOTE: See also the [Jenkins documentation](service/jenkins#gitlab-ci) and [ticket 40364](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40364)
-for more information on the discussion on the different options that
-were considered in the Jenkins retirement.
+### webhook deployment
+
+A designed based on GitLab webhooks was established, with a workflow
+that goes something like this:
+
+ 1. user pushes a change to GitLab, which ...
+ 2. triggers a CI pipeline
+ 3. CI runner picks up the jobs and builds the website, pushes the
+    artifacts back to GitLab
+ 4. GitLab fires a [webhook](https://gitlab.torproject.org/help/user/project/integrations/webhooks#pipeline-events), typically on [pipeline events](https://docs.gitlab.com/ee/user/project/integrations/webhooks.html#pipeline-events)
+ 5. webhook receives the ping and authenticates against a
+    configuration, mapping to a given `static-component` (TODO: allow
+    list for gitlab?)
+ 6. after authentication, the webhook fires a script
+    (`static-gitlab-shim-pull`)
+ 7. `static-gitlab-shim-pull` parses the payload from the webhook and
+    finds the URL for the artifacts
+ 8. it extracts the artifacts in a temporary directory
+ 9. it runs `rsync -c` into the local static source, to avoid
+    resetting timestamps
+ 10. it fires the static-update-component command to propagate changes
+     to the rest of the static-component system
+
+A subset of those steps can be seen in the following design:
+
+![Design of the static-shim](static-shim/architecture-static-shim.png)
+
+The shim components runs on a separate static-source, called
+`static-gitlab-shim-source`. This is done to avoid adding complexity
+to the already complicated, general purpose static source
+(`staticiforme`). This has the added benefit that the source can be
+hardened in the sense that access is restricted to TPA (which is not
+the case of `staticiforme`).
+
+The mapping between webhooks and static components is established in
+Puppet, which generates the secrets and writes it to the webhook
+configuration, along with the `site_url` which corresponds to the site
+URL in the `static-components.yaml` file. This is done to ensure that
+a given GitLab project only has access to a single site and cannot
+overwrite other sites.
+
+This involves that each site configured in this way must have a
+secret token (in Trocla) and configuration (in Hiera) created by TPA
+in Puppet. The secret token must also be configured in the GitLab
+project. This could be automated by the judicious use of the GitLab
+API using admin credentials, but considering that new sites are not
+created very frequently, it could also be done by hand.
+
+Unfortunately this design has two major flaws:
+
+ 1. webhooks are designed to be fast and short-lived: most site
+    deployments take longer than the preconfigured webhook timeout (10
+    seconds) and therefore cannot be deployed synchronously, which
+    implies that...
+
+ 2. webhook cannot propagate deployment errors back to the user
+    meaningfully: even if they run synchronously, errors in webhooks
+    do not show up in the CI pipeline, assuming the webhook manages to
+    complete at all. if the webhook fails to complete in time, no
+    output is available to the user at all. running asynchronously is
+    even worse as deployment errors do not show up in GitLab at all
+    and would require special monitoring by TPA, instead of delegating
+    that management to users.
+
+In the short term, the webhook system might be used asynchronously,
+but in the long term, we're considering switching back to the
+deployment system documented above.