From 5452f437b48a8e7522252c5c48f2f312b0d948d4 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Antoine=20Beaupr=C3=A9?= <anarcat@debian.org> Date: Tue, 28 Sep 2021 10:34:07 -0400 Subject: [PATCH] document major issues with the webhook implementation --- service/static-shim.md | 166 +++++++++++++++++++++++++---------------- 1 file changed, 101 insertions(+), 65 deletions(-) diff --git a/service/static-shim.md b/service/static-shim.md index 428584b2..12309ec3 100644 --- a/service/static-shim.md +++ b/service/static-shim.md @@ -18,14 +18,26 @@ TODO: review ticket for possible howtos ## Deploy artifacts manually -If there are concerns that the webhook is not behaving as it should, -it's possible to call the "puller" directly by hand. For example, -given the [Pipeline 13285](https://gitlab.torproject.org/tpo/tpa/status-site/-/pipelines/13285) has [job 38077](https://gitlab.torproject.org/tpo/tpa/status-site/-/jobs/38077), we can tell the -puller to deploy in debugging mode with this command: +If a site is not deploying normally, it's possible to deploy a site by +hand by downloading and extracting the artifacts using the +`static-gitlab-shim-pull` script. + +For example, given the [Pipeline 13285](https://gitlab.torproject.org/tpo/tpa/status-site/-/pipelines/13285) has [job 38077](https://gitlab.torproject.org/tpo/tpa/status-site/-/jobs/38077), we can +tell the puller to deploy in debugging mode with this command: sudo -u static-gitlab-shim /usr/local/bin/static-gitlab-shim-pull --artifacts-url https://gitlab.torproject.org/tpo/tpa/status-site/-/jobs/38077/artifacts/download --site-url status.torproject.org --debug -The `--artifacts-url` is the `Download` link in the job page. +The `--artifacts-url` is the `Download` link in the job page. This +will: + + 1. download the artifacts (which is a ZIP file) + 2. extract them in a temporary directory + 3. `rsync --checksum` them to the actual source directory (to avoid + spurious timestamp changes) + 4. call `static-update-component` to deploy the site + +Note that this script is part of the webhook implementation and might +eventually be retired if that implementation is completely removed. ## Pager playbook @@ -84,70 +96,24 @@ during downtimes, updates to websites are not possible. how is this thing built, basically? --> -TODO: expand and review design. - -NOTE: this is worded as if this was already implement, but this -implementation might be incomplete or even inexistent. See [ticket -40364](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40364) for more information on progress. - -GitLab can publish pages in the static component mirror network -through the use of a "static shim" deployed on a static source. The -workflow goes a little like this: - - 1. user pushes a change to GitLab, which ... - 2. triggers a CI pipeline - 3. CI runner picks up the jobs and builds the website, pushes the - artifacts back to GitLab - 4. GitLab fires a [webhook](https://gitlab.torproject.org/help/user/project/integrations/webhooks#pipeline-events), typically on [pipeline events](https://docs.gitlab.com/ee/user/project/integrations/webhooks.html#pipeline-events) - 5. webhook receives the ping and authenticates against a - configuration, mapping to a given `static-component` (TODO: allow - list for gitlab?) - 6. after authentication, the webhook fires a script - (`static-gitlab-shim-pull`) - 7. `static-gitlab-shim-pull` parses the payload from the webhook and - finds the URL for the artifacts - 8. it extracts the artifacts in a temporary directory - 9. it runs `rsync -c` into the local static source, to avoid - resetting timestamps - 10. it fires the static-update-component command to propagate changes - to the rest of the static-component system - -A subset of those steps can be seen in the following design: - - - -The shim components run on a separate static-source, called -`static-gitlab-shim-source`. This is done to avoid adding complexity -to the already complicated, general purpose static source -(`staticiforme`). This has the added benefit that the source can be -hardened in the sense that access is restricted to TPA (which is not -the case of `staticiforme`). - -The mapping between webhooks and static components is established in -Puppet, which generates the secrets and writes it to the webhook -configuration, along with the `site_url` which corresponds to the site -URL in the `static-components.yaml` file. This is done to ensure that -a given GitLab project only has access to a single site and cannot -overwrite other sites. - -This involves that each site configured in this way must have a -secret token (in Trocla) and configuration (in Hiera) created by TPA -in Puppet. The secret token must also be configured in the GitLab -project. This could be automated by the judicious use of the GitLab -API using admin credentials, but considering that new sites are not -created very frequently, it could also be done by hand. +TODO: design still in flux, see "alternatives considered" below. ## Issues -<!-- such projects are never over. add a pointer to well-known issues --> -<!-- and show how to report problems. usually a link to the bugtracker --> - There is no issue tracker specifically for this project, [File][] or [search][] for issues in the [team issue tracker][search]. [File]: https://gitlab.torproject.org/tpo/tpa/team/-/issues/new [search]: https://gitlab.torproject.org/tpo/tpa/team/-/issues +This service was designed in [ticket 40364](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40364). + + * the webhook implementation fails if sites take more than 10 seconds + to deploy. + * the webhook implementation doesn't provide much visibility on + failures or progress, to see the list of recent webhook calls, head + to Settings -> Webhooks -> Edit -> Recent deliveries + ## Maintainer, users, and upstream The shim was written by anarcat and is maintained by TPA. It is used @@ -239,8 +205,14 @@ and could be reconstructed from pipelines in case of a total disaster. ## Alternatives considered -We briefly considered using GitLab's [CI deployment mechanism](https://about.gitlab.com/blog/2021/02/05/ci-deployment-and-environments/) -instead of webhooks, but decided against it for the following reasons: +This shim was designed to replace Jenkins with GitLab CI. As suche, +other options were considered, see also the [Jenkins documentation](service/jenkins#gitlab-ci) +and [ticket 40364](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40364) on those. + +### CI deployment + +We considered using GitLab's [CI deployment mechanism](https://about.gitlab.com/blog/2021/02/05/ci-deployment-and-environments/) instead of +webhooks, but originally decided against it for the following reasons: * the complexity is similar: both need a shared token between GitLab and the static source @@ -263,6 +235,70 @@ instead of webhooks, but decided against it for the following reasons: exception and is more error prone (e.g. if we somehow forget the `command=` override, we open full shell access) -NOTE: See also the [Jenkins documentation](service/jenkins#gitlab-ci) and [ticket 40364](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40364) -for more information on the discussion on the different options that -were considered in the Jenkins retirement. +### webhook deployment + +A designed based on GitLab webhooks was established, with a workflow +that goes something like this: + + 1. user pushes a change to GitLab, which ... + 2. triggers a CI pipeline + 3. CI runner picks up the jobs and builds the website, pushes the + artifacts back to GitLab + 4. GitLab fires a [webhook](https://gitlab.torproject.org/help/user/project/integrations/webhooks#pipeline-events), typically on [pipeline events](https://docs.gitlab.com/ee/user/project/integrations/webhooks.html#pipeline-events) + 5. webhook receives the ping and authenticates against a + configuration, mapping to a given `static-component` (TODO: allow + list for gitlab?) + 6. after authentication, the webhook fires a script + (`static-gitlab-shim-pull`) + 7. `static-gitlab-shim-pull` parses the payload from the webhook and + finds the URL for the artifacts + 8. it extracts the artifacts in a temporary directory + 9. it runs `rsync -c` into the local static source, to avoid + resetting timestamps + 10. it fires the static-update-component command to propagate changes + to the rest of the static-component system + +A subset of those steps can be seen in the following design: + + + +The shim components runs on a separate static-source, called +`static-gitlab-shim-source`. This is done to avoid adding complexity +to the already complicated, general purpose static source +(`staticiforme`). This has the added benefit that the source can be +hardened in the sense that access is restricted to TPA (which is not +the case of `staticiforme`). + +The mapping between webhooks and static components is established in +Puppet, which generates the secrets and writes it to the webhook +configuration, along with the `site_url` which corresponds to the site +URL in the `static-components.yaml` file. This is done to ensure that +a given GitLab project only has access to a single site and cannot +overwrite other sites. + +This involves that each site configured in this way must have a +secret token (in Trocla) and configuration (in Hiera) created by TPA +in Puppet. The secret token must also be configured in the GitLab +project. This could be automated by the judicious use of the GitLab +API using admin credentials, but considering that new sites are not +created very frequently, it could also be done by hand. + +Unfortunately this design has two major flaws: + + 1. webhooks are designed to be fast and short-lived: most site + deployments take longer than the preconfigured webhook timeout (10 + seconds) and therefore cannot be deployed synchronously, which + implies that... + + 2. webhook cannot propagate deployment errors back to the user + meaningfully: even if they run synchronously, errors in webhooks + do not show up in the CI pipeline, assuming the webhook manages to + complete at all. if the webhook fails to complete in time, no + output is available to the user at all. running asynchronously is + even worse as deployment errors do not show up in GitLab at all + and would require special monitoring by TPA, instead of delegating + that management to users. + +In the short term, the webhook system might be used asynchronously, +but in the long term, we're considering switching back to the +deployment system documented above. -- GitLab