Skip to content
Snippets Groups Projects
Verified Commit 8349c724 authored by anarcat's avatar anarcat
Browse files

complete the static-shim documentation (team#40364)

parent 4439d5ff
No related branches found
No related tags found
No related merge requests found
......@@ -5,31 +5,22 @@ hosted in the static mirror infrastructure.
# Tutorial
<!-- simple, brainless step-by-step instructions requiring little or -->
<!-- no technical background -->
TODO: "how do users add/remove sites"
# How-to
<!-- more in-depth procedure that may require interpretation -->
TODO: review ticket for possible howtos
## Deploying a static site from GitLab CI
First, you will need to make sure the site builds in GitLab CI. A
First, you will need to make sure the site builds in [GitLab CI][]. A
`build` stage MUST be used that will produce artifacts that can be
used by the `deploy` job provided in the [`static-shim-deploy.yml`
template][]. How to build the website will vary according to the site,
obviously. See the [hugo build instructions below](#building-a-hugo-site).
obviously. See the [Hugo build instructions below](#building-a-hugo-site) for that
specific generator.
[`static-shim-deploy.yml` template]: https://gitlab.torproject.org/tpo/tpa/ci-templates/-/blob/main/static-shim-deploy.yml
TODO: link to documentation on how to build Lektor sites.
TODO: link to documentation on how to build Lektor sites in GitLab CI.
It is a good idea to also add a `pages` stage to preview the
build. The above template has an example `pages` stage.
build. The above template has an example `pages` stage, see also the
[publishing GitLab pages](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/gitlab/#publishing-gitlab-pages) section our [GitLab documentation](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/howto/gitlab/).
Then include the deploy job template in the `.gitlab-ci.yml` with a
snippet like this:
......@@ -79,14 +70,16 @@ variable`, with the following parameters:
Then the *public* part of that key needs to be added in Puppet. This
can only be done by TPA, so file a ticket there if you need
assistance. For TPA, see below for the remaining instructions.
assistance. For TPA, [see below](#adding-a-new-static-site-shim-in-puppet) for the remaining instructions.
You can commit the above changes to the `.gitlab-ci.yml` file, but
when pushed, the pipeline's `deploy` stage is normal, TPA needs to do
its magic for the deploy to work. Make sure the build works in GitLab
pages before requesting the deploy in the static mirror system.
### TPA: adding a new static site shim in Puppet
# How-to
## Adding a new static site shim in Puppet
The public key mentioned above should be added in the `tor-puppet.git` repository, in the
`hiera/common.yaml` file, in the `staticsync::gitlab_shim::ssh::sites`
......@@ -251,19 +244,40 @@ template][].
## Pager playbook
<!-- information about common errors from the monitoring system and -->
<!-- how to deal with them. this should be easy to follow: think of -->
<!-- your future self, in a stressful situation, tired and hungry. -->
TODO: pager?
A typical failure will be that users complains that their
`deploy_static` job fails. We have yet to see such a failure occur,
but if if does, users should provide a link to the Job log, which
should provide more information.
## Disaster recovery
<!-- what to do if all goes to hell. e.g. restore from backups? -->
<!-- rebuild from scratch? not necessarily those procedures (e.g. see -->
<!-- "Installation" below but some pointers. -->
The service is "cattle" in that it can easily be rebuilt from scratch
if the server is completely lost. Naturally it strongly depends on
GitLab for operation. If GitLab would fail, it should still be
possible to deploy sites to the static mirror system by deploying them
by hand to the static shim and calling `static-update-component`
there. It would be preferable to build the site outside of the
static-shim server to avoid adding any extra packages we do not need
there.
The status site is particularly vulnerable to disasters here, see the
[status-site disaster recovery documentation](service/status#disaster-recovery) for pointers on where
to go in case things really go south.
Another possible disaster that could happen is a complete GitLab
compromise or hostile GitLab admin. Such an attacker could deploy any
site they wanted and therefore deface or sabotage critical websites,
introducing hostile code in thousands of users. If such an event would
occur:
1. **remove all SSH keys from the Puppet configuration**,
specifically in the `staticsync::gitlab_shim::ssh::sites`
variable, defined in `hiera/common.yaml`.
TODO: DR
2. restore sites from a known backup. the [backup service](howto/backup) should
have a copy of the static-shim content
3. redeploy the sites manually (`static-update-component $URL`)
# Reference
......@@ -283,30 +297,35 @@ during downtimes, updates to websites are not possible.
## Design
<!-- how this is built -->
<!-- should reuse and expand on the "proposed solution", it's a -->
<!-- "as-built" documented, whereas the "Proposed solution" is an -->
<!-- "architectural" document, which the final result might differ -->
<!-- from, sometimes significantly -->
The static shim was built to allow [GitLab CI][] to deploy content to the
[static mirror system][].
[GitLab CI]: service/ci
They way it works is that GitLab CI jobs (defined in the
`.gitlab-ci.yml` file) build the site and then push it to a static
source (currently `static-gitlab-shim.torproject.org`) with rsync over
SSH. Then the CI job also calls the `static-update-component` script
for the master to pull the content just like any other static
component.
<!-- a good guide to "audit" an existing project's design: -->
<!-- https://bluesock.org/~willkg/blog/dev/auditing_projects.html -->
![SSH deploy design of the static-shim](static-shim/architecture-static-shim-ssh.png)
<!-- things to evaluate here:
A [previous design](#webhook-deployment) involved a webhook written in Python, but now most
of the business logic resides in a [`static-shim-deploy.yml` template]
template which is basically a shell script embedded in a YAML
file. The CI hooks are deployed by users, which will typically include
the above template in their own `.gitlab-ci.yml` file.
* services
* storage (databases? plain text files? cloud/S3 storage?)
* queues (e.g. email queues, job queues, schedulers)
* interfaces (e.g. webserver, commandline)
* authentication (e.g. SSH, LDAP?)
* programming languages, frameworks, versions
* dependent services (e.g. authenticates against LDAP, or requires
git pushes)
* deployments: how is code for this deployed (see also Installation)
[static mirror system]: howto/static-component
how is this thing built, basically? -->
### Storage
TODO: design still in flux, see "alternatives considered" below.
Files are generated in GitLab CI as artifacts and stored there, which
makes it possible for them to be deployed by hand as well. A copy is
also kept on the static-shim server to make future deployments
faster. We use `rsync --checksum` to avoid updating the timestamps
even if the source file were just regenerated from scratch.
### Authentication
......@@ -351,12 +370,6 @@ There is no issue tracker specifically for this project, [File][] or
This service was designed in [ticket 40364](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40364).
* the webhook implementation fails if sites take more than 10 seconds
to deploy.
* the webhook implementation doesn't provide much visibility on
failures or progress, to see the list of recent webhook calls, head
to Settings -> Webhooks -> Edit -> Recent deliveries
## Maintainer, users, and upstream
The shim was written by anarcat and is maintained by TPA. It is used
......@@ -364,36 +377,43 @@ by all "critical" websites managed in GitLab.
## Monitoring and testing
<!-- describe how this service is monitored and how it can be tested -->
<!-- after major changes like IP address changes or upgrades. describe -->
<!-- CI, test suites, linting, how security issues and upgrades are -->
<!-- tracked -->
There is not specific monitoring for this service, other than the
usual server-level monitoring. If the service should fail, users will
notice because their pipelines start failing.
TODO: write unit tests
TODO: how is this monitored?
Good sites to test that the deployment works are
<https://research.torproject.org/> ([pipeline link](https://gitlab.torproject.org/tpo/web/research/-/pipelines), not critical)
or <https://status.torproject.org/> ([pipeline link](https://gitlab.torproject.org/tpo/tpa/status-site/-/pipelines),
semi-critical).
## Logs and metrics
<!-- where are the logs? how long are they kept? any PII? -->
<!-- what about performance metrics? same questions -->
Jobs in GitLab CI have their own logs and retention policies. The
static shim should not add anything special to this, in theory. In
practice it's possible some private key leakage occurs if a user would
display the content of their own private SSH key in the job log. If
they use the provided template, this should not occur.
The webhook logs are available through `journalctl -u webhook` and in
`/var/log/daemon.log`. They should not contain PII that is not already
present in GitLab itself. Specifically, they might contain webhook
payloads, artifacts URL and webpages contents.
TODO: metrics?
We do not maintain any metrics on this service, other than the
usual server-level metrics.
## Backups
No specific backup procedure is necessary for this server, outside of
the automated basics. In fact, data on this host is mostly ephemeral
and could be reconstructed from pipelines in case of a total disaster.
and could be reconstructed from pipelines in case of a total server
loss.
## Other documentation
As mentioned in the [disaster recovery section](#disaster-recovery), if the GitLab
server gets compromised, the backup should still contain previous
good copies of the websites, in any case.
<!-- references to upstream documentation, if relevant -->
## Other documentation
* GitLab's [CI deployment mechanism](https://about.gitlab.com/blog/2021/02/05/ci-deployment-and-environments/) blog post
* [Design and launch ticket](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40364)
* our [static mirror system][] documentation
* our [GitLab CI documentation][GitLab CI]
* [Webhook homepage](https://github.com/adnanh/webhook)
* [hook definition documentation](https://github.com/adnanh/webhook/blob/master/docs/Hook-Definition.md)
* [hook examples](https://github.com/adnanh/webhook/blob/master/docs/Hook-Examples.md)
......@@ -401,51 +421,51 @@ and could be reconstructed from pipelines in case of a total disaster.
* [how to refer to payload in hook configuration](https://github.com/adnanh/webhook/blob/master/docs/Referencing-Request-Values.md)
* [usage](https://github.com/adnanh/webhook/blob/master/docs/Webhook-Parameters.md)
* [GitLab webhook documentation](https://docs.gitlab.com/ee/user/project/integrations/webhooks.html)
* [Design and launch ticket](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40364)
# Discussion
## Overview
<!-- describe the overall project. should include a link to a ticket -->
<!-- that has a launch checklist -->
<!-- if this is an old project being documented, summarize the known -->
<!-- issues with the project. to quote the "audit procedure":
5. When was the last security review done on the project? What was
the outcome? Are there any security issues currently? Should it
have another security review?
6. When was the last risk assessment done? Something that would cover
risks from the data stored, the access required, etc.
7. Are there any in-progress projects? Technical debt cleanup?
Migrations? What state are they in? What's the urgency? What's the
next steps?
8. What urgent things need to be done on this project?
-->
The static shim was built to unblock the [Jenkins retirement
project](https://gitlab.torproject.org/groups/tpo/-/milestones/27). A key blocker was that the [static mirror system][] was
strongly coupled with Jenkins: many high traffic and critical websites
are built and deployed by Jenkins. Unless we wanted to completely
retire the static mirror system (in favor, say, of GitLab Pages), we
had to create a way for GitLab CI to deploy content to the static
mirror system.
## Goals
<!-- include bugs to be fixed -->
### Must have
* deploy sites from GitLab CI to the static mirror system
* site A cannot deploy to site B without being explicitly granted
permissions
* server-side (ie. in Puppet) access control (ie. user X can only
deploy site B)
### Nice to have
* automate migration from Jenkins to avoid manually doing many sites
* reusable GitLab CI templates
### Non-Goals
* static mirror system replacement
## Approvals required
<!-- for example, legal, "vegas", accounting, current maintainer -->
TPA
## Proposed Solution
We have decided to deploy sites over SSH from GitLab CI, see below for
a discussion.
## Cost
One VM, 20-30 hours of work, see [tpo/tpa/team#40364](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40364) for time tracking.
## Alternatives considered
This shim was designed to replace Jenkins with GitLab CI. As suche,
......@@ -478,6 +498,18 @@ webhooks, but originally decided against it for the following reasons:
exception and is more error prone (e.g. if we somehow forget the
`command=` override, we open full shell access)
After trying the webhook deployment mechanism (below), we decided to
go back to the deployment mechanism instead, because:
* the webhook implementation fails if sites take more than 10 seconds
to deploy.
* the webhook implementation doesn't provide much visibility on
failures or progress, to see the list of recent webhook calls, head
to Settings -> Webhooks -> Edit -> Recent deliveries
See below for details on that, and above for the full design of the
current deployment.
### webhook deployment
A designed based on GitLab webhooks was established, with a workflow
......@@ -489,8 +521,7 @@ that goes something like this:
artifacts back to GitLab
4. GitLab fires a [webhook](https://gitlab.torproject.org/help/user/project/integrations/webhooks#pipeline-events), typically on [pipeline events](https://docs.gitlab.com/ee/user/project/integrations/webhooks.html#pipeline-events)
5. webhook receives the ping and authenticates against a
configuration, mapping to a given `static-component` (TODO: allow
list for gitlab?)
configuration, mapping to a given `static-component`
6. after authentication, the webhook fires a script
(`static-gitlab-shim-pull`)
7. `static-gitlab-shim-pull` parses the payload from the webhook and
......
digraph static {
label="GitLab / static mirror integration architecture, SSH deploy design\nanarcat@torproject.org, september 2021"
subgraph "clustergitlab" {
label="GitLab"
labelloc=top
CI [ label="CI runners" ]
GitLab [ label="GitLab rails\n app" shape=box ]
GitLab -> CI [ label="dispatches jobs" ]
}
subgraph "clustersource" {
label="static source"
labelloc=bottom
rsync [ label="files" shape=cylinder ]
update [ label="static-update-component" ]
}
subgraph "clusterlegend" {
service [ shape=box ]
files [ shape=cylinder ]
process [ shape=oval ]
label="legend"
labelloc=bottom
}
CI -> rsync [ label="rsync" ]
CI -> update [ label="runs" ]
master [ label="static master\nand mirrors..." ]
update -> master [ label="notifies" ]
rsync -> master [ label="pulls" ]
}
service/static-shim/architecture-static-shim-ssh.png

57.1 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment