Unverified Commit 07a72723 authored by anarcat's avatar anarcat
Browse files

quick run at basic static site mirror docs (tpo/tpa/team#34436)

still lots todo but it's a start
parent 95b0ced5
Loading
Loading
Loading
Loading
+124 −29
Original line number Diff line number Diff line
@@ -5,12 +5,6 @@ distributed, a sort of content distribution network (CDN).

[[_TOC_]]

<!-- note: this template was designed based on multiple sources: -->
<!-- https://www.divio.com/blog/documentation/ -->
<!-- http://opsreportcard.com/section/9-->
<!-- http://opsreportcard.com/section/11 -->
<!-- comments like this one should be removed on instanciation -->

# Tutorial

This documentation is about administrating the static site components,
@@ -145,12 +139,16 @@ from a sysadmin perspective. User documentation lives in [doc/static-sites](doc/

## Pager playbook

TODO: add a pager playbook.

<!-- information about common errors from the monitoring system and -->
<!-- how to deal with them. this should be easy to follow: think of -->
<!-- your future self, in a stressful situation, tired and hungry. -->

## Disaster recovery

TODO: add a disaster recovery.

<!-- what to do if all goes to hell. e.g. restore from backups? -->
<!-- rebuild from scratch? not necessarily those procedures (e.g. see -->
<!-- "Installation" below but some pointers. -->
@@ -158,25 +156,70 @@ from a sysadmin perspective. User documentation lives in [doc/static-sites](doc/
# Reference

## Installation
<!-- how to setup the service from scratch -->

Unclear. A new set of servers would need to be built, probably using
[Puppet](puppet) although it's not quite clear whether the
configuration in Puppet is sufficient to establish new servers.

TODO: check with weasel to see what is not in Puppet for a new host
(source/mirror/master) setup.

## SLA
<!-- this describes an acceptable level of service for this service -->

This service is designed to be highly available. All web sites should
keep working (maybe with some performance degradation) even if one of
the hosts goes down. It should also absorb and tolerate moderate
denial of service attacks.

## Design
<!-- how this is built -->
<!-- should reuse and expand on the "proposed solution", it's a -->
<!-- "as-built" documented, whereas the "Proposed solution" is an -->
<!-- "architectural" document, which the final result might differ -->
<!-- from, sometimes significantly -->

<!-- a good guide to "audit" an existing project's design: -->
<!-- https://bluesock.org/~willkg/blog/dev/auditing_projects.html -->
The static mirror system is built of three kinds of hosts:

## Issues
 * `source` - builds and hosts the original content
 * `master` - receives the contents from the source, dispatches it
   (atomically) to the mirrors
 * `mirror` - serves the contents to the user

<!-- this is a rephrased copy of -->
<!-- https://salsa.debian.org/dsa-team/mirror/dsa-puppet/-/blob/master/modules/roles/README.static-mirroring.txt -->

A key advantage of that infrastructure is the higher availability it
provides: whereas individual virtual machines are power-cycled for
scheduled maintenance (e.g. kernel upgrades), static mirroring
machines are removed from the DNS during their maintenance.

The term static mirroring infrastructure includes:

<!-- such projects are never over. add a pointer to well-known issues -->
<!-- and show how to report problems. usually a link to the bugtracker -->
 • components, specifying the data source and other config options.
   See `modules/roles/misc/static-components.yaml`
 • a `master` host for each component, responsible only for distributing data,
   not for serving data to end users.
 • machines with the `static_mirror` Puppet role
 • a few scripts around `rsync(1)`

When data changes, the `source` is responsible for running
`static-update-component`, which instructs the `master` via SSH to run
`static-master-update-component`, transfers a new copy of the source
data to the `master` using rsync(1) and, upon successful copy, swaps
it with the current copy.

The current copy on the `master` is then distributed to all actual
`mirror`s, again placing a new copy alongside their current copy using
`rsync(1)`.

Once the data successfully made it to all mirrors, the mirrors are
instructed to swap the new copy with their current copy, at which
point the updated data will be served to end users.

<!-- end of the copy -->

TODO: expand design. talk about mininag and walk through the [scripts overview](https://salsa.debian.org/dsa-team/mirror/dsa-puppet/-/blob/master/modules/staticsync/files/OVERVIEW)

TODO: make a diagram?

TODO: "audit" the static site mirror design as per https://bluesock.org/~willkg/blog/dev/auditing_projects.html

## Issues

There is no issue tracker specifically for this project, [File][] or
[search][] for issues in the [team issue tracker][search].
@@ -186,32 +229,61 @@ There is no issue tracker specifically for this project, [File][] or

## Monitoring and testing

<!-- describe how this service is monitored and how it can be tested -->
<!-- after major changes like IP address changes or upgrades -->
Static site synchronisation is monitored in Nagios, using a block in
`nagios-master.cfg` which looks like:

    -
        name: mirror static sync - extra
        check: "dsa_check_staticsync!extra.torproject.org"
        hosts: global
        servicegroups: mirror

That script (actually called `dsa-check-mirrorsync`) actually makes an
HTTP request to every mirror and checks the timestamp inside a "trace"
file (`.serial`) to make sure everyone has the same copy of the site.

## Logs and metrics

<!-- where are the logs? how long are they kept? any PII? -->
<!-- what about performance metrics? same questions -->
All tor webservers keep a minimal amount of logs. The IP address and
time (but not the date) are zero'd (`00:00:00`). The referer is
disabled on the client side by sending the `Referrer-Policy
"no-referrer"` header.

The IP addresses are replaced with:

 * `0.0.0.0` - HTTP request
 * `0.0.0.1` - HTTPS request
 * `0.0.0.2` - hidden service request

Logs are kept for two weeks.

Metrics are scraped by [Prometheus](prometheus) using the "apache"
exporter.

## Backups

<!-- does this service need anything special in terms of backups? -->
<!-- e.g. locking a database? special recovery procedures? -->
The `source` hosts are backed up with [bacula](backups) without any special
provision. 

TODO: check if master / mirror nodes need to be backup. Probably not?

## Other documentation

<!-- references to upstream documentation, if relevant -->
 * [DSA wiki](https://dsa.debian.org/howto/static-mirroring/)
 * [scripts overview](https://salsa.debian.org/dsa-team/mirror/dsa-puppet/-/blob/master/modules/staticsync/files/OVERVIEW)
 * [README.static-mirroring](https://salsa.debian.org/dsa-team/mirror/dsa-puppet/-/blob/master/modules/roles/README.static-mirroring.txt)

# Discussion

## Overview

<!-- describe the overall project. should include a link to a ticket -->
<!-- that has a launch checklist -->
The goal of this discussion section is to consider improvements to the
static site mirror system at torproject.org. It might also apply to
debian.org, but the focus is currently on TPO.

## Goals
<!-- include bugs to be fixed -->

TODO: document requirements

### Must have

@@ -220,12 +292,35 @@ There is no issue tracker specifically for this project, [File][] or
### Non-Goals

## Approvals required
<!-- for example, legal, "vegas", accounting, current maintainer -->

Should be approved by TPA.

## Proposed Solution

TODO: propose improvements to the current static mirror system.

brainstorm:

 * replace source with gitlab CI/runners
 * get rid of master altogether? becomes gitlab pages?
 * replace mirrors with the caching system?

One concern with using GitLab pages is that it uses a custom webserver
(to get and issue TLS certs for the custom domains) and requires a
shared filesystem to deploy content. GitLab.com uses NFS to decouple
the pages host from the main GitLab host, maybe we could use CephFS
instead? In any case it's a little clunky and doesn't immediately
fulfill the high availability requirement.

## Cost

Staff, mostly. We expect a reduction in cost if we reduce the number
of copies of the sites we have to keep around.

## Alternatives considered

<!-- include benchmarks and procedure if relevant -->

 * [GitLab pages](https://docs.gitlab.com/ee/administration/pages/) could be used as a source?
 * the [cache system](cache) could be used as a replacement in the
   frontend