From f12d790d5cfcc830f14f23d36513adff1088dae6 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Antoine=20Beaupr=C3=A9?= <anarcat@debian.org>
Date: Mon, 11 Nov 2019 15:43:16 -0500
Subject: [PATCH] include a few more maintenance instructions of the cache

---
 tsa/howto/cache.mdwn | 94 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 82 insertions(+), 12 deletions(-)

diff --git a/tsa/howto/cache.mdwn b/tsa/howto/cache.mdwn
index a0a49cdd..d9283ef5 100644
--- a/tsa/howto/cache.mdwn
+++ b/tsa/howto/cache.mdwn
@@ -6,14 +6,51 @@ server.
 
 # Tutorial
 
-<!-- simple, brainless step-by-step instructions requiring little or -->
-<!-- no technical background -->
+To inspect the current cache hit ratio, head over to the [cache health
+dashboard](https://grafana.torproject.org/d/p21-cvJWk/cache-health) in [[grafana]]. It should be at least 75% and generally
+over or close to 90%.
 
 # How-to
 
-<!-- more in-depth procedure that may require interpretation -->
+## Traffic inspection
+
+A quick way to see how much traffic is flowing through the cache is to
+fire up [slurm](https://screenshots.debian.net/package/slurm) on the public interface:
+
+    slurm -i eth0
+
+This will display a realtime graphic of the traffic going in and out
+of the server. It should be below 1Gbit/s (or around 120MB/s).
+
+Another way to see throughput is to use [iftop](https://screenshots.debian.net/package/iftop), in a similar way:
+
+    iftop -i eth0 -n
+
+This will show *per host* traffic statistics, which might allow
+pinpointing possible abusers. Hit the `L` key to turn on the
+logarithmic scale, without which the display quickly becomes
+unreadable.
+
+Log files are in `/var/log/nginx` (although those might eventually go
+away, see [ticket #32461](https://trac.torproject.org/projects/tor/ticket/32461)). The [lnav](https://screenshots.debian.net/package/lnav) program can be used to
+show those log files in a pretty way and do extensive queries on
+them. Hit the `i` button to flip to the "histogram" view and `z`
+multiple times to zoom all the way into a per-second hit rate
+view. Hit `q` to go back to the normal view, which is useful to
+inspect individual hits and diagnose why they fail to be cached, for
+example.
 
 ## Pager playbook
+
+The only monitoring for this service is to ensure the proper number of
+nginx processes are running. If this gets triggered, the fix might be
+to just restart nginx:
+
+    service nginx restart
+
+... although it might be a sign of a deeper issue requiring further
+traffic inspection.
+
 ## Disaster recovery
 
 In case of fire, head to the `torproject.org` zone in the
@@ -32,20 +69,53 @@ Include `roles::cache` in Puppet.
 
 TODO: document how to add new sites in the cache.
 
-## Design
-<!-- how this is build -->
-<!-- should reuse and expand on the "proposed solution", it's a -->
-<!-- "as-built" documented, whereas the "Proposed solution" is an -->
-<!-- "architectural" document, which the final result might differ -->
-<!-- from, sometimes significantly -->
-
-To be clarified.
-
 ## SLA
 <!-- this describes an acceptable level of service for this service -->
 
 TBD.
 
+## Design
+
+The cache service generally constitutes of two or more servers in
+geographically distinct areas that run a webserver acting as a
+[reverse proxy](https://en.wikipedia.org/wiki/Reverse_proxy). In our case, we run the [Nginx webserver](https://nginx.org/) with
+the [proxy module](https://nginx.org/en/docs/http/ngx_http_proxy_module.html) for the <https://blog.torproject.org/> website
+(and eventually others, see [ticket #32462](https://trac.torproject.org/projects/tor/ticket/32462)). DNS for the site
+points to `cache.torproject.org`, an alias for the caching servers,
+which are currently two: `cache01.torproject.org` [sic] and
+`cache-02`. An HTTPS certificate for the site was issued through
+[[letsencrypt]]. Like the Nginx configuration, the certificate is
+deployed by Puppet in the `roles::cache` class.
+
+When a user hits the cache server, content is served from the cache
+stored in `/var/cache/nginx`, with a filename derived from the
+[`proxy_cache_key`](https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_cache_key) and [`proxy_cache_path`](https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_cache_path) settings. Those
+files should end up being cached by the kernel in virtual memory,
+which should make those accesses fast. If the cache is present and
+valid, it is returned directly to the user. If it is missing or
+invalid, it is fetched from the backend immediately. The backend is
+configured in Puppet as well.
+
+Requests to the cache are logged to the disk in
+`/var/log/nginx/ssl.$hostname.access.log`, with IP address and user
+agent removed. Then [mtail](https://github.com/google/mtail) parses those log files and increments
+various counters and exposes those as metrics that are then scraped by
+[[prometheus]]. We use [[grafana]] to display that hit ratio which, at
+the time of writing, is about 88% for the blog.
+
+## Issues
+
+ * logs should not be written to disk, but instead piped directly into
+   mtail (or through syslog as a buffer), see [ticket #32461](https://trac.torproject.org/projects/tor/ticket/32461)
+ * we use other caches and load balancers elsewhere (haproxy and
+   varnish), we should converge over nginx everywhere for consistency,
+   see [ticket #32462](https://trac.torproject.org/projects/tor/ticket/32462) for the varnish conversion
+ * the cipher suite is an old hardcoded copy derived from Apache, see
+   [ticket #32351](https://trac.torproject.org/projects/tor/ticket/32351)
+
+There is no issue tracker specifically for this project, file and
+serch for issues in [internal services](https://trac.torproject.org/projects/tor/query?status=!closed&component=Internal+Services%2FTor+Sysadmin+Team).
+
 # Discussion
 
 This section regroups notes that were gathered during the research,
-- 
GitLab