move nginx and ats install notes and benchmarks in the discussion section

cc8e1368 · anarcat · daa8c372 · cc8e1368
Unverified Commit cc8e1368 authored 5 years ago by anarcat
--- a/tsa/howto/cache.mdwn
+++ b/tsa/howto/cache.mdwn
@@ -46,229 +46,295 @@ To be clarified.
 TBD.
-# Design
+# Discussion
-## Nginx
+A discussion of the design of the new service, mostly.
-picking the "light" debian package. The modules that would be
+## Overview
-interesting in others would be "cache purge" (from extras) and "geoip"
-(from full):
-    apt install nginx-light
+The original goal of this project is to create a pair of caching
+servers in front of the blog to reduce the bandwidth costs we're being
+charged there.
-Then drop this config file in `/etc/nginx/sites-available` and symlink
+## Goals
-into `sites-enabled`:
-    server_names_hash_bucket_size 64;
+### Must have
-    proxy_cache_path /var/cache/nginx/ levels=1:2 keys_zone=blog:10m;
-    server {
+ * reduce the traffic on the blog, hosted at a costly provider (#32090)
-        listen 80;
+ * HTTPS support in the frontend and backend
-        listen [::]:80;
+ * deployment through Puppet
-        listen 443 ssl;
+ * anonymized logs
-        listen [::]:443 ssl;
+ * hit rate stats
-        ssl_certificate /etc/ssl/torproject/certs/blog.torproject.org.crt-chained;
-        ssl_certificate_key /etc/ssl/private/blog.torproject.org.key;
-        server_name blog.torproject.org;
+### Nice to have
-        proxy_cache blog;
-        location / {
+ * provide a frontend for our existing mirror infrastructure, a
-            proxy_pass https://live-tor-blog-8.pantheonsite.io;
+   home-made CDN for TBB and other releases
-            proxy_set_header Host       $host;
+ * no on-disk logs
+ * cute dashboard or grafana integration
+ * well-maintained upstream Puppet module
-            # cache 304
+### Approvals required
-            proxy_cache_revalidate on;
-            # add cookie to cache key
+ * approved and requested by vegas
-            #proxy_cache_key "$host$request_uri$cookie_user";
-            # not sure what the cookie name is
-            proxy_cache_key $scheme$proxy_host$request_uri;
-            # allow serving stale content on error, timeout, or refresh
+## Non-Goals
-            proxy_cache_use_stale error timeout updating;
-            # allow only first request through backend
-            proxy_cache_lock on;
-            # add header
+ * global CDN for users outside of TPO
-            add_header X-Cache-Status $upstream_cache_status;
+ * geoDNS
-        }
-    }
-... and reload nginx.
+## Cost
-I tested that logged in users bypass the cache and things generally
+Somewhere between 11EUR and 100EUR/mth for bandwidth and hardware.
-work well.
-A key problem with Nginx is getting decent statistics out. The
+We're getting apparently around 2.2M "page views" per month at
-[upstream nginx exporter](https://github.com/nginxinc/nginx-prometheus-exporter) supports only (basically) hits per second
+Pantheon. That is about 1 hit per second and 12 terabyte per month,
-through the [stub status module](http://nginx.org/en/docs/http/ngx_http_stub_status_module.html) a very limited module shipped with
+36Mbit/s on average:
-core Nginx. The commercial version, Nginx Plus, supports a [more
-extensive API](https://nginx.org/en/docs/http/ngx_http_api_module.html#api) which includes the hit rate, but that's not an
-option for us.
-There are two solutions to work around this problem:
+    $ qalc
+    > 2 200 000 ∕ (30d) to hertz
- * create our own metrics using the [Nginx Lua Prometheus module](https://github.com/knyar/nginx-lua-prometheus):
+      2200000 / (30 * day) = approx. 0.84876543 Hz
-   this can have performance impacts and involves a custom
-   configuration
- * write and parse log files, that's the way the [munin plugin](https://github.com/munin-monitoring/contrib/blob/master/plugins/nginx/nginx-cache-hit-rate)
-   works - this could possibly be fed *directly* into [mtail](https://github.com/google/mtail) to
-   avoid storing logs on disk but still get the date (include
-   [`$upstream_cache_status`](http://nginx.org/en/docs/http/ngx_http_upstream_module.html#var_upstream_cache_status) in the logs)
- * use a third-party module like [vts](https://github.com/vozlt/nginx-module-vts) or [sts](https://github.com/vozlt/nginx-module-sts) and the
-   [exporter](https://github.com/hnlq715/nginx-vts-exporter) to expose those metrics - the vts module doesn't seem
-   to be very well maintained (no release since 2018) and it's unclear
-   if this will work for our use case
-Here's an example of how to do the mtail hack. First tell nginx to
+    > 2 200 000 * 5Mibyte
-write to syslog, to act as a buffer, so that parsing doesn't slow
-processing, excerpt from the [nginx.conf snippet](https://git.autistici.org/ai3/float/blob/master/roles/nginx/templates/config/nginx.conf#L34):
-    # Log response times so that we can compute latency histograms
+      2200000 * (5 * mebibyte) = 11.534336 terabytes
-    # (using mtail). Works around the lack of Prometheus
-    # instrumentation in NGINX.
-    log_format extended '$server_name:$server_port '
-                '$remote_addr - $remote_user [$time_local] '
-                '"$request" $status $body_bytes_sent '
-                '"$http_referer" "$http_user_agent" '
-                '$upstream_addr $upstream_response_time $request_time';
-    access_log syslog:server=unix:/dev/log,facility=local3,tag=nginx_access extended;
+    > 2 200 000 * 5Mibyte/(30d) to megabit / s
-(We would also need to add `$upstream_cache_status` in that format.)
+      (2200000 * (5 * mebibyte)) / (30 * day) = approx. 35.599802 megabits / s
-Then count the different stats using mtail, excerpt from the [mtail
+Hetzner charges 1EUR/TB/month over our 1TB quota, so bandwidth would
-config snippet](https://git.autistici.org/ai3/float/blob/master/roles/base/files/mtail/nginx.mtail):
+cost 11EUR/month on average. If costs become prohibitive, we could
+switch to a Hetzner VM which includ 20TB of traffic per month at costs
+ranging from 3EUR/mth to 30EUR/mth depending on the VPS size (between
+1 vCPU, 2GB ram, 20GB SSD and 8vCPU, 32GB ram and 240GB SSD).
-    # Define the exported metrics.
+Dedicated servers start at 34EUR/mth (`EX42`, 64GB ram 2x4TB HDD) for
-    counter nginx_http_request_total
+unlimited gigabit.
-    counter nginx_http_requests by host, vhost, method, code, backend
-    counter nginx_http_bytes by host, vhost, method, code, backend
-    counter nginx_http_requests_ms by le, host, vhost, method, code, backend 
-    /(?P<hostname>[-0-9A-Za-z._:]+) nginx_access: (?P<vhost>[-0-9A-Za-z._:]+) (?P<remote_addr>[0-9a-f\.:]+) - - \[[^\]]+\] "(?P<request_method>[A-Z]+) (?P<request_uri>\S+) (?P<http_version>HTTP\/[0-9\.]+)" (?P<status>\d{3}) ((?P<response_size>\d+)|-) "[^"]*" "[^"]*" (?P<upstream_addr>[-0-9A-Za-z._:]+) ((?P<ups_resp_seconds>\d+\.\d+)|-) (?P<request_seconds>\d+)\.(?P<request_milliseconds>\d+)/ {
+## Proposed Solution
-    	nginx_http_request_total++
+Nginx will be deployed on two servers. ATS was found to be somewhat
+difficult to configure and debug, while Nginx has a more "regular"
+configuration file format. Furthermore, performance was equivalent or
+better in Nginx. 
-We'd also need to check the cache statuf in that parser.
+Finally, there is the possibility of converging all HTTP services
+towards Nginx if desired, which would reduce the number of moving
+parts in the infrastructure.
-Update: cache status now written to on-disk, anonymised, log files and
+## Launch checklist
-can be parsed with lnav. see ticket #32239 for details, hit ratio
-between 70 and 80% based on preliminary results.
-References:
+See [#32239](https://trac.torproject.org/projects/tor/ticket/32239).
- * [NGINX Alphabetical index of variables](https://nginx.org/en/docs/varindex.html)
+## Benchmarking procedures
- * [NGINX Module ngx_http_proxy_module](https://nginx.org/en/docs/http/ngx_http_proxy_module.html)
- * [NGINX Content Caching](https://docs.nginx.com/nginx/admin-guide/content-cache/content-caching/)
- * [NGINX Reverse Proxy](https://docs.nginx.com/nginx/admin-guide/web-server/reverse-proxy/)
- * [perusio@github.com: Nginx configuration for running Drupal](https://github.com/perusio/drupal-with-nginx) -
-   interesting [snippet](https://github.com/perusio/drupal-with-nginx/blob/D7/apps/drupal/map_cache.conf) for cookies handling, not required
- * [NGINX: Maximizing Drupal 8 Performance with NGINX, Part 2: Caching and Load Balancing](https://www.nginx.com/blog/maximizing-drupal-8-performance-nginx-part-ii-caching-load-balancing/)
-### Benchmarks
+Will require a test VM (or two?) to hit the caches.
-ab:
+### Common procedure
-    root@cache-02:~# ab -c 100 -n 1000 https://blog.torproject.org/
+ 1. punch a hole in the firewall to allow cache2 to access cache1
-    [...]
-    Server Software:        nginx/1.14.2
-    Server Hostname:        blog.torproject.org
-    Server Port:            443
-    SSL/TLS Protocol:       TLSv1.2,ECDHE-RSA-AES256-GCM-SHA384,4096,256
-    Server Temp Key:        X25519 253 bits
-    TLS Server Name:        blog.torproject.org
-    Document Path:          /
+        iptables -I INPUT -s 78.47.61.104 -j ACCEPT
-    Document Length:        53313 bytes
+        ip6tables -I INPUT -s 2a01:4f8:c010:25ff::1 -j ACCEPT
-    Concurrency Level:      100
+ 2. point the blog to cache1 on cache2 in `/etc/hosts`:
-    Time taken for tests:   3.083 seconds
-    Complete requests:      1000
-    Failed requests:        0
-    Total transferred:      54458000 bytes
-    HTML transferred:       53313000 bytes
-    Requests per second:    324.31 [#/sec] (mean)
-    Time per request:       308.349 [ms] (mean)
-    Time per request:       3.083 [ms] (mean, across all concurrent requests)
-    Transfer rate:          17247.25 [Kbytes/sec] received
-    Connection Times (ms)
+        116.202.120.172	blog.torproject.org
-                  min  mean[+/-sd] median   max
+        2a01:4f8:fff0:4f:266:37ff:fe26:d6e1 blog.torproject.org
-    Connect:       30  255  78.0    262     458
-    Processing:    18   35  19.2     28     119
-    Waiting:        7   19   7.4     18      58
-    Total:         81  290  88.3    291     569
-    Percentage of the requests served within a certain time (ms)
+ 3. disable Puppet:
-      50%    291
-      66%    298
+        puppet agent --disable 'benchmarking requires /etc/hosts override'
-      75%    303
-      80%    306
-      90%    321
-      95%    533
-      98%    561
-      99%    562
-     100%    569 (longest request)
-About 50% faster than ATS.
+ 4. launch the benchmark
-Siege:
+### Siege
-    Transactions:		       32246 hits
+Siege configuration sample:
-    Availability:		      100.00 %
-    Elapsed time:		      119.57 secs
-    Data transferred:	     1639.49 MB
-    Response time:		        0.37 secs
-    Transaction rate:	      269.68 trans/sec
-    Throughput:		       13.71 MB/sec
-    Concurrency:		       99.60
-    Successful transactions:       32246
-    Failed transactions:	           0
-    Longest transaction:	        1.65
-    Shortest transaction:	        0.23
-Almost an order of magnitude faster than ATS.
+```
+verbose = false
+fullurl = true
+concurrent = 100
+time = 2M
+url = http://www.example.com/
+delay = 1
+internet = false
+benchmark = true
+```
-Bombardier:
+Might require this, which might work only with varnish:
-    anarcat@cache-02:~$ ./go/bin/bombardier --duration=2m --latencies https://blog.torproject.org/  -c 100
+```
-    Bombarding https://blog.torproject.org:443/ for 2m0s using 100 connection(s)
+proxy-host = 209.44.112.101
-    [=========================================================================] 2m0s
+proxy-port = 80
-    Done!
+```
-    Statistics        Avg      Stdev        Max
-      Reqs/sec      2116.74     506.01    5495.77
-      Latency       48.42ms    34.25ms      2.15s
-      Latency Distribution
-         50%    37.19ms
-         75%    50.44ms
-         90%    89.58ms
-         95%   109.59ms
-         99%   169.69ms
-      HTTP codes:
-        1xx - 0, 2xx - 247827, 3xx - 0, 4xx - 0, 5xx - 0
-        others - 0
-      Throughput:   107.43MB/s
-Almost maxes out the gigabit connexion as well, but only marginally
+Alternative is to hack `/etc/hosts`.
-faster (~3%?) than ATS.
-Does not max theoritical gigabit maximal performance, which [is
+### apachebench
-apparently](http://rickardnobel.se/actual-throughput-on-gigabit-ethernet/) at around 118MB/s without jumbo frames (and 123MB/s
-with).
-## ATS
+Classic commandline:
-    apt install trafficserver
+    ab2 -n 1000 -c 100 -X cache01.torproject.org https://example.com/
-Default Debian config seems sane when compared to the [Cicimov
+`-X` also doesn't work with ATS, hacked `/etc/hosts`.
-tutorial][cicimov]. On thing we will need to change is the [default listening
-port][], which is by default:
-[default listening port]: https://docs.trafficserver.apache.org/en/8.0.x/admin-guide/files/records.config.en.html#proxy.config.http.server_ports
+### bombardier
-    CONFIG proxy.config.http.server_ports STRING 8080 8080:ipv6
+Unfortunately, the [bombardier package in Debian](https://tracker.debian.org/pkg/bombardier) is *not* the HTTP
+benchmarking tool but a commandline game. It's still possible to
+install it in Debian with:
+    export GOPATH=$HOME/go
+    apt install golang
+    go get -v github.com/codesenberg/bombardier
+Then running the benchmark is as simple as:
+    ./go/bin/bombardier --duration=2m --latencies https://blog.torproject.org/
+Baseline benchmark, from cache02:
+    anarcat@cache-02:~$ ./go/bin/bombardier --duration=2m --latencies https://blog.torproject.org/  -c 100
+    Bombarding https://blog.torproject.org:443/ for 2m0s using 100 connection(s)
+    [================================================================================================================================================================] 2m0s
+    Done!
+    Statistics        Avg      Stdev        Max
+      Reqs/sec      2796.01     716.69    6891.48
+      Latency       35.96ms    22.59ms      1.02s
+      Latency Distribution
+         50%    33.07ms
+         75%    40.06ms
+         90%    47.91ms
+         95%    54.66ms
+         99%    75.69ms
+      HTTP codes:
+        1xx - 0, 2xx - 333646, 3xx - 0, 4xx - 0, 5xx - 0
+        others - 0
+      Throughput:   144.79MB/s
+This is strangely much higher, in terms of throughput, and faster, in
+terms of latency, than testing against our own servers. Different
+avenues were explored to explain that disparity with our servers:
+ * jumbo frames? nope, both connexions see packets larger than 1500
+   bytes
+ * protocol differences? nope, both go over IPv6 and (probably) HTTP/2
+   (at least not over UDP)
+ * different link speeds
+The last theory is currently the only one standing. Indeed, 144.79MB/s
+should not be possible on regular gigabit ethernet (GigE), as it is
+actually *more* than 1000Mbit/s (1158.32Mbit/s). Sometimes the above
+benchmark even gives 152MB/s (1222Mbit/s), way beyond what a regular
+GigE link should be able to provide.
+### Other tools
+Siege has trouble going above ~100 concurrent clients because of its
+design (and ulimit) limitations. Its interactive features are also
+limited, here's a set of interesting alternatives:
+ * [bombardier](https://github.com/codesenberg/bombardier) - golang, HTTP/2, better performance than siege in
+   my (2017) tests, not in debian
+ * [boom](https://github.com/tarekziade/boom) - python rewrite of apachebench, supports duration,
+   HTTP/2, not in debian, unsearchable name
+ * [go-wrk](https://github.com/adjust/go-wrk/) - golang rewrite of wrk with HTTPS, had performance
+   issues in my first tests (2017), [no duration target](https://github.com/adjust/go-wrk/issues/2), not in
+   Debian
+ * [hey](https://github.com/rakyll/hey) - golang rewrite of apachebench, similar to boom, not in
+   debian ([ITP #943596](https://bugs.debian.org/943596)), unsearchable name
+ * [Jmeter](https://jmeter.apache.org/) - interactive behavior, can replay recorded sessions
+   from browsers
+ * [Locust](https://locust.io/) - distributed, can model login and interactive
+   behavior, not in Debian
+ * [Tsung](http://tsung.erlang-projects.org/1/01/about/) - multi-protocol, distributed, erlang
+ * [wrk](https://github.com/wg/wrk/) - multithreaded, epoll, Lua scriptable, no HTTPS, only in
+   Debian unstable
+## Alternatives considered
+Four alternatives were seriously considered:
+ * Apache Traffic Server
+ * Nginx proxying + caching
+ * Varnish + stunnel
+ * Fastly
+Other alternatives were not:
+ * [Apache HTTPD caching](https://httpd.apache.org/docs/2.4/caching.html) - performance expected to be sub-par
+ * [Envoy][] - [not designed for caching](https://github.com/envoyproxy/envoy/issues/868), [external cache support
+   planned in 2019](https://blog.getambassador.io/envoy-proxy-in-2019-security-caching-wasm-http-3-and-more-e5ba82da0197?gi=82c1a78157b8)
+ * [HAproxy](https://www.haproxy.com/) - [not designed to cache large objects](https://www.haproxy.com/documentation/aloha/9-5/traffic-management/lb-layer7/caching-small-objects/)
+ * [Ledge](https://github.com/ledgetech/ledge) - caching extension to Nginx with ESI, Redis, and cache
+   purge support, not packaged in Debian
+ * [Nuster](https://github.com/jiangwenyuan/nuster) - new project, not packaged in Debian (based on
+   HAproxy), performance [comparable with nginx and varnish](https://github.com/jiangwenyuan/nuster/wiki/Web-cache-server-performance-benchmark:-nuster-vs-nginx-vs-varnish-vs-squid#results)
+   according to upstream, although impressive improvements
+ * [Polipo](https://en.wikipedia.org/wiki/Polipo) - not designed for production use
+ * [Squid](http://www.squid-cache.org/) - not designed as a reverse proxy
+ * [Traefik](https://traefik.io/) - [not designed for caching](https://github.com/containous/traefik/issues/878)
+[Envoy]: https://www.envoyproxy.io/
+### Apache Traffic Server
+#### Summary of online reviews
+Pros:
+ * HTTPS
+ * HTTP/2
+ * industry leader (behind cloudflare)
+ * out of the box clustering support
+Cons:
+ * load balancing is an experimental plugin (at least in 2016)
+ * no static file serving? or slower?
+ * no commercial support
+Used by Yahoo, Apple and Comcast.
+#### First impressions
+Pros:
+ * [Puppet module available](https://forge.puppet.com/brainsware/trafficserver)
+ * no query logging by default (good?)
+ * good documentation, but a bit lacking in tutorials
+ * nice little dashboard shipped by default (`traffic_top`) although
+   it could be more useful (doesn't seem to show hit ratio clearly)
+Cons:
+ * configuration spread out over many different configuration file
+ * complex and arcane configuration language (e.g. try to guess what
+   this actually does:: `CONFIG proxy.config.http.server_ports STRING
+   8080:ipv6:tr-full 443:ssl
+   ip-in=192.168.17.1:80:ip-out=[fc01:10:10:1::1]:ip-out=10.10.10.1`)
+ * configuration syntax varies across config files and plugins
+ * <del>couldn't decouple backend hostname and passed `Host`
+   header</del> bad random tutorial found on the internet
+ * couldn't figure out how to make HTTP/2 work
+ * no prometheus exporters
+#### Configuration
+    apt install trafficserver
+Default Debian config seems sane when compared to the [Cicimov
+tutorial][cicimov]. On thing we will need to change is the [default listening
+port][], which is by default:
+[default listening port]: https://docs.trafficserver.apache.org/en/8.0.x/admin-guide/files/records.config.en.html#proxy.config.http.server_ports
+    CONFIG proxy.config.http.server_ports STRING 8080 8080:ipv6
 We want something more like this:
@@ -312,9 +378,9 @@ And finally curl is able to talk to the proxy:
    curl --proxy-cacert /etc/ssl/torproject-auto/servercerts/ca.crt --proxy https://cache01.torproject.org/ https://blog.torproject.org
-### Troubleshooting
+#### Troubleshooting
-#### Proxy fails to hit backend:
+##### Proxy fails to hit backend:
    curl: (56) Received HTTP code 404 from proxy after CONNECT
@@ -362,15 +428,15 @@ the `Foo` header appears in the request.
 The solution to this is the `proxy.config.url_remap.pristine_host_hdr`
 documented above.
-#### HTTP/2 support missing
+##### HTTP/2 support missing
 Next hurdle: no HTTP/2 support, even when using `proto=http2;http`
 (falls back on `HTTP/1.1`) and `proto=http2` only (fails with
 `WARNING: Unregistered protocol type 0`).
-### Benchmarks
+#### Benchmarks
-#### Same host tests
+##### Same host tests
 With `blog.tpo` in `/etc/hosts`, because `proxy-host` doesn't work, and
 running on the same host as the proxy (!), cold cache:
@@ -483,7 +549,7 @@ ab:
      99%    172
     100%    196 (longest request)
-#### Separate host
+##### Separate host
 Those tests were performed from one cache server to the other, to
 avoid the benchmarking tool fighting for resources with the server.
@@ -581,321 +647,44 @@ connexion:
         75%    57.98ms
         90%    69.05ms
         95%    78.44ms
         99%   128.34ms
      HTTP codes:
        1xx - 0, 2xx - 241187, 3xx - 0, 4xx - 0, 5xx - 0
        others - 0
      Throughput:   104.67MB/s
-It might be because it supports doing HTTP/2 requests and, indeed, the
-`Throughput` drops down to `14MB/s` when we use the `--http1` flag,
-along with rates closer to ab:
-    anarcat@cache-02:~$ ./go/bin/bombardier --duration=2m --latencies https://blog.torproject.org/ --http1 -c 100
-    Bombarding https://blog.torproject.org:443/ for 2m0s using 100 connection(s)
-    [=========================================================================] 2m0s
-    Done!
-    Statistics        Avg      Stdev        Max
-      Reqs/sec      1322.21     253.18    1911.21
-      Latency       78.40ms    18.65ms   688.60ms
-      Latency Distribution
-         50%    75.53ms
-         75%    88.52ms
-         90%   101.30ms
-         95%   110.68ms
-         99%   132.89ms
-      HTTP codes:
-        1xx - 0, 2xx - 153114, 3xx - 0, 4xx - 0, 5xx - 0
-        others - 0
-      Throughput:    14.22MB/s
-Inter-server communication is good, according to `iperf3`:
-    [ ID] Interval           Transfer     Bitrate
-    [  5]   0.00-10.04  sec  1.00 GBytes   859 Mbits/sec                  receiver
-So we see the roundtrip does add significant overhead to ab and
-siege. It's possible this is due to the nature of the virtual server,
-much less powerful than the server. This seems to be confirmed by
-`bombardieer`'s success, since it's possibly better designed than the
-other two to maximize resources on the client side.
-# Discussion
-A discussion of the design of the new service, mostly.
-## Overview
-The original goal of this project is to create a pair of caching
-servers in front of the blog to reduce the bandwidth costs we're being
-charged there.
-## Goals
-### Must have
- * reduce the traffic on the blog, hosted at a costly provider (#32090)
- * HTTPS support in the frontend and backend
- * deployment through Puppet
- * anonymized logs
- * hit rate stats
-### Nice to have
- * provide a frontend for our existing mirror infrastructure, a
-   home-made CDN for TBB and other releases
- * no on-disk logs
- * cute dashboard or grafana integration
- * well-maintained upstream Puppet module
-### Approvals required
- * approved and requested by vegas
-## Non-Goals
- * global CDN for users outside of TPO
- * geoDNS
-## Proposed Solution
-Nginx will be deployed on two servers. ATS was found to be somewhat
-difficult to configure and debug, while Nginx has a more "regular"
-configuration file format. Furthermore, performance was equivalent or
-better in Nginx. 
-Finally, there is the possibility of converging all HTTP services
-towards Nginx if desired, which would reduce the number of moving
-parts in the infrastructure.
-## Launch checklist
-See [#32239](https://trac.torproject.org/projects/tor/ticket/32239).
-## Benchmarking procedures
-Will require a test VM (or two?) to hit the caches.
-### Common procedure
- 1. punch a hole in the firewall to allow cache2 to access cache1
-        iptables -I INPUT -s 78.47.61.104 -j ACCEPT
-        ip6tables -I INPUT -s 2a01:4f8:c010:25ff::1 -j ACCEPT
- 2. point the blog to cache1 on cache2 in `/etc/hosts`:
-        116.202.120.172	blog.torproject.org
-        2a01:4f8:fff0:4f:266:37ff:fe26:d6e1 blog.torproject.org
- 3. disable Puppet:
-        puppet agent --disable 'benchmarking requires /etc/hosts override'
- 4. launch the benchmark
-### Siege
-Siege configuration sample:
-```
-verbose = false
-fullurl = true
-concurrent = 100
-time = 2M
-url = http://www.example.com/
-delay = 1
-internet = false
-benchmark = true
-```
-Might require this, which might work only with varnish:
-```
-proxy-host = 209.44.112.101
-proxy-port = 80
-```
-Alternative is to hack `/etc/hosts`.
-### apachebench
-Classic commandline:
-    ab2 -n 1000 -c 100 -X cache01.torproject.org https://example.com/
-`-X` also doesn't work with ATS, hacked `/etc/hosts`.
-### bombardier
-Unfortunately, the [bombardier package in Debian](https://tracker.debian.org/pkg/bombardier) is *not* the HTTP
-benchmarking tool but a commandline game. It's still possible to
-install it in Debian with:
-    export GOPATH=$HOME/go
-    apt install golang
-    go get -v github.com/codesenberg/bombardier
-Then running the benchmark is as simple as:
-    ./go/bin/bombardier --duration=2m --latencies https://blog.torproject.org/
-Baseline benchmark, from cache02:
-    anarcat@cache-02:~$ ./go/bin/bombardier --duration=2m --latencies https://blog.torproject.org/  -c 100
-    Bombarding https://blog.torproject.org:443/ for 2m0s using 100 connection(s)
-    [================================================================================================================================================================] 2m0s
-    Done!
-    Statistics        Avg      Stdev        Max
-      Reqs/sec      2796.01     716.69    6891.48
-      Latency       35.96ms    22.59ms      1.02s
-      Latency Distribution
-         50%    33.07ms
-         75%    40.06ms
-         90%    47.91ms
-         95%    54.66ms
-         99%    75.69ms
-      HTTP codes:
-        1xx - 0, 2xx - 333646, 3xx - 0, 4xx - 0, 5xx - 0
-        others - 0
-      Throughput:   144.79MB/s
-This is strangely much higher, in terms of throughput, and faster, in
-terms of latency, than testing against our own servers. Different
-avenues were explored to explain that disparity with our servers:
- * jumbo frames? nope, both connexions see packets larger than 1500
-   bytes
- * protocol differences? nope, both go over IPv6 and (probably) HTTP/2
-   (at least not over UDP)
- * different link speeds
-The last theory is currently the only one standing. Indeed, 144.79MB/s
-should not be possible on regular gigabit ethernet (GigE), as it is
-actually *more* than 1000Mbit/s (1158.32Mbit/s). Sometimes the above
-benchmark even gives 152MB/s (1222Mbit/s), way beyond what a regular
-GigE link should be able to provide.
-### Other tools
-Siege has trouble going above ~100 concurrent clients because of its
-design (and ulimit) limitations. Its interactive features are also
-limited, here's a set of interesting alternatives:
- * [bombardier](https://github.com/codesenberg/bombardier) - golang, HTTP/2, better performance than siege in
-   my (2017) tests, not in debian
- * [boom](https://github.com/tarekziade/boom) - python rewrite of apachebench, supports duration,
-   HTTP/2, not in debian, unsearchable name
- * [go-wrk](https://github.com/adjust/go-wrk/) - golang rewrite of wrk with HTTPS, had performance
-   issues in my first tests (2017), [no duration target](https://github.com/adjust/go-wrk/issues/2), not in
-   Debian
- * [hey](https://github.com/rakyll/hey) - golang rewrite of apachebench, similar to boom, not in
-   debian ([ITP #943596](https://bugs.debian.org/943596)), unsearchable name
- * [Jmeter](https://jmeter.apache.org/) - interactive behavior, can replay recorded sessions
-   from browsers
- * [Locust](https://locust.io/) - distributed, can model login and interactive
-   behavior, not in Debian
- * [Tsung](http://tsung.erlang-projects.org/1/01/about/) - multi-protocol, distributed, erlang
- * [wrk](https://github.com/wg/wrk/) - multithreaded, epoll, Lua scriptable, no HTTPS, only in
-   Debian unstable
-## Cost
-Somewhere between 11EUR and 100EUR/mth for bandwidth and hardware.
-We're getting apparently around 2.2M "page views" per month at
-Pantheon. That is about 1 hit per second and 12 terabyte per month,
-36Mbit/s on average:
-    $ qalc
-    > 2 200 000 ∕ (30d) to hertz
-      2200000 / (30 * day) = approx. 0.84876543 Hz
-    > 2 200 000 * 5Mibyte
-      2200000 * (5 * mebibyte) = 11.534336 terabytes
-    > 2 200 000 * 5Mibyte/(30d) to megabit / s
-      (2200000 * (5 * mebibyte)) / (30 * day) = approx. 35.599802 megabits / s
-Hetzner charges 1EUR/TB/month over our 1TB quota, so bandwidth would
-cost 11EUR/month on average. If costs become prohibitive, we could
-switch to a Hetzner VM which includ 20TB of traffic per month at costs
-ranging from 3EUR/mth to 30EUR/mth depending on the VPS size (between
-1 vCPU, 2GB ram, 20GB SSD and 8vCPU, 32GB ram and 240GB SSD).
-Dedicated servers start at 34EUR/mth (`EX42`, 64GB ram 2x4TB HDD) for
-unlimited gigabit.
-## Alternatives considered
-Four alternatives were seriously considered:
- * Apache Traffic Server
- * Nginx proxying + caching
- * Varnish + stunnel
- * Fastly
-Other alternatives were not:
- * [Apache HTTPD caching](https://httpd.apache.org/docs/2.4/caching.html) - performance expected to be sub-par
- * [Envoy][] - [not designed for caching](https://github.com/envoyproxy/envoy/issues/868), [external cache support
-   planned in 2019](https://blog.getambassador.io/envoy-proxy-in-2019-security-caching-wasm-http-3-and-more-e5ba82da0197?gi=82c1a78157b8)
- * [HAproxy](https://www.haproxy.com/) - [not designed to cache large objects](https://www.haproxy.com/documentation/aloha/9-5/traffic-management/lb-layer7/caching-small-objects/)
- * [Ledge](https://github.com/ledgetech/ledge) - caching extension to Nginx with ESI, Redis, and cache
-   purge support, not packaged in Debian
- * [Nuster](https://github.com/jiangwenyuan/nuster) - new project, not packaged in Debian (based on
-   HAproxy), performance [comparable with nginx and varnish](https://github.com/jiangwenyuan/nuster/wiki/Web-cache-server-performance-benchmark:-nuster-vs-nginx-vs-varnish-vs-squid#results)
-   according to upstream, although impressive improvements
- * [Polipo](https://en.wikipedia.org/wiki/Polipo) - not designed for production use
- * [Squid](http://www.squid-cache.org/) - not designed as a reverse proxy
- * [Traefik](https://traefik.io/) - [not designed for caching](https://github.com/containous/traefik/issues/878)
-[Envoy]: https://www.envoyproxy.io/
-### Apache Traffic Server
-#### Summary of online reviews
-Pros:
- * HTTPS
- * HTTP/2
- * industry leader (behind cloudflare)
- * out of the box clustering support
-Cons:
- * load balancing is an experimental plugin (at least in 2016)
- * no static file serving? or slower?
- * no commercial support
-Used by Yahoo, Apple and Comcast.
-#### First impressions
+It might be because it supports doing HTTP/2 requests and, indeed, the
+`Throughput` drops down to `14MB/s` when we use the `--http1` flag,
+along with rates closer to ab:
-Pros:
+    anarcat@cache-02:~$ ./go/bin/bombardier --duration=2m --latencies https://blog.torproject.org/ --http1 -c 100
+    Bombarding https://blog.torproject.org:443/ for 2m0s using 100 connection(s)
+    [=========================================================================] 2m0s
+    Done!
+    Statistics        Avg      Stdev        Max
+      Reqs/sec      1322.21     253.18    1911.21
+      Latency       78.40ms    18.65ms   688.60ms
+      Latency Distribution
+         50%    75.53ms
+         75%    88.52ms
+         90%   101.30ms
+         95%   110.68ms
+         99%   132.89ms
+      HTTP codes:
+        1xx - 0, 2xx - 153114, 3xx - 0, 4xx - 0, 5xx - 0
+        others - 0
+      Throughput:    14.22MB/s
- * [Puppet module available](https://forge.puppet.com/brainsware/trafficserver)
+Inter-server communication is good, according to `iperf3`:
- * no query logging by default (good?)
- * good documentation, but a bit lacking in tutorials
- * nice little dashboard shipped by default (`traffic_top`) although
-   it could be more useful (doesn't seem to show hit ratio clearly)
-Cons:
+    [ ID] Interval           Transfer     Bitrate
+    [  5]   0.00-10.04  sec  1.00 GBytes   859 Mbits/sec                  receiver
- * configuration spread out over many different configuration file
+So we see the roundtrip does add significant overhead to ab and
- * complex and arcane configuration language (e.g. try to guess what
+siege. It's possible this is due to the nature of the virtual server,
-   this actually does:: `CONFIG proxy.config.http.server_ports STRING
+much less powerful than the server. This seems to be confirmed by
-   8080:ipv6:tr-full 443:ssl
+`bombardieer`'s success, since it's possibly better designed than the
-   ip-in=192.168.17.1:80:ip-out=[fc01:10:10:1::1]:ip-out=10.10.10.1`)
+other two to maximize resources on the client side.
- * configuration syntax varies across config files and plugins
- * <del>couldn't decouple backend hostname and passed `Host`
-   header</del> bad random tutorial found on the internet
- * couldn't figure out how to make HTTP/2 work
- * no prometheus exporters
 ### Nginx
@@ -952,6 +741,213 @@ Cons:
 * [detailed cache stats][] are only in the "plus" version
 [detailed cache stats]: https://docs.nginx.com/nginx/admin-guide/monitoring/live-activity-monitoring/
+#### Configuration
+picking the "light" debian package. The modules that would be
+interesting in others would be "cache purge" (from extras) and "geoip"
+(from full):
+    apt install nginx-light
+Then drop this config file in `/etc/nginx/sites-available` and symlink
+into `sites-enabled`:
+    server_names_hash_bucket_size 64;
+    proxy_cache_path /var/cache/nginx/ levels=1:2 keys_zone=blog:10m;
+    server {
+        listen 80;
+        listen [::]:80;
+        listen 443 ssl;
+        listen [::]:443 ssl;
+        ssl_certificate /etc/ssl/torproject/certs/blog.torproject.org.crt-chained;
+        ssl_certificate_key /etc/ssl/private/blog.torproject.org.key;
+        server_name blog.torproject.org;
+        proxy_cache blog;
+        location / {
+            proxy_pass https://live-tor-blog-8.pantheonsite.io;
+            proxy_set_header Host       $host;
+            # cache 304
+            proxy_cache_revalidate on;
+            # add cookie to cache key
+            #proxy_cache_key "$host$request_uri$cookie_user";
+            # not sure what the cookie name is
+            proxy_cache_key $scheme$proxy_host$request_uri;
+            # allow serving stale content on error, timeout, or refresh
+            proxy_cache_use_stale error timeout updating;
+            # allow only first request through backend
+            proxy_cache_lock on;
+            # add header
+            add_header X-Cache-Status $upstream_cache_status;
+        }
+    }
+... and reload nginx.
+I tested that logged in users bypass the cache and things generally
+work well.
+A key problem with Nginx is getting decent statistics out. The
+[upstream nginx exporter](https://github.com/nginxinc/nginx-prometheus-exporter) supports only (basically) hits per second
+through the [stub status module](http://nginx.org/en/docs/http/ngx_http_stub_status_module.html) a very limited module shipped with
+core Nginx. The commercial version, Nginx Plus, supports a [more
+extensive API](https://nginx.org/en/docs/http/ngx_http_api_module.html#api) which includes the hit rate, but that's not an
+option for us.
+There are two solutions to work around this problem:
+ * create our own metrics using the [Nginx Lua Prometheus module](https://github.com/knyar/nginx-lua-prometheus):
+   this can have performance impacts and involves a custom
+   configuration
+ * write and parse log files, that's the way the [munin plugin](https://github.com/munin-monitoring/contrib/blob/master/plugins/nginx/nginx-cache-hit-rate)
+   works - this could possibly be fed *directly* into [mtail](https://github.com/google/mtail) to
+   avoid storing logs on disk but still get the date (include
+   [`$upstream_cache_status`](http://nginx.org/en/docs/http/ngx_http_upstream_module.html#var_upstream_cache_status) in the logs)
+ * use a third-party module like [vts](https://github.com/vozlt/nginx-module-vts) or [sts](https://github.com/vozlt/nginx-module-sts) and the
+   [exporter](https://github.com/hnlq715/nginx-vts-exporter) to expose those metrics - the vts module doesn't seem
+   to be very well maintained (no release since 2018) and it's unclear
+   if this will work for our use case
+Here's an example of how to do the mtail hack. First tell nginx to
+write to syslog, to act as a buffer, so that parsing doesn't slow
+processing, excerpt from the [nginx.conf snippet](https://git.autistici.org/ai3/float/blob/master/roles/nginx/templates/config/nginx.conf#L34):
+    # Log response times so that we can compute latency histograms
+    # (using mtail). Works around the lack of Prometheus
+    # instrumentation in NGINX.
+    log_format extended '$server_name:$server_port '
+                '$remote_addr - $remote_user [$time_local] '
+                '"$request" $status $body_bytes_sent '
+                '"$http_referer" "$http_user_agent" '
+                '$upstream_addr $upstream_response_time $request_time';
+    access_log syslog:server=unix:/dev/log,facility=local3,tag=nginx_access extended;
+(We would also need to add `$upstream_cache_status` in that format.)
+Then count the different stats using mtail, excerpt from the [mtail
+config snippet](https://git.autistici.org/ai3/float/blob/master/roles/base/files/mtail/nginx.mtail):
+    # Define the exported metrics.
+    counter nginx_http_request_total
+    counter nginx_http_requests by host, vhost, method, code, backend
+    counter nginx_http_bytes by host, vhost, method, code, backend
+    counter nginx_http_requests_ms by le, host, vhost, method, code, backend 
+    /(?P<hostname>[-0-9A-Za-z._:]+) nginx_access: (?P<vhost>[-0-9A-Za-z._:]+) (?P<remote_addr>[0-9a-f\.:]+) - - \[[^\]]+\] "(?P<request_method>[A-Z]+) (?P<request_uri>\S+) (?P<http_version>HTTP\/[0-9\.]+)" (?P<status>\d{3}) ((?P<response_size>\d+)|-) "[^"]*" "[^"]*" (?P<upstream_addr>[-0-9A-Za-z._:]+) ((?P<ups_resp_seconds>\d+\.\d+)|-) (?P<request_seconds>\d+)\.(?P<request_milliseconds>\d+)/ {
+    	nginx_http_request_total++
+We'd also need to check the cache statuf in that parser.
+References:
+ * [NGINX Alphabetical index of variables](https://nginx.org/en/docs/varindex.html)
+ * [NGINX Module ngx_http_proxy_module](https://nginx.org/en/docs/http/ngx_http_proxy_module.html)
+ * [NGINX Content Caching](https://docs.nginx.com/nginx/admin-guide/content-cache/content-caching/)
+ * [NGINX Reverse Proxy](https://docs.nginx.com/nginx/admin-guide/web-server/reverse-proxy/)
+ * [perusio@github.com: Nginx configuration for running Drupal](https://github.com/perusio/drupal-with-nginx) -
+   interesting [snippet](https://github.com/perusio/drupal-with-nginx/blob/D7/apps/drupal/map_cache.conf) for cookies handling, not required
+ * [NGINX: Maximizing Drupal 8 Performance with NGINX, Part 2: Caching and Load Balancing](https://www.nginx.com/blog/maximizing-drupal-8-performance-nginx-part-ii-caching-load-balancing/)
+#### Benchmarks
+ab:
+    root@cache-02:~# ab -c 100 -n 1000 https://blog.torproject.org/
+    [...]
+    Server Software:        nginx/1.14.2
+    Server Hostname:        blog.torproject.org
+    Server Port:            443
+    SSL/TLS Protocol:       TLSv1.2,ECDHE-RSA-AES256-GCM-SHA384,4096,256
+    Server Temp Key:        X25519 253 bits
+    TLS Server Name:        blog.torproject.org
+    Document Path:          /
+    Document Length:        53313 bytes
+    Concurrency Level:      100
+    Time taken for tests:   3.083 seconds
+    Complete requests:      1000
+    Failed requests:        0
+    Total transferred:      54458000 bytes
+    HTML transferred:       53313000 bytes
+    Requests per second:    324.31 [#/sec] (mean)
+    Time per request:       308.349 [ms] (mean)
+    Time per request:       3.083 [ms] (mean, across all concurrent requests)
+    Transfer rate:          17247.25 [Kbytes/sec] received
+    Connection Times (ms)
+                  min  mean[+/-sd] median   max
+    Connect:       30  255  78.0    262     458
+    Processing:    18   35  19.2     28     119
+    Waiting:        7   19   7.4     18      58
+    Total:         81  290  88.3    291     569
+    Percentage of the requests served within a certain time (ms)
+      50%    291
+      66%    298
+      75%    303
+      80%    306
+      90%    321
+      95%    533
+      98%    561
+      99%    562
+     100%    569 (longest request)
+About 50% faster than ATS.
+Siege:
+    Transactions:		       32246 hits
+    Availability:		      100.00 %
+    Elapsed time:		      119.57 secs
+    Data transferred:	     1639.49 MB
+    Response time:		        0.37 secs
+    Transaction rate:	      269.68 trans/sec
+    Throughput:		       13.71 MB/sec
+    Concurrency:		       99.60
+    Successful transactions:       32246
+    Failed transactions:	           0
+    Longest transaction:	        1.65
+    Shortest transaction:	        0.23
+Almost an order of magnitude faster than ATS.
+Bombardier:
+    anarcat@cache-02:~$ ./go/bin/bombardier --duration=2m --latencies https://blog.torproject.org/  -c 100
+    Bombarding https://blog.torproject.org:443/ for 2m0s using 100 connection(s)
+    [=========================================================================] 2m0s
+    Done!
+    Statistics        Avg      Stdev        Max
+      Reqs/sec      2116.74     506.01    5495.77
+      Latency       48.42ms    34.25ms      2.15s
+      Latency Distribution
+         50%    37.19ms
+         75%    50.44ms
+         90%    89.58ms
+         95%   109.59ms
+         99%   169.69ms
+      HTTP codes:
+        1xx - 0, 2xx - 247827, 3xx - 0, 4xx - 0, 5xx - 0
+        others - 0
+      Throughput:   107.43MB/s
+Almost maxes out the gigabit connexion as well, but only marginally
+faster (~3%?) than ATS.
+Does not max theoritical gigabit maximal performance, which [is
+apparently](http://rickardnobel.se/actual-throughput-on-gigabit-ethernet/) at around 118MB/s without jumbo frames (and 123MB/s
+with).
 ### Varnish
 Pros: