document possible keepalive snafu (#41902) authored by anarcat's avatar anarcat
......@@ -2679,6 +2679,20 @@ general, and our setup in particular:
resolved](https://github.com/prometheus/alertmanager/issues/226) ([PR pending since 2022](https://github.com/prometheus/alertmanager/pull/3034))
- [Alertmanager doesn't send notifications when silences are
posted](https://github.com/prometheus/alertmanager/issues/730)
- Prometheus uses [keep alive](https://en.wikipedia.org/wiki/Keepalive) HTTP requests to probe
targets. This means that DNS changes might take longer to take
effect than expected. In particular, some servers (e.g. [Nginx](https://nginx.org/en/docs/http/ngx_http_core_module.html#keepalive_requests))
allow a *lot* of keep alive requests (e.g. 1000) which means
Prometheus will take a long time to switch the new host (e.g. 16
hours).
A workaround is to shutdown the previous host to force Prometheus
to check the new one during a rotation, or reduce the number of
keep alive requests allowed on the server
([`keepalive_requests`](https://nginx.org/en/docs/http/ngx_http_core_module.html#keepalive_requests) on Nginx, [`MaxKeepAliveRequests`](https://httpd.apache.org/docs/2.4/mod/core.html#maxkeepaliverequests) on
Apache)
See [41902](https://gitlab.torproject.org/tpo/tpa/team/-/issues/41902) for further information.
In general, the service is still being launched, see [TPA-RFC-33][]
for the full deployment plan.
......
......