prometheus not updating DNS resolution

mentioned in issue tpo/anti-censorship/pluggable-transports/snowflake#40349 (closed)

I think I can confirm this because from the new broker I see 350K requests from unrestricted clients in russia:

snowflake_rounded_client_poll_total{cc="RU",nat="unrestricted",rendezvous_method="http",status="matched"} 350944

While quering the same in grafana2 I see 77M:

snowflake_rounded_client_poll_total{cc="RU",instance="snowflake-broker.torproject.net:443",job="snowflake-broker",nat="unrestricted",rendezvous_method="http",status="matched",team="anti-censorship"} 77593704

And the DNS was changed a day and a half ago, I hope it should have being enough time to propagate.

assigned to @lelutin

oh that's annoying if it doesn't follow dns changes :| I'll check out what's happening

added Doing Prometheus labels

the host sees the new IP addresses:

root@hetzner-nbg1-02:~# ping snowflake-broker.torproject.net
PING snowflake-broker.torproject.net(2a00:c6c0:0:151:4:ae99:c0a9:d585 (2a00:c6c0:0:151:4:ae99:c0a9:d585)) 56 data bytes
64 bytes from 2a00:c6c0:0:151:4:ae99:c0a9:d585 (2a00:c6c0:0:151:4:ae99:c0a9:d585): icmp_seq=1 ttl=55 time=10.2 ms
root@hetzner-nbg1-02:~# ping -4 snowflake-broker.torproject.net
PING  (37.218.242.175) 56(84) bytes of data.
64 bytes from 37.218.242.175 (37.218.242.175): icmp_seq=1 ttl=54 time=9.90 ms

according to the metric names in the description, the job that's not polling the right place is snowflake-broker, so not the blackbox one. so the faulty thing would be the prometheus daemon itself in this case.

if I poke directly at prometheus2 for the value of the metric, I can confirm that the number is currently 77625112. so it grew since when @meskio poked at it.

if I query the old IP directly, I can see the 77M number:

$ curl --resolve snowflake-broker.torproject.net:443:37.218.245.111 https://snowflake-broker.torproject.net/prometheus
[...]
snowflake_rounded_client_poll_total{cc="RU",nat="unrestricted",rendezvous_method="http",status="matched"} 7.7625472e+07

I'll restart the prometheus service to see if it starts polling the newplace

@meskio with the service restarted I now see the correct value when I query prom2

found this which might explain what was happening: https://github.com/prometheus/prometheus/issues/7552

so prom keeps the http connections alive and it means it won't follow dns changes correctly. that's pretty annoying, at least if the connection would be reset every so often it would remove the issue related to dns changing without putting too much pressure on http connections renewals.

closing this since the issue went away

@meskio could it be that the server allows an infinite number of Keep Alive requests? that would let Prometheus keep the socket open forever and never check DNS...

That might be an easy for future-proofing this...

@anarcat you mean on the snowflake-broker? I think is whatever is the default for nginx in debian.

@anarcat you mean on the snowflake-broker? I think is whatever is the default for nginx in debian.

Yes. Then I think that's what's happening: nginx allows for a surprisingly large number of keep alive requests before closing the socket on the client (1000). My theory is that Prometheus was still scraping the old server simply because it was still on the same TCP connection to the web server that wasn't closed because of the keep alive.

With a 1000 request limit, at one scrape per minute, this means up to 1000 minutes delay for Prom to notice the rotation, which is 16 hours!

If you want a faster rotation, you can lower the keep alive limit or kill the socket, either by resetting the connection in the kernel or restarting the webserver entirely.

Does that make sense?

No, it doesn't. Because it was not 16h, we waited 2 days to ask you to fix it. But maybe is just your calculation that is wrong.

I'm not totally understanding how keep alive will affect this, as if I understand correctly prometheus just opens 1 connection and keeps it alive for ever, so even setting the number in nginx to 1 will not fix this problem as only 1 connection will be kept alive.

No, it doesn't. Because it was not 16h, we waited 2 days to ask you to fix it. But maybe is just your calculation that is wrong.

Weird!

It's possible my calculation is wrong - what i did was look at the default Nginx configuration here:

https://nginx.org/en/docs/http/ngx_http_core_module.html#keepalive_requests

which defaults to 1000 minutes. I would assume Debian doesn't change that default configuration.

1000 minutes, if I count this right, is 16 hours and 40 minutes (1000 / 60 == 100 / 6 == 16 and two thirds, or 16 hours and 40 minutes). Happy to be proven wrong.

If you waited two days, then something else is going on, because 16h is a maximum.

Honestly, I'm really surprised to read "prometheus not updating DNS", I think Prom follows DNS, and, according to upstream, doesn't keep its own DNS cache.

I'll note that the DNS TTL for that domain is 3600 (1h), not that it changes anything for "no update in two days".

I'm not totally understanding how keep alive will affect this, as if I understand correctly prometheus just opens 1 connection and keeps it alive for ever, so even setting the number in nginx to 1 will not fix this problem as only 1 connection will be kept alive.

Hm... The setting I refer to above is a counter on the number of requests a client is allowed to issue before the server closes that "1 connection". It doesn't say "how many parallel requests a client can open". So I think it is exactly what I am describing here.

To reuse your example, if you set this to 1, then it's pretty much equivalent to disabling keep alive, and clients are allowed to make a single request per socket, and, in theory, we wouldn't have had this problem.

I'd also point out that if the old broken had been shutdown, Prometheus would have likely switched over to the new one without issues.

Again, at least, in theory. If you're confident about your claims and want us to investigate further, we could create a test scaffolding on our end to confirm the behavior.

I see how it works, but I'm not sure how to predict a good number to configure there. And in doubt I think I prefer to keep it like it is and add to our setup documentation that we need to restart prometheus.

I see how it works, but I'm not sure how to predict a good number to configure there. And in doubt I think I prefer to keep it like it is and add to our setup documentation that we need to restart prometheus.

Right. So for the record, I am not arguing that we restart prometheus for this. I would say that you should restart the old server, so that it kicks out that prometheus socket, then prometheus will check DNS and rotate to the new server.

Ahh, nice that I can solve it in my side.

notice that nginx also has keepalive_time which defaults to 1h. so I think this means that even though the 1000 requests was not hit, prom should have been forced to reconnect after just 1h.

I'm really not sure what's happening

closed

reopened

actually maybe we can do something about this. if we try to set the keepalive for http connections, it could possibly tell prometheus to reset connections after that time has passed.

@meskio I can't find access to that host. can I ask you to check out in the webserver configuration (nginx?) if there's a connection keepalive configured?

That server is not under TPA umbrella.

I don't see any specific configuration for keepalive:

root@snowflake-broker-40349:~# grep -r keepalive /etc/nginx/
root@snowflake-broker-40349:~#

Not sure what are the defaults.

I do see the new data now in grafana. Thanks for the restart.

mentioned in commit wiki-replica@c6ce410c

closed

prometheus not updating DNS resolution

Designs

Child items ...

Activity