We have changed the IP address of the snowflake broker. Prometheus2 is polling metrics from it using the domain name snowflake-broker.torproject.net which has being updated but I believe is still polling from the old broker. Might it be that prometheus doesn't update the DNS resolution without a restart? If so, can you restart prometheus2?
Designs
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
root@hetzner-nbg1-02:~# ping snowflake-broker.torproject.netPING snowflake-broker.torproject.net(2a00:c6c0:0:151:4:ae99:c0a9:d585 (2a00:c6c0:0:151:4:ae99:c0a9:d585)) 56 data bytes64 bytes from 2a00:c6c0:0:151:4:ae99:c0a9:d585 (2a00:c6c0:0:151:4:ae99:c0a9:d585): icmp_seq=1 ttl=55 time=10.2 msroot@hetzner-nbg1-02:~# ping -4 snowflake-broker.torproject.netPING (37.218.242.175) 56(84) bytes of data.64 bytes from 37.218.242.175 (37.218.242.175): icmp_seq=1 ttl=54 time=9.90 ms
according to the metric names in the description, the job that's not polling the right place is snowflake-broker, so not the blackbox one. so the faulty thing would be the prometheus daemon itself in this case.
if I poke directly at prometheus2 for the value of the metric, I can confirm that the number is currently 77625112. so it grew since when @meskio poked at it.
if I query the old IP directly, I can see the 77M number:
so prom keeps the http connections alive and it means it won't follow dns changes correctly. that's pretty annoying, at least if the connection would be reset every so often it would remove the issue related to dns changing without putting too much pressure on http connections renewals.
@meskio could it be that the server allows an infinite number of Keep
Alive requests? that would let Prometheus keep the socket open forever
and never check DNS...
@anarcat you mean on the snowflake-broker? I think is whatever is the default for nginx in debian.
Yes. Then I think that's what's happening: nginx allows for a surprisingly
large number of keep alive requests before closing the socket on the
client (1000). My theory is that Prometheus was still scraping the old
server simply because it was still on the same TCP connection to the web
server that wasn't closed because of the keep alive.
With a 1000 request limit, at one scrape per minute, this means up to
1000 minutes delay for Prom to notice the rotation, which is 16 hours!
If you want a faster rotation, you can lower the keep alive limit or
kill the socket, either by resetting the connection in the kernel or
restarting the webserver entirely.
No, it doesn't. Because it was not 16h, we waited 2 days to ask you to fix it. But maybe is just your calculation that is wrong.
I'm not totally understanding how keep alive will affect this, as if I understand correctly prometheus just opens 1 connection and keeps it alive for ever, so even setting the number in nginx to 1 will not fix this problem as only 1 connection will be kept alive.
which defaults to 1000 minutes. I would assume Debian doesn't change
that default configuration.
1000 minutes, if I count this right, is 16 hours and 40 minutes (1000 /
60 == 100 / 6 == 16 and two thirds, or 16 hours and 40 minutes). Happy
to be proven wrong.
If you waited two days, then something else is going on, because 16h is
a maximum.
Honestly, I'm really surprised to read "prometheus not updating DNS",
I think Prom follows DNS, and, according to upstream, doesn't keep its
own DNS cache.
I'll note that the DNS TTL for that domain is 3600 (1h), not that it
changes anything for "no update in two days".
I'm not totally understanding how keep alive will affect this, as if I understand correctly prometheus just opens 1 connection and keeps it alive for ever, so even setting the number in nginx to 1 will not fix this problem as only 1 connection will be kept alive.
Hm... The setting I refer to above is a counter on the number of
requests a client is allowed to issue before the server closes that "1
connection". It doesn't say "how many parallel requests a client can
open". So I think it is exactly what I am describing here.
To reuse your example, if you set this to 1, then it's pretty much
equivalent to disabling keep alive, and clients are allowed to make a
single request per socket, and, in theory, we wouldn't have had this
problem.
I'd also point out that if the old broken had been shutdown, Prometheus
would have likely switched over to the new one without issues.
Again, at least, in theory. If you're confident about your claims and
want us to investigate further, we could create a test scaffolding on
our end to confirm the behavior.
I see how it works, but I'm not sure how to predict a good number to configure there. And in doubt I think I prefer to keep it like it is and add to our setup documentation that we need to restart prometheus.
I see how it works, but I'm not sure how to predict a good number to configure there. And in doubt I think I prefer to keep it like it is and add to our setup documentation that we need to restart prometheus.
Right. So for the record, I am not arguing that we restart prometheus
for this. I would say that you should restart the old server, so that
it kicks out that prometheus socket, then prometheus will check DNS and
rotate to the new server.
notice that nginx also has keepalive_time which defaults to 1h. so I think this means that even though the 1000 requests was not hit, prom should have been forced to reconnect after just 1h.
actually maybe we can do something about this. if we try to set the keepalive for http connections, it could possibly tell prometheus to reset connections after that time has passed.
@meskio I can't find access to that host. can I ask you to check out in the webserver configuration (nginx?) if there's a connection keepalive configured?