prometheus2 "device or resource busy" errors
dgoulet reported on irc
<dgoulet> anarcat, lavamind, kez: I think prom2 is unhappy, it is screaming with "device or resource busy" errors
<+dgoulet> https://prometheus2.torproject.org/classic/targets
Designs
- Show closed items
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
#40954 (closed) is possibly related. i don't have prometheus creds so i can't check
- anarcat marked this issue as related to #40954 (closed)
marked this issue as related to #40954 (closed)
- anarcat mentioned in issue #40954 (closed)
mentioned in issue #40954 (closed)
looks like we have a pretty low fd_limit there...
Nov 01 02:58:49 hetzner-nbg1-02 prometheus[788]: level=info ts=2022-11-01T02:58:49.159Z caller=main.go:362 fd_limits="(soft=8192, hard=8192)"
logically, we'd also have prometheus2's self monitoring available in prometheus1, but it seems that wasn't correctly configured. each prometheus server is monitoring itself and that's it, which is ... not great.
but interstingly, it seems we're still far from the 8192 limit here:
root@hetzner-nbg1-02:~# systemctl status prometheus | cat ● prometheus.service - Monitoring system and time series database Loaded: loaded (/lib/systemd/system/prometheus.service; enabled; vendor preset: enabled) Drop-In: /etc/systemd/system/prometheus.service.d └─override.conf Active: active (running) since Tue 2022-11-01 02:58:48 UTC; 16h ago Docs: https://prometheus.io/docs/introduction/overview/ man:prometheus(1) Main PID: 788 (prometheus) Tasks: 25 (limit: 4554) Memory: 1.8G CPU: 1h 25min 31.595s CGroup: /system.slice/prometheus.service └─788 /usr/bin/prometheus --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/var/lib/prometheus/metrics2 --storage.tsdb.retention.time=365d --web.console.libraries=/usr/local/share/prometheus/console_libraries --web.console.templates=/usr/local/share/prometheus/consoles Nov 01 17:01:06 hetzner-nbg1-02 prometheus[788]: level=info ts=2022-11-01T17:01:06.159Z caller=db.go:1191 component=tsdb msg="Deleting obsolete block" block=01GGSME17799S1981HXV1RAN3P Nov 01 17:01:06 hetzner-nbg1-02 prometheus[788]: level=info ts=2022-11-01T17:01:06.160Z caller=db.go:1191 component=tsdb msg="Deleting obsolete block" block=01GGSDJ9Z85BXK0DJ9QYF50E2E Nov 01 17:01:07 hetzner-nbg1-02 prometheus[788]: level=info ts=2022-11-01T17:01:07.077Z caller=compact.go:448 component=tsdb msg="compact blocks" count=3 mint=1667239200747 maxt=1667304000000 ulid=01GGT27D7M63T29F8QWTR22D7F sources="[01GGRS129GCFMQGF4XA2M7QDFR 01GGS6RG2M73TYS42K07KSV713 01GGT27D38ZTB5VM5W84THRC19]" duration=912.569507ms Nov 01 17:01:07 hetzner-nbg1-02 prometheus[788]: level=info ts=2022-11-01T17:01:07.099Z caller=db.go:1191 component=tsdb msg="Deleting obsolete block" block=01GGT27D38ZTB5VM5W84THRC19 Nov 01 17:01:07 hetzner-nbg1-02 prometheus[788]: level=info ts=2022-11-01T17:01:07.106Z caller=db.go:1191 component=tsdb msg="Deleting obsolete block" block=01GGS6RG2M73TYS42K07KSV713 Nov 01 17:01:07 hetzner-nbg1-02 prometheus[788]: level=info ts=2022-11-01T17:01:07.114Z caller=db.go:1191 component=tsdb msg="Deleting obsolete block" block=01GGRS129GCFMQGF4XA2M7QDFR Nov 01 19:00:03 hetzner-nbg1-02 prometheus[788]: level=info ts=2022-11-01T19:00:03.254Z caller=compact.go:507 component=tsdb msg="write block" mint=1667318400747 maxt=1667325600000 ulid=01GGT916Z8Q5P2RZ182P2J3JHH duration=77.476209ms Nov 01 19:00:03 hetzner-nbg1-02 prometheus[788]: level=info ts=2022-11-01T19:00:03.433Z caller=head.go:824 component=tsdb msg="Head GC completed" duration=170.644744ms Nov 01 19:00:03 hetzner-nbg1-02 prometheus[788]: level=info ts=2022-11-01T19:00:03.623Z caller=checkpoint.go:95 component=tsdb msg="Creating checkpoint" from_segment=4702 to_segment=21694 mint=1667325600000 Nov 01 19:00:03 hetzner-nbg1-02 prometheus[788]: level=error ts=2022-11-01T19:00:03.984Z caller=db.go:745 component=tsdb msg="compaction failed" err="WAL truncation in Compact: create checkpoint: create segment reader: open segment:00012752 in dir:/var/lib/prometheus/metrics2/wal: open /var/lib/prometheus/metrics2/wal/00012752: too many open files" root@hetzner-nbg1-02:~# ls /proc/788/fd | wc -l 139 root@hetzner-nbg1-02:~# lsof -p 788 | wc -l 205 root@hetzner-nbg1-02:~#
yet this hack seems to fix the problem:
prlimit --pid 788 --nofile=100000:100000
... so I guess i'm going to go ahead and tweak the systemd unit file to bump the resource limit here.
@dgoulet can you confirm this problem is fixed for you now?
Collapse replies Yes I see my data has come back to my Grafana :).
- anarcat added Prometheus label
added Prometheus label
i restarted prometheus a few times since then, trying to get the override in place (and often forgeting a
systemctl daemon-reload
and having puppet overwrite my changes) but this should be permanently fixed.i'll file a MR against the debian package to fix the unit file, but it's fixed in puppet for now...
i filed https://salsa.debian.org/go-team/packages/prometheus/-/merge_requests/6 to push the unit file tweak in the debian package.
also note that i've noticed rather slow startup times here, it takes a full minute to replay the WAL log on startup.. there's a feature flag to do this upstream (memory-snapshot-on-shutdown) but it's available only starting from 2.30.0, which is not available in stable (but is in backports).
otherwise this issue is fixed, i believe.
- anarcat closed
closed
- anarcat mentioned in commit wiki-replica@d14fc32c
mentioned in commit wiki-replica@d14fc32c
oh, and one last bit of info here, this resource was quite useful to diagnose and fix the problem as well:
https://www.robustperception.io/dealing-with-too-many-open-files/