promtheus2 ran out of disk space
Date: Mon, 24 Oct 2022 11:30:23 +0000
From: nagios@hetzner-hel1-01.torproject.org
To: anarcat+rapports@orangeseeds.org
Subject: ** PROBLEM Service Alert: hetzner-nbg1-02/disk usage - all is CRITICAL **
***** Icinga *****
Notification Type: PROBLEM
Service: disk usage - all
Host: hetzner-nbg1-02
Address: 116.203.55.206
State: CRITICAL
Date/Time: Mon Oct 24 11:30:23 UTC 2022
Additional Info:
DISK CRITICAL - free space: / 1955 MB (5% inode=95%): /dev 1897 MB (100% inode=99%): /dev/shm 1917 MB (99% inode=99%): /run 383 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /tmp 512 MB (100% inode=99%): /boot 327 MB (77% inode=99%): /run/credentials 383 MB (99% inode=99%): /var/tmp 1955 MB (5% inode=95%):
anectodal reports of "prometheus 2 is down" as well
action points:
- lint rules before merge (already filed as prometheus-alerts#1 (closed), will followup there)
- don't restart prometheus forever (filed https://bugs.debian.org/1022724, https://salsa.debian.org/go-team/packages/prometheus/-/merge_requests/5, deployed an override through puppet)
- prometheus shouldn't flood its logs with WAL notices if there's a syntax error (filed https://github.com/prometheus/prometheus/issues/11486)
- Show closed items
- #40942
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- anarcat added Doing Prometheus labels
added Doing Prometheus labels
- anarcat mentioned in merge request prometheus-alerts!14 (merged)
mentioned in merge request prometheus-alerts!14 (merged)
it suddenly ate up all disk space a few days ago:
logs have filled up the drive:
ncdu 1.15.1 ~ Use the arrow keys to navigate, press ? for help --- /var/log ---------------------------------------------------------------------------------------------------------------------------------------------------------- /.. 10.1 GiB [##########] daemon.log.1 5.4 GiB [##### ] daemon.log 3.8 GiB [### ] /journal 3.2 GiB [### ] syslog 2.2 GiB [## ] syslog.1 164.2 MiB [ ] syslog.2.gz
root@hetzner-nbg1-02:/var/log# wc -l daemon.log.1 55311822 daemon.log.1
it's mostly prometheus:
root@hetzner-nbg1-02:/var/log# awk '{print $5}' daemon.log.1 | sed 's/\[.*//' | head -10000000| sort | uniq -c | sort -n 1 bacula-fd 1 fstrim 1 lldpcli 1 sysctl 1 systemd-fsck 2 systemd-modules-load 2 systemd-shutdown: 3 systemd-cryptsetup 4 prometheus-apache-exporter 4 prometheus-blackbox-exporter 4 syslog-ng 4 systemd-udevd 5 haveged 8 lldpd 8 prometheus-alertmanager 9 tor 13 ifdown 14 sh 40 ferm 42 ulogd 53 prometheus-node-exporter 67 dhclient 78 systemd: 88 puppet-agent 112 Tor 121 ntpd 329 unbound 509 systemd-journald 2249 grafana-server 47294 systemd 51637 nrpe 9897296 prometheus
3 days ago, at 3:21:05, puppet noticed prometheus was down, and tried to restart it:
Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 puppet-agent[34600]: (/Stage[main]/Prometheus::Run_service/Service[prometheus]/ensure) ensure changed 'stopped' to 'running' (corrective) Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.441Z caller=main.go:355 msg="Starting Prometheus" version="(versio n=2.24.1+ds, branch=debian/sid, revision=2.24.1+ds-1+b7)" Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.441Z caller=main.go:360 build_context="(go=go1.15.9, user=pkg-go-m aintainers@lists.alioth.debian.org, date=20210804-20:00:23)" Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.441Z caller=main.go:361 host_details="(Linux 5.10.0-19-amd64 #1 SM P Debian 5.10.149-1 (2022-10-17) x86_64 hetzner-nbg1-02 (none))" Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.442Z caller=main.go:362 fd_limits="(soft=8192, hard=8192)" Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.442Z caller=main.go:363 vm_limits="(soft=unlimited, hard=unlimited )" Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.446Z caller=web.go:468 component=web msg="Start listening for conn ections" address=0.0.0.0:9090 Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.447Z caller=main.go:722 msg="Starting TSDB ..." Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.448Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1634126400000 maxt=1635876000000 ulid=01FKH7786PJNNMXGZYF7M0DPY8 Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.448Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1635876000000 maxt=1637625600000 ulid=01FN5BRTG8Z76G9FBG0W2BSV6Q Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.449Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1637625600000 maxt=1639375200000 ulid=01FPSGACWEDEQ8RDD0BKN41A28 Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.449Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1639375200000 maxt=1641124800000 ulid=01FRDMVXTK0H1NBDVT9TDM2EPS Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.449Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1641124800000 maxt=1642874400000 ulid=01FT1SDFXK9QG2VWR696X4MXYP Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.449Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1642874400000 maxt=1644624000000 ulid=01FVNXZ2ESZCWMJ33HDM5N3QJ3 Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.450Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1644624000000 maxt=1646373600000 ulid=01FXA2GTJD4RA5HMFF980ZBZSB Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.450Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1646373600000 maxt=1648123200000 ulid=01FYY72A5BH9TNFBEVWV45YAZ1 Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.450Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1648123200000 maxt=1649872800000 ulid=01G0JBKSZ26PCB9YF0GFAD0K9W Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.451Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1649872801112 maxt=1651622400000 ulid=01G26Q15M6W7NZ8EQR0GY2VA8Y Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.451Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1651622400267 maxt=1653372000000 ulid=01G3TVJHYPBVP078DH933Y69HF Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.452Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1653372000206 maxt=1655121600000 ulid=01G5F04B2R8J9Y0CBEYMJEX9B5 Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.452Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1655121600747 maxt=1656871200000 ulid=01G734NWGMZWKC6YJ13A0K2FK8 Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.452Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1656871200747 maxt=1658620800000 ulid=01G8Q97GJNQ0Q0Y0R8CV5CC3NN Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.452Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1658620800747 maxt=1660370400000 ulid=01GABDS1FEM0K5M458GPH5X805 Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.453Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1660370400087 maxt=1662120000000 ulid=01GBZJAJFTFBYHHTEXPG4NRX7H Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.453Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1662120003174 maxt=1663869600000 ulid=01GDKPW3E73AHR3FAYPP9CPSKY Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.453Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1663869601859 maxt=1665619200000 ulid=01GF7VDR42FYHGQAQTR5D1J9TJ Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.454Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1665619200747 maxt=1666202400000 ulid=01GFS7KFRPN8696D3H2AA14K50 Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.455Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1666267200747 maxt=1666274400000 ulid=01GFTYH4K5NJT43BNQXRDEM7TY Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.456Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1666274400747 maxt=1666281600000 ulid=01GFV5CVVDA0MTB3Y0FDP0B6G1 Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.456Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1666202400747 maxt=1666267200000 ulid=01GFV5CXBR7ZH5WQQJ10PS5KTN Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.456Z caller=tls_config.go:191 component=web msg="TLS is disabled." http2=false Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.456Z caller=repair.go:57 component=tsdb msg="Found healthy block" mint=1666281600747 maxt=1666288800000 ulid=01GFVC8K35EWJM24WVPEHDR8B6 Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.580Z caller=head.go:645 component=tsdb msg="Replaying on-disk memory mappable chunks if any" Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.593Z caller=head.go:659 component=tsdb msg="On-disk memory mappable chunks replay completed" duration=12.072812ms Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.593Z caller=head.go:665 component=tsdb msg="Replaying WAL, this may take a while" Oct 21 03:21:05 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:05.857Z caller=head.go:691 component=tsdb msg="WAL checkpoint loaded" Oct 21 03:21:06 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:06.449Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=4702 maxSegment=4724 Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.457Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=4703 maxSegment=4724 Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.647Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=4704 maxSegment=4724 Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.647Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=4705 maxSegment=4724 Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.649Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=4706 maxSegment=4724 Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.650Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=4707 maxSegment=4724 Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.650Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=4708 maxSegment=4724 Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.650Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=4709 maxSegment=4724 Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.651Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=4710 maxSegment=4724 Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.651Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=4711 maxSegment=4724 Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.651Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=4712 maxSegment=4724 Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.651Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=4713 maxSegment=4724 Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.653Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=4714 maxSegment=4724 Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.654Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=4715 maxSegment=4724 Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.654Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=4716 maxSegment=4724 Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.655Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=4717 maxSegment=4724 Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.655Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=4718 maxSegment=4724 Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.655Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=4719 maxSegment=4724 Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.655Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=4720 maxSegment=4724 Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.662Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=4721 maxSegment=4724 Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.662Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=4722 maxSegment=4724 Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.663Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=4723 maxSegment=4724 Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.663Z caller=head.go:717 component=tsdb msg="WAL segment loaded" segment=4724 maxSegment=4724 Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.663Z caller=head.go:722 component=tsdb msg="WAL replay completed" checkpoint_replay_duration=264.4415ms wal_replay_duration=1.805117901s total_replay_duration=2.082034151s Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.748Z caller=main.go:742 fs_type=EXT4_SUPER_MAGIC Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.749Z caller=main.go:745 msg="TSDB started" Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.749Z caller=main.go:871 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=error ts=2022-10-21T03:21:07.757Z caller=manager.go:946 component="rule manager" msg="loading groups failed" err="/etc/prometheus-alerts/rules.d/rdsys.rules: 0:0: group \"rdsys\", rule 1, \"\": one of 'record' or 'alert' must be set" Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=error ts=2022-10-21T03:21:07.757Z caller=main.go:891 msg="Failed to apply configuration" err="error loading rules, previous rule set restored" Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.757Z caller=main.go:595 msg="Stopping scrape discovery manager..." Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.757Z caller=main.go:609 msg="Stopping notify discovery manager..." Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.757Z caller=main.go:631 msg="Stopping scrape manager..." Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.757Z caller=main.go:605 msg="Notify discovery manager stopped" Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.760Z caller=main.go:591 msg="Scrape discovery manager stopped" Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.761Z caller=manager.go:924 component="rule manager" msg="Stopping rule manager..." Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.761Z caller=manager.go:934 component="rule manager" msg="Rule manager stopped" Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.761Z caller=main.go:625 msg="Scrape manager stopped" Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.764Z caller=notifier.go:601 component=notifier msg="Stopping notification manager..." Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=info ts=2022-10-21T03:21:07.764Z caller=main.go:799 msg="Notifier manager stopped" Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=error ts=2022-10-21T03:21:07.765Z caller=main.go:808 err="error loading config from \"/etc/prometheus/prometheus.yml\": one or more errors occurred while applying the new configuration (--config.file=\"/etc/prometheus/prometheus.yml\")" Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 systemd[1]: prometheus.service: Main process exited, code=exited, status=1/FAILURE Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 systemd[1]: prometheus.service: Failed with result 'exit-code'. Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 systemd[1]: prometheus.service: Consumed 2.534s CPU time. Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 systemd[1]: prometheus.service: Scheduled restart job, restart counter is at 1. Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 systemd[1]: prometheus.service: Consumed 2.534s CPU time. Oct 21 03:21:08 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[35222]: level=info ts=2022-10-21T03:21:08.082Z caller=main.go:355 msg="Starting Prometheus" version="(version=2.24.1+ds, branch=debian/sid, revision=2.24.1+ds-1+b7)"
this would go on for a while, every time puppet would run, it would start prometheus and fail to start it with the same error:
Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=error ts=2022-10-21T03:21:07.757Z caller=manager.go:946 component="rule manager" msg="loading groups failed" err="/etc/prometheus-alerts/rules.d/rdsys.rules: 0:0: group \"rdsys\", rule 1, \"\": one of 'record' or 'alert' must be set" Oct 21 03:21:07 hetzner-nbg1-02/hetzner-nbg1-02 prometheus[34774]: level=error ts=2022-10-21T03:21:07.765Z caller=main.go:808 err="error loading config from \"/etc/prometheus/prometheus.yml\": one or more errors occurred while applying the new configuration (--config.file=\"/etc/prometheus/prometheus.yml\")"
... or, more clearly:
err="/etc/prometheus-alerts/rules.d/rdsys.rules: 0:0: group \"rdsys\", rule 1, \"\": one of 'record' or 'alert' must be set"
so this is probably due to a syntax error introduced in prometheus-alerts!14 (merged), but that's besides the point... what's interesting is that this went chugging along for a while... still not affecting disk space usage significantly.
systemd actually tried to restart the service repeatedly like this for a long time. it's been going on for 3 days, actually, and would have kept on going if i didn't do
systemctl stop prometheus
a few minutes ago:root@hetzner-nbg1-02:/var/log# tac /var/log/daemon.log.1 | grep 'Scheduled restart job, restart counter' | head -1 Oct 23 00:00:00 hetzner-nbg1-02/hetzner-nbg1-02 systemd[1]: prometheus.service: Scheduled restart job, restart counter is at 19746.
that ridiculous process somehow peaked yesterday:
root@hetzner-nbg1-02:/var/log# cat /var/log/daemon.log | grep 'Scheduled restart job, restart counter' | awk 'BEGIN { m=0} { if ($NF > m) { d = $1 " " $2 " " $3; m = $NF } } END { print d " " m }' Oct 23 08:08:12 22215.
what's strange is that none of this significantly impacted disk space until around 11:30, according to grafana, when disk usage exploded first exponentially, and then linearly:
https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&var-instance=hetzner-nbg1-02.torproject.org&from=1666347239427&to=1666452688246
one thing that is peculiar though is that even though prometheus would fail to start, it would seem to move its WAL pointer every time, which meant it had a longer and longer log to replay at every startup. take for example:
root@hetzner-nbg1-02:/var/log# zgrep -o 'maxSegment=[0-9]*' syslog.3.gz | head -1 maxSegment=4724 root@hetzner-nbg1-02:/var/log# zgrep -o 'maxSegment=[0-9]*' syslog.3.gz | tail -1 maxSegment=14937 root@hetzner-nbg1-02:/var/log# grep -o 'maxSegment=[0-9]*' syslog | head -1 maxSegment=26972 root@hetzner-nbg1-02:/var/log# tac syslog | grep -o 'maxSegment=[0-9]*' | head -1 maxSegment=30073
so while this can't quite explain why disk space usage didn't start to explode at
3:21:05
, it certainly gives us a good idea of why it did eventually explode. maybe there's something else going on here, but it's pretty slow grepping tens of gigabytes of logs.so i'm not sure what the root cause is, because of the discrepancy between the log dates and the actual disk space rise in grafana. but it sure as hell seems like systemd's restart procedures (or the unit configuration, to be more accurate) are to blame.
it seems we need to limit the number of retries systemd does to restart this service. 2-3 should be enough, it certainly should retry forever. i'll file a ticket for this in the debian package.
we should also have linting in the rules repo so we catch errors like this before deployment. this is another ticket to make.
to restore the service, i think i'll have no choice but to trash at least some part of the logs and move on, otherwise we're never going to get out of this hell hole.
Edited by anarcatfixed the syntax error on the server, restarting the server, didn't work. prometheus still stops with error 1 after trying to reply its log.
maybe 2.3G is not enough room, i'll clear out more logs:
logrotate -f /etc/logrotate.d/syslog-ng logrotate -f /etc/logrotate.d/syslog-ng
root@hetzner-nbg1-02:~# df -h / Filesystem Size Used Avail Use% Mounted on /dev/mapper/croot 37G 30G 5.3G 85% /
Edited by anarcat- anarcat mentioned in commit prometheus-alerts@430fd7bd
mentioned in commit prometheus-alerts@430fd7bd
syntax was fixed in prometheus-alerts@430fd7bd
next up is to file followup tickets.
- anarcat mentioned in issue prometheus-alerts#1 (closed)
mentioned in issue prometheus-alerts#1 (closed)
- anarcat marked this issue as related to prometheus-alerts#1 (closed)
marked this issue as related to prometheus-alerts#1 (closed)
- anarcat changed the description
changed the description
- anarcat changed the description
changed the description
- anarcat marked the checklist item don't restart prometheus forever (filed https://bugs.debian.org/1022724, https://salsa.debian.org/go-team/packages/prometheus/-/merge_requests/5) as completed
marked the checklist item don't restart prometheus forever (filed https://bugs.debian.org/1022724, https://salsa.debian.org/go-team/packages/prometheus/-/merge_requests/5) as completed
- anarcat changed the description
changed the description
- anarcat closed
closed
- anarcat changed the incident status to Resolved by closing the incident
changed the incident status to Resolved by closing the incident
disk space is almost back to normal:
https://grafana.torproject.org/d/zbCoGRjnz/disk-usage?orgId=1&var-instance=hetzner-nbg1-02.torproject.org&from=now-4d&to=now
we have a disk usage warning on loghost that might be related to this however, but i'll let the new star of the week (@kez) handle this normally.
1- anarcat mentioned in issue #40942 (closed)
mentioned in issue #40942 (closed)
- anarcat marked this issue as related to #40942 (closed)
marked this issue as related to #40942 (closed)
- anarcat mentioned in issue #40935 (closed)
mentioned in issue #40935 (closed)
- anarcat mentioned in commit wiki-replica@d14fc32c
mentioned in commit wiki-replica@d14fc32c
- anarcat mentioned in issue #40417 (closed)
mentioned in issue #40417 (closed)