- Sep 23, 2024
-
-
anarcat authored
-
anarcat authored
-
anarcat authored
We'd use our own image here, but i can't find a stable-backports image (base-images#13).
-
anarcat authored
This will reduce the impact on Docker hub rate limiting and improve our supply chain, among many other things.
-
anarcat authored
-
anarcat authored
Those are queries I often find myself having to dig out of dashboards and alerts, but that are useful on their own.
-
- Sep 22, 2024
-
- Sep 21, 2024
-
-
Sebastian Hahn authored
-
- Sep 20, 2024
- Sep 19, 2024
-
-
lelutin authored
-
anarcat authored
I didn't find the result to be particularly legible, and it had lost the separation of source code references I had before. I am not sure we should dig too much into implementation details (like "threads"), but I kept that anyways. This is mostly reformulations.
-
lelutin authored
Starting with the "fourth" timer, knowing about alert grouping and how it's done is useful to understand the rest of the timers, so I've added a bit of context there. During discussions on IRC, we took the time to dig into the alertmanager code and we've confirmed what @cks was mentioning in their blog post: once a group is created for a route, a thread is launched for processing new notifications every `group_interval`, so that setting is really like a group-specific ticker for new notifications.
-
anarcat authored
-
anarcat authored
I don't understand wtf is going on here, but it looks like edits done through the wiki interface somehow rewrite the entire file with DOS line endings.
-
anarcat authored
-
anarcat authored
-
anarcat authored
-
anarcat authored
-
groente authored
-
- Sep 18, 2024
-
-
lelutin authored
Using cumin's batch size is still a possibility to avoid issues, but it is preferred to configure yourself for direct ssh connections and avoid using the batch size if not necessary. if direct-ssh connection is not possible, then using the batch size hack is still possible. using it does have some side-effects that one should be aware of though. small correction in the text after my tests today: the limitation is imposed by the MaxStartups setting, not MaxSessions.
-
lelutin authored
without this, if you have some blocks in your ssh config that set you up for connecting to certain hosts as an unprivileged users, you'll end up running cumin commands with that user and very probably failing. cumin is mostly used for running ad-hoc admin commands on hosts so it makes sense to make it force connection to root.
-
anarcat authored
-
anarcat authored
-
anarcat authored
I was looking for the answer to "what are the metrics here".
-
anarcat authored
Amazingly, those two didn't know each other... At least now we can find one another when looking at one, but perhaps the triage stuff could be merged in the labels proposal?
-
anarcat authored
-
anarcat authored
This has been bugging me since basically forever: the howto/template is not a template for the "howto" section (which is now poorly defined anyways) at *all*. It's precisely the template for *services*, and really just belongs there. I've been hesitant in performing that rename for a long time. First because GitLab wikis didn't support redirects (they do now, and we add one here), but also because we probably link to the wiki-replica version of this in a few places. I've tried to fix the links inside the wiki, but there are certainly others that will break. We'll fix those as we go. For now it seems better and more intuitive to have this at the right place than preserve the legacy location.
-
anarcat authored
This is so we have a runbook to link to in a new alert about this.
-
anarcat authored
-
anarcat authored
-
anarcat authored
That was relevant in 2019, when we actually were replacing Munin (which "died in a fire"), but we're really far past that now. Perhaps we could also have a "migrating from Nagios" section here as well though, see also team#41655.
-
anarcat authored
We were missing key bits about the firewall rules and a simpler example for `collect_scrape_jobs`.
-
- Sep 17, 2024
-
-
lelutin authored
Clearer distinction between recipients and routes. They're set in different hiera keys. The definitions are now in hiera, not in puppet manifests, so the examples need to be refreshed.
-
lelutin authored
Currently rules are *not* defined in puppet. However, scrape jobs and targets should be for all TPA-related services.
-
lelutin authored
By using a numbered list with unnumbered subpoints, we can convey a good sense of what are all systems that collaborate for the monitoring. The higher-level list is numbered since it follows the path of what happens in time. First the alert is created by prom, then received by alertmanager and finally consulted by sysadmins via karma/grafana
-
lelutin authored
The first paragraph says that we are not using prom alerting, and while it's still technically true that we haven't fully switched to it yet, we do have alerts for TPA services in prometheus now and we're slowly moving towards switching to that completely. So we might as well change that now to say that we do indeed use this for our montiring. The "Looking for alerts" paragraph gives a better overview of things if we make the list of URLs that one needs to know about in a list format with verbosity reduced.
-