- Sep 26, 2024
-
- Sep 25, 2024
-
- Sep 24, 2024
-
-
Jérôme Charaoui authored
- Sep 23, 2024
-
- Sep 20, 2024
-
- Sep 19, 2024
-
-
lelutin authored
Starting with the "fourth" timer, knowing about alert grouping and how it's done is useful to understand the rest of the timers, so I've added a bit of context there. During discussions on IRC, we took the time to dig into the alertmanager code and we've confirmed what @cks was mentioning in their blog post: once a group is created for a route, a thread is launched for processing new notifications every `group_interval`, so that setting is really like a group-specific ticker for new notifications.
- Sep 18, 2024
-
-
lelutin authored
Using cumin's batch size is still a possibility to avoid issues, but it is preferred to configure yourself for direct ssh connections and avoid using the batch size if not necessary. if direct-ssh connection is not possible, then using the batch size hack is still possible. using it does have some side-effects that one should be aware of though. small correction in the text after my tests today: the limitation is imposed by the MaxStartups setting, not MaxSessions.
-
lelutin authored
without this, if you have some blocks in your ssh config that set you up for connecting to certain hosts as an unprivileged users, you'll end up running cumin commands with that user and very probably failing. cumin is mostly used for running ad-hoc admin commands on hosts so it makes sense to make it force connection to root.
-
anarcat authored
This has been bugging me since basically forever: the howto/template is not a template for the "howto" section (which is now poorly defined anyways) at *all*. It's precisely the template for *services*, and really just belongs there. I've been hesitant in performing that rename for a long time. First because GitLab wikis didn't support redirects (they do now, and we add one here), but also because we probably link to the wiki-replica version of this in a few places. I've tried to fix the links inside the wiki, but there are certainly others that will break. We'll fix those as we go. For now it seems better and more intuitive to have this at the right place than preserve the legacy location.
-
- Sep 17, 2024
-
-
lelutin authored
By using a numbered list with unnumbered subpoints, we can convey a good sense of what are all systems that collaborate for the monitoring. The higher-level list is numbered since it follows the path of what happens in time. First the alert is created by prom, then received by alertmanager and finally consulted by sysadmins via karma/grafana
-
lelutin authored
The first paragraph says that we are not using prom alerting, and while it's still technically true that we haven't fully switched to it yet, we do have alerts for TPA services in prometheus now and we're slowly moving towards switching to that completely. So we might as well change that now to say that we do indeed use this for our montiring. The "Looking for alerts" paragraph gives a better overview of things if we make the list of URLs that one needs to know about in a list format with verbosity reduced.