From ae04389353d2d315dfeff260dd01391dddc06ce1 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Antoine=20Beaupr=C3=A9?= Date: Wed, 27 Jul 2022 10:54:03 -0400 Subject: [PATCH 1/4] remove status and notification event history from dashboard The rationale here is that the former is covered by the "trending" requirement: previous metrics *will* have status history baked in. I also strongly feel like notification history doesn't *necessarily* belong in the dashboard. I'm thinking of a design where notifications end up as ticket, because notifications *really* do signal an important issue with the system (as opposed to the noise we're getting now, which would be unmanageable as tickets). If we want this in the dashboard, let's make it a SHOULD (i.e. not a must have), because "history" is already a requirement, and there are many ways to implement it (including, but not limited to, a dashboard). --- policy/tpa-rfc-33-monitoring.md | 1 - 1 file changed, 1 deletion(-) diff --git a/policy/tpa-rfc-33-monitoring.md b/policy/tpa-rfc-33-monitoring.md index 62f8811..1834a9d 100644 --- a/policy/tpa-rfc-33-monitoring.md +++ b/policy/tpa-rfc-33-monitoring.md @@ -185,7 +185,6 @@ monitoring system, as provided by TPA. execute certain tasks via the system dashboard: - silence an alert - schedule ad-hoc silences - - display status and notification event history - trigger a service check update * **automatic configuration**: monitoring MUST NOT require a manual -- GitLab From 1861e82a0a88d98b92ef0edbc216dd86e3e77b10 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Antoine=20Beaupr=C3=A9?= Date: Wed, 27 Jul 2022 11:01:58 -0400 Subject: [PATCH 2/4] remove manual service checks from MUST Instead, I think what we really want is timely service checks. The current system is peculiar in my experience with Nagios in that it's *slow* for many things. Many of that slowness is because it looks like Nagios has never been tuned to do all checks within a certain period (say one minute). Instead, some checks can take many minutes to finally go through, which means we need to (for example) manually trigger checks to make sure things have recovered when we do an operation. This is particularly painful for security updates, where there's a second level of state kept: the `dsa-update-apt-status` hack, which keeps its own state file that needs to be manually refresh, outside of the monitoring system, when updates are performed (manually or automatically). The new system MUST not fall into those traps and instead provide timely updates. There are many ways to do this with the current system, from Icinga tuning, to removing or improving the dsa-update-apt-status hack, tweaking certain check frequencies, and so on. In Prometheus, all checks are done within a minute. Checks are assumed to be "cheap", which means some checks are actually pre-computed client-side. This makes it impossible to trigger those checks server side, but I believe that's not a problem: when an update happens now (for example), Prometheus picks it up immediately because it doesn't have the dsa-update-apt-status hack. For needrestart, which, true, is more expensive, it would need a client-side trigger, but that's nothing Cumin can't fix. --- policy/tpa-rfc-33-monitoring.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/policy/tpa-rfc-33-monitoring.md b/policy/tpa-rfc-33-monitoring.md index 1834a9d..280526b 100644 --- a/policy/tpa-rfc-33-monitoring.md +++ b/policy/tpa-rfc-33-monitoring.md @@ -185,7 +185,6 @@ monitoring system, as provided by TPA. execute certain tasks via the system dashboard: - silence an alert - schedule ad-hoc silences - - trigger a service check update * **automatic configuration**: monitoring MUST NOT require a manual intervention from TPA when a new server is provisioned, and new @@ -205,6 +204,10 @@ monitoring system, as provided by TPA. not "load too high on runners"), which should help with alert fatigue and auto-configuration + * **timely service checks**: the monitoring system should notice + issues promptly (within a minute or so), without us having to + trigger checks manually to verify service recovery, for example + ### Nice to have * **alert notifications**: it SHOULD be possible for operators to -- GitLab From 503d14d493ae0f4a6ba0804e440585c5ca633b84 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Antoine=20Beaupr=C3=A9?= Date: Wed, 27 Jul 2022 11:08:11 -0400 Subject: [PATCH 3/4] decouple the silence requirement from its implementation I don't believe we absolutely need silences to be bound to the dashboard. There are many useful ways those could be implemented, *including* through tools like fabric that could silence the monitoring system on reboot. In that case, what we would need is more an API than a dashboard, for example. In general, it's better to focus on what we actually want out of the requirement more than its implementation. --- policy/tpa-rfc-33-monitoring.md | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/policy/tpa-rfc-33-monitoring.md b/policy/tpa-rfc-33-monitoring.md index 280526b..7231730 100644 --- a/policy/tpa-rfc-33-monitoring.md +++ b/policy/tpa-rfc-33-monitoring.md @@ -181,11 +181,6 @@ monitoring system, as provided by TPA. service admins SHOULD also have access to their own service-specific dashboards - * **operations dashboard**: it MUST be possible for TPA operators to - execute certain tasks via the system dashboard: - - silence an alert - - schedule ad-hoc silences - * **automatic configuration**: monitoring MUST NOT require a manual intervention from TPA when a new server is provisioned, and new components added during the server lifetime should be picked up @@ -198,6 +193,9 @@ monitoring system, as provided by TPA. silencing expected alerts ahead of time (for planned maintenance) or on a schedule (eg. high i/o load during the backup window) + * **alert silences**: operators should be able to silence ongoing + alerts or plan silences in advance + * **performance-level alerting**: alerts MUST focus on user-visible performance metrics instead of underlying assumptions about architecture (e.g. alert on "CI jobs waiting for more than X hours" -- GitLab From 9762041cc272d1db09437eed75eb9ab69643e029 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Antoine=20Beaupr=C3=A9?= Date: Wed, 27 Jul 2022 11:13:01 -0400 Subject: [PATCH 4/4] move silences to a nice-to-have Notifications are not actually a requirement, so it doesn't make sense to have silences be a requirement. --- policy/tpa-rfc-33-monitoring.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/policy/tpa-rfc-33-monitoring.md b/policy/tpa-rfc-33-monitoring.md index 7231730..dfd0f38 100644 --- a/policy/tpa-rfc-33-monitoring.md +++ b/policy/tpa-rfc-33-monitoring.md @@ -193,9 +193,6 @@ monitoring system, as provided by TPA. silencing expected alerts ahead of time (for planned maintenance) or on a schedule (eg. high i/o load during the backup window) - * **alert silences**: operators should be able to silence ongoing - alerts or plan silences in advance - * **performance-level alerting**: alerts MUST focus on user-visible performance metrics instead of underlying assumptions about architecture (e.g. alert on "CI jobs waiting for more than X hours" @@ -239,6 +236,9 @@ monitoring system, as provided by TPA. DNSSEC records by following this playbook"; counter-example: "disk 80% full", "security delegations is WARNING") + * **notification silences**: operators should be able to silence + ongoing alerts or plan silences in advance + * **long term storage**: it should be possible to store metrics indefinitely, possibly with downsampling, to make long term (multi-year) analysis -- GitLab