diff --git a/policy/tpa-rfc-33-monitoring.md b/policy/tpa-rfc-33-monitoring.md index 62f8811374666b2aacf5353b517577ab06ef6f3a..dfd0f38ae2120e48114b7a0380d91fecc7d8581b 100644 --- a/policy/tpa-rfc-33-monitoring.md +++ b/policy/tpa-rfc-33-monitoring.md @@ -181,13 +181,6 @@ monitoring system, as provided by TPA. service admins SHOULD also have access to their own service-specific dashboards - * **operations dashboard**: it MUST be possible for TPA operators to - execute certain tasks via the system dashboard: - - silence an alert - - schedule ad-hoc silences - - display status and notification event history - - trigger a service check update - * **automatic configuration**: monitoring MUST NOT require a manual intervention from TPA when a new server is provisioned, and new components added during the server lifetime should be picked up @@ -206,6 +199,10 @@ monitoring system, as provided by TPA. not "load too high on runners"), which should help with alert fatigue and auto-configuration + * **timely service checks**: the monitoring system should notice + issues promptly (within a minute or so), without us having to + trigger checks manually to verify service recovery, for example + ### Nice to have * **alert notifications**: it SHOULD be possible for operators to @@ -239,6 +236,9 @@ monitoring system, as provided by TPA. DNSSEC records by following this playbook"; counter-example: "disk 80% full", "security delegations is WARNING") + * **notification silences**: operators should be able to silence + ongoing alerts or plan silences in advance + * **long term storage**: it should be possible to store metrics indefinitely, possibly with downsampling, to make long term (multi-year) analysis