From 8377ddb4f0c39451e11694c395329d06e7837f5d Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Antoine=20Beaupr=C3=A9?= <anarcat@debian.org>
Date: Wed, 8 May 2024 22:26:02 -0400
Subject: [PATCH] more tpa-rfc-33 ideas (tpo/tpa/team#40755)

---
 policy/tpa-rfc-33-monitoring.md | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/policy/tpa-rfc-33-monitoring.md b/policy/tpa-rfc-33-monitoring.md
index fc25703c..5e5be6d8 100644
--- a/policy/tpa-rfc-33-monitoring.md
+++ b/policy/tpa-rfc-33-monitoring.md
@@ -332,6 +332,11 @@ monitoring system, as provided by TPA.
 
 # Personas
 
+TODO: document impact on personas
+
+TODO: review previous policies sections on personas to see if we're
+missing anything
+
 ## Ethan, the TPA admin
 
 Ethan is a member of the TPA team. He has access to the Puppet
@@ -504,6 +509,8 @@ services:
 
 ### Planned
 
+The eventual architecture for the system might look something like this:
+
 ![Diagram of the new infrastructure showing two redundant prom/grafana
 servers](tpa-rfc-33-monitoring/architecture-after.png)
 
@@ -544,16 +551,16 @@ setup. Each server has its own set of services running:
  * **Karma**: alerting dashboard which pulls alerts from Alertmanager
    and can issue silences.
 
-The current prometheus1/prometheus2 server will actually be retired in
-favor of two *new* servers which will be rebuilt from scratch,
-entirely from Puppet, LDAP, and GitLab repository, ensuring they are
-properly reproducible.
+The current prometheus1/prometheus2 server may actually be retired in
+favor of two *new* servers to be rebuilt from scratch, entirely from
+Puppet, LDAP, and GitLab repository, ensuring they are properly
+reproducible.
 
 Experiments can be done manually on the current servers to speed up
 development and replacement of the legacy infrastructure, but the goal
-is to merge the two current server in a single cluster.
-
-TODO: start with a single merged server at first and HA later?
+is to merge the two current server in a single cluster. This might
+also be accomplished by retiring one of the two servers and migrating
+everything on the other.
 
 ## Metrics: Prometheus
 
@@ -895,8 +902,10 @@ TODO: review https://gitlab.com/gitlab-com/gl-infra/helicopter
  * turn off the Icinga server
  * remove all traces of NRPE on all nodes
 
-TODO: how to merge the two databases? maybe adopt the prom2 data and
-drop old TPA data?
+TODO: multiple stages; emergency buster retirement, then alerting
+improvements, then HA, then long term retention
+
+TODO: consider merging prom2 into prom1
 
 ## Timeline
 
-- 
GitLab