more tpa-rfc-33 ideas (team#40755)

8377ddb4 · anarcat · e322f978 · 8377ddb4
Verified Commit 8377ddb4 authored 10 months ago by anarcat
--- a/policy/tpa-rfc-33-monitoring.md
+++ b/policy/tpa-rfc-33-monitoring.md
@@ -332,6 +332,11 @@ monitoring system, as provided by TPA.

 # Personas

+TODO: document impact on personas
+
+TODO: review previous policies sections on personas to see if we're
+missing anything
+
 ## Ethan, the TPA admin

 Ethan is a member of the TPA team. He has access to the Puppet
@@ -504,6 +509,8 @@ services:

 ### Planned

+The eventual architecture for the system might look something like this:
+
 ![Diagram of the new infrastructure showing two redundant prom/grafana
 servers](tpa-rfc-33-monitoring/architecture-after.png)

@@ -544,16 +551,16 @@ setup. Each server has its own set of services running:
 * **Karma**: alerting dashboard which pulls alerts from Alertmanager
   and can issue silences.

-The current prometheus1/prometheus2 server will actually be retired in
-favor of two *new* servers which will be rebuilt from scratch,
-entirely from Puppet, LDAP, and GitLab repository, ensuring they are
-properly reproducible.
+The current prometheus1/prometheus2 server may actually be retired in
+favor of two *new* servers to be rebuilt from scratch, entirely from
+Puppet, LDAP, and GitLab repository, ensuring they are properly
+reproducible.

 Experiments can be done manually on the current servers to speed up
 development and replacement of the legacy infrastructure, but the goal
-is to merge the two current server in a single cluster.
-
-TODO: start with a single merged server at first and HA later?
+is to merge the two current server in a single cluster. This might
+also be accomplished by retiring one of the two servers and migrating
+everything on the other.

 ## Metrics: Prometheus

@@ -895,8 +902,10 @@ TODO: review https://gitlab.com/gitlab-com/gl-infra/helicopter
 * turn off the Icinga server
 * remove all traces of NRPE on all nodes

-TODO: how to merge the two databases? maybe adopt the prom2 data and
-drop old TPA data?
+TODO: multiple stages; emergency buster retirement, then alerting
+improvements, then HA, then long term retention
+
+TODO: consider merging prom2 into prom1

 ## Timeline