diff --git a/policy/tpa-rfc-33-monitoring.md b/policy/tpa-rfc-33-monitoring.md index fc25703c1cc88fb2fbe4f09fb79ac53c1e694695..5e5be6d83d3799ee5f7a6594ed717183dbc1d2cd 100644 --- a/policy/tpa-rfc-33-monitoring.md +++ b/policy/tpa-rfc-33-monitoring.md @@ -332,6 +332,11 @@ monitoring system, as provided by TPA. # Personas +TODO: document impact on personas + +TODO: review previous policies sections on personas to see if we're +missing anything + ## Ethan, the TPA admin Ethan is a member of the TPA team. He has access to the Puppet @@ -504,6 +509,8 @@ services: ### Planned +The eventual architecture for the system might look something like this: +  @@ -544,16 +551,16 @@ setup. Each server has its own set of services running: * **Karma**: alerting dashboard which pulls alerts from Alertmanager and can issue silences. -The current prometheus1/prometheus2 server will actually be retired in -favor of two *new* servers which will be rebuilt from scratch, -entirely from Puppet, LDAP, and GitLab repository, ensuring they are -properly reproducible. +The current prometheus1/prometheus2 server may actually be retired in +favor of two *new* servers to be rebuilt from scratch, entirely from +Puppet, LDAP, and GitLab repository, ensuring they are properly +reproducible. Experiments can be done manually on the current servers to speed up development and replacement of the legacy infrastructure, but the goal -is to merge the two current server in a single cluster. - -TODO: start with a single merged server at first and HA later? +is to merge the two current server in a single cluster. This might +also be accomplished by retiring one of the two servers and migrating +everything on the other. ## Metrics: Prometheus @@ -895,8 +902,10 @@ TODO: review https://gitlab.com/gitlab-com/gl-infra/helicopter * turn off the Icinga server * remove all traces of NRPE on all nodes -TODO: how to merge the two databases? maybe adopt the prom2 data and -drop old TPA data? +TODO: multiple stages; emergency buster retirement, then alerting +improvements, then HA, then long term retention + +TODO: consider merging prom2 into prom1 ## Timeline