Skip to content
Snippets Groups Projects
Verified Commit 8377ddb4 authored by anarcat's avatar anarcat
Browse files

more tpa-rfc-33 ideas (team#40755)

parent e322f978
No related branches found
No related tags found
No related merge requests found
Pipeline #167161 passed with warnings
......@@ -332,6 +332,11 @@ monitoring system, as provided by TPA.
# Personas
TODO: document impact on personas
TODO: review previous policies sections on personas to see if we're
missing anything
## Ethan, the TPA admin
Ethan is a member of the TPA team. He has access to the Puppet
......@@ -504,6 +509,8 @@ services:
### Planned
The eventual architecture for the system might look something like this:
![Diagram of the new infrastructure showing two redundant prom/grafana
servers](tpa-rfc-33-monitoring/architecture-after.png)
......@@ -544,16 +551,16 @@ setup. Each server has its own set of services running:
* **Karma**: alerting dashboard which pulls alerts from Alertmanager
and can issue silences.
The current prometheus1/prometheus2 server will actually be retired in
favor of two *new* servers which will be rebuilt from scratch,
entirely from Puppet, LDAP, and GitLab repository, ensuring they are
properly reproducible.
The current prometheus1/prometheus2 server may actually be retired in
favor of two *new* servers to be rebuilt from scratch, entirely from
Puppet, LDAP, and GitLab repository, ensuring they are properly
reproducible.
Experiments can be done manually on the current servers to speed up
development and replacement of the legacy infrastructure, but the goal
is to merge the two current server in a single cluster.
TODO: start with a single merged server at first and HA later?
is to merge the two current server in a single cluster. This might
also be accomplished by retiring one of the two servers and migrating
everything on the other.
## Metrics: Prometheus
......@@ -895,8 +902,10 @@ TODO: review https://gitlab.com/gitlab-com/gl-infra/helicopter
* turn off the Icinga server
* remove all traces of NRPE on all nodes
TODO: how to merge the two databases? maybe adopt the prom2 data and
drop old TPA data?
TODO: multiple stages; emergency buster retirement, then alerting
improvements, then HA, then long term retention
TODO: consider merging prom2 into prom1
## Timeline
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment