Skip to content
Snippets Groups Projects
Verified Commit b29461cb authored by anarcat's avatar anarcat
Browse files

tpa-rfc-33: single Grafana/Karma setup (team#40755)

This feels simpler, and easier to manage. I couldn't figure out where
to fit Karma in the fully redundant setup, which is a tell that
something was wrong.

We keep the other idea as a rejected plan.
parent f89be8e7
No related branches found
No related tags found
No related merge requests found
......@@ -507,27 +507,27 @@ services:
![Diagram of the new infrastructure showing two redundant prom/grafana
servers](tpa-rfc-33-monitoring/architecture-after.png)
TODO: where does karma sit?
The above shows a diagram of a highly available Prometheus/Grafana
server setup. Each server has its own set of services running:
* Prometheus: pulls metrics from exporters including a node exporter
on every machine but also other exporters defined by service
admins, for which configuration is a mix of Puppet and a GitLab
repository pulled by Puppet. One server keeps long term metrics and
has a longer scrape interval. Each Prometheus server monitors each
other.
* blackbox exporter: this exporter runs on every Prometheus server
and is scraped by that Prometheus server for arbitrary metrics like
ICMP, HTTP or TLS response times
* Grafana: each server runs its own Grafana service which should be
The above shows a diagram of a highly available Prometheus server
setup. Each server has its own set of services running:
* Prometheus: the primary pulls metrics from exporters including a
node exporter on every machine but also other exporters defined by
service admins, for which configuration is a mix of Puppet and a
GitLab repository pulled by Puppet. The secondary server keeps long
term metrics and pulls all the metrics from the primary server
using a longer scrape interval. Bother Prometheus server monitor
each other.
* blackbox exporter: this exporter runs on the primary Prometheus
server and is scraped by the primary Prometheus server for
arbitrary metrics like ICMP, HTTP or TLS response times
* Grafana: the primary server runs a Grafana service which should be
fully configured in Puppet, with some dashboards being pulled from
a GitLab repository. Local configuration is completely ephemeral
and discouraged. Each Grafana server browses metrics from the local
Prometheus database.
and discouraged. It pulls metrics from the local Prometheus server
which has a "remote read" interface to pull backlog from the
secondary server.
* Alertmanager: each server also runs its own Alertmanager which
fires off notifications to IRC, email, or (eventually) GitLab,
......@@ -606,6 +606,8 @@ TODO: review https://gitlab.torproject.org/tpo/tpa/team/-/issues/40330
TODO: there's something about an upper limit to scrape interval, check that.
TODO: double-check that remote read and pull from the other actually works
### Self-monitoring
Prometheus should monitor itself and its [Alertmanager][] for
......@@ -809,6 +811,9 @@ TODO: review https://gitlab.com/gitlab-com/gl-infra/helicopter
* turn off the Icinga server
* remove all traces of NRPE on all nodes
TODO: how to merge the two databases? maybe adopt the prom2 data and
drop old TPA data?
## Timeline
TODO: retire nagios first, HA later?
......@@ -827,6 +832,47 @@ TODO: evaluate https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755#note_29
https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755#note_2968812
## Fully redundant Grafana/Karma instances
We have also briefly considered setting up the same, complete stack on
both servers:
![Diagram of an alternative infrastructure showing two fully redundant prom/grafana
servers](tpa-rfc-33-monitoring/architecture-reject.png)
The above shows a diagram of a highly available Prometheus/Grafana
server setup. Each server has its own set of services running:
* Prometheus: both servers pulls metrics from all exporters including
a node exporter on every machine but also other exporters defined
by service admins
* blackbox exporter: this exporter runs on every Prometheus server
and is scraped by that Prometheus server for arbitrary metrics like
ICMP, HTTP or TLS response times
* Grafana: each server runs its own Grafana service, each Grafana
server browses metrics from the local Prometheus database.
* Alertmanager: each server also runs its own Alertmanager which
fires off notifications to IRC, email, or (eventually) GitLab,
deduplicating alerts between the two servers using its gossip
protocol.
This feels impractical and overloaded. Grafana, in particular, would
be tricky to configure as there is necessarily a bit of manual
configuration on the server. Having two different retention policies
would make it annoying as you would never quite know which server to
use to browse data.
The idea of having a single Grafana/Karma pair is that if they are
down, you have other things to worry about anyways: the Alertmanager
will let operators know of the problem, which needs to be fixed
anyway.
If this becomes a problem over time, the setup *could* be expanded to
such a stage, but it feels superfluous for now.
## Other dashboards
### Grafana
......
......@@ -11,26 +11,24 @@ digraph before {
Alertmanager1 [ label="Alertmanager" ]
Grafana1 [ label="Grafana" ]
blackbox1 [ label="Blackbox" ]
karma1 [ label="Karma" ]
}
subgraph "clusterprom2" {
label="prometheusN+1.torproject.org"
label="prometheusN+1.torproject.org\nlong term storage and HA"
Prometheus2 [ label="Prometheus" ]
Alertmanager2 [ label="Alertmanager" ]
Grafana2 [ label="Grafana" ]
blackbox2 [ label="Blackbox" ]
}
email
IRC
GitLab
"node exporters"
{ "other exporters", "node exporters" } -> { Prometheus1, Prometheus2} [arrowtail=inv, dir=back]
{ "other exporters", "node exporters" } -> Prometheus1 [arrowtail=inv, dir=back]
blackbox1 -> Prometheus1 [ arrowtail=inv dir=back]
blackbox2 -> Prometheus2 [ arrowtail=inv dir=back]
Prometheus1 -> Alertmanager1 -> { email, IRC, GitLab }
Prometheus2 -> Alertmanager2 -> { email, IRC, GitLab }
Prometheus1 -> Prometheus2 [ dir=both ]
Alertmanager1 -> Alertmanager2 [ dir=both ]
Prometheus1 -> Grafana1 [ arrowtail=inv dir=back]
Prometheus2 -> Grafana2 [ arrowtail=inv dir=back]
{ Alertmanager1, Alertmanager2 } -> karma1 [ arrowtail=inv dir=back ]
}
policy/tpa-rfc-33-monitoring/architecture-after.png

107 KiB | W: | H:

policy/tpa-rfc-33-monitoring/architecture-after.png

91.6 KiB | W: | H:

policy/tpa-rfc-33-monitoring/architecture-after.png
policy/tpa-rfc-33-monitoring/architecture-after.png
policy/tpa-rfc-33-monitoring/architecture-after.png
policy/tpa-rfc-33-monitoring/architecture-after.png
  • 2-up
  • Swipe
  • Onion skin
digraph before {
label="TPA monitoring infrastructure, planned 2024-2025\nGrafana, Prometheus, Alertmanager configurations pulled from GitLab and Puppet, not shown\nOther configuration pulled from Puppet and LDAP, not shown"
labelloc=bottom
graph [ fontname=Liberation fontsize=14 ];
node [ fontname=Liberaion ];
edge [ fontname=Liberation ];
subgraph "clusterprom1" {
label="prometheusN.torproject.org"
Prometheus1 [ label="Prometheus" ]
Alertmanager1 [ label="Alertmanager" ]
Grafana1 [ label="Grafana" ]
blackbox1 [ label="Blackbox" ]
}
subgraph "clusterprom2" {
label="prometheusN+1.torproject.org"
Prometheus2 [ label="Prometheus" ]
Alertmanager2 [ label="Alertmanager" ]
Grafana2 [ label="Grafana" ]
blackbox2 [ label="Blackbox" ]
}
email
IRC
GitLab
"node exporters"
{ "other exporters", "node exporters" } -> { Prometheus1, Prometheus2} [arrowtail=inv, dir=back]
blackbox1 -> Prometheus1 [ arrowtail=inv dir=back]
blackbox2 -> Prometheus2 [ arrowtail=inv dir=back]
Prometheus1 -> Alertmanager1 -> { email, IRC, GitLab }
Prometheus2 -> Alertmanager2 -> { email, IRC, GitLab }
Prometheus1 -> Prometheus2 [ dir=both ]
Alertmanager1 -> Alertmanager2 [ dir=both ]
Prometheus1 -> Grafana1 [ arrowtail=inv dir=back]
Prometheus2 -> Grafana2 [ arrowtail=inv dir=back]
}
policy/tpa-rfc-33-monitoring/architecture-reject.png

107 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment