tpa-rfc-33: single Grafana/Karma setup (team#40755)

This feels simpler, and easier to manage. I couldn't figure out where to fit Karma in the fully redundant setup, which is a tell that something was wrong. We keep the other idea as a rejected plan.

tpa-rfc-33: single Grafana/Karma setup (team#40755)
b29461cb · anarcat · f89be8e7 · b29461cb · b29461cb · f89be8e7
Verified Commit b29461cb authored 10 months ago by anarcat
--- a/policy/tpa-rfc-33-monitoring.md
+++ b/policy/tpa-rfc-33-monitoring.md
@@ -507,27 +507,27 @@ services:
 ![Diagram of the new infrastructure showing two redundant prom/grafana
 servers](tpa-rfc-33-monitoring/architecture-after.png)

-TODO: where does karma sit?
-
-The above shows a diagram of a highly available Prometheus/Grafana
-server setup. Each server has its own set of services running:
-
- * Prometheus: pulls metrics from exporters including a node exporter
-   on every machine but also other exporters defined by service
-   admins, for which configuration is a mix of Puppet and a GitLab
-   repository pulled by Puppet. One server keeps long term metrics and
-   has a longer scrape interval. Each Prometheus server monitors each
-   other.
-
- * blackbox exporter: this exporter runs on every Prometheus server
-   and is scraped by that Prometheus server for arbitrary metrics like
-   ICMP, HTTP or TLS response times
-
- * Grafana: each server runs its own Grafana service which should be
+The above shows a diagram of a highly available Prometheus server
+setup. Each server has its own set of services running:
+
+ * Prometheus: the primary pulls metrics from exporters including a
+   node exporter on every machine but also other exporters defined by
+   service admins, for which configuration is a mix of Puppet and a
+   GitLab repository pulled by Puppet. The secondary server keeps long
+   term metrics and pulls all the metrics from the primary server
+   using a longer scrape interval. Bother Prometheus server monitor
+   each other.
+
+ * blackbox exporter: this exporter runs on the primary Prometheus
+   server and is scraped by the primary Prometheus server for
+   arbitrary metrics like ICMP, HTTP or TLS response times
+
+ * Grafana: the primary server runs a Grafana service which should be
   fully configured in Puppet, with some dashboards being pulled from
   a GitLab repository. Local configuration is completely ephemeral
-   and discouraged. Each Grafana server browses metrics from the local
-   Prometheus database.
+   and discouraged. It pulls metrics from the local Prometheus server
+   which has a "remote read" interface to pull backlog from the
+   secondary server.

 * Alertmanager: each server also runs its own Alertmanager which
   fires off notifications to IRC, email, or (eventually) GitLab,
@@ -606,6 +606,8 @@ TODO: review https://gitlab.torproject.org/tpo/tpa/team/-/issues/40330

 TODO: there's something about an upper limit to scrape interval, check that.

+TODO: double-check that remote read and pull from the other actually works
+
 ### Self-monitoring

 Prometheus should monitor itself and its [Alertmanager][] for
@@ -809,6 +811,9 @@ TODO: review https://gitlab.com/gitlab-com/gl-infra/helicopter
 * turn off the Icinga server
 * remove all traces of NRPE on all nodes

+TODO: how to merge the two databases? maybe adopt the prom2 data and
+drop old TPA data?
+
 ## Timeline

 TODO: retire nagios first, HA later?
@@ -827,6 +832,47 @@ TODO: evaluate https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755#note_29

 https://gitlab.torproject.org/tpo/tpa/team/-/issues/40755#note_2968812

+## Fully redundant Grafana/Karma instances
+
+We have also briefly considered setting up the same, complete stack on
+both servers:
+
+![Diagram of an alternative infrastructure showing two fully redundant prom/grafana
+servers](tpa-rfc-33-monitoring/architecture-reject.png)
+
+The above shows a diagram of a highly available Prometheus/Grafana
+server setup. Each server has its own set of services running:
+
+ * Prometheus: both servers pulls metrics from all exporters including
+   a node exporter on every machine but also other exporters defined
+   by service admins
+
+ * blackbox exporter: this exporter runs on every Prometheus server
+   and is scraped by that Prometheus server for arbitrary metrics like
+   ICMP, HTTP or TLS response times
+
+ * Grafana: each server runs its own Grafana service, each Grafana
+   server browses metrics from the local Prometheus database.
+
+ * Alertmanager: each server also runs its own Alertmanager which
+   fires off notifications to IRC, email, or (eventually) GitLab,
+   deduplicating alerts between the two servers using its gossip
+   protocol.
+
+This feels impractical and overloaded. Grafana, in particular, would
+be tricky to configure as there is necessarily a bit of manual
+configuration on the server. Having two different retention policies
+would make it annoying as you would never quite know which server to
+use to browse data.
+
+The idea of having a single Grafana/Karma pair is that if they are
+down, you have other things to worry about anyways: the Alertmanager
+will let operators know of the problem, which needs to be fixed
+anyway.
+
+If this becomes a problem over time, the setup *could* be expanded to
+such a stage, but it feels superfluous for now.
+
 ## Other dashboards

 ### Grafana

--- a/policy/tpa-rfc-33-monitoring/architecture-after.dot
+++ b/policy/tpa-rfc-33-monitoring/architecture-after.dot
@@ -11,26 +11,24 @@ digraph before {
                Alertmanager1 [ label="Alertmanager" ]
                Grafana1 [ label="Grafana" ]
                blackbox1 [ label="Blackbox" ]
+                karma1 [ label="Karma" ]
        }

        subgraph "clusterprom2" {
-                label="prometheusN+1.torproject.org"
+                label="prometheusN+1.torproject.org\nlong term storage and HA"
                Prometheus2 [ label="Prometheus" ]
                Alertmanager2 [ label="Alertmanager" ]
-                Grafana2 [ label="Grafana" ]
-                blackbox2 [ label="Blackbox" ]
        }
        email
        IRC
        GitLab
        "node exporters"
-        { "other exporters", "node exporters" } -> { Prometheus1, Prometheus2} [arrowtail=inv, dir=back]
+        { "other exporters", "node exporters" } -> Prometheus1 [arrowtail=inv, dir=back]
        blackbox1 -> Prometheus1 [ arrowtail=inv dir=back]
-        blackbox2 -> Prometheus2 [ arrowtail=inv dir=back]
        Prometheus1 -> Alertmanager1 -> { email, IRC, GitLab }
        Prometheus2 -> Alertmanager2 -> { email, IRC, GitLab }
        Prometheus1 -> Prometheus2 [ dir=both ]
        Alertmanager1 -> Alertmanager2 [ dir=both ]
        Prometheus1 -> Grafana1 [ arrowtail=inv dir=back]
-        Prometheus2 -> Grafana2 [ arrowtail=inv dir=back]
+        { Alertmanager1, Alertmanager2 } -> karma1 [ arrowtail=inv dir=back ]
 }
--- a/policy/tpa-rfc-33-monitoring/architecture-after.png
+++ b/policy/tpa-rfc-33-monitoring/architecture-after.png
--- a/policy/tpa-rfc-33-monitoring/architecture-reject.dot
+++ b/policy/tpa-rfc-33-monitoring/architecture-reject.dot
+digraph before {
+        label="TPA monitoring infrastructure, planned 2024-2025\nGrafana, Prometheus, Alertmanager configurations pulled from GitLab and Puppet, not shown\nOther configuration pulled from Puppet and LDAP, not shown"
+        labelloc=bottom
+        graph [ fontname=Liberation fontsize=14 ];
+        node [ fontname=Liberaion ];
+        edge [ fontname=Liberation ];
+
+        subgraph "clusterprom1" {
+                label="prometheusN.torproject.org"
+                Prometheus1 [ label="Prometheus" ]
+                Alertmanager1 [ label="Alertmanager" ]
+                Grafana1 [ label="Grafana" ]
+                blackbox1 [ label="Blackbox" ]
+        }
+
+        subgraph "clusterprom2" {
+                label="prometheusN+1.torproject.org"
+                Prometheus2 [ label="Prometheus" ]
+                Alertmanager2 [ label="Alertmanager" ]
+                Grafana2 [ label="Grafana" ]
+                blackbox2 [ label="Blackbox" ]
+        }
+        email
+        IRC
+        GitLab
+        "node exporters"
+        { "other exporters", "node exporters" } -> { Prometheus1, Prometheus2} [arrowtail=inv, dir=back]
+        blackbox1 -> Prometheus1 [ arrowtail=inv dir=back]
+        blackbox2 -> Prometheus2 [ arrowtail=inv dir=back]
+        Prometheus1 -> Alertmanager1 -> { email, IRC, GitLab }
+        Prometheus2 -> Alertmanager2 -> { email, IRC, GitLab }
+        Prometheus1 -> Prometheus2 [ dir=both ]
+        Alertmanager1 -> Alertmanager2 [ dir=both ]
+        Prometheus1 -> Grafana1 [ arrowtail=inv dir=back]
+        Prometheus2 -> Grafana2 [ arrowtail=inv dir=back]
+}
--- a/policy/tpa-rfc-33-monitoring/architecture-reject.png
+++ b/policy/tpa-rfc-33-monitoring/architecture-reject.png