tpa-rfc-33: draft timeline (team#40755)

117fef9e · anarcat · 3bc1d6c5 · 117fef9e
Verified Commit 117fef9e authored 10 months ago by anarcat
--- a/policy/tpa-rfc-33-monitoring.md
+++ b/policy/tpa-rfc-33-monitoring.md
@@ -739,6 +739,10 @@ kept as an implementation detail to be researched later. [Thanos is
 not packaged in Debian](https://bugs.debian.org/1032842) which would probably mean deploying it with
 a container.

+There are other proxies too, like [promxy](https://github.com/jacksontj/promxy) and [trickster](https://trickstercache.org/) which
+might be easier to deploy because their scope is more limited than
+Thanos, but neither are packaged in Debian either.
+
 ### Self-monitoring

 Prometheus should monitor itself and its [Alertmanager][] for outages,
@@ -1065,31 +1069,61 @@ operators for open issues, but we do not believe this is necessary.

 ## Timeline

- * deploy Alertmanager on prometheus1
- * reimplement the Nagios alerting commands (optional?)
- * send Nagios alerts through the alertmanager (optional?)
- * rewrite (non-NRPE) commands (9) as Prometheus alerts
- * scrape the NRPE metrics from Prometheus (optional)
- * create a dashboard and/or alerts for the NRPE metrics (optional)
- * review the NRPE commands (300+) to see which one to rewrite as Prometheus alerts
- * turn off the Icinga server
- * remove all traces of NRPE on all nodes
+We will deploy this in three phase: 
+
+ * Phase A: short term conversion to retire Icinga to avoid running
+   buster out of support for too long
+
+ * Phase B: mid-term work to expand the number of exporters, high
+   availability configuration
+
+ * Phase C: further exporter and metrics expansion, long terms metrics
+   storage
+
+TODO: put actual dates in there, estimates?
+
+### Phase A: emergency Nagios retirement
+
+In this phase we prioritize emergency work to replace core components
+of the Nagios server, so the machine can be retired.
+
+Those are the tasks required here:
+
+ * LDAP web password addition
+ * new authentication deployment on prometheus1
+ * deploy Alertmanager and email notifications on prometheus1
+ * deploy alertmanager-irc-relay on prometheus1
+ * deploy Karma on prometheus1
+ * priority A metrics and alerts deployment
+ * Icinga server retirement

-TODO: multiple stages; emergency buster retirement, then alerting
-improvements, then HA, then long term retention
+### Phase B: more exporters

-The current prometheus1/prometheus2 server may actually be retired in
-favor of two *new* servers to be rebuilt from scratch, entirely from
-Puppet, LDAP, and GitLab repository, ensuring they are properly
-reproducible.
+In this phase, we integrate more exporters and services in the
+infrastructure, which includes merging the second Prometheus
+server for the service admins.

-Experiments can be done manually on the current servers to speed up
-development and replacement of the legacy infrastructure, but the goal
-is to merge the two current server in a single cluster. This might
-also be accomplished by retiring one of the two servers and migrating
-everything on the other.
+We *may* retire the existing servers and build two new servers
+instead, but the more likely outcome is to progressively integrate the
+targets and alerting rules from prometheus2 into prometheus1 and then
+eventually retire prometheus2, rebuilding a copy of prometheus1.

-TODO: how to merge prom2 into prom1
+Here are the tasks required here:
+
+ * prometheus2 merged into prometheus1
+ * priority B metrics and alerts deployment
+
+### Phase C: high availability, long term metrics, other exporters
+
+At this point, the vast majority of checks has been converted into
+Prometheus and we have reached feature parity. We are looking for
+"nice to have" improvements.
+
+ * prometheus3 server built for high availability
+ * GitLab alert integration
+ * long term metrics: high retention, lower scrape interval on
+   secondary server
+ * additional proxy setup as data source for Grafana (promxy or Thanos)

 # Challenges

@@ -1098,6 +1132,8 @@ TODO: how to merge prom2 into prom1
 TODO: name each server according to retention? say mon-short-01 and
 the other mon-long-02?

+TODO: nagios vs icinga
+
 # Alternatives considered

 ## Flap detection
@@ -1149,6 +1185,28 @@ anyway.
 If this becomes a problem over time, the setup *could* be expanded to
 such a stage, but it feels superfluous for now.

+## Progressive conversion timeline
+
+We originally wrote this timeline, a long time ago, when we had more
+time to do the conversion:
+
+ * deploy Alertmanager on prometheus1
+ * reimplement the Nagios alerting commands (optional?)
+ * send Nagios alerts through the alertmanager (optional?)
+ * rewrite (non-NRPE) commands (9) as Prometheus alerts
+ * scrape the NRPE metrics from Prometheus (optional)
+ * create a dashboard and/or alerts for the NRPE metrics (optional)
+ * review the NRPE commands (300+) to see which one to rewrite as Prometheus alerts
+ * turn off the Icinga server
+ * remove all traces of NRPE on all nodes
+
+In that abandoned approach, we progressively migrate from Nagios to
+Prometheus by scraping Nagios from Prometheus. The progressive nature
+allowed for a possible rollback in case we couldn't make things work
+in Prometheus. This was ultimately abandoned because it seemed to take
+more time and we had mostly decided to do the migration, without the
+need for a rollback.
+
 ## Other dashboards

 ### Grafana