From 117fef9ea4033c03a96dccdfe1196d5b854e7786 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Antoine=20Beaupr=C3=A9?= <anarcat@debian.org> Date: Sun, 12 May 2024 16:52:36 -0400 Subject: [PATCH] tpa-rfc-33: draft timeline (tpo/tpa/team#40755) --- policy/tpa-rfc-33-monitoring.md | 100 +++++++++++++++++++++++++------- 1 file changed, 79 insertions(+), 21 deletions(-) diff --git a/policy/tpa-rfc-33-monitoring.md b/policy/tpa-rfc-33-monitoring.md index a5a281d2..1f7a4a8d 100644 --- a/policy/tpa-rfc-33-monitoring.md +++ b/policy/tpa-rfc-33-monitoring.md @@ -739,6 +739,10 @@ kept as an implementation detail to be researched later. [Thanos is not packaged in Debian](https://bugs.debian.org/1032842) which would probably mean deploying it with a container. +There are other proxies too, like [promxy](https://github.com/jacksontj/promxy) and [trickster](https://trickstercache.org/) which +might be easier to deploy because their scope is more limited than +Thanos, but neither are packaged in Debian either. + ### Self-monitoring Prometheus should monitor itself and its [Alertmanager][] for outages, @@ -1065,31 +1069,61 @@ operators for open issues, but we do not believe this is necessary. ## Timeline - * deploy Alertmanager on prometheus1 - * reimplement the Nagios alerting commands (optional?) - * send Nagios alerts through the alertmanager (optional?) - * rewrite (non-NRPE) commands (9) as Prometheus alerts - * scrape the NRPE metrics from Prometheus (optional) - * create a dashboard and/or alerts for the NRPE metrics (optional) - * review the NRPE commands (300+) to see which one to rewrite as Prometheus alerts - * turn off the Icinga server - * remove all traces of NRPE on all nodes +We will deploy this in three phase: + + * Phase A: short term conversion to retire Icinga to avoid running + buster out of support for too long + + * Phase B: mid-term work to expand the number of exporters, high + availability configuration + + * Phase C: further exporter and metrics expansion, long terms metrics + storage + +TODO: put actual dates in there, estimates? + +### Phase A: emergency Nagios retirement + +In this phase we prioritize emergency work to replace core components +of the Nagios server, so the machine can be retired. + +Those are the tasks required here: + + * LDAP web password addition + * new authentication deployment on prometheus1 + * deploy Alertmanager and email notifications on prometheus1 + * deploy alertmanager-irc-relay on prometheus1 + * deploy Karma on prometheus1 + * priority A metrics and alerts deployment + * Icinga server retirement -TODO: multiple stages; emergency buster retirement, then alerting -improvements, then HA, then long term retention +### Phase B: more exporters -The current prometheus1/prometheus2 server may actually be retired in -favor of two *new* servers to be rebuilt from scratch, entirely from -Puppet, LDAP, and GitLab repository, ensuring they are properly -reproducible. +In this phase, we integrate more exporters and services in the +infrastructure, which includes merging the second Prometheus +server for the service admins. -Experiments can be done manually on the current servers to speed up -development and replacement of the legacy infrastructure, but the goal -is to merge the two current server in a single cluster. This might -also be accomplished by retiring one of the two servers and migrating -everything on the other. +We *may* retire the existing servers and build two new servers +instead, but the more likely outcome is to progressively integrate the +targets and alerting rules from prometheus2 into prometheus1 and then +eventually retire prometheus2, rebuilding a copy of prometheus1. -TODO: how to merge prom2 into prom1 +Here are the tasks required here: + + * prometheus2 merged into prometheus1 + * priority B metrics and alerts deployment + +### Phase C: high availability, long term metrics, other exporters + +At this point, the vast majority of checks has been converted into +Prometheus and we have reached feature parity. We are looking for +"nice to have" improvements. + + * prometheus3 server built for high availability + * GitLab alert integration + * long term metrics: high retention, lower scrape interval on + secondary server + * additional proxy setup as data source for Grafana (promxy or Thanos) # Challenges @@ -1098,6 +1132,8 @@ TODO: how to merge prom2 into prom1 TODO: name each server according to retention? say mon-short-01 and the other mon-long-02? +TODO: nagios vs icinga + # Alternatives considered ## Flap detection @@ -1149,6 +1185,28 @@ anyway. If this becomes a problem over time, the setup *could* be expanded to such a stage, but it feels superfluous for now. +## Progressive conversion timeline + +We originally wrote this timeline, a long time ago, when we had more +time to do the conversion: + + * deploy Alertmanager on prometheus1 + * reimplement the Nagios alerting commands (optional?) + * send Nagios alerts through the alertmanager (optional?) + * rewrite (non-NRPE) commands (9) as Prometheus alerts + * scrape the NRPE metrics from Prometheus (optional) + * create a dashboard and/or alerts for the NRPE metrics (optional) + * review the NRPE commands (300+) to see which one to rewrite as Prometheus alerts + * turn off the Icinga server + * remove all traces of NRPE on all nodes + +In that abandoned approach, we progressively migrate from Nagios to +Prometheus by scraping Nagios from Prometheus. The progressive nature +allowed for a possible rollback in case we couldn't make things work +in Prometheus. This was ultimately abandoned because it seemed to take +more time and we had mostly decided to do the migration, without the +need for a rollback. + ## Other dashboards ### Grafana -- GitLab