Skip to content
Snippets Groups Projects
Verified Commit 117fef9e authored by anarcat's avatar anarcat
Browse files

tpa-rfc-33: draft timeline (team#40755)

parent 3bc1d6c5
No related branches found
No related tags found
No related merge requests found
Pipeline #168114 passed with warnings
......@@ -739,6 +739,10 @@ kept as an implementation detail to be researched later. [Thanos is
not packaged in Debian](https://bugs.debian.org/1032842) which would probably mean deploying it with
a container.
There are other proxies too, like [promxy](https://github.com/jacksontj/promxy) and [trickster](https://trickstercache.org/) which
might be easier to deploy because their scope is more limited than
Thanos, but neither are packaged in Debian either.
### Self-monitoring
Prometheus should monitor itself and its [Alertmanager][] for outages,
......@@ -1065,31 +1069,61 @@ operators for open issues, but we do not believe this is necessary.
## Timeline
* deploy Alertmanager on prometheus1
* reimplement the Nagios alerting commands (optional?)
* send Nagios alerts through the alertmanager (optional?)
* rewrite (non-NRPE) commands (9) as Prometheus alerts
* scrape the NRPE metrics from Prometheus (optional)
* create a dashboard and/or alerts for the NRPE metrics (optional)
* review the NRPE commands (300+) to see which one to rewrite as Prometheus alerts
* turn off the Icinga server
* remove all traces of NRPE on all nodes
We will deploy this in three phase:
* Phase A: short term conversion to retire Icinga to avoid running
buster out of support for too long
* Phase B: mid-term work to expand the number of exporters, high
availability configuration
* Phase C: further exporter and metrics expansion, long terms metrics
storage
TODO: put actual dates in there, estimates?
### Phase A: emergency Nagios retirement
In this phase we prioritize emergency work to replace core components
of the Nagios server, so the machine can be retired.
Those are the tasks required here:
* LDAP web password addition
* new authentication deployment on prometheus1
* deploy Alertmanager and email notifications on prometheus1
* deploy alertmanager-irc-relay on prometheus1
* deploy Karma on prometheus1
* priority A metrics and alerts deployment
* Icinga server retirement
TODO: multiple stages; emergency buster retirement, then alerting
improvements, then HA, then long term retention
### Phase B: more exporters
The current prometheus1/prometheus2 server may actually be retired in
favor of two *new* servers to be rebuilt from scratch, entirely from
Puppet, LDAP, and GitLab repository, ensuring they are properly
reproducible.
In this phase, we integrate more exporters and services in the
infrastructure, which includes merging the second Prometheus
server for the service admins.
Experiments can be done manually on the current servers to speed up
development and replacement of the legacy infrastructure, but the goal
is to merge the two current server in a single cluster. This might
also be accomplished by retiring one of the two servers and migrating
everything on the other.
We *may* retire the existing servers and build two new servers
instead, but the more likely outcome is to progressively integrate the
targets and alerting rules from prometheus2 into prometheus1 and then
eventually retire prometheus2, rebuilding a copy of prometheus1.
TODO: how to merge prom2 into prom1
Here are the tasks required here:
* prometheus2 merged into prometheus1
* priority B metrics and alerts deployment
### Phase C: high availability, long term metrics, other exporters
At this point, the vast majority of checks has been converted into
Prometheus and we have reached feature parity. We are looking for
"nice to have" improvements.
* prometheus3 server built for high availability
* GitLab alert integration
* long term metrics: high retention, lower scrape interval on
secondary server
* additional proxy setup as data source for Grafana (promxy or Thanos)
# Challenges
......@@ -1098,6 +1132,8 @@ TODO: how to merge prom2 into prom1
TODO: name each server according to retention? say mon-short-01 and
the other mon-long-02?
TODO: nagios vs icinga
# Alternatives considered
## Flap detection
......@@ -1149,6 +1185,28 @@ anyway.
If this becomes a problem over time, the setup *could* be expanded to
such a stage, but it feels superfluous for now.
## Progressive conversion timeline
We originally wrote this timeline, a long time ago, when we had more
time to do the conversion:
* deploy Alertmanager on prometheus1
* reimplement the Nagios alerting commands (optional?)
* send Nagios alerts through the alertmanager (optional?)
* rewrite (non-NRPE) commands (9) as Prometheus alerts
* scrape the NRPE metrics from Prometheus (optional)
* create a dashboard and/or alerts for the NRPE metrics (optional)
* review the NRPE commands (300+) to see which one to rewrite as Prometheus alerts
* turn off the Icinga server
* remove all traces of NRPE on all nodes
In that abandoned approach, we progressively migrate from Nagios to
Prometheus by scraping Nagios from Prometheus. The progressive nature
allowed for a possible rollback in case we couldn't make things work
in Prometheus. This was ultimately abandoned because it seemed to take
more time and we had mostly decided to do the migration, without the
need for a rollback.
## Other dashboards
### Grafana
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment