Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
Wiki Replica
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Container Registry
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
The Tor Project
TPA
Wiki Replica
Commits
117fef9e
Verified
Commit
117fef9e
authored
10 months ago
by
anarcat
Browse files
Options
Downloads
Patches
Plain Diff
tpa-rfc-33: draft timeline (
team#40755
)
parent
3bc1d6c5
No related branches found
Branches containing commit
No related tags found
No related merge requests found
Pipeline
#168114
passed with warnings
10 months ago
Stage: build
Stage: test
Changes
1
Pipelines
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
policy/tpa-rfc-33-monitoring.md
+79
-21
79 additions, 21 deletions
policy/tpa-rfc-33-monitoring.md
with
79 additions
and
21 deletions
policy/tpa-rfc-33-monitoring.md
+
79
−
21
View file @
117fef9e
...
...
@@ -739,6 +739,10 @@ kept as an implementation detail to be researched later. [Thanos is
not packaged in Debian
](
https://bugs.debian.org/1032842
)
which would probably mean deploying it with
a container.
There are other proxies too, like
[
promxy
](
https://github.com/jacksontj/promxy
)
and
[
trickster
](
https://trickstercache.org/
)
which
might be easier to deploy because their scope is more limited than
Thanos, but neither are packaged in Debian either.
### Self-monitoring
Prometheus should monitor itself and its
[
Alertmanager
][]
for outages,
...
...
@@ -1065,31 +1069,61 @@ operators for open issues, but we do not believe this is necessary.
## Timeline
*
deploy Alertmanager on prometheus1
*
reimplement the Nagios alerting commands (optional?)
*
send Nagios alerts through the alertmanager (optional?)
*
rewrite (non-NRPE) commands (9) as Prometheus alerts
*
scrape the NRPE metrics from Prometheus (optional)
*
create a dashboard and/or alerts for the NRPE metrics (optional)
*
review the NRPE commands (300+) to see which one to rewrite as Prometheus alerts
*
turn off the Icinga server
*
remove all traces of NRPE on all nodes
We will deploy this in three phase:
*
Phase A: short term conversion to retire Icinga to avoid running
buster out of support for too long
*
Phase B: mid-term work to expand the number of exporters, high
availability configuration
*
Phase C: further exporter and metrics expansion, long terms metrics
storage
TODO: put actual dates in there, estimates?
### Phase A: emergency Nagios retirement
In this phase we prioritize emergency work to replace core components
of the Nagios server, so the machine can be retired.
Those are the tasks required here:
*
LDAP web password addition
*
new authentication deployment on prometheus1
*
deploy Alertmanager and email notifications on prometheus1
*
deploy alertmanager-irc-relay on prometheus1
*
deploy Karma on prometheus1
*
priority A metrics and alerts deployment
*
Icinga server retirement
TODO: multiple stages; emergency buster retirement, then alerting
improvements, then HA, then long term retention
### Phase B: more exporters
The current prometheus1/prometheus2 server may actually be retired in
favor of two
*new*
servers to be rebuilt from scratch, entirely from
Puppet, LDAP, and GitLab repository, ensuring they are properly
reproducible.
In this phase, we integrate more exporters and services in the
infrastructure, which includes merging the second Prometheus
server for the service admins.
Experiments can be done manually on the current servers to speed up
development and replacement of the legacy infrastructure, but the goal
is to merge the two current server in a single cluster. This might
also be accomplished by retiring one of the two servers and migrating
everything on the other.
We
*may*
retire the existing servers and build two new servers
instead, but the more likely outcome is to progressively integrate the
targets and alerting rules from prometheus2 into prometheus1 and then
eventually retire prometheus2, rebuilding a copy of prometheus1.
TODO: how to merge prom2 into prom1
Here are the tasks required here:
*
prometheus2 merged into prometheus1
*
priority B metrics and alerts deployment
### Phase C: high availability, long term metrics, other exporters
At this point, the vast majority of checks has been converted into
Prometheus and we have reached feature parity. We are looking for
"nice to have" improvements.
*
prometheus3 server built for high availability
*
GitLab alert integration
*
long term metrics: high retention, lower scrape interval on
secondary server
*
additional proxy setup as data source for Grafana (promxy or Thanos)
# Challenges
...
...
@@ -1098,6 +1132,8 @@ TODO: how to merge prom2 into prom1
TODO: name each server according to retention? say mon-short-01 and
the other mon-long-02?
TODO: nagios vs icinga
# Alternatives considered
## Flap detection
...
...
@@ -1149,6 +1185,28 @@ anyway.
If this becomes a problem over time, the setup
*could*
be expanded to
such a stage, but it feels superfluous for now.
## Progressive conversion timeline
We originally wrote this timeline, a long time ago, when we had more
time to do the conversion:
*
deploy Alertmanager on prometheus1
*
reimplement the Nagios alerting commands (optional?)
*
send Nagios alerts through the alertmanager (optional?)
*
rewrite (non-NRPE) commands (9) as Prometheus alerts
*
scrape the NRPE metrics from Prometheus (optional)
*
create a dashboard and/or alerts for the NRPE metrics (optional)
*
review the NRPE commands (300+) to see which one to rewrite as Prometheus alerts
*
turn off the Icinga server
*
remove all traces of NRPE on all nodes
In that abandoned approach, we progressively migrate from Nagios to
Prometheus by scraping Nagios from Prometheus. The progressive nature
allowed for a possible rollback in case we couldn't make things work
in Prometheus. This was ultimately abandoned because it seemed to take
more time and we had mostly decided to do the migration, without the
need for a rollback.
## Other dashboards
### Grafana
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment