Skip to content
Snippets Groups Projects
Verified Commit 79e1b7b7 authored by lelutin's avatar lelutin
Browse files

New playbook for when a host is unreachable

We're splitting the alert for hosts being unreachable / unresponsive so
that we can have a better severity level and instructions for how to
handle this case.

Unreachable hosts can be caused by a number of things so it's important
to systematically investigate what's happening.
parent 1e3cd286
No related branches found
No related tags found
No related merge requests found
Pipeline #229906 passed with warnings
......@@ -1948,6 +1948,49 @@ exporter configuration. Look in `tor-puppet.git`, the
`hiera/common/prometheus.yaml`, where credentials should be defined
(although they should actually be stored in Trocla).
### Host reported as unreachable
Servers that stop responding can have multiple different causes. To bring it
back online as soon as possible, we need to identify what's preventing it from
responding and then to address that problem.
#### Node exporter stopped or crashed
See the section about [Job Down errors](#exporter-job-down-warnings)
#### Can't connect with SSH
This situation can be caused by:
1. a network outage at our provider
2. RAM and swap being full
3. the host being offline
If the host in question is a VM in our clusters try to reach the ganeti node
containing the instance. From there you can use the [serial
console](howto/ganeti#accessing-serial-console) to identify what's happening
with the instance.
If the host is _not_ an instance in our ganeti clusters, then reach out for the
console at the corresponding provider's site.
If the machine is running but the network is unreachable, check with our
hosting provider if any known network issues are currently known and if not open
a support ticket with them.
If the machine is running but you have difficulty even logging into the TTY, try
and figure out what's happening like if you can see messages on the console
about processes getting OOM-Killed, disk failures, kernel panics or other
critical problems.
Once you have some information, if the errors you see are not related to disk
failures you'll want to forcefully restart the machine, either with ganeti if
it's an instance or with the help of the hosting provider's website or support.
If the errors are related to disk failures, you'll want to enlist the help of
our hosting providers to get a disk replacement and fix any RAID arrays that are
now degraded.
### Apache exporter scraping failed
If you get the error `Apache Exporter cannot monitor web server on
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment