Merge some information I had added to prometheus into incident-response authored by lelutin's avatar lelutin
There's a section about finding information in Nagios which I don't know
how to translate to prometheus, so I left it there, but we'll want to
remove and/or review that.
......@@ -1948,49 +1948,6 @@ exporter configuration. Look in `tor-puppet.git`, the
`hiera/common/prometheus.yaml`, where credentials should be defined
(although they should actually be stored in Trocla).
### Host reported as unreachable
Servers that stop responding can have multiple different causes. To bring it
back online as soon as possible, we need to identify what's preventing it from
responding and then to address that problem.
#### Node exporter stopped or crashed
See the section about [Job Down errors](#exporter-job-down-warnings)
#### Can't connect with SSH
This situation can be caused by:
1. a network outage at our provider
2. RAM and swap being full
3. the host being offline
If the host in question is a VM in our clusters try to reach the ganeti node
containing the instance. From there you can use the [serial
console](howto/ganeti#accessing-serial-console) to identify what's happening
with the instance.
If the host is _not_ an instance in our ganeti clusters, then reach out for the
console at the corresponding provider's site.
If the machine is running but the network is unreachable, check with our
hosting provider if any known network issues are currently known and if not open
a support ticket with them.
If the machine is running but you have difficulty even logging into the TTY, try
and figure out what's happening like if you can see messages on the console
about processes getting OOM-Killed, disk failures, kernel panics or other
critical problems.
Once you have some information, if the errors you see are not related to disk
failures you'll want to forcefully restart the machine, either with ganeti if
it's an instance or with the help of the hosting provider's website or support.
If the errors are related to disk failures, you'll want to enlist the help of
our hosting providers to get a disk replacement and fix any RAID arrays that are
now degraded.
### Apache exporter scraping failed
If you get the error `Apache Exporter cannot monitor web server on
......
......