diff --git a/service/prometheus.md b/service/prometheus.md index 66af83a17e0bb4a5dfcbbfb4cbfeb8b481ce314d..cf901686690a4e25b3804d2a6019c877c7894cef 100644 --- a/service/prometheus.md +++ b/service/prometheus.md @@ -1948,6 +1948,49 @@ exporter configuration. Look in `tor-puppet.git`, the `hiera/common/prometheus.yaml`, where credentials should be defined (although they should actually be stored in Trocla). +### Host reported as unreachable + +Servers that stop responding can have multiple different causes. To bring it +back online as soon as possible, we need to identify what's preventing it from +responding and then to address that problem. + +#### Node exporter stopped or crashed + +See the section about [Job Down errors](#exporter-job-down-warnings) + +#### Can't connect with SSH + +This situation can be caused by: + +1. a network outage at our provider +2. RAM and swap being full +3. the host being offline + +If the host in question is a VM in our clusters try to reach the ganeti node +containing the instance. From there you can use the [serial +console](howto/ganeti#accessing-serial-console) to identify what's happening +with the instance. + +If the host is _not_ an instance in our ganeti clusters, then reach out for the +console at the corresponding provider's site. + +If the machine is running but the network is unreachable, check with our +hosting provider if any known network issues are currently known and if not open +a support ticket with them. + +If the machine is running but you have difficulty even logging into the TTY, try +and figure out what's happening like if you can see messages on the console +about processes getting OOM-Killed, disk failures, kernel panics or other +critical problems. + +Once you have some information, if the errors you see are not related to disk +failures you'll want to forcefully restart the machine, either with ganeti if +it's an instance or with the help of the hosting provider's website or support. + +If the errors are related to disk failures, you'll want to enlist the help of +our hosting providers to get a disk replacement and fix any RAID arrays that are +now degraded. + ### Apache exporter scraping failed If you get the error `Apache Exporter cannot monitor web server on