From 79e1b7b759409b93e4cdbd9b5090fe2952602a69 Mon Sep 17 00:00:00 2001
From: Gabriel Filion <lelutin@torproject.org>
Date: Thu, 5 Dec 2024 16:36:52 -0500
Subject: [PATCH] New playbook for when a host is unreachable

We're splitting the alert for hosts being unreachable / unresponsive so
that we can have a better severity level and instructions for how to
handle this case.

Unreachable hosts can be caused by a number of things so it's important
to systematically investigate what's happening.
---
 service/prometheus.md | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/service/prometheus.md b/service/prometheus.md
index 66af83a1..cf901686 100644
--- a/service/prometheus.md
+++ b/service/prometheus.md
@@ -1948,6 +1948,49 @@ exporter configuration. Look in `tor-puppet.git`, the
 `hiera/common/prometheus.yaml`, where credentials should be defined
 (although they should actually be stored in Trocla).
 
+### Host reported as unreachable
+
+Servers that stop responding can have multiple different causes. To bring it
+back online as soon as possible, we need to identify what's preventing it from
+responding and then to address that problem.
+
+#### Node exporter stopped or crashed
+
+See the section about [Job Down errors](#exporter-job-down-warnings)
+
+#### Can't connect with SSH
+
+This situation can be caused by:
+
+1. a network outage at our provider
+2. RAM and swap being full
+3. the host being offline
+
+If the host in question is a VM in our clusters try to reach the ganeti node
+containing the instance. From there you can use the [serial
+console](howto/ganeti#accessing-serial-console) to identify what's happening
+with the instance.
+
+If the host is _not_ an instance in our ganeti clusters, then reach out for the
+console at the corresponding provider's site.
+
+If the machine is running but the network is unreachable, check with our
+hosting provider if any known network issues are currently known and if not open
+a support ticket with them.
+
+If the machine is running but you have difficulty even logging into the TTY, try
+and figure out what's happening like if you can see messages on the console
+about processes getting OOM-Killed, disk failures, kernel panics or other
+critical problems.
+
+Once you have some information, if the errors you see are not related to disk
+failures you'll want to forcefully restart the machine, either with ganeti if
+it's an instance or with the help of the hosting provider's website or support.
+
+If the errors are related to disk failures, you'll want to enlist the help of
+our hosting providers to get a disk replacement and fix any RAID arrays that are
+now degraded.
+
 ### Apache exporter scraping failed
 
 If you get the error `Apache Exporter cannot monitor web server on
-- 
GitLab