From 75c9d9b9c4e284729bdc590271e29fe2e38839c5 Mon Sep 17 00:00:00 2001
From: Gabriel Filion <lelutin@torproject.org>
Date: Thu, 5 Dec 2024 17:42:50 -0500
Subject: [PATCH] Merge some information I had added to prometheus into
 incident-response

There's a section about finding information in Nagios which I don't know
how to translate to prometheus, so I left it there, but we'll want to
remove and/or review that.
---
 howto/incident-response.md | 40 +++++++++++++++++++++++++++--------
 service/prometheus.md      | 43 --------------------------------------
 2 files changed, 31 insertions(+), 52 deletions(-)

diff --git a/howto/incident-response.md b/howto/incident-response.md
index 98ebbb0a..f7fb0477 100644
--- a/howto/incident-response.md
+++ b/howto/incident-response.md
@@ -14,25 +14,47 @@ to wake up someone that can deal with them. See the
 
 ## Server down
 
-If a server is non-responsive, you can first check if it is actually
-reachable over the network:
+If a server is reported as non-responsive, this situation can be caused by:
 
-    ping -c 10 server.torproject.org
+1. a network outage at our provider
+   * sometimes the network outage can be happening between two of our providers
+     so make sure to test network reachability from more than one place on the
+     internet.
+2. RAM and swap being full
+3. the host being offline or crashed
 
-If it does respond, you can try to diagnose the issue by looking at
-[Nagios][] and/or [Grafana](https://grafana.torproject.org) and analyse what, exactly is going on.
+You can first check if it is actually reachable over the network:
 
-[Nagios]: https://nagios.torproject.org
+    ping -4 -c 10 server.torproject.org
+    ping -6 -c 10 server.torproject.org
+    ssh server.torproject.org
+
+If it does respond at least from one point on the internet, you can try to
+diagnose the issue by looking at [prometheus][] and/or [Grafana][] and analyse
+what, exactly is going on. If you're lucky enough to have SSH access, you can
+dive deeper in the logs and systemd unit status: for example it might just be
+that the node exporter has crashed.
+
+[prometheus]: https://prometheus.torproject.org
+[Grafana]: https://grafana.torproject.org
 
 If the host does *not* respond, you should see if it's a virtual
 machine, and in this case, which server is hosting it. This
 information is available in [howto/ldap](howto/ldap) (or [the web
-interface](https://db.torproject.org/machines.cgi), under the `physicalHost` field). Then login to that
-server to diagnose this issue.
+interface](https://db.torproject.org/machines.cgi), under the `physicalHost`
+field). Then login to that server to diagnose this issue. If the physical host
+is a ganeti node, you can use the [serial console](howto/ganeti#accessing-serial-console)
+and if it's not a ganeti node, you can try to access the console on the hosting
+provider's web site.
+
+Once you have access to the console, look out for signs of errors like OOM-Kill,
+disk failures, kernel panics, network-related errors. If you're still able to
+login and investigate, you might be able to bring the machine back online.
+Otherwise, look in subsections below for how to perform hard resets.
 
 If the physical host is not responding or is empty (in which case it
 *is* a physical host), you need to file a ticket with the upstream
-provider. This information is available in [Nagios][]:
+provider. This information is available in Nagios:
 
  1. search for the server name in the search box
  2. click on the server
diff --git a/service/prometheus.md b/service/prometheus.md
index cf901686..66af83a1 100644
--- a/service/prometheus.md
+++ b/service/prometheus.md
@@ -1948,49 +1948,6 @@ exporter configuration. Look in `tor-puppet.git`, the
 `hiera/common/prometheus.yaml`, where credentials should be defined
 (although they should actually be stored in Trocla).
 
-### Host reported as unreachable
-
-Servers that stop responding can have multiple different causes. To bring it
-back online as soon as possible, we need to identify what's preventing it from
-responding and then to address that problem.
-
-#### Node exporter stopped or crashed
-
-See the section about [Job Down errors](#exporter-job-down-warnings)
-
-#### Can't connect with SSH
-
-This situation can be caused by:
-
-1. a network outage at our provider
-2. RAM and swap being full
-3. the host being offline
-
-If the host in question is a VM in our clusters try to reach the ganeti node
-containing the instance. From there you can use the [serial
-console](howto/ganeti#accessing-serial-console) to identify what's happening
-with the instance.
-
-If the host is _not_ an instance in our ganeti clusters, then reach out for the
-console at the corresponding provider's site.
-
-If the machine is running but the network is unreachable, check with our
-hosting provider if any known network issues are currently known and if not open
-a support ticket with them.
-
-If the machine is running but you have difficulty even logging into the TTY, try
-and figure out what's happening like if you can see messages on the console
-about processes getting OOM-Killed, disk failures, kernel panics or other
-critical problems.
-
-Once you have some information, if the errors you see are not related to disk
-failures you'll want to forcefully restart the machine, either with ganeti if
-it's an instance or with the help of the hosting provider's website or support.
-
-If the errors are related to disk failures, you'll want to enlist the help of
-our hosting providers to get a disk replacement and fix any RAID arrays that are
-now degraded.
-
 ### Apache exporter scraping failed
 
 If you get the error `Apache Exporter cannot monitor web server on
-- 
GitLab