diff --git a/howto/incident-response.md b/howto/incident-response.md index 98ebbb0a3618bda35407dd510ccbedc55f4cb8ed..f7fb04774949444c567d2ba92a13eb9377cd2f85 100644 --- a/howto/incident-response.md +++ b/howto/incident-response.md @@ -14,25 +14,47 @@ to wake up someone that can deal with them. See the ## Server down -If a server is non-responsive, you can first check if it is actually -reachable over the network: +If a server is reported as non-responsive, this situation can be caused by: - ping -c 10 server.torproject.org +1. a network outage at our provider + * sometimes the network outage can be happening between two of our providers + so make sure to test network reachability from more than one place on the + internet. +2. RAM and swap being full +3. the host being offline or crashed -If it does respond, you can try to diagnose the issue by looking at -[Nagios][] and/or [Grafana](https://grafana.torproject.org) and analyse what, exactly is going on. +You can first check if it is actually reachable over the network: -[Nagios]: https://nagios.torproject.org + ping -4 -c 10 server.torproject.org + ping -6 -c 10 server.torproject.org + ssh server.torproject.org + +If it does respond at least from one point on the internet, you can try to +diagnose the issue by looking at [prometheus][] and/or [Grafana][] and analyse +what, exactly is going on. If you're lucky enough to have SSH access, you can +dive deeper in the logs and systemd unit status: for example it might just be +that the node exporter has crashed. + +[prometheus]: https://prometheus.torproject.org +[Grafana]: https://grafana.torproject.org If the host does *not* respond, you should see if it's a virtual machine, and in this case, which server is hosting it. This information is available in [howto/ldap](howto/ldap) (or [the web -interface](https://db.torproject.org/machines.cgi), under the `physicalHost` field). Then login to that -server to diagnose this issue. +interface](https://db.torproject.org/machines.cgi), under the `physicalHost` +field). Then login to that server to diagnose this issue. If the physical host +is a ganeti node, you can use the [serial console](howto/ganeti#accessing-serial-console) +and if it's not a ganeti node, you can try to access the console on the hosting +provider's web site. + +Once you have access to the console, look out for signs of errors like OOM-Kill, +disk failures, kernel panics, network-related errors. If you're still able to +login and investigate, you might be able to bring the machine back online. +Otherwise, look in subsections below for how to perform hard resets. If the physical host is not responding or is empty (in which case it *is* a physical host), you need to file a ticket with the upstream -provider. This information is available in [Nagios][]: +provider. This information is available in Nagios: 1. search for the server name in the search box 2. click on the server diff --git a/service/prometheus.md b/service/prometheus.md index cf901686690a4e25b3804d2a6019c877c7894cef..66af83a17e0bb4a5dfcbbfb4cbfeb8b481ce314d 100644 --- a/service/prometheus.md +++ b/service/prometheus.md @@ -1948,49 +1948,6 @@ exporter configuration. Look in `tor-puppet.git`, the `hiera/common/prometheus.yaml`, where credentials should be defined (although they should actually be stored in Trocla). -### Host reported as unreachable - -Servers that stop responding can have multiple different causes. To bring it -back online as soon as possible, we need to identify what's preventing it from -responding and then to address that problem. - -#### Node exporter stopped or crashed - -See the section about [Job Down errors](#exporter-job-down-warnings) - -#### Can't connect with SSH - -This situation can be caused by: - -1. a network outage at our provider -2. RAM and swap being full -3. the host being offline - -If the host in question is a VM in our clusters try to reach the ganeti node -containing the instance. From there you can use the [serial -console](howto/ganeti#accessing-serial-console) to identify what's happening -with the instance. - -If the host is _not_ an instance in our ganeti clusters, then reach out for the -console at the corresponding provider's site. - -If the machine is running but the network is unreachable, check with our -hosting provider if any known network issues are currently known and if not open -a support ticket with them. - -If the machine is running but you have difficulty even logging into the TTY, try -and figure out what's happening like if you can see messages on the console -about processes getting OOM-Killed, disk failures, kernel panics or other -critical problems. - -Once you have some information, if the errors you see are not related to disk -failures you'll want to forcefully restart the machine, either with ganeti if -it's an instance or with the help of the hosting provider's website or support. - -If the errors are related to disk failures, you'll want to enlist the help of -our hosting providers to get a disk replacement and fix any RAID arrays that are -now degraded. - ### Apache exporter scraping failed If you get the error `Apache Exporter cannot monitor web server on