Nagios/Icinga service for Tor Project infrastructure [[_TOC_]] # How-to ## Getting status updates - Using a web browser: https://nagios.torproject.org/cgi-bin/icinga/status.cgi?allunhandledproblems&sortobject=services&sorttype=1&sortoption=2 - On IRC: /j #tor-nagios - Over email: Add your email address to `tor-nagios/config/static/objects/contacts.cfg` ## How to run a nagios check manually on a host (TARGET.tpo) NCHECKFILE=$(egrep -A 4 THE-SERVICE-TEXT-FROM-WEB | egrep '^ *nrpe:' | cut -d : -f 2 | tr -d ' |"') NCMD=$(ssh -t TARGET.tpo grep "$NCHECKFILE" /etc/nagios -r) : NCMD is the command that's being run. If it looks sane, run it. With --verbose if you like more output. ssh -t TARGET.tpo "$NCMD" --verbose ## Changing the Nagios configuration Hosts and services are managed in the `config/nagios-master.cfg` YAML configuration file, kept in the `nagiosadm@nagios.torproject.org:/home/nagiosadm/tor-nagios` repository. Make changes with a normal text editor, commit and push: $EDITOR config/nagios-master.cfg git commit -a git push Carefully watch the output of the `git push` command! If there is an error, your changes won't show up (and the commit is still accepted). ## Forcing a rebuild of the configuration If the Nagios configuration seems out of sync with the YAML config, a rebuild of the configuration can be forced with this command on the Nagios server: touch /home/nagiosadm/tor-nagios/config/nagios-master.cfg && sudo -u nagiosadm make -C /home/nagiosadm/tor-nagios/config Alternatively, changing the `.cfg` file and pushing a new commit should trigger this as well. ## Batch jobs You can run batch commands from the web interface, thanks to Icinga's changes to the UI. But there is also a commandline client called [icli](https://tracker.debian.org/pkg/icli) which can do this from the commandline, on the Icinga server. This, for example, will queue recheck jobs on all problem hosts: icli -z '!o,!A,!S,!D' -a recheck This will run the `dsa-update-apt-status` command on all problem hosts: cumin "$(ssh hetzner-hel1-01.torproject.org "icli -z'"'!o,!A,!S,!D'"'" | grep ^[a-z] | sed 's/$/.torproject.org or/') false" dsa-update-apt-status It's kind of an awful hack -- take some time to appreciate the quoting required for those `!` -- which might not be necessary with later Icinga releases. Icinga 2 has a [REST API](https://icinga.com/docs/icinga-2/latest/doc/12-icinga2-api/) and its own [command line console](https://icinga.com/docs/icinga-2/latest/doc/11-cli-commands/#cli-command-console) which makes `icli` completely obsolete. ## Adding a new admin user When a user needs to be added to the admin group, follow the steps below in the `tor-nagios.git` repository 1. Create a new contact for the user in `config/static/objects/contacts.cfg`: ``` define contact{ contact_name <username> alias <username> service_notification_period 24x7 host_notification_period 24x7 service_notification_options w,u,c,r host_notification_options d,r service_notification_commands notify-service-by-email host_notification_commands notify-host-by-email email <email>+nagios@torproject.org } ``` 2. Add the user to `authorized_for_full_command_resolution` and `authorized_for_configuration_information` in `config/static/cgi.cfg`: ``` authorized_for_full_command_resolution=user1,foo,bar,<new user> authorized_for_configuration_information=user1,foo,bar,<new user> ``` ## Pager playbook ### What is this alert anyways? Say you receive a mysterious alert and you have no idea what it's about. Take, for example, [tpo/tpa/team#40795](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40795): 09:35:23 <nsa> tor-nagios: [gettor-01] application service - gettor status is CRITICAL: 2: b[AUTHENTICATIONFAILED] Invalid credentials (Failure) To figure out what triggered this error, follow this procedure: 1. log into the Nagios web interface at https://nagios.torproject.org 2. find the broken service, for example by listing all [unhandled problems](https://nagios.torproject.org/cgi-bin/icinga/status.cgi?allunhandledproblems) 3. click on the actual service name to see details 4. find the "executed command" field and click on "Command Expander" 5. this will show you the "Raw commandline" that nagios runs to do this check, in this case it is a NRPE check that calls `tor_application_service` on the other end 6. if it's an NRPE check, log on the remote host and run the command, otherwise, the command is ran on the nagios host In this case, the error can be reproduced with: root@gettor-01:~# /usr/lib/nagios/plugins/dsa-check-statusfile /srv/gettor.torproject.org/check/status 2: b'[AUTHENTICATIONFAILED] Invalid credentials (Failure)' In this case, it seems like the status file is under the control of the service administrator, which should be contacted for followup. # Reference ## Design ### Config generation The Nagios/Icinga configuration gets generated from the `config/nagios-master.cfg` YAML configuration file stored in the `tor-nagios.git` repository. The generation works like this: 1. operator pushes changes to the git repository on the Nagios server (in `/home/nagiosadm/tor-nagios`) 2. the `post-receive` hook calls `make` in the `config` sub-directory, which calls `./build-nagios` to generate the files in `~/tor-nagios/config/generated/` 3. the hook then calls `make install`, which: 4. deploys the config file (using `rsync`) in `/etc/inciga/from-git`... 5. pushes the NRPE config to the [Puppet server](puppet) in `nagiospush@pauli.torproject.org:/etc/puppet/modules/nagios/files/tor-nagios/generated/nrpe_tor.cfg` 6. reloads Incinga 7. and finally mirrors the repository to GitLab (<https://gitlab.torproject.org/tpo/tpa/tor-nagios>)