Nagios/Icinga service for Tor Project infrastructure
NOTE: the Nagios server was retired in 2024. This documentation is kept for historical reference only, see TPA-RFC-33.
How-to
Getting status updates
- Using a web browser: https://nagios.torproject.org/cgi-bin/icinga/status.cgi?allunhandledproblems&sortobject=services&sorttype=1&sortoption=2
- On IRC: /j #tor-nagios
- Over email: Add your email address to
tor-nagios/config/static/objects/contacts.cfg
How to run a nagios check manually on a host (TARGET.tpo)
NCHECKFILE=$(egrep -A 4 THE-SERVICE-TEXT-FROM-WEB | egrep '^ *nrpe:' | cut -d : -f 2 | tr -d ' |"')
NCMD=$(ssh -t TARGET.tpo grep "$NCHECKFILE" /etc/nagios -r)
: NCMD is the command that's being run. If it looks sane, run it. With --verbose if you like more output.
ssh -t TARGET.tpo "$NCMD" --verbose
Changing the Nagios configuration
Hosts and services are managed in the config/nagios-master.cfg
YAML
configuration file, kept in the nagiosadm@nagios.torproject.org:/home/nagiosadm/tor-nagios
repository. Make changes with a normal text editor, commit and push:
$EDITOR config/nagios-master.cfg
git commit -a
git push
Carefully watch the output of the git push
command! If there is an
error, your changes won't show up (and the commit is still accepted).
Forcing a rebuild of the configuration
If the Nagios configuration seems out of sync with the YAML config, a rebuild of the configuration can be forced with this command on the Nagios server:
touch /home/nagiosadm/tor-nagios/config/nagios-master.cfg && sudo -u nagiosadm make -C /home/nagiosadm/tor-nagios/config
Alternatively, changing the .cfg
file and pushing a new commit
should trigger this as well.
Batch jobs
You can run batch commands from the web interface, thanks to Icinga's changes to the UI. But there is also a commandline client called icli which can do this from the commandline, on the Icinga server.
This, for example, will queue recheck jobs on all problem hosts:
icli -z '!o,!A,!S,!D' -a recheck
This will run the dsa-update-apt-status
command on all problem
hosts:
cumin "$(ssh hetzner-hel1-01.torproject.org "icli -z'"'!o,!A,!S,!D'"'" | grep ^[a-z] | sed 's/$/.torproject.org or/') false" dsa-update-apt-status
It's kind of an awful hack -- take some time to appreciate the quoting
required for those !
-- which might not be necessary with later
Icinga releases. Icinga 2 has a REST API and its own command
line console which makes icli
completely obsolete.
Adding a new admin user
When a user needs to be added to the admin group, follow the steps below in the tor-nagios.git
repository
- Create a new contact for the user in
config/static/objects/contacts.cfg
:
define contact{
contact_name <username>
alias <username>
service_notification_period 24x7
host_notification_period 24x7
service_notification_options w,u,c,r
host_notification_options d,r
service_notification_commands notify-service-by-email
host_notification_commands notify-host-by-email
email <email>+nagios@torproject.org
}
- Add the user to
authorized_for_full_command_resolution
andauthorized_for_configuration_information
inconfig/static/cgi.cfg
:
authorized_for_full_command_resolution=user1,foo,bar,<new user>
authorized_for_configuration_information=user1,foo,bar,<new user>
Pager playbook
What is this alert anyways?
Say you receive a mysterious alert and you have no idea what it's about. Take, for example, tpo/tpa/team#40795:
09:35:23 <nsa> tor-nagios: [gettor-01] application service - gettor status is CRITICAL: 2: b[AUTHENTICATIONFAILED] Invalid credentials (Failure)
To figure out what triggered this error, follow this procedure:
-
log into the Nagios web interface at https://nagios.torproject.org
-
find the broken service, for example by listing all unhandled problems
-
click on the actual service name to see details
-
find the "executed command" field and click on "Command Expander"
-
this will show you the "Raw commandline" that nagios runs to do this check, in this case it is a NRPE check that calls
tor_application_service
on the other end -
if it's an NRPE check, log on the remote host and run the command, otherwise, the command is ran on the nagios host
In this case, the error can be reproduced with:
root@gettor-01:~# /usr/lib/nagios/plugins/dsa-check-statusfile /srv/gettor.torproject.org/check/status
2: b'[AUTHENTICATIONFAILED] Invalid credentials (Failure)'
In this case, it seems like the status file is under the control of the service administrator, which should be contacted for followup.
Reference
Design
Config generation
The Nagios/Icinga configuration gets generated from the
config/nagios-master.cfg
YAML configuration file stored in the
tor-nagios.git
repository. The generation works like this:
-
operator pushes changes to the git repository on the Nagios server (in
/home/nagiosadm/tor-nagios
) -
the
post-receive
hook callsmake
in theconfig
sub-directory, which calls./build-nagios
to generate the files in~/tor-nagios/config/generated/
-
the hook then calls
make install
, which: -
deploys the config file (using
rsync
) in/etc/inciga/from-git
... -
pushes the NRPE config to the Puppet server in
nagiospush@pauli.torproject.org:/etc/puppet/modules/nagios/files/tor-nagios/generated/nrpe_tor.cfg
-
reloads Incinga
-
and finally mirrors the repository to GitLab (https://gitlab.torproject.org/tpo/tpa/tor-nagios)