Newer
Older
Nagios/Icinga service for Tor Project infrastructure
NOTE: the Nagios server was retired in 2024. This documentation is
kept for historical reference only, see [TPA-RFC-33](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-33-monitoring).
## Getting status updates
- Using a web browser: https://nagios.torproject.org/cgi-bin/icinga/status.cgi?allunhandledproblems&sortobject=services&sorttype=1&sortoption=2
- On IRC: /j #tor-nagios
- Over email: Add your email address to `tor-nagios/config/static/objects/contacts.cfg`
## How to run a nagios check manually on a host (TARGET.tpo)
NCHECKFILE=$(egrep -A 4 THE-SERVICE-TEXT-FROM-WEB | egrep '^ *nrpe:' | cut -d : -f 2 | tr -d ' |"')
NCMD=$(ssh -t TARGET.tpo grep "$NCHECKFILE" /etc/nagios -r)
: NCMD is the command that's being run. If it looks sane, run it. With --verbose if you like more output.
ssh -t TARGET.tpo "$NCMD" --verbose
Hosts and services are managed in the `config/nagios-master.cfg` YAML
configuration file, kept in the `nagiosadm@nagios.torproject.org:/home/nagiosadm/tor-nagios`
repository. Make changes with a normal text editor, commit and push:
$EDITOR config/nagios-master.cfg
git commit -a
git push
Carefully watch the output of the `git push` command! If there is an
error, your changes won't show up (and the commit is still accepted).
## Forcing a rebuild of the configuration
If the Nagios configuration seems out of sync with the YAML config, a
rebuild of the configuration can be forced with this command on the
Nagios server:
touch /home/nagiosadm/tor-nagios/config/nagios-master.cfg && sudo -u nagiosadm make -C /home/nagiosadm/tor-nagios/config
Alternatively, changing the `.cfg` file and pushing a new commit
should trigger this as well.
## Batch jobs
You can run batch commands from the web interface, thanks to Icinga's
changes to the UI. But there is also a commandline client called
[icli](https://tracker.debian.org/pkg/icli) which can do this from the commandline, on the Icinga
server.
This, for example, will queue recheck jobs on all problem hosts:
icli -z '!o,!A,!S,!D' -a recheck
This will run the `dsa-update-apt-status` command on all problem
hosts:
cumin "$(ssh hetzner-hel1-01.torproject.org "icli -z'"'!o,!A,!S,!D'"'" | grep ^[a-z] | sed 's/$/.torproject.org or/') false" dsa-update-apt-status
It's kind of an awful hack -- take some time to appreciate the quoting
required for those `!` -- which might not be necessary with later
Icinga releases. Icinga 2 has a [REST API](https://icinga.com/docs/icinga-2/latest/doc/12-icinga2-api/) and its own [command
line console](https://icinga.com/docs/icinga-2/latest/doc/11-cli-commands/#cli-command-console) which makes `icli` completely obsolete.
## Adding a new admin user
When a user needs to be added to the admin group, follow the steps below in the `tor-nagios.git` repository
1. Create a new contact for the user in `config/static/objects/contacts.cfg`:
```
define contact{
contact_name <username>
alias <username>
service_notification_period 24x7
host_notification_period 24x7
service_notification_options w,u,c,r
host_notification_options d,r
service_notification_commands notify-service-by-email
host_notification_commands notify-host-by-email
email <email>+nagios@torproject.org
}
```
2. Add the user to `authorized_for_full_command_resolution` and `authorized_for_configuration_information` in `config/static/cgi.cfg`:
```
authorized_for_full_command_resolution=user1,foo,bar,<new user>
authorized_for_configuration_information=user1,foo,bar,<new user>
```
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
## Pager playbook
### What is this alert anyways?
Say you receive a mysterious alert and you have no idea what it's
about. Take, for example, [tpo/tpa/team#40795](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40795):
09:35:23 <nsa> tor-nagios: [gettor-01] application service - gettor status is CRITICAL: 2: b[AUTHENTICATIONFAILED] Invalid credentials (Failure)
To figure out what triggered this error, follow this procedure:
1. log into the Nagios web interface at https://nagios.torproject.org
2. find the broken service, for example by listing all [unhandled
problems](https://nagios.torproject.org/cgi-bin/icinga/status.cgi?allunhandledproblems)
3. click on the actual service name to see details
4. find the "executed command" field and click on "Command Expander"
5. this will show you the "Raw commandline" that nagios runs to do
this check, in this case it is a NRPE check that calls
`tor_application_service` on the other end
6. if it's an NRPE check, log on the remote host and run the command,
otherwise, the command is ran on the nagios host
In this case, the error can be reproduced with:
root@gettor-01:~# /usr/lib/nagios/plugins/dsa-check-statusfile /srv/gettor.torproject.org/check/status
2: b'[AUTHENTICATIONFAILED] Invalid credentials (Failure)'
In this case, it seems like the status file is under the control of
the service administrator, which should be contacted for followup.
# Reference
## Design
### Config generation
The Nagios/Icinga configuration gets generated from the
`config/nagios-master.cfg` YAML configuration file stored in the
`tor-nagios.git` repository. The generation works like this:
1. operator pushes changes to the git repository on the Nagios server
(in `/home/nagiosadm/tor-nagios`)
2. the `post-receive` hook calls `make` in the `config` sub-directory,
which calls `./build-nagios` to generate the files in
`~/tor-nagios/config/generated/`
3. the hook then calls `make install`, which:
4. deploys the config file (using `rsync`) in
`/etc/inciga/from-git`...
5. pushes the NRPE config to the [Puppet server](puppet) in
`nagiospush@pauli.torproject.org:/etc/puppet/modules/nagios/files/tor-nagios/generated/nrpe_tor.cfg`