Nagios/Icinga service for Tor Project infrastructure

[[_TOC_]]

# How-to

## Getting status updates

- Using a web browser: https://nagios.torproject.org/cgi-bin/icinga/status.cgi?allunhandledproblems&sortobject=services&sorttype=1&sortoption=2
- On IRC: /j #tor-nagios
- Over email: Add your email address to `tor-nagios/config/static/objects/contacts.cfg`

## How to run a nagios check manually on a host (TARGET.tpo)

    NCHECKFILE=$(egrep -A 4 THE-SERVICE-TEXT-FROM-WEB | egrep '^ *nrpe:' | cut -d : -f 2 | tr -d ' |"')
    NCMD=$(ssh -t TARGET.tpo grep "$NCHECKFILE" /etc/nagios -r)
    : NCMD is the command that's being run. If it looks sane, run it. With --verbose if you like more output.
    ssh -t TARGET.tpo "$NCMD" --verbose

## Changing the Nagios configuration

Hosts and services are managed in the `config/nagios-master.cfg` YAML
configuration file, kept in the `admin/tor-nagios.git`
repository. Make changes with a normal text editor, commit and push:

    $EDITOR config/nagios-master.cfg
    git commit -a
    git push

Carefully watch the output of the `git push` command! If there is an
error, your changes won't show up (and the commit is still accepted).

## Forcing a rebuild of the configuration

If the Nagios configuration seems out of sync with the YAML config, a
rebuild of the configuration can be forced with this command on the
Nagios server:

    touch /home/nagiosadm/tor-nagios/config/nagios-master.cfg && sudo -u nagiosadm make -C /home/nagiosadm/tor-nagios/config

Alternatively, changing the `.cfg` file and pushing a new commit
should trigger this as well.

## Batch jobs

You can run batch commands from the web interface, thanks to Icinga's
changes to the UI. But there is also a commandline client called
[icli](https://tracker.debian.org/pkg/icli) which can do this from the commandline, on the Icinga
server.

This, for example, will queue recheck jobs on all problem hosts:

    icli -z '!o,!A,!S,!D' -a recheck

This will run the `dsa-update-apt-status` command on all problem
hosts:

    cumin "$(ssh hetzner-hel1-01.torproject.org "icli -z'"'!o,!A,!S,!D'"'" | grep ^[a-z] | sed 's/$/.torproject.org or/') false" dsa-update-apt-status

It's kind of an awful hack -- take some time to appreciate the quoting
required for those `!` -- which might not be necessary with later
Icinga releases. Icinga 2 has a [REST API](https://icinga.com/docs/icinga-2/latest/doc/12-icinga2-api/) and its own [command
line console](https://icinga.com/docs/icinga-2/latest/doc/11-cli-commands/#cli-command-console) which makes `icli` completely obsolete.

## Adding a new admin user

When a user needs to be added to the admin group, follow the steps below in the `tor-nagios.git` repository

1. Create a new contact for the user in `config/static/objects/contacts.cfg`:

```
define contact{
       contact_name                    <username>
       alias                           <username>
       service_notification_period     24x7
       host_notification_period        24x7
       service_notification_options    w,u,c,r
       host_notification_options       d,r
       service_notification_commands   notify-service-by-email
       host_notification_commands      notify-host-by-email
       email                           <email>+nagios@torproject.org
       }
```

2. Add the user to `authorized_for_full_command_resolution` and `authorized_for_configuration_information` in `config/static/cgi.cfg`:

```
authorized_for_full_command_resolution=user1,foo,bar,<new user>
authorized_for_configuration_information=user1,foo,bar,<new user>
```

# Reference

## Design

### Config generation

The Nagios/Icinga configuration gets generated from the
`config/nagios-master.cfg` YAML configuration file stored in the
`admin/tor-nagios.git` repository. The generation works like this:

 1. the [git server](git) has a post-receive hook (in
    `/srv/git.torproject.org/git-helpers/post-receive-per-repo.d/admin%tor-nagios/trigger-nagios-build`)

 2. ... which launches a "trigger" on the Nagios server, like so:

        ssh -i ~/.ssh/gitweb -l nagiosadm hetzner-hel1-01 -- -trigger-

 3. that SSH key, deployed from Puppet (so in
    `/etc/ssh/puppetkeys/nagiosadm`), calls the
    `/home/nagiosadm/bin/from-git-rw` which then...

 4. creates or updates (`git clone` or `git pull`) the git repository
    in `~/tor-nagios/config`...
    
 5. then calls `make` in the directory, which calls `./build-nagios`
    to generate the files in `~/tor-nagios/config/generated/`

 7. then calls `make install` in the `config` directory, which deploys
    the config file (using `rsync`) in `/etc/inciga/from-git` and also
    pushes the NRPE config to the [Puppet server](puppet) in
    `nagiospush@pauli.torproject.org:/etc/puppet/modules/nagios/files/tor-nagios/generated/nrpe_tor.cfg`

 8. then finally reloads incinga