Skip to content
Snippets Groups Projects
nagios.md 6.06 KiB
Newer Older
Nagios/Icinga service for Tor Project infrastructure

NOTE: the Nagios server was retired in 2024. This documentation is
kept for historical reference only, see [TPA-RFC-33](https://gitlab.torproject.org/tpo/tpa/team/-/wikis/policy/tpa-rfc-33-monitoring).

anarcat's avatar
anarcat committed
[[_TOC_]]

# How-to
Linus Nordberg's avatar
Linus Nordberg committed

## Getting status updates

- Using a web browser: https://nagios.torproject.org/cgi-bin/icinga/status.cgi?allunhandledproblems&sortobject=services&sorttype=1&sortoption=2
- On IRC: /j #tor-nagios
- Over email: Add your email address to `tor-nagios/config/static/objects/contacts.cfg`

## How to run a nagios check manually on a host (TARGET.tpo)

    NCHECKFILE=$(egrep -A 4 THE-SERVICE-TEXT-FROM-WEB | egrep '^ *nrpe:' | cut -d : -f 2 | tr -d ' |"')
    NCMD=$(ssh -t TARGET.tpo grep "$NCHECKFILE" /etc/nagios -r)
    : NCMD is the command that's being run. If it looks sane, run it. With --verbose if you like more output.
    ssh -t TARGET.tpo "$NCMD" --verbose

anarcat's avatar
anarcat committed
## Changing the Nagios configuration
Linus Nordberg's avatar
Linus Nordberg committed

anarcat's avatar
anarcat committed
Hosts and services are managed in the `config/nagios-master.cfg` YAML
configuration file, kept in the `nagiosadm@nagios.torproject.org:/home/nagiosadm/tor-nagios`
anarcat's avatar
anarcat committed
repository. Make changes with a normal text editor, commit and push:

    $EDITOR config/nagios-master.cfg
    git commit -a
    git push

Carefully watch the output of the `git push` command! If there is an
error, your changes won't show up (and the commit is still accepted).
anarcat's avatar
anarcat committed

## Forcing a rebuild of the configuration

If the Nagios configuration seems out of sync with the YAML config, a
rebuild of the configuration can be forced with this command on the
Nagios server:

    touch /home/nagiosadm/tor-nagios/config/nagios-master.cfg && sudo -u nagiosadm make -C /home/nagiosadm/tor-nagios/config

Alternatively, changing the `.cfg` file and pushing a new commit
should trigger this as well.

## Batch jobs

You can run batch commands from the web interface, thanks to Icinga's
changes to the UI. But there is also a commandline client called
[icli](https://tracker.debian.org/pkg/icli) which can do this from the commandline, on the Icinga
server.

This, for example, will queue recheck jobs on all problem hosts:

    icli -z '!o,!A,!S,!D' -a recheck

This will run the `dsa-update-apt-status` command on all problem
hosts:

    cumin "$(ssh hetzner-hel1-01.torproject.org "icli -z'"'!o,!A,!S,!D'"'" | grep ^[a-z] | sed 's/$/.torproject.org or/') false" dsa-update-apt-status

It's kind of an awful hack -- take some time to appreciate the quoting
required for those `!` -- which might not be necessary with later
Icinga releases. Icinga 2 has a [REST API](https://icinga.com/docs/icinga-2/latest/doc/12-icinga2-api/) and its own [command
line console](https://icinga.com/docs/icinga-2/latest/doc/11-cli-commands/#cli-command-console) which makes `icli` completely obsolete.

## Adding a new admin user

When a user needs to be added to the admin group, follow the steps below in the `tor-nagios.git` repository

1. Create a new contact for the user in `config/static/objects/contacts.cfg`:

```
define contact{
       contact_name                    <username>
       alias                           <username>
       service_notification_period     24x7
       host_notification_period        24x7
       service_notification_options    w,u,c,r
       host_notification_options       d,r
       service_notification_commands   notify-service-by-email
       host_notification_commands      notify-host-by-email
       email                           <email>+nagios@torproject.org
       }
```

2. Add the user to `authorized_for_full_command_resolution` and `authorized_for_configuration_information` in `config/static/cgi.cfg`:

```
authorized_for_full_command_resolution=user1,foo,bar,<new user>
authorized_for_configuration_information=user1,foo,bar,<new user>
```

## Pager playbook

### What is this alert anyways?

Say you receive a mysterious alert and you have no idea what it's
about. Take, for example, [tpo/tpa/team#40795](https://gitlab.torproject.org/tpo/tpa/team/-/issues/40795):

    09:35:23 <nsa> tor-nagios: [gettor-01] application service - gettor status is CRITICAL: 2: b[AUTHENTICATIONFAILED] Invalid credentials (Failure)

To figure out what triggered this error, follow this procedure:

 1. log into the Nagios web interface at https://nagios.torproject.org

 2. find the broken service, for example by listing all [unhandled
    problems](https://nagios.torproject.org/cgi-bin/icinga/status.cgi?allunhandledproblems)

 3. click on the actual service name to see details

 4. find the "executed command" field and click on "Command Expander"

 5. this will show you the "Raw commandline" that nagios runs to do
    this check, in this case it is a NRPE check that calls
    `tor_application_service` on the other end

 6. if it's an NRPE check, log on the remote host and run the command,
    otherwise, the command is ran on the nagios host

In this case, the error can be reproduced with:

    root@gettor-01:~# /usr/lib/nagios/plugins/dsa-check-statusfile /srv/gettor.torproject.org/check/status
    2: b'[AUTHENTICATIONFAILED] Invalid credentials (Failure)'

In this case, it seems like the status file is under the control of
the service administrator, which should be contacted for followup.

anarcat's avatar
anarcat committed
# Reference

## Design

### Config generation

The Nagios/Icinga configuration gets generated from the
`config/nagios-master.cfg` YAML configuration file stored in the
`tor-nagios.git` repository. The generation works like this:
anarcat's avatar
anarcat committed

 1. operator pushes changes to the git repository on the Nagios server
    (in `/home/nagiosadm/tor-nagios`)
anarcat's avatar
anarcat committed
    
 2. the `post-receive` hook calls `make` in the `config` sub-directory,
    which calls `./build-nagios` to generate the files in
    `~/tor-nagios/config/generated/`

 3. the hook then calls `make install`, which:
 
 4. deploys the config file (using `rsync`) in
    `/etc/inciga/from-git`...
 
 5. pushes the NRPE config to the [Puppet server](puppet) in
anarcat's avatar
anarcat committed
    `nagiospush@pauli.torproject.org:/etc/puppet/modules/nagios/files/tor-nagios/generated/nrpe_tor.cfg`

 6. reloads Incinga

 7. and finally mirrors the repository to GitLab
    (<https://gitlab.torproject.org/tpo/tpa/tor-nagios>)