This is a TSA host so already has a bunch of ping and NRPE checks. Application specific checks are mostly looking at the index file:
- That there is an index file that parses and:
- it was recently updated
- it contains a recent run for:
- bridge descriptors
- relay descriptors
- exit lists
The old check uses bushel's CollecTor index parser, but we could equally hack up a single python script to do this with the JSON at a lower level. In the end it looks a lot like the Onionoo plugin on the TSA Nagios.
We have a Python script that runs on the TSA Nagios to check Onionoo.
A quick win for someone with some time, I had started extending this to check a relay's status (with a relay ops hat on):
- Onionoo is unhappy => UNKNOWN (because we're monitoring the relay not Onionoo)
- Tor version number not recommended => WARN
- Last changed address recently => WARN
- BadExit flag is present => WARN
- Not running => CRIT
- Rate of change of consensus weight is large => WARN
- Rate of change of bandwidth usage is large => WARN
- Otherwise => OK
If it's OK, output the current set of flags alphabetically sorted (or at least consistently sorted) and include the current consensus weight and bandwidth values in Nagios performance data format.
The primary issue with OnionPerfs is that they run out of disk space. A decent set of ping and NRPE checks should cover most of the common issues we've had.
Application specific checks would include:
- that a file is available in the webserver root for the last analysis run
- that there is something listening on the tgen connect port
- also on the onion service
- that the HTTPS certificate is valid and not about to expire (on port 8443)