We currently have a metrics-specific Nagios host that we want to shut down soon. One of its checks is to see whether CollecTor's files are becoming unavailable or stale. This check is not easily transferable to Tor's Nagios host, because it depends on a code base that is not being maintained anymore and that we want to deploy on Tor's Nagios host. That's why I rewrote this check in a simple Python script to be deployed on Tor's Nagios instance.
Questions:
anarcat and/or weasel: do you have any concerns about deploying this check in Tor's Nagios host alongside the Onionoo check?
irl: do you spot any checks in this Python script that are way off, or other checks that are missing?
atagar, other Python people: do you mind reviewing the Python code for general code improvements? The goal is to have a single, self-contained, easy-to-read Python script that produces just the data we need for Nagios to send out alerts.
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related.
Learn more.
Those checks are more comprehensive than the ones I had, I wasn't checking archive paths. I didn't check it in detail but it looks good to me. I assume you got all the comparison operators the right way round, etc.
We currently have a metrics-specific Nagios host that we want to shut down soon. One of its checks is to see whether CollecTor's files are becoming unavailable or stale. This check is not easily transferable to Tor's Nagios host, because it depends on a code base that is not being maintained anymore and that we want to deploy on Tor's Nagios host. That's why I rewrote this check in a simple Python script to be deployed on Tor's Nagios instance.
Questions:
anarcat and/or weasel: do you have any concerns about deploying this check in Tor's Nagios host alongside the Onionoo check?
I reviewed the code quickly, and it looks reasonable. Assuming performance is acceptable, this should be fine.
irl: do you spot any checks in this Python script that are way off, or other checks that are missing?
atagar, other Python people: do you mind reviewing the Python code for general code improvements? The goal is to have a single, self-contained, easy-to-read Python script that produces just the data we need for Nagios to send out alerts.
I would add to that "runs fast". The way Nagios schedules checks makes it suffer if there's a check that takes too long. Think "open TCP port" instead of "make a full HTTP request that downloads a 3MB file" or "... renders a complex report". :) We have some leeway of course, but if it can be optimized, it's a definite plus.
I would also mention there's a "nagiosplugin" python module that could be used instead of rolling our own behavior.
It might be overkill for this simple plugin, but could be useful if you want to actually send metrics like age and so on and have them processable on the other side (which we don't currently do, mind you).
We currently have a metrics-specific Nagios host that we want to shut down soon. One of its checks is to see whether CollecTor's files are becoming unavailable or stale. This check is not easily transferable to Tor's Nagios host, because it depends on a code base that is not being maintained anymore and that we want to deploy on Tor's Nagios host. That's why I rewrote this check in a simple Python script to be deployed on Tor's Nagios instance.
Questions:
anarcat and/or weasel: do you have any concerns about deploying this check in Tor's Nagios host alongside the Onionoo check?
I reviewed the code quickly, and it looks reasonable. Assuming performance is acceptable, this should be fine.
The script runs in under a second here, where most of the time is spent on downloading the 1 MiB index.json file.
irl: do you spot any checks in this Python script that are way off, or other checks that are missing?
atagar, other Python people: do you mind reviewing the Python code for general code improvements? The goal is to have a single, self-contained, easy-to-read Python script that produces just the data we need for Nagios to send out alerts.
I would add to that "runs fast". The way Nagios schedules checks makes it suffer if there's a check that takes too long. Think "open TCP port" instead of "make a full HTTP request that downloads a 3MB file" or "... renders a complex report". :) We have some leeway of course, but if it can be optimized, it's a definite plus.
Makes sense.
I would also mention there's a "nagiosplugin" python module that could be used instead of rolling our own behavior.
It might be overkill for this simple plugin, but could be useful if you want to actually send metrics like age and so on and have them processable on the other side (which we don't currently do, mind you).
That looks useful. Is that module available on the Tor Nagios host? I agree that it might be overkill for this plugin, but it might be useful for future plugins we write, and then we could go back and simplify the existing scripts for Onionoo and CollecTor.
I'm attaching a fixed version of the script where I removed a superfluous comma that somehow slipped in when doing the final cleanup.
Can you deploy this script on Tor's Nagios for collector.torproject.org (not for collector2.torproject.org, though)?
That looks useful. Is that module available on the Tor Nagios host? I agree that it might be overkill for this plugin, but it might be useful for future plugins we write, and then we could go back and simplify the existing scripts for Onionoo and CollecTor.
The module is not installed yet but installing it is trivial.
Will install on monday, as friday is sacred "don't deploy on friday" day. :)
Trac: Owner: tpa to anarcat Status: merge_ready to accepted
That looks useful. Is that module available on the Tor Nagios host? I agree that it might be overkill for this plugin, but it might be useful for future plugins we write, and then we could go back and simplify the existing scripts for Onionoo and CollecTor.
The module is not installed yet but installing it is trivial.
Will install on monday, as friday is sacred "don't deploy on friday" day. :)
however, it's not possible to change the target host without patching the check. i didn't realize that, but it would be preferable to make that configurable on the commandline.
it is possible to pass a host on the commandline, from what i can tell, but passing the machine hostname fails with a TLS error:
# /usr/lib/nagios/plugins/tor-check-collector -s colchicifolium.torproject.orgUNKNOWN: Error fetching https://colchicifolium.torproject.org/index/index.json: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1056)
the "typical" way this works in nagios is, for example:
-H, --hostname=ADDRESS Host name argument for servers using host headers (virtual host) Append a port to include it in the header (eg: example.com:5000) -I, --IP-address=ADDRESS IP address or name (use numeric address if possible to bypass DNS lookup).
Thanks for deploying the check! Can you change this line to contacts: +metrics, so that alerts don't go out just to me but to the metrics-alerts@ mailing list?
I'll move away a file on colchicifolium now to trigger the alert and back afterwards. Just to see if it's working.
I'll also look into the parameters and using argparse next week. Unfortunately, the check wouldn't work for corsicum right now anyway, because that CollecTor instance does not archive all descriptor types. It would just keep shouting about timestamps being missing. Maybe we'll need to add another option to only complain about outdated timestamp, not about missing timestamps. Added to my list.
Trac: Status: closed to reopened Resolution: fixed toN/A
I'll move away a file on colchicifolium now to trigger the alert and back afterwards. Just to see if it's working.
Yup, this worked. One thing I noticed is that the alert says "global/collector" whereas the Onionoo checks say things like "omeiense/network service - onionoo varnish". Is it possible to rename the CollecTor check to something like "colchicifolium/collector"?
Thanks for deploying the check! Can you change this line to contacts: +metrics, so that alerts don't go out just to me but to the metrics-alerts@ mailing list?
Of course, consider it done.
I'll move away a file on colchicifolium now to trigger the alert and back afterwards. Just to see if it's working.
Definitely got that ring here. :)
I'll also look into the parameters and using argparse next week.
Good.
Unfortunately, the check wouldn't work for corsicum right now anyway, because that CollecTor instance does not archive all descriptor types. It would just keep shouting about timestamps being missing.
That's fine: the point is to make sure we check on a specific host instead of delegating this to DNS or whatever. Keep in mind this means you need to bypass DNS while still making HTTPS verification work! It's tricky stuff... But since you're already using urlopen(), it's possible you can implement such a hack.
Maybe we'll need to add another option to only complain about outdated timestamp, not about missing timestamps. Added to my list.
No idea about that. ;)
Yup, this worked. One thing I noticed is that the alert says "global/collector" whereas the Onionoo checks say things like "omeiense/network service - onionoo varnish". Is it possible to rename the CollecTor check to something like "colchicifolium/collector"?
That would be confusing, at this stage: because we do not actually control which host we're probing, I prefer to keep the check "global" because that's effectively what it is. When we can specify the host, I'll update the label, if you don't mind.
Not sure which status to set here. I'll just reassign it to you, feel free to resolve. :)
Trac: Owner: anarcat to phw Status: reopened to assigned
Yup, this worked. One thing I noticed is that the alert says "global/collector" whereas the Onionoo checks say things like "omeiense/network service - onionoo varnish". Is it possible to rename the CollecTor check to something like "colchicifolium/collector"?
That would be confusing, at this stage: because we do not actually control which host we're probing, I prefer to keep the check "global" because that's effectively what it is. When we can specify the host, I'll update the label, if you don't mind.
Okay, works for me.
Not sure which status to set here. I'll just reassign it to you, feel free to resolve. :)
I had already opened #34029 (moved) for the remaining changes. I'll resolve this ticket and will open a new one once there's an updated script to deploy.
Thanks!
Trac: Resolution: N/Ato fixed Status: assigned to closed
Thanks for deploying the check! Can you change this line to contacts: +metrics, so that alerts don't go out just to me but to the metrics-alerts@ mailing list?
Note that I set it up this way because you were marked as a contact for the hosts (corsicum and colchicifolium), should those be changed to metrics-alerts@ as well?
Thanks for deploying the check! Can you change this line to contacts: +metrics, so that alerts don't go out just to me but to the metrics-alerts@ mailing list?
Note that I set it up this way because you were marked as a contact for the hosts (corsicum and colchicifolium), should those be changed to metrics-alerts@ as well?
No, you can leave those unchanged. These are too low-level, or system-specific, that they shouldn't go out to the world at this point. Really, I should fix the underlying problem that the disk runs full by writing code that deletes files we don't need anymore. That just didn't happen yet for lack of dev time. But that's on my list for May. For now, just the new alerts should go to the mailing list. Thanks!