Skip to content
Snippets Groups Projects

Increase number of cycles for felix bridges

Merged anadahz requested to merge anadahz/monit-configuration:fix/alerts into main

Increase timeout check cycles for default-bridge-felix-1 and default-bridge-felix-2 as they have been generating too many alerts.

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Cool, did you test this locally first? You can do that by modifying the email server configuration for monit.

  • After looking at the change, I'd be surprised if this solved the issue. The way I understand the syntax used here is that monit will check the connection once per cycle (2 minutes). With how it was written before, if it failed 3 times within the span of 5 cycles, it would send an alert. With this change, if it fails 3 times in the span of 8 cycles it will trigger an alert. So isn't this strictly worse than before?

  • anadahz added 1 commit

    added 1 commit

    Compare with previous version

  • Author Contributor

    @cohosh true, I mixed up the number of times and cycles. Here is how I tested it:

    $ cat monit-test.conf

    check host testing with address 127.0.0.1
        if failed port 34567
          for 6 times within 6 cycles
        then exec "/usr/bin/netcat -l 127.0.0.1 34567"

    $ monit -d 1 -I -c monit-test.conf

    Starting Monit 5.27.2 daemon
    'host' Monit 5.27.2 started
    'testing' failed protocol test [DEFAULT] at [127.0.0.1]:34567 [TCP/IP] -- Connection refused
    'testing' failed protocol test [DEFAULT] at [127.0.0.1]:34567 [TCP/IP] -- Connection refused
    'testing' failed protocol test [DEFAULT] at [127.0.0.1]:34567 [TCP/IP] -- Connection refused
    'testing' failed protocol test [DEFAULT] at [127.0.0.1]:34567 [TCP/IP] -- Connection refused
    'testing' failed protocol test [DEFAULT] at [127.0.0.1]:34567 [TCP/IP] -- Connection refused
    'testing' failed protocol test [DEFAULT] at [127.0.0.1]:34567 [TCP/IP] -- Connection refused
    'testing' exec: '/usr/bin/netcat -l 127.0.0.1 34567'
    'testing' connection succeeded to [127.0.0.1]:34567 [TCP/IP]
    Edited by anadahz
  • Okay thanks for the update. This looks reasonable to me. I'll ping @phw about deploying it

  • I'm fine with deploying this but why not fix the root cause and ask Felix to take a look at his bridge's reliability? After all, this doesn't appear to be an issue with monit.

  • I've reached out. The metrics for the bridge look reasonable to me. It's possible that these network issues are a symptom of overload, in which case the issue is on our side to see why the flag isn't being applied and possibly some load balancing of default bridges.

Please register or sign in to reply
Loading