Skip to content
Snippets Groups Projects

draft: notify auto-dns during reboots

Closed anarcat requested to merge autodns-notif into master
2 unresolved threads

Paired with the mininag retirement (team#41734 (closed)), this will ensure hosts get removed from rotation when we reboot them.

There are many problems here:

  • the zonefile gets rebuilt too often
  • code is duplicated between the ganeti and non-ganeti cases (an already existing problem
  • not sure i like the good=True API
  • we should do this for all hosts, as this will take forever, we need a way to check if a host is a mirror, presumbaly in LDAP

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • anarcat requested review from @lelutin

    requested review from @lelutin

  • @lelutin would love to hear your thoughts on how i could improve this code.

  • assigned to @anarcat

    • if I understand correctly, those auto_dns state actions are really only required for a handful of hosts which are in a round-robin rotation for the same service as others. so maybe we should delay merging this until we have some way to check whether or not the state changes are required.

      one thing that I saw is that there's currently no call to auto_dns_set_status with good=True, only False. should we set state to good once we know that the hosts are back online (I'm guessing that would be after if not wait_for_shutdown() ... or is it rather what's intended in the call towards the end of the function right before return True)?

      I agree that the zone rebuild should be split out and done only once before we wait for the shutdowns to happen, and then once more after setting state back to good.

    • if I understand correctly, those auto_dns state actions are really only required for a handful of hosts which are in a round-robin rotation for the same service as others. so maybe we should delay merging this until we have some way to check whether or not the state changes are required.

      agreed.

      one thing that I saw is that there's currently no call to auto_dns_set_status with good=True, only False. should we set state to good once we know that the hosts are back online (I'm guessing that would be after if not wait_for_shutdown() ... or is it rather what's intended in the call towards the end of the function right before return True)?

      there is one auto_dns_set_status(primary_dns_con, con.host, True) line there, but the last one is incorrect. that API sucks, we should just have good/bad directly in the function name.

      I agree that the zone rebuild should be split out and done only once before we wait for the shutdowns to happen, and then once more after setting state back to good.

      yep.


      what do you think of the duplication between ganeti.stop_instances and shutdown_and_wait? part of the messiness here is related to that...

      (i mean the whole thing is a pile of ugly spaghetti code now, but that's a particilarly yucky copy-paste, imho).

    • Please register or sign in to reply
    • hmm well the function shutdown_and_wait is pretty long and it has long if blocks and apparent code duplication in certain parts, so it's hard to understand what's happening with it.

      now is it this MR's fault? I don't think so.. maybe we should refactor the function at some point to make it easier to read and maintain. but I don't think this idea should prevent the addition of the auto_dns stuff.

    • the problem is specifically how we stop instances in ganeti. right now we need to add wrapper code in shutdown_and_wait to cover for that, but maybe that's alright and we can refactor all of that later.

      god i wish we had unit tests for this...

    • fwiw fabric does have some convenience classes and functions to help out with writing tests: https://docs.fabfile.org/en/stable/api/testing.html

      maybe we can start with creating tests for the shutdown_and_wait() function and then we'll be able to refactor it with more confidence

    • not a bad idea! i've always been meaning to get started with fabric tests, but never found the courage to figure out their test suite...

    • Please register or sign in to reply
  • mentioned in issue team#41766 (closed)

  • i don't think this is relevant anymore. in TPA-RFC-67 (#41734), we simply retired mininag and the next step is probably to rewrite it with a prometheus backend going forward, see team#41766 (closed) in any case.

  • closed

  • mentioned in issue team#41811 (closed)

Please register or sign in to reply
Loading