Skip to content
Snippets Groups Projects
Open puppet fleet doesn't converge
  • View options
  • puppet fleet doesn't converge

  • View options
  • Open Issue created by anarcat

    while building a grafana dahsboard over the prometheus puppet metrics, i noticed we have a solid 6-8 hosts that never converge.

    image

    https://grafana.torproject.org/d/fe1vcz4hlvgu8b/puppet-health?orgId=1&from=now-30d&to=now&timezone=utc&refresh=auto&viewPanel=panel-2

    Notice the min on that: 6 hosts.

    I made a panel showing what those hosts are, they are currently:

    • anonticket-01.torproject.org
    • crm-int-01.torproject.org
    • puppetdb-01.torproject.org
    • tb-build-02.torproject.org
    • tb-build-03.torproject.org
    • weather-01.torproject.org

    Each of those hosts should be inspect and tweaked so that they converge. I believe most of those (except tb-build*) hosts have this issue:

    root@weather-01:~# pat
    Info: Using environment 'production'
    Info: Retrieving pluginfacts
    Info: Retrieving plugin
    Info: Loading facts
    Warning: /Postgresql_psql[ALTER ROLE torweather ENCRYPTED PASSWORD ****]: Unable to mark 'unless' as sensitive: unless is a parameter and not a property, and cannot be automatically redacted.
    Info: Caching catalog for weather-01.torproject.org
    Info: Applying configuration version '1732290058'
    Notice: /Stage[main]/Profile::Weather/Postgresql::Server::Db[torweather]/Postgresql::Server::Role[torweather]/Postgresql_psql[ALTER ROLE torweather ENCRYPTED PASSWORD ****]/command: changed [redacted] to [redacted]
    Notice: Applied catalog in 12.38 seconds

    crm-int-01 doesn't run psql of course, but i think it's a similar issue with mysql passwords.

    for the tb-build servers, it's more like this:

    root@tb-build-03:~# pat
    Info: Using environment 'production'
    Info: Retrieving pluginfacts
    Info: Retrieving plugin
    Info: Loading facts
    Info: Caching catalog for tb-build-03.torproject.org
    Info: Applying configuration version '1732290110'
    Notice: /Stage[main]/Profile::Torbrowser_build::Runner/File[/usr/local/bin/update-apps-repos.sh]/content: 
    
    Notice: /Stage[main]/Profile::Torbrowser_build::Runner/File[/usr/local/bin/update-apps-repos.sh]/content: content changed '{mtime}2024-11-22 15:10:45 +0000' to '{mtime}2024-11-22 15:41:59 +0000'
    Notice: Applied catalog in 4.79 seconds
    root@tb-build-03:~# 

    It would be nice to resolve those so we could have an alert to warn us about those.

    (Honestly, I'm not sure this is actually an issue: is it really a bad thing that hosts don't converge? It can be a sign of a serious issue: for example we could be constantly trying to install a package and failing, and we should probably notice that... but i'm not sure this is high priority which is why this is Icebox and TPA-RFC-33-C, as we definitely were not monitoring this before Prometheus came in, let alert alert on that...)

    2 of 6 checklist items completed · Edited by Jérôme Charaoui
    • Merge request
    • Branch

    Linked items ... 0

  • Activity

    • All activity
    • Comments only
    • History only
    • Newest first
    • Oldest first
    Loading Loading Loading Loading Loading Loading Loading Loading Loading Loading