Skip to content

puppet fleet doesn't converge

while building a grafana dahsboard over the prometheus puppet metrics, i noticed we have a solid 6-8 hosts that never converge.

image

https://grafana.torproject.org/d/fe1vcz4hlvgu8b/puppet-health?orgId=1&from=now-30d&to=now&timezone=utc&refresh=auto&viewPanel=panel-2

Notice the min on that: 6 hosts.

I made a panel showing what those hosts are, they are currently:

  • anonticket-01.torproject.org
  • crm-int-01.torproject.org
  • puppetdb-01.torproject.org
  • tb-build-02.torproject.org
  • tb-build-03.torproject.org
  • weather-01.torproject.org

Each of those hosts should be inspect and tweaked so that they converge. I believe most of those (except tb-build*) hosts have this issue:

root@weather-01:~# pat
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Warning: /Postgresql_psql[ALTER ROLE torweather ENCRYPTED PASSWORD ****]: Unable to mark 'unless' as sensitive: unless is a parameter and not a property, and cannot be automatically redacted.
Info: Caching catalog for weather-01.torproject.org
Info: Applying configuration version '1732290058'
Notice: /Stage[main]/Profile::Weather/Postgresql::Server::Db[torweather]/Postgresql::Server::Role[torweather]/Postgresql_psql[ALTER ROLE torweather ENCRYPTED PASSWORD ****]/command: changed [redacted] to [redacted]
Notice: Applied catalog in 12.38 seconds

crm-int-01 doesn't run psql of course, but i think it's a similar issue with mysql passwords.

for the tb-build servers, it's more like this:

root@tb-build-03:~# pat
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for tb-build-03.torproject.org
Info: Applying configuration version '1732290110'
Notice: /Stage[main]/Profile::Torbrowser_build::Runner/File[/usr/local/bin/update-apps-repos.sh]/content: 

Notice: /Stage[main]/Profile::Torbrowser_build::Runner/File[/usr/local/bin/update-apps-repos.sh]/content: content changed '{mtime}2024-11-22 15:10:45 +0000' to '{mtime}2024-11-22 15:41:59 +0000'
Notice: Applied catalog in 4.79 seconds
root@tb-build-03:~# 

It would be nice to resolve those so we could have an alert to warn us about those.

(Honestly, I'm not sure this is actually an issue: is it really a bad thing that hosts don't converge? It can be a sign of a serious issue: for example we could be constantly trying to install a package and failing, and we should probably notice that... but i'm not sure this is high priority which is why this is Icebox and TPA-RFC-33-C, as we definitely were not monitoring this before Prometheus came in, let alert alert on that...)

Edited by Jérôme Charaoui
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information